kiwi.data.datasets.wmt_qe_dataset

Module Contents

Classes

InputConfig

Base class for all pydantic configs. Used to configure base behaviour of configs.

OutputConfig

Base class for all pydantic configs. Used to configure base behaviour of configs.

TrainingConfig

Base class for all pydantic configs. Used to configure base behaviour of configs.

TestConfig

Base class for all pydantic configs. Used to configure base behaviour of configs.

WMTQEDataset

An abstract class representing a Dataset.

Functions

read_file(path, reader)

kiwi.data.datasets.wmt_qe_dataset.logger
class kiwi.data.datasets.wmt_qe_dataset.InputConfig

Bases: kiwi.utils.io.BaseConfig

Base class for all pydantic configs. Used to configure base behaviour of configs.

source :FilePath

Path to a corpus file in the source language.

target :FilePath

Path to a corpus file in the target language.

alignments :Optional[FilePath]

Path to alignments between source and target.

post_edit :Optional[FilePath]

Path to file containing post-edited target.

source_pos :Optional[FilePath]

Path to input file with POS tags for source.

target_pos :Optional[FilePath]

Path to input file with POS tags for source.

class kiwi.data.datasets.wmt_qe_dataset.OutputConfig

Bases: kiwi.utils.io.BaseConfig

Base class for all pydantic configs. Used to configure base behaviour of configs.

target_tags :Optional[FilePath]

Path to label file for target.

source_tags :Optional[FilePath]

Path to label file for source.

sentence_scores :Optional[FilePath]

Path to file containing sentence level scores (HTER).

class kiwi.data.datasets.wmt_qe_dataset.TrainingConfig

Bases: kiwi.utils.io.BaseConfig

Base class for all pydantic configs. Used to configure base behaviour of configs.

input :InputConfig
output :OutputConfig
class kiwi.data.datasets.wmt_qe_dataset.TestConfig

Bases: kiwi.utils.io.BaseConfig

Base class for all pydantic configs. Used to configure base behaviour of configs.

input :InputConfig
class kiwi.data.datasets.wmt_qe_dataset.WMTQEDataset(columns: Dict[Any, Union[Iterable, List]])

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader.

Note

DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

class Config

Bases: kiwi.utils.io.BaseConfig

Base class for all pydantic configs. Used to configure base behaviour of configs.

buffer_size :int

Number of consecutive instances to be temporarily stored in the buffer, which will be used later for batching/bucketing.

train :TrainingConfig
valid :TrainingConfig
test :TestConfig
split :Optional[confloat(gt=0.0, lt=1.0)]

Split train dataset in case that no validation set is given.

ensure_there_is_validation_data(cls, v, values)
static build(config: Config, directory=None, train=False, valid=False, test=False, split=0)

Build training, validation, and test datasets.

Parameters
  • config – configuration object with file paths and processing flags; check out the docs for Config.

  • directory – if provided and paths in configuration are not absolute, use it to anchor them.

  • train – whether to build the training dataset.

  • valid – whether to build the validation dataset.

  • test – whether to build the testing dataset.

  • split (float) – If no validation set is provided, randomly sample \(1-split\) of training examples as validation set.

__getitem__(self, index_or_field: Union[int, str]) → Union[List[Any], Dict[str, Any]]

Get a row with data from all fields or all rows for a given field

__len__(self)
__contains__(self, item)
sort_key(self, field='source')
kiwi.data.datasets.wmt_qe_dataset.read_file(path, reader)