kiwi.data.datasets.parallel_dataset

Module Contents

Classes

InputConfig

TrainingConfig

TestConfig

ParallelDataset

An abstract class representing a Dataset.

Functions

read_file(path, reader)

kiwi.data.datasets.parallel_dataset.logger
class kiwi.data.datasets.parallel_dataset.InputConfig

Bases: pydantic.BaseModel

source :FilePath

Path to a corpus file in the source language

target :FilePath

Path to a corpus file in the target language

class kiwi.data.datasets.parallel_dataset.TrainingConfig

Bases: pydantic.BaseModel

input :InputConfig
class kiwi.data.datasets.parallel_dataset.TestConfig

Bases: pydantic.BaseModel

input :InputConfig
class kiwi.data.datasets.parallel_dataset.ParallelDataset(columns: Dict[Any, Union[Iterable, List]])

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader.

Note

DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

class Config

Bases: pydantic.BaseModel

buffer_size :int

Number of consecutive instances to be temporarily stored in the buffer, which will be used later for batching/bucketing.

train :Optional[TrainingConfig]
valid :Optional[TrainingConfig]
test :Optional[TestConfig]
split :Optional[confloat(gt=0.0, lt=1.0)]

Split train dataset in case that no validation set is given.

ensure_there_is_validation_data(cls, v, values)
static build(config: Config, directory=None, train=False, valid=False, test=False, split=0)

Build training, validation, and test datasets.

Parameters
  • config – configuration object with file paths and processing flags; check out the docs for Config.

  • directory – if provided and paths in configuration are not absolute, use it to anchor them.

  • train – whether to build the training dataset.

  • valid – whether to build the validation dataset.

  • test – whether to build the testing dataset.

  • split (float) – If no validation set is provided, randomly sample \(1-split\) of training examples as validation set.

__getitem__(self, index_or_field: Union[int, str]) → Union[List[Any], Dict[str, Any]]

Get a row with data from all fields or all rows for a given field

__len__(self)
__contains__(self, item)
sort_key(self, field='source')
kiwi.data.datasets.parallel_dataset.read_file(path, reader)