`kiwi.data.datasets.parallel_dataset`¶

Module Contents¶

Classes¶

`InputConfig`
`TrainingConfig`
`TestConfig`
`ParallelDataset`	An abstract class representing a `Dataset`.

Functions¶

read_file(path, reader)

kiwi.data.datasets.parallel_dataset.logger¶

class kiwi.data.datasets.parallel_dataset.InputConfig¶

Bases: pydantic.BaseModel

source :FilePath¶: Path to a corpus file in the source language

target :FilePath¶: Path to a corpus file in the target language

class kiwi.data.datasets.parallel_dataset.TrainingConfig¶

Bases: pydantic.BaseModel

input :InputConfig¶

class kiwi.data.datasets.parallel_dataset.TestConfig¶

Bases: pydantic.BaseModel

input :InputConfig¶

class kiwi.data.datasets.parallel_dataset.ParallelDataset(columns: Dict[Any, Union[Iterable, List]])¶

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader.

Note

DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

class Config¶

Bases: pydantic.BaseModel

buffer_size :int¶: Number of consecutive instances to be temporarily stored in the buffer, which will be used later for batching/bucketing.

train :Optional[TrainingConfig]¶

valid :Optional[TrainingConfig]¶

test :Optional[TestConfig]¶

split :Optional[confloat(gt=0.0, lt=1.0)]¶: Split train dataset in case that no validation set is given.

ensure_there_is_validation_data(cls, v, values)¶

static build(config: Config, directory=None, train=False, valid=False, test=False, split=0)¶

Build training, validation, and test datasets.

Parameters

config – configuration object with file paths and processing flags; check out the docs for Config.
directory – if provided and paths in configuration are not absolute, use it to anchor them.
train – whether to build the training dataset.
valid – whether to build the validation dataset.
test – whether to build the testing dataset.
split (float) – If no validation set is provided, randomly sample \(1-split\) of training examples as validation set.

__getitem__(self, index_or_field: Union[int, str]) → Union[List[Any], Dict[str, Any]]¶: Get a row with data from all fields or all rows for a given field

__len__(self)¶

__contains__(self, item)¶

sort_key(self, field='source')¶

kiwi.data.datasets.parallel_dataset.read_file(path, reader)¶

kiwi.data.datasets kiwi.data.datasets.wmt_qe_dataset

kiwi.data.datasets.parallel_dataset¶

Module Contents¶

Classes¶

Functions¶

`kiwi.data.datasets.parallel_dataset`¶