kiwi.data.datasets.parallel_dataset
InputConfig
TrainingConfig
TestConfig
ParallelDataset
An abstract class representing a Dataset.
Dataset
read_file(path, reader)
read_file
kiwi.data.datasets.parallel_dataset.
logger
Bases: pydantic.BaseModel
pydantic.BaseModel
source
Path to a corpus file in the source language
target
Path to a corpus file in the target language
input
Bases: torch.utils.data.Dataset
torch.utils.data.Dataset
All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader.
__getitem__()
__len__()
Sampler
DataLoader
Note
DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.
Config
buffer_size
Number of consecutive instances to be temporarily stored in the buffer, which will be used later for batching/bucketing.
train
valid
test
split
Split train dataset in case that no validation set is given.
ensure_there_is_validation_data
build
Build training, validation, and test datasets.
config – configuration object with file paths and processing flags; check out the docs for Config.
directory – if provided and paths in configuration are not absolute, use it to anchor them.
train – whether to build the training dataset.
valid – whether to build the validation dataset.
test – whether to build the testing dataset.
split (float) – If no validation set is provided, randomly sample \(1-split\) of training examples as validation set.
__getitem__
Get a row with data from all fields or all rows for a given field
__len__
__contains__
sort_key
kiwi.data.datasets
kiwi.data.datasets.wmt_qe_dataset