kiwi.data package¶
Subpackages¶
Submodules¶
kiwi.data.builders module¶
-
kiwi.data.builders.
build_test_dataset
(fieldset, load_vocab=None, **kwargs)[source]¶ Build a test QE dataset.
Parameters: - fieldset (Fieldset) – specific set of fields to be used (depends on the model to be used.)
- load_vocab – A path to a saved vocabulary.
Returns: A Dataset object.
-
kiwi.data.builders.
build_training_datasets
(fieldset, split=0.0, valid_source=None, valid_target=None, load_vocab=None, **kwargs)[source]¶ Build a training and validation QE datasets.
- Required Args:
- fieldset (Fieldset): specific set of fields to be used (depends on
- the model to be used).
train_source: Train Source train_target: Train Target (MT)
- Optional Args (depends on the model):
train_pe: Train Post-edited train_target_tags: Train Target Tags train_source_tags: Train Source Tags train_sentence_scores: Train HTER scores
valid_source: Valid Source valid_target: Valid Target (MT) valid_pe: Valid Post-edited valid_target_tags: Valid Target Tags valid_source_tags: Valid Source Tags valid_sentence_scores: Valid HTER scores
- split (float): If no validation sets are provided, randomly sample
- 1 - split of training examples as validation set.
target_vocab_size: Maximum Size of target vocabulary source_vocab_size: Maximum Size of source vocabulary target_max_length: Maximum length for target field target_min_length: Minimum length for target field source_max_length: Maximum length for source field source_min_length: Minimum length for source field target_vocab_min_freq: Minimum word frequency target field source_vocab_min_freq: Minimum word frequency source field load_vocab: Path to existing vocab file
Returns: A training and a validation Dataset.
kiwi.data.corpus module¶
-
class
kiwi.data.corpus.
Corpus
(fields_examples=None, dataset_fields=None)[source]¶ Bases:
object
-
classmethod
from_files
(fields, files)[source]¶ Create a QualityEstimationDataset given paths and fields.
Parameters: - fields – A dict between field name and field object.
- files – A dict between field name and file dict (with ‘name’ and ‘format’ keys).
-
classmethod
from_tabular_file
(fields, file_fields, file_path, sep='\t')[source]¶ Create a QualityEstimationDataset given paths and fields.
Parameters: - fields – A dict between field name and field object.
- file_fields – A list of field names for each column of the file (by order). File fields not in fields will be ignored, but every field in fields should correspond to some column.
- file_path – Path to the tabular file.
-
classmethod
kiwi.data.iterators module¶
kiwi.data.qe_dataset module¶
-
class
kiwi.data.qe_dataset.
QEDataset
(examples, fields, filter_pred=None)[source]¶ Bases:
torchtext.data.dataset.Dataset
Defines a dataset for quality estimation. Based on the WMT 201X.
-
split
(split_ratio=0.7, stratified=False, strata_field='label', random_state=None)[source]¶ Create train-test(-valid?) splits from the instance’s examples.
Parameters: - split_ratio (float or List of floats) – a number [0, 1] denoting the amount of data to be used for the training split (rest is used for validation), or a list of numbers denoting the relative sizes of train, test and valid splits respectively. If the relative size for valid is missing, only the train-test split is returned. Default is 0.7 (for the train set).
- stratified (bool) – whether the sampling should be stratified. Default is False.
- strata_field (str) – name of the examples Field stratified over. Default is ‘label’ for the conventional label field.
- random_state (tuple) – the random seed used for shuffling. A return value of random.getstate().
Returns: Datasets for train, validation, and test splits in that order, if the splits are provided.
Return type: Tuple[Dataset]
-
kiwi.data.tokenizers module¶
-
kiwi.data.tokenizers.
align_reversed_tokenizer
(s)[source]¶ Return a list of pair of integers for each sentence.
kiwi.data.utils module¶
-
kiwi.data.utils.
deserialize_fields_from_vocabs
(fields, vocabs)[source]¶ Load serialized vocabularies into their fields.
-
kiwi.data.utils.
fields_from_vocabs
(fields, vocabs)[source]¶ Load Field objects from vocabs dict. From OpenNMT
-
kiwi.data.utils.
fields_to_vocabs
(fields)[source]¶ - Extract Vocab Dictionary from Fields Dictionary.
- Args:
- fields: A dict mapping field names to Field objects
- Returns:
- vocab: A dict mapping field names to Vocabularies
-
kiwi.data.utils.
filter_len
(x, source_min_length=1, source_max_length=inf, target_min_length=1, target_max_length=inf)[source]¶
-
kiwi.data.utils.
load_vocabularies_to_fields
(vocab_path, fields)[source]¶ Load serialized Vocabularies from disk into fields.
-
kiwi.data.utils.
project
(batch, *args)[source]¶ Projection onto the first argument.
Needed to create a postprocessing pipeline that implements the identity.
-
kiwi.data.utils.
save_datasets
(directory, **named_datasets)[source]¶ Pickle datasets to standard file in directory
Note that fields cannot be saved as part of a dataset, so they are ignored.
Parameters:
-
kiwi.data.utils.
save_vocabularies_from_fields
(directory, fields, include_vectors=False)[source]¶ Save Vocab objects in Field objects to vocab.pt file. From OpenNMT
-
kiwi.data.utils.
serialize_fields_to_vocabs
(fields)[source]¶ Save Vocab objects in Field objects to vocab.pt file. From OpenNMT
kiwi.data.vectors module¶
kiwi.data.vocabulary module¶
-
class
kiwi.data.vocabulary.
Vocabulary
(counter, max_size=None, min_freq=1, specials=None, vectors=None, unk_init=None, vectors_cache=None, rare_with_vectors=True, add_vectors_vocab=False)[source]¶ Bases:
torchtext.vocab.Vocab
Defines a vocabulary object that will be used to numericalize a field.
-
freqs
¶ A collections.Counter object holding the frequencies of tokens in the data used to build the Vocab.
-
stoi
¶ A collections.defaultdict instance mapping token strings to numerical identifiers.
-
itos
¶ A list of token strings indexed by their numerical identifiers.
-