kiwi.data package

Submodules

kiwi.data.builders module

kiwi.data.builders.build_dataset(fieldset, prefix='', filter_pred=None, **kwargs)[source]
kiwi.data.builders.build_test_dataset(fieldset, load_vocab=None, **kwargs)[source]

Build a test QE dataset.

Parameters:
  • fieldset (Fieldset) – specific set of fields to be used (depends on the model to be used.)
  • load_vocab – A path to a saved vocabulary.
Returns:

A Dataset object.

kiwi.data.builders.build_training_datasets(fieldset, split=0.0, valid_source=None, valid_target=None, load_vocab=None, **kwargs)[source]

Build a training and validation QE datasets.

Required Args:
fieldset (Fieldset): specific set of fields to be used (depends on
the model to be used).

train_source: Train Source train_target: Train Target (MT)

Optional Args (depends on the model):

train_pe: Train Post-edited train_target_tags: Train Target Tags train_source_tags: Train Source Tags train_sentence_scores: Train HTER scores

valid_source: Valid Source valid_target: Valid Target (MT) valid_pe: Valid Post-edited valid_target_tags: Valid Target Tags valid_source_tags: Valid Source Tags valid_sentence_scores: Valid HTER scores

split (float): If no validation sets are provided, randomly sample
1 - split of training examples as validation set.

target_vocab_size: Maximum Size of target vocabulary source_vocab_size: Maximum Size of source vocabulary target_max_length: Maximum length for target field target_min_length: Minimum length for target field source_max_length: Maximum length for source field source_min_length: Minimum length for source field target_vocab_min_freq: Minimum word frequency target field source_vocab_min_freq: Minimum word frequency source field load_vocab: Path to existing vocab file

Returns:A training and a validation Dataset.

kiwi.data.corpus module

class kiwi.data.corpus.Corpus(fields_examples=None, dataset_fields=None)[source]

Bases: object

examples_per_field()[source]
classmethod from_files(fields, files)[source]

Create a QualityEstimationDataset given paths and fields.

Parameters:
  • fields – A dict between field name and field object.
  • files – A dict between field name and file dict (with ‘name’ and ‘format’ keys).
classmethod from_tabular_file(fields, file_fields, file_path, sep='\t')[source]

Create a QualityEstimationDataset given paths and fields.

Parameters:
  • fields – A dict between field name and field object.
  • file_fields – A list of field names for each column of the file (by order). File fields not in fields will be ignored, but every field in fields should correspond to some column.
  • file_path – Path to the tabular file.
paste_fields(corpus)[source]

Pastes (appends) fields from another corpus. :param corpus: A corpus object. Must have the same number of examples as

the current corpus.
static read_tabular_file(file_path, sep='\t', extract_column=None)[source]

kiwi.data.iterators module

kiwi.data.iterators.build_bucket_iterator(dataset, device, batch_size, is_train)[source]

kiwi.data.qe_dataset module

class kiwi.data.qe_dataset.QEDataset(examples, fields, filter_pred=None)[source]

Bases: torchtext.data.dataset.Dataset

Defines a dataset for quality estimation. Based on the WMT 201X.

static sort_key(ex)[source]
split(split_ratio=0.7, stratified=False, strata_field='label', random_state=None)[source]

Create train-test(-valid?) splits from the instance’s examples.

Parameters:
  • split_ratio (float or List of floats) – a number [0, 1] denoting the amount of data to be used for the training split (rest is used for validation), or a list of numbers denoting the relative sizes of train, test and valid splits respectively. If the relative size for valid is missing, only the train-test split is returned. Default is 0.7 (for the train set).
  • stratified (bool) – whether the sampling should be stratified. Default is False.
  • strata_field (str) – name of the examples Field stratified over. Default is ‘label’ for the conventional label field.
  • random_state (tuple) – the random seed used for shuffling. A return value of random.getstate().
Returns:

Datasets for train, validation, and test splits in that order, if the splits are provided.

Return type:

Tuple[Dataset]

kiwi.data.tokenizers module

kiwi.data.tokenizers.align_reversed_tokenizer(s)[source]

Return a list of pair of integers for each sentence.

kiwi.data.tokenizers.align_tokenizer(s)[source]

Return a list of pair of integers for each sentence.

kiwi.data.tokenizers.tokenizer(sentence)[source]

Implement your own tokenize procedure.

kiwi.data.utils module

kiwi.data.utils.build_vocabulary(fields_vocab_options, *datasets)[source]
kiwi.data.utils.cross_split_dataset(dataset, splits)[source]
kiwi.data.utils.deserialize_fields_from_vocabs(fields, vocabs)[source]

Load serialized vocabularies into their fields.

kiwi.data.utils.deserialize_vocabs(vocabs)[source]

Restore defaultdict lost in serialization.

kiwi.data.utils.fields_from_vocabs(fields, vocabs)[source]

Load Field objects from vocabs dict. From OpenNMT

kiwi.data.utils.fields_to_vocabs(fields)[source]
Extract Vocab Dictionary from Fields Dictionary.
Args:
fields: A dict mapping field names to Field objects
Returns:
vocab: A dict mapping field names to Vocabularies
kiwi.data.utils.filter_len(x, source_min_length=1, source_max_length=inf, target_min_length=1, target_max_length=inf)[source]
kiwi.data.utils.hter_to_binary(x)[source]

Transform hter score into binary OK/BAD label.

kiwi.data.utils.load_datasets(directory, *datasets_names)[source]
kiwi.data.utils.load_training_datasets(directory, fieldset)[source]
kiwi.data.utils.load_vocabularies_to_datasets(vocab_path, *datasets)[source]
kiwi.data.utils.load_vocabularies_to_fields(vocab_path, fields)[source]

Load serialized Vocabularies from disk into fields.

kiwi.data.utils.project(batch, *args)[source]

Projection onto the first argument.

Needed to create a postprocessing pipeline that implements the identity.

kiwi.data.utils.read_file(path)[source]

Reads a file into a list of lists of words.

kiwi.data.utils.save_datasets(directory, **named_datasets)[source]

Pickle datasets to standard file in directory

Note that fields cannot be saved as part of a dataset, so they are ignored.

Parameters:
  • directory (str or Path) – directory where to save the datasets pickle.
  • named_datasets (dict) – mapping of name and respective dataset.
kiwi.data.utils.save_file(file_path, data, token_sep=' ', example_sep='\n')[source]
kiwi.data.utils.save_predicted_probabilities(directory, predictions, prefix='')[source]
kiwi.data.utils.save_training_datasets(directory, train_dataset, valid_dataset)[source]
kiwi.data.utils.save_vocabularies_from_datasets(directory, *datasets)[source]
kiwi.data.utils.save_vocabularies_from_fields(directory, fields, include_vectors=False)[source]

Save Vocab objects in Field objects to vocab.pt file. From OpenNMT

kiwi.data.utils.serialize_fields_to_vocabs(fields)[source]

Save Vocab objects in Field objects to vocab.pt file. From OpenNMT

kiwi.data.utils.serialize_vocabs(vocabs, include_vectors=False)[source]

Make vocab dictionary serializable.

kiwi.data.utils.vocab_loaded_if_needed(field)[source]
kiwi.data.utils.wmt18_to_gaps(batch, *args)[source]

Extract gap tags from wmt18 format file.

kiwi.data.utils.wmt18_to_target(batch, *args)[source]

Extract target tags from wmt18 format file.

kiwi.data.vectors module

class kiwi.data.vectors.WordEmbeddings(name, emb_format='polyglot', binary=True, map_fn=<function WordEmbeddings.<lambda>>, **kwargs)[source]

Bases: torchtext.vocab.Vectors

cache(name, cache, url=None, max_vectors=None)[source]
kiwi.data.vectors.map_to_polyglot(token)[source]

kiwi.data.vocabulary module

class kiwi.data.vocabulary.Vocabulary(counter, max_size=None, min_freq=1, specials=None, vectors=None, unk_init=None, vectors_cache=None, rare_with_vectors=True, add_vectors_vocab=False)[source]

Bases: torchtext.vocab.Vocab

Defines a vocabulary object that will be used to numericalize a field.

freqs

A collections.Counter object holding the frequencies of tokens in the data used to build the Vocab.

stoi

A collections.defaultdict instance mapping token strings to numerical identifiers.

itos

A list of token strings indexed by their numerical identifiers.

kiwi.data.vocabulary.merge_vocabularies(vocab_a, vocab_b, max_size=None, vectors=None, **kwargs)[source]

Module contents