kiwi.data package¶

Subpackages¶

Submodules¶

kiwi.data.builders module¶

kiwi.data.builders.build_dataset(fieldset, prefix='', filter_pred=None, **kwargs)[source]¶

kiwi.data.builders.build_test_dataset(fieldset, load_vocab=None, **kwargs)[source]¶

Build a test QE dataset.

Parameters:	fieldset (Fieldset) – specific set of fields to be used (depends on the model to be used.) load_vocab – A path to a saved vocabulary.
Returns:	A Dataset object.

kiwi.data.builders.build_training_datasets(fieldset, split=0.0, valid_source=None, valid_target=None, load_vocab=None, **kwargs)[source]¶

Build a training and validation QE datasets.

Required Args:

fieldset (Fieldset): specific set of fields to be used (depends on: the model to be used).

train_source: Train Source train_target: Train Target (MT)

Optional Args (depends on the model):

train_pe: Train Post-edited train_target_tags: Train Target Tags train_source_tags: Train Source Tags train_sentence_scores: Train HTER scores

valid_source: Valid Source valid_target: Valid Target (MT) valid_pe: Valid Post-edited valid_target_tags: Valid Target Tags valid_source_tags: Valid Source Tags valid_sentence_scores: Valid HTER scores

split (float): If no validation sets are provided, randomly sample: 1 - split of training examples as validation set.

target_vocab_size: Maximum Size of target vocabulary source_vocab_size: Maximum Size of source vocabulary target_max_length: Maximum length for target field target_min_length: Minimum length for target field source_max_length: Maximum length for source field source_min_length: Minimum length for source field target_vocab_min_freq: Minimum word frequency target field source_vocab_min_freq: Minimum word frequency source field load_vocab: Path to existing vocab file

Returns:	A training and a validation Dataset.

kiwi.data.corpus module¶

class kiwi.data.corpus.Corpus(fields_examples=None, dataset_fields=None)[source]¶

Bases: object

examples_per_field()[source]¶

classmethod from_files(fields, files)[source]¶

Create a QualityEstimationDataset given paths and fields.

Parameters:	fields – A dict between field name and field object. files – A dict between field name and file dict (with ‘name’ and ‘format’ keys).

classmethod from_tabular_file(fields, file_fields, file_path, sep='\t')[source]¶

Create a QualityEstimationDataset given paths and fields.

Parameters:	fields – A dict between field name and field object. file_fields – A list of field names for each column of the file (by order). File fields not in fields will be ignored, but every field in fields should correspond to some column. file_path – Path to the tabular file.

paste_fields(corpus)[source]¶: Pastes (appends) fields from another corpus. :param corpus: A corpus object. Must have the same number of examples as

the current corpus.

static read_tabular_file(file_path, sep='\t', extract_column=None)[source]¶

kiwi.data.iterators module¶

kiwi.data.iterators.build_bucket_iterator(dataset, device, batch_size, is_train)[source]¶

kiwi.data.qe_dataset module¶

class kiwi.data.qe_dataset.QEDataset(examples, fields, filter_pred=None)[source]¶

Bases: torchtext.data.dataset.Dataset

Defines a dataset for quality estimation. Based on the WMT 201X.

static sort_key(ex)[source]¶

split(split_ratio=0.7, stratified=False, strata_field='label', random_state=None)[source]¶

Create train-test(-valid?) splits from the instance’s examples.

Parameters:	split_ratio (float or List of floats) – a number [0, 1] denoting the amount of data to be used for the training split (rest is used for validation), or a list of numbers denoting the relative sizes of train, test and valid splits respectively. If the relative size for valid is missing, only the train-test split is returned. Default is 0.7 (for the train set). stratified (bool) – whether the sampling should be stratified. Default is False. strata_field (str) – name of the examples Field stratified over. Default is ‘label’ for the conventional label field. random_state (tuple) – the random seed used for shuffling. A return value of random.getstate().
Returns:	Datasets for train, validation, and test splits in that order, if the splits are provided.
Return type:	Tuple[Dataset]

kiwi.data.tokenizers module¶

kiwi.data.tokenizers.align_reversed_tokenizer(s)[source]¶: Return a list of pair of integers for each sentence.

kiwi.data.tokenizers.align_tokenizer(s)[source]¶: Return a list of pair of integers for each sentence.

kiwi.data.tokenizers.tokenizer(sentence)[source]¶: Implement your own tokenize procedure.

kiwi.data.utils module¶

kiwi.data.utils.build_vocabulary(fields_vocab_options, *datasets)[source]¶

kiwi.data.utils.cross_split_dataset(dataset, splits)[source]¶

kiwi.data.utils.deserialize_fields_from_vocabs(fields, vocabs)[source]¶: Load serialized vocabularies into their fields.

kiwi.data.utils.deserialize_vocabs(vocabs)[source]¶: Restore defaultdict lost in serialization.

kiwi.data.utils.fields_from_vocabs(fields, vocabs)[source]¶: Load Field objects from vocabs dict. From OpenNMT

kiwi.data.utils.fields_to_vocabs(fields)[source]¶

Extract Vocab Dictionary from Fields Dictionary.

Args:: fields: A dict mapping field names to Field objects
Returns:: vocab: A dict mapping field names to Vocabularies

kiwi.data.utils.filter_len(x, source_min_length=1, source_max_length=inf, target_min_length=1, target_max_length=inf)[source]¶

kiwi.data.utils.hter_to_binary(x)[source]¶: Transform hter score into binary OK/BAD label.

kiwi.data.utils.load_datasets(directory, *datasets_names)[source]¶

kiwi.data.utils.load_training_datasets(directory, fieldset)[source]¶

kiwi.data.utils.load_vocabularies_to_datasets(vocab_path, *datasets)[source]¶

kiwi.data.utils.load_vocabularies_to_fields(vocab_path, fields)[source]¶: Load serialized Vocabularies from disk into fields.

kiwi.data.utils.project(batch, *args)[source]¶

Projection onto the first argument.

Needed to create a postprocessing pipeline that implements the identity.

kiwi.data.utils.read_file(path)[source]¶: Reads a file into a list of lists of words.

kiwi.data.utils.save_datasets(directory, **named_datasets)[source]¶

Pickle datasets to standard file in directory

Note that fields cannot be saved as part of a dataset, so they are ignored.

Parameters:	directory (str or Path) – directory where to save the datasets pickle. named_datasets (dict) – mapping of name and respective dataset.

kiwi.data.utils.save_file(file_path, data, token_sep=' ', example_sep='\n')[source]¶

kiwi.data.utils.save_predicted_probabilities(directory, predictions, prefix='')[source]¶

kiwi.data.utils.save_training_datasets(directory, train_dataset, valid_dataset)[source]¶

kiwi.data.utils.save_vocabularies_from_datasets(directory, *datasets)[source]¶

kiwi.data.utils.save_vocabularies_from_fields(directory, fields, include_vectors=False)[source]¶: Save Vocab objects in Field objects to vocab.pt file. From OpenNMT

kiwi.data.utils.serialize_fields_to_vocabs(fields)[source]¶: Save Vocab objects in Field objects to vocab.pt file. From OpenNMT

kiwi.data.utils.serialize_vocabs(vocabs, include_vectors=False)[source]¶: Make vocab dictionary serializable.

kiwi.data.utils.vocab_loaded_if_needed(field)[source]¶

kiwi.data.utils.wmt18_to_gaps(batch, *args)[source]¶: Extract gap tags from wmt18 format file.

kiwi.data.utils.wmt18_to_target(batch, *args)[source]¶: Extract target tags from wmt18 format file.

kiwi.data.vectors module¶

class kiwi.data.vectors.WordEmbeddings(name, emb_format='polyglot', binary=True, map_fn=<function WordEmbeddings.<lambda>>, **kwargs)[source]¶

Bases: torchtext.vocab.Vectors

cache(name, cache, url=None, max_vectors=None)[source]¶

kiwi.data.vectors.map_to_polyglot(token)[source]¶

kiwi.data.vocabulary module¶

class kiwi.data.vocabulary.Vocabulary(counter, max_size=None, min_freq=1, specials=None, vectors=None, unk_init=None, vectors_cache=None, rare_with_vectors=True, add_vectors_vocab=False)[source]¶

Bases: torchtext.vocab.Vocab

Defines a vocabulary object that will be used to numericalize a field.

freqs¶: A collections.Counter object holding the frequencies of tokens in the data used to build the Vocab.

stoi¶: A collections.defaultdict instance mapping token strings to numerical identifiers.

itos¶: A list of token strings indexed by their numerical identifiers.

kiwi.data.vocabulary.merge_vocabularies(vocab_a, vocab_b, max_size=None, vectors=None, **kwargs)[source]¶