`kiwi.data.encoders.parallel_data_encoder`¶

Module Contents¶

`InputFields`
`EmbeddingsConfig`	Paths to word embeddings file for each input field.
`VocabularyConfig`	Base class for all pydantic configs. Used to configure base behaviour of configs.
`ParallelDataEncoder`

class kiwi.data.encoders.parallel_data_encoder.InputFields¶

Bases: pydantic.generics.GenericModel, Generic[T]

class kiwi.data.encoders.parallel_data_encoder.EmbeddingsConfig¶

Paths to word embeddings file for each input field.

format :Literal['polyglot', 'word2vec', 'fasttext', 'glove', 'text'] = polyglot¶: Word embeddings format. See README for specific formatting instructions.

class kiwi.data.encoders.parallel_data_encoder.VocabularyConfig¶

Base class for all pydantic configs. Used to configure base behaviour of configs.

min_frequency :InputFields[PositiveInt] = 1¶: Only add to vocabulary words that occur more than this number of times in the training dataset (doesn’t apply to loaded or pretrained vocabularies).

max_size :InputFields[Optional[PositiveInt]]¶: Only create vocabulary with up to this many words (doesn’t apply to loaded or pretrained vocabularies).

keep_rare_words_with_embeddings = False¶: Keep words that occur less then min-frequency but are in embeddings vocabulary.

add_embeddings_vocab = False¶: Add words from embeddings vocabulary to source/target vocabulary.

class kiwi.data.encoders.parallel_data_encoder.ParallelDataEncoder(config: Config, field_encoders: Dict[str, TextEncoder] = None)¶

class Config¶

Bases: kiwi.utils.io.BaseConfig

Base class for all pydantic configs. Used to configure base behaviour of configs.

share_input_fields_encoders :bool = False¶: Share encoding/vocabs between source and target fields.

load_vocabularies(self, load_vocabs_from: Path = None, overwrite: bool = False)¶: Load serialized Vocabularies from disk into fields.

vocabularies_from_dict(self, vocabs_dict: Dict, overwrite: bool = False)¶

property vocabularies(self)¶

Return the vocabularies for all encoders that have one.