kiwi.data.encoders.parallel_data_encoder

Module Contents

Classes

InputFields

EmbeddingsConfig

Paths to word embeddings file for each input field.

VocabularyConfig

Base class for all pydantic configs. Used to configure base behaviour of configs.

ParallelDataEncoder

kiwi.data.encoders.parallel_data_encoder.logger
kiwi.data.encoders.parallel_data_encoder.T
class kiwi.data.encoders.parallel_data_encoder.InputFields

Bases: pydantic.generics.GenericModel, Generic[T]

source :T
target :T
class kiwi.data.encoders.parallel_data_encoder.EmbeddingsConfig

Bases: kiwi.utils.io.BaseConfig

Paths to word embeddings file for each input field.

source :Optional[Path]
target :Optional[Path]
format :Literal['polyglot', 'word2vec', 'fasttext', 'glove', 'text'] = polyglot

Word embeddings format. See README for specific formatting instructions.

class kiwi.data.encoders.parallel_data_encoder.VocabularyConfig

Bases: kiwi.utils.io.BaseConfig

Base class for all pydantic configs. Used to configure base behaviour of configs.

min_frequency :InputFields[PositiveInt] = 1

Only add to vocabulary words that occur more than this number of times in the training dataset (doesn’t apply to loaded or pretrained vocabularies).

max_size :InputFields[Optional[PositiveInt]]

Only create vocabulary with up to this many words (doesn’t apply to loaded or pretrained vocabularies).

keep_rare_words_with_embeddings = False

Keep words that occur less then min-frequency but are in embeddings vocabulary.

add_embeddings_vocab = False

Add words from embeddings vocabulary to source/target vocabulary.

check_nested_options(cls, v)
class kiwi.data.encoders.parallel_data_encoder.ParallelDataEncoder(config: Config, field_encoders: Dict[str, TextEncoder] = None)

Bases: kiwi.data.encoders.base.DataEncoders

class Config

Bases: kiwi.utils.io.BaseConfig

Base class for all pydantic configs. Used to configure base behaviour of configs.

share_input_fields_encoders :bool = False

Share encoding/vocabs between source and target fields.

vocab :VocabularyConfig
embeddings :Optional[EmbeddingsConfig]
warn_missing_feature(cls, v)
fit_vocabularies(self, dataset: ParallelDataset)
load_vocabularies(self, load_vocabs_from: Path = None, overwrite: bool = False)

Load serialized Vocabularies from disk into fields.

vocabularies_from_dict(self, vocabs_dict: Dict, overwrite: bool = False)
property vocabularies(self)

Return the vocabularies for all encoders that have one.

Returns

A dict mapping encoder names to Vocabulary instances.

collate_fn(self, samples, device=None)