NuQE training¶
usage: kiwi train [-h] --train-source TRAIN_SOURCE --train-target TRAIN_TARGET
--train-alignments TRAIN_ALIGNMENTS
[--train-source-tags TRAIN_SOURCE_TAGS]
[--train-target-tags TRAIN_TARGET_TAGS] --valid-source
VALID_SOURCE --valid-target VALID_TARGET --valid-alignments
VALID_ALIGNMENTS [--valid-source-tags VALID_SOURCE_TAGS]
[--valid-target-tags VALID_TARGET_TAGS]
[--valid-source-pos VALID_SOURCE_POS]
[--valid-target-pos VALID_TARGET_POS]
[--predict-target [PREDICT_TARGET]]
[--predict-gaps [PREDICT_GAPS]]
[--predict-source [PREDICT_SOURCE]]
[--wmt18-format [WMT18_FORMAT]]
[--source-max-length SOURCE_MAX_LENGTH]
[--source-min-length SOURCE_MIN_LENGTH]
[--target-max-length TARGET_MAX_LENGTH]
[--target-min-length TARGET_MIN_LENGTH]
[--source-vocab-size SOURCE_VOCAB_SIZE]
[--target-vocab-size TARGET_VOCAB_SIZE]
[--source-vocab-min-frequency SOURCE_VOCAB_MIN_FREQUENCY]
[--target-vocab-min-frequency TARGET_VOCAB_MIN_FREQUENCY]
[--keep-rare-words-with-embeddings [KEEP_RARE_WORDS_WITH_EMBEDDINGS]]
[--add-embeddings-vocab [ADD_EMBEDDINGS_VOCAB]]
[--embeddings-format {polyglot,word2vec,fasttext,glove,text}]
[--embeddings-binary [EMBEDDINGS_BINARY]]
[--source-embeddings SOURCE_EMBEDDINGS]
[--target-embeddings TARGET_EMBEDDINGS]
[--bad-weight BAD_WEIGHT] [--window-size WINDOW_SIZE]
[--max-aligned MAX_ALIGNED]
[--source-embeddings-size SOURCE_EMBEDDINGS_SIZE]
[--target-embeddings-size TARGET_EMBEDDINGS_SIZE]
[--freeze-embeddings [FREEZE_EMBEDDINGS]]
[--embeddings-dropout EMBEDDINGS_DROPOUT]
[--hidden-sizes HIDDEN_SIZES [HIDDEN_SIZES ...]]
[--dropout DROPOUT]
[--init-type {uniform,normal,constant,glorot_uniform,glorot_normal}]
[--init-support INIT_SUPPORT]
data¶
--train-source | Path to training source file |
--train-target | Path to training target file |
--train-alignments | |
Path to train alignments between source and target. | |
--train-source-tags | |
Path to training label file for source (WMT18 format) | |
--train-target-tags | |
Path to training label file for target | |
--valid-source | Path to validation source file |
--valid-target | Path to validation target file |
--valid-alignments | |
Path to valid alignments between source and target. | |
--valid-source-tags | |
Path to validation label file for source (WMT18 format) | |
--valid-target-tags | |
Path to validation label file for target | |
--valid-source-pos | |
Path to training PoS tags file for source | |
--valid-target-pos | |
Path to training PoS tags file for target |
data processing options¶
--predict-target | |
Predict Target Tags. Leave unchanged for WMT17 format Default: True | |
--predict-gaps | Predict Gap Tags. Default: False |
--predict-source | |
Predict Source Tags. Default: False | |
--wmt18-format | Read target tags in WMT18 format. Default: False |
--source-max-length | |
Maximum source sequence length Default: inf | |
--source-min-length | |
Truncate source sequence length. Default: 1 | |
--target-max-length | |
Maximum target sequence length to keep. Default: inf | |
--target-min-length | |
Truncate target sequence length. Default: 1 |
vocabulary options¶
--source-vocab-size | |
Size of the source vocabulary. | |
--target-vocab-size | |
Size of the target vocabulary. | |
--source-vocab-min-frequency | |
Min word frequency for source vocabulary. Default: 1 | |
--target-vocab-min-frequency | |
Min word frequency for target vocabulary. Default: 1 | |
--keep-rare-words-with-embeddings | |
Keep words that occur less then min-frequency but are in embeddings vocabulary. Default: False | |
--add-embeddings-vocab | |
Add words from embeddings vocabulary to source/target vocabulary. Default: False | |
--embeddings-format | |
Possible choices: polyglot, word2vec, fasttext, glove, text Word embeddings format. See README for specific formatting instructions. Default: “polyglot” | |
--embeddings-binary | |
Load embeddings stored in binary. Default: False | |
--source-embeddings | |
Path to word embeddings file for source. | |
--target-embeddings | |
Path to word embeddings file for target. |
hyper-parameters¶
--bad-weight | Relative weight for bad labels. Default: 3.0 |
--window-size | Sliding window size. Default: 3 |
--max-aligned | Max number of alignments between source and target. Default: 5 |
--source-embeddings-size | |
Word embedding size for source. Default: 50 | |
--target-embeddings-size | |
Word embedding size for target. Default: 50 | |
--freeze-embeddings | |
Freeze embedding weights during training. Default: False | |
--embeddings-dropout | |
Dropout rate for embedding layers. Default: 0.0 | |
--hidden-sizes | List of hidden sizes. Default: [400, 200, 100, 50] |
--dropout | Dropout rate for linear layers. Default: 0.0 |
--init-type | Possible choices: uniform, normal, constant, glorot_uniform, glorot_normal Distribution type for parameters initialization. Default: “uniform” |
--init-support | Parameters are initialized over uniform distribution with support (-param_init, param_init). Use 0 to not use initialization. Default: 0.1 |