NuQE training¶

Contents

data
data processing options
vocabulary options
hyper-parameters

usage: kiwi train [-h] --train-source TRAIN_SOURCE --train-target TRAIN_TARGET
                  --train-alignments TRAIN_ALIGNMENTS
                  [--train-source-tags TRAIN_SOURCE_TAGS]
                  [--train-target-tags TRAIN_TARGET_TAGS] --valid-source
                  VALID_SOURCE --valid-target VALID_TARGET --valid-alignments
                  VALID_ALIGNMENTS [--valid-source-tags VALID_SOURCE_TAGS]
                  [--valid-target-tags VALID_TARGET_TAGS]
                  [--valid-source-pos VALID_SOURCE_POS]
                  [--valid-target-pos VALID_TARGET_POS]
                  [--predict-target [PREDICT_TARGET]]
                  [--predict-gaps [PREDICT_GAPS]]
                  [--predict-source [PREDICT_SOURCE]]
                  [--wmt18-format [WMT18_FORMAT]]
                  [--source-max-length SOURCE_MAX_LENGTH]
                  [--source-min-length SOURCE_MIN_LENGTH]
                  [--target-max-length TARGET_MAX_LENGTH]
                  [--target-min-length TARGET_MIN_LENGTH]
                  [--source-vocab-size SOURCE_VOCAB_SIZE]
                  [--target-vocab-size TARGET_VOCAB_SIZE]
                  [--source-vocab-min-frequency SOURCE_VOCAB_MIN_FREQUENCY]
                  [--target-vocab-min-frequency TARGET_VOCAB_MIN_FREQUENCY]
                  [--keep-rare-words-with-embeddings [KEEP_RARE_WORDS_WITH_EMBEDDINGS]]
                  [--add-embeddings-vocab [ADD_EMBEDDINGS_VOCAB]]
                  [--embeddings-format {polyglot,word2vec,fasttext,glove,text}]
                  [--embeddings-binary [EMBEDDINGS_BINARY]]
                  [--source-embeddings SOURCE_EMBEDDINGS]
                  [--target-embeddings TARGET_EMBEDDINGS]
                  [--bad-weight BAD_WEIGHT] [--window-size WINDOW_SIZE]
                  [--max-aligned MAX_ALIGNED]
                  [--source-embeddings-size SOURCE_EMBEDDINGS_SIZE]
                  [--target-embeddings-size TARGET_EMBEDDINGS_SIZE]
                  [--freeze-embeddings [FREEZE_EMBEDDINGS]]
                  [--embeddings-dropout EMBEDDINGS_DROPOUT]
                  [--hidden-sizes HIDDEN_SIZES [HIDDEN_SIZES ...]]
                  [--dropout DROPOUT]
                  [--init-type {uniform,normal,constant,glorot_uniform,glorot_normal}]
                  [--init-support INIT_SUPPORT]

data ¶

`--train-source`	Path to training source file
`--train-target`	Path to training target file
`--train-alignments`
	Path to train alignments between source and target.
`--train-source-tags`
	Path to training label file for source (WMT18 format)
`--train-target-tags`
	Path to training label file for target
`--valid-source`	Path to validation source file
`--valid-target`	Path to validation target file
`--valid-alignments`
	Path to valid alignments between source and target.
`--valid-source-tags`
	Path to validation label file for source (WMT18 format)
`--valid-target-tags`
	Path to validation label file for target
`--valid-source-pos`
	Path to training PoS tags file for source
`--valid-target-pos`
	Path to training PoS tags file for target

data processing options ¶

`--predict-target`
	Predict Target Tags. Leave unchanged for WMT17 format Default: True
`--predict-gaps`	Predict Gap Tags. Default: False
`--predict-source`
	Predict Source Tags. Default: False
`--wmt18-format`	Read target tags in WMT18 format. Default: False
`--source-max-length`
	Maximum source sequence length Default: inf
`--source-min-length`
	Truncate source sequence length. Default: 1
`--target-max-length`
	Maximum target sequence length to keep. Default: inf
`--target-min-length`
	Truncate target sequence length. Default: 1

vocabulary options ¶

`--source-vocab-size`
	Size of the source vocabulary.
`--target-vocab-size`
	Size of the target vocabulary.
`--source-vocab-min-frequency`
	Min word frequency for source vocabulary. Default: 1
`--target-vocab-min-frequency`
	Min word frequency for target vocabulary. Default: 1
`--keep-rare-words-with-embeddings`
	Keep words that occur less then min-frequency but are in embeddings vocabulary. Default: False
`--add-embeddings-vocab`
	Add words from embeddings vocabulary to source/target vocabulary. Default: False
`--embeddings-format`
	Possible choices: polyglot, word2vec, fasttext, glove, text Word embeddings format. See README for specific formatting instructions. Default: “polyglot”
`--embeddings-binary`
	Load embeddings stored in binary. Default: False
`--source-embeddings`
	Path to word embeddings file for source.
`--target-embeddings`
	Path to word embeddings file for target.

hyper-parameters ¶

`--bad-weight`	Relative weight for bad labels. Default: 3.0
`--window-size`	Sliding window size. Default: 3
`--max-aligned`	Max number of alignments between source and target. Default: 5
`--source-embeddings-size`
	Word embedding size for source. Default: 50
`--target-embeddings-size`
	Word embedding size for target. Default: 50
`--freeze-embeddings`
	Freeze embedding weights during training. Default: False
`--embeddings-dropout`
	Dropout rate for embedding layers. Default: 0.0
`--hidden-sizes`	List of hidden sizes. Default: [400, 200, 100, 50]
`--dropout`	Dropout rate for linear layers. Default: 0.0
`--init-type`	Possible choices: uniform, normal, constant, glorot_uniform, glorot_normal Distribution type for parameters initialization. Default: “uniform”
`--init-support`	Parameters are initialized over uniform distribution with support (-param_init, param_init). Use 0 to not use initialization. Default: 0.1