Predictor training¶

This is used to pre-train the predictor side of the predictor-estimator model.

Contents

data
validation data
data processing options
vocabulary options
PredEst data
predictor training
model-embeddings

usage: kiwi train [-h] --train-source TRAIN_SOURCE
                  [--train-target TRAIN_TARGET]
                  [--train-source-tags TRAIN_SOURCE_TAGS]
                  [--train-target-tags TRAIN_TARGET_TAGS]
                  [--train-pe TRAIN_PE]
                  [--train-sentence-scores TRAIN_SENTENCE_SCORES]
                  [--split SPLIT] [--valid-source VALID_SOURCE]
                  [--valid-target VALID_TARGET]
                  [--valid-alignments VALID_ALIGNMENTS]
                  [--valid-source-tags VALID_SOURCE_TAGS]
                  [--valid-target-tags VALID_TARGET_TAGS]
                  [--valid-pe VALID_PE]
                  [--valid-sentence-scores VALID_SENTENCE_SCORES]
                  [--predict-side {tags,source_tags,gap_tags}]
                  [--wmt18-format [WMT18_FORMAT]]
                  [--source-max-length SOURCE_MAX_LENGTH]
                  [--source-min-length SOURCE_MIN_LENGTH]
                  [--target-max-length TARGET_MAX_LENGTH]
                  [--target-min-length TARGET_MIN_LENGTH]
                  [--source-vocab-size SOURCE_VOCAB_SIZE]
                  [--target-vocab-size TARGET_VOCAB_SIZE]
                  [--source-vocab-min-frequency SOURCE_VOCAB_MIN_FREQUENCY]
                  [--target-vocab-min-frequency TARGET_VOCAB_MIN_FREQUENCY]
                  [--extend-source-vocab EXTEND_SOURCE_VOCAB]
                  [--extend-target-vocab EXTEND_TARGET_VOCAB]
                  [--warmup WARMUP] [--rnn-layers-pred RNN_LAYERS_PRED]
                  [--dropout-pred DROPOUT_PRED] [--hidden-pred HIDDEN_PRED]
                  [--out-embeddings-size OUT_EMBEDDINGS_SIZE]
                  [--embedding-sizes EMBEDDING_SIZES]
                  [--share-embeddings [SHARE_EMBEDDINGS]]
                  [--predict-inverse [PREDICT_INVERSE]]
                  [--source-embeddings-size SOURCE_EMBEDDINGS_SIZE]
                  [--target-embeddings-size TARGET_EMBEDDINGS_SIZE]

data ¶

`--train-source`	Path to training source file
`--train-target`	Path to training target file
`--train-source-tags`
	Path to validation label file for source (WMT18 format)
`--train-target-tags`
	Path to validation label file for target
`--train-pe`	Path to file containing post-edited target.
`--train-sentence-scores`
	Path to file containing sentence level scores.

validation data ¶

`--split`	Split Train dataset in case that no validation set is given.
`--valid-source`	Path to validation source file
`--valid-target`	Path to validation target file
`--valid-alignments`
	Path to valid alignments between source and target.
`--valid-source-tags`
	Path to validation label file for source (WMT18 format)
`--valid-target-tags`
	Path to validation label file for target
`--valid-pe`	Path to file containing postedited target.
`--valid-sentence-scores`
	Path to file containing sentence level scores.

data processing options ¶

`--predict-side`	Possible choices: tags, source_tags, gap_tags Tagset to predict. Leave unchanged for WMT17 format. Default: “tags”
`--wmt18-format`	Read target tags in WMT18 format. Default: False
`--source-max-length`
	Maximum source sequence length Default: inf
`--source-min-length`
	Truncate source sequence length. Default: 0
`--target-max-length`
	Maximum target sequence length to keep. Default: inf
`--target-min-length`
	Truncate target sequence length. Default: 0

vocabulary options ¶

Options for loading vocabulary from a previous run. This is used for e.g. training a source predictor via predict-inverse: True ; If set, other vocab options are ignored

`--source-vocab-size`
	Size of the source vocabulary.
`--target-vocab-size`
	Size of the target vocabulary.
`--source-vocab-min-frequency`
	Min word frequency for source vocabulary. Default: 1
`--target-vocab-min-frequency`
	Min word frequency for target vocabulary. Default: 1

PredEst data ¶

Predictor Estimator specific data options. (POSTECH)

`--extend-source-vocab`
	This is useful to reduce OOV words if the parallel data and QE data are from different domains.

`--extend-source-vocab`
	Optionally load more data which is used only for vocabulary creation. Path to additional Data(Predictor)
`--extend-target-vocab`
	Optionally load more data which is used only for vocabulary creation. Path to additional Data(Predictor)

predictor training ¶

Predictor Estimator (POSTECH)

`--warmup`	Pretrain Predictor for this number of steps. Default: 0
`--rnn-layers-pred`
	Layers in Pred RNN Default: 2
`--dropout-pred`	Dropout in predictor Default: 0.0
`--hidden-pred`	Size of hidden layers in LSTM Default: 100
`--out-embeddings-size`
	Word Embedding in Output layer Default: 200
`--embedding-sizes`
	If set, takes precedence over other embedding params Default: 0
`--share-embeddings`
	Tie input and output embeddings for target. Default: False
`--predict-inverse`
	Predict target -> source instead of source -> target. Default: False

model-embeddings ¶

Embedding layers size in case pre-trained embeddings are not used.

`--source-embeddings-size`
	Word embedding size for source. Default: 50
`--target-embeddings-size`
	Word embedding size for target. Default: 50

Predictor training¶

data¶

validation data¶

data processing options¶

vocabulary options¶

PredEst data¶

predictor training¶

model-embeddings¶