Predictor training

This is used to pre-train the predictor side of the predictor-estimator model.

usage: kiwi train [-h] --train-source TRAIN_SOURCE
                  [--train-target TRAIN_TARGET]
                  [--train-source-tags TRAIN_SOURCE_TAGS]
                  [--train-target-tags TRAIN_TARGET_TAGS]
                  [--train-pe TRAIN_PE]
                  [--train-sentence-scores TRAIN_SENTENCE_SCORES]
                  [--split SPLIT] [--valid-source VALID_SOURCE]
                  [--valid-target VALID_TARGET]
                  [--valid-alignments VALID_ALIGNMENTS]
                  [--valid-source-tags VALID_SOURCE_TAGS]
                  [--valid-target-tags VALID_TARGET_TAGS]
                  [--valid-pe VALID_PE]
                  [--valid-sentence-scores VALID_SENTENCE_SCORES]
                  [--predict-side {tags,source_tags,gap_tags}]
                  [--wmt18-format [WMT18_FORMAT]]
                  [--source-max-length SOURCE_MAX_LENGTH]
                  [--source-min-length SOURCE_MIN_LENGTH]
                  [--target-max-length TARGET_MAX_LENGTH]
                  [--target-min-length TARGET_MIN_LENGTH]
                  [--source-vocab-size SOURCE_VOCAB_SIZE]
                  [--target-vocab-size TARGET_VOCAB_SIZE]
                  [--source-vocab-min-frequency SOURCE_VOCAB_MIN_FREQUENCY]
                  [--target-vocab-min-frequency TARGET_VOCAB_MIN_FREQUENCY]
                  [--extend-source-vocab EXTEND_SOURCE_VOCAB]
                  [--extend-target-vocab EXTEND_TARGET_VOCAB]
                  [--warmup WARMUP] [--rnn-layers-pred RNN_LAYERS_PRED]
                  [--dropout-pred DROPOUT_PRED] [--hidden-pred HIDDEN_PRED]
                  [--out-embeddings-size OUT_EMBEDDINGS_SIZE]
                  [--embedding-sizes EMBEDDING_SIZES]
                  [--share-embeddings [SHARE_EMBEDDINGS]]
                  [--predict-inverse [PREDICT_INVERSE]]
                  [--source-embeddings-size SOURCE_EMBEDDINGS_SIZE]
                  [--target-embeddings-size TARGET_EMBEDDINGS_SIZE]

data

--train-source Path to training source file
--train-target Path to training target file
--train-source-tags
 Path to validation label file for source (WMT18 format)
--train-target-tags
 Path to validation label file for target
--train-pe Path to file containing post-edited target.
--train-sentence-scores
 Path to file containing sentence level scores.

validation data

--split Split Train dataset in case that no validation set is given.
--valid-source Path to validation source file
--valid-target Path to validation target file
--valid-alignments
 Path to valid alignments between source and target.
--valid-source-tags
 Path to validation label file for source (WMT18 format)
--valid-target-tags
 Path to validation label file for target
--valid-pe Path to file containing postedited target.
--valid-sentence-scores
 Path to file containing sentence level scores.

data processing options

--predict-side

Possible choices: tags, source_tags, gap_tags

Tagset to predict. Leave unchanged for WMT17 format.

Default: “tags”

--wmt18-format

Read target tags in WMT18 format.

Default: False

--source-max-length
 

Maximum source sequence length

Default: inf

--source-min-length
 

Truncate source sequence length.

Default: 0

--target-max-length
 

Maximum target sequence length to keep.

Default: inf

--target-min-length
 

Truncate target sequence length.

Default: 0

vocabulary options

Options for loading vocabulary from a previous run. This is used for e.g. training a source predictor via predict-inverse: True ; If set, other vocab options are ignored

--source-vocab-size
 Size of the source vocabulary.
--target-vocab-size
 Size of the target vocabulary.
--source-vocab-min-frequency
 

Min word frequency for source vocabulary.

Default: 1

--target-vocab-min-frequency
 

Min word frequency for target vocabulary.

Default: 1

PredEst data

Predictor Estimator specific data options. (POSTECH)

--extend-source-vocab
 This is useful to reduce OOV words if the parallel data and QE data are from different domains.
--extend-source-vocab
 Optionally load more data which is used only for vocabulary creation. Path to additional Data(Predictor)
--extend-target-vocab
 Optionally load more data which is used only for vocabulary creation. Path to additional Data(Predictor)

predictor training

Predictor Estimator (POSTECH)

--warmup

Pretrain Predictor for this number of steps.

Default: 0

--rnn-layers-pred
 

Layers in Pred RNN

Default: 2

--dropout-pred

Dropout in predictor

Default: 0.0

--hidden-pred

Size of hidden layers in LSTM

Default: 100

--out-embeddings-size
 

Word Embedding in Output layer

Default: 200

--embedding-sizes
 

If set, takes precedence over other embedding params

Default: 0

--share-embeddings
 

Tie input and output embeddings for target.

Default: False

--predict-inverse
 

Predict target -> source instead of source -> target.

Default: False

model-embeddings

Embedding layers size in case pre-trained embeddings are not used.

--source-embeddings-size
 

Word embedding size for source.

Default: 50

--target-embeddings-size
 

Word embedding size for target.

Default: 50