Running COMET

Command Line Interface

Our CLI supports 4 different commands:

comet-score: the Scoring Command is used to evaluate MT.
comet-compare: the Compare Command is used too compare two MT systems using statistical significance tests.
comet-mbr: the MBR Command is used for Minimum Bayes Risk Decoding.
comet-train: used to train your own evaluation Metric.

Before we get started please create the following dummy test data:

echo -e "Dem Feuer konnte Einhalt geboten werden\nSchulen und Kindergärten wurden eröffnet." >> src.de
echo -e "The fire could be stopped\nSchools and kindergartens were open" >> hyp1.en
echo -e "The fire could have been stopped\nSchools and pre-school were open" >> hyp2.en
echo -e "They were able to control the fire.\nSchools and kindergartens opened" >> ref.en

Scoring Command

Basic scoring command:

comet-score -s src.de -t hyp1.en -r ref.en

use --gpus 0 to test on CPU.

Scoring multiple systems:

comet-score -s src.de -t hyp1.en hyp2.en -r ref.en

You can also test your system on public benchmarks such as WMT20 en-de via SacreBLEU:

comet-score -d wmt20:en-de -t PATH/TO/TRANSLATIONS

The default setting of comet-score prints the score for each segment individually. If you are only interested in the score for the whole dataset (computed as the average of the segment scores), run the following command:

comet-score -s src.de -t hyp1.en -r ref.en --quiet --only_system

COMET provides a list of different model/metrics that you can use to evaluate your systems. You can select which one you want using the --model flag.

NOTE: For reference-free (QE-as-a-metric) models you don’t need to pass a reference. E.g:

comet-score -s src.de -t hyp1.en --model Unbabel/wmt20-comet-qe-da

Compare Command

When comparing multiple MT systems we encourage you to run the comet-compare command to get statistical significance with Paired T-Test and bootstrap resampling.

comet-compare -s src.de -t hyp1.en hyp2.en hyp3.en -r ref.en

MBR Command

Minimum Bayes-Risk (MBR) decoding aims to find the candidate hypothesis that has the least expected loss under a given metric. Recent studies showed that MBR with neural fine-tuned metrics such as COMET leads to significant improvement in automatic and human evaluations [Fernandes et al., NAACL 2022, Zhang et al. 2022].

Example:

comet-mbr -s [SOURCE].txt -t [MT_SAMPLES].txt --num_sample [X] -o [OUTPUT_FILE].txt

If working with a very large candidate list you can use --rerank_top_k flag to prune the topK most promissing candidates according to a reference-free metric.

comet-mbr -s [SOURCE].txt -t [MT_SAMPLES].txt -o [OUTPUT_FILE].txt –num_sample 1000 –rerank_top_k 100 –gpus 4 –qe_model Unbabel/wmt20-comet-qe-da

Example with 2 source and 3 samples:

source.txt	samples.txt
Obama receives Netanyahu	Obama empfängt Netanjahu
	Obama empfing Netanjahu
	Obama trifft Netanjahu
Lamb grew up in the area.	Lamm wuchs in der Gegend auf.
	Lamb wuchs in der Gegend auf.
	Lamb wuchs in dieser Gegend auf.