Running COMET
Command Line Interface
Our CLI supports 4 different commands:
comet-score
: the Scoring Command is used to evaluate MT.comet-compare
: the Compare Command is used too compare two MT systems using statistical significance tests.comet-mbr
: the MBR Command is used for Minimum Bayes Risk Decoding.comet-train
: used to train your own evaluation Metric.
Before we get started please create the following dummy test data:
echo -e "Dem Feuer konnte Einhalt geboten werden\nSchulen und Kindergärten wurden eröffnet." >> src.de
echo -e "The fire could be stopped\nSchools and kindergartens were open" >> hyp1.en
echo -e "The fire could have been stopped\nSchools and pre-school were open" >> hyp2.en
echo -e "They were able to control the fire.\nSchools and kindergartens opened" >> ref.en
Scoring Command
Basic scoring command:
comet-score -s src.de -t hyp1.en -r ref.en
use --gpus 0
to test on CPU.
Scoring multiple systems:
comet-score -s src.de -t hyp1.en hyp2.en -r ref.en
You can also test your system on public benchmarks such as WMT20 en-de via SacreBLEU:
comet-score -d wmt20:en-de -t PATH/TO/TRANSLATIONS
The default setting of comet-score
prints the score for each segment individually.
If you are only interested in the score for the whole dataset (computed as the average of the segment scores),
run the following command:
comet-score -s src.de -t hyp1.en -r ref.en --quiet --only_system
COMET provides a list of different model/metrics that you can use to evaluate your systems. You can select which one you want using the --model
flag.
NOTE: For reference-free (QE-as-a-metric) models you don’t need to pass a reference. E.g:
comet-score -s src.de -t hyp1.en --model Unbabel/wmt20-comet-qe-da
Compare Command
When comparing multiple MT systems we encourage you to run the comet-compare
command to get statistical significance with Paired T-Test and bootstrap resampling.
comet-compare -s src.de -t hyp1.en hyp2.en hyp3.en -r ref.en
MBR Command
Minimum Bayes-Risk (MBR) decoding aims to find the candidate hypothesis that has the least expected loss under a given metric. Recent studies showed that MBR with neural fine-tuned metrics such as COMET leads to significant improvement in automatic and human evaluations [Fernandes et al., NAACL 2022, Zhang et al. 2022].
Example:
comet-mbr -s [SOURCE].txt -t [MT_SAMPLES].txt --num_sample [X] -o [OUTPUT_FILE].txt
If working with a very large candidate list you can use --rerank_top_k
flag to prune the topK most promissing candidates according to a reference-free metric.
comet-mbr -s [SOURCE].txt -t [MT_SAMPLES].txt -o [OUTPUT_FILE].txt –num_sample 1000 –rerank_top_k 100 –gpus 4 –qe_model Unbabel/wmt20-comet-qe-da
Example with 2 source and 3 samples:
source.txt |
samples.txt |
---|---|
Obama receives Netanyahu |
Obama empfängt Netanjahu |
Obama empfing Netanjahu |
|
Obama trifft Netanjahu |
|
Lamb grew up in the area. |
Lamm wuchs in der Gegend auf. |
Lamb wuchs in der Gegend auf. |
|
Lamb wuchs in dieser Gegend auf. |