Frequently Asked Questions

Since we released COMET we have received several questions related to interpretabilty of the scores and usage. In this section we try to address these questions the best we can!

Interpreting Scores:

When using COMET to evaluate machine translation, it’s important to understand how to interpret the scores it produces.

In general, COMET models are trained to predict quality scores for translations. These scores are typically normalized using a z-score transformation to account for individual differences among annotators. While the raw score itself does not have a direct interpretation, it is useful for ranking translations and systems according to their quality.

However, for the latest COMET models like Unbabel/wmt22-comet-da, we have introduced a new training approach that scales the scores between 0 and 1. This makes it easier to interpret the scores: a score close to 1 indicates a high-quality translation, while a score close to 0 indicates a translation that is no better than random chance.

It’s worth noting that when using COMET to compare the performance of two different translation systems, it’s important to run the comet-compare command to obtain statistical significance measures. This command compares the output of two systems using a statistical hypothesis test, providing an estimate of the probability that the observed difference in scores between the systems is due to chance. This is an important step to ensure that any differences in scores between systems are statistically significant.

Overall, the added interpretability of scores in the latest COMET models, combined with the ability to assess statistical significance between systems using comet-compare, make COMET a valuable tool for evaluating machine translation.

Which COMET model should I use?

For general purpose MT evaluation we recommend you to use Unbabel/wmt22-comet-da. This is the most stable model we have. It is an improved version of our previous model Unbabel/wmt20-comet-da.

For evaluating models without a reference we recommend the Unbabel/wmt20-comet-qe-da for higher correlations with DA, and to download wmt21-comet-qe-mqm for higher correlations with MQM.

Where can I find the data used to train COMET models?

Direct Assessments

year data paper
2017 🔗 Findings of the 2017 Conference on Machine Translation (WMT17)
2018 🔗 Findings of the 2018 Conference on Machine Translation (WMT18)
2019 🔗 Findings of the 2019 Conference on Machine Translation (WMT19)
2020 🔗 Findings of the 2020 Conference on Machine Translation (WMT20)
2021 🔗 Findings of the 2021 Conference on Machine Translation (WMT21)
2022 🔗 Findings of the 2022 Conference on Machine Translation (WMT22)

Another large source of DA annotations is the MLQE-PE corpus that is typically used for quality estimation shared tasks (Specia et al. 2020, 2021; Zerva et al. 2022).

You can download MLQE-PE by using the following 🔗.

Direct Assessments: Relative Ranks

Before 2021 the WMT Metrics shared task used relative ranks to evaluate metrics.

Relative ranks can be created when we have at least two DA scores for translations of the same source input, by converting those DA scores into a relative ranking judgement, if the difference in DA scores allows conclusion that one translation is better than the other (usually atleast 25 points).

To make it easier to replicate results from previous Metrics shared tasks (2017-2020) you can find the preprocessed DA relative ranks in the table below:

year relative ranks paper
2017 🔗 Results of the WMT17 Metrics Shared Task
2018 🔗 Results of the WMT18 Metrics Shared Task
2019 🔗 Results of the WMT19 Metrics Shared Task
2020 🔗 Results of the WMT20 Metrics Shared Task

Direct Assessment + Scalar Quality Metric:

In 2022, several changes were made to the annotation procedure used in the WMT Translation task. In contrast to the standard DA (sliding scale from 0-100) used in previous years, in 2022 annotators performed DA+SQM (Direct Assessment + Scalar Quality Metric). In DA+SQM, the annotators still provide a raw score between 0 and 100, but also are presented with seven labeled tick marks. DA+SQM helps to stabilize scores across annotators (as compared to DA).

year data paper
2022 🔗 Findings of the 2022 Conference on Machine Translation (WMT22)

Multidimensional Quality Metrics:

Since 2021 the WMT Metrics task decided to perform they own expert-based evaluation based on Multidimensional Quality Metrics (MQM) framework. In the table below you can find MQM annotations from previous years.

year data paper
2020 🔗 A Large-Scale Study of Human Evaluation for Machine Translation
2021 🔗 Results of the WMT21 Metrics Shared Task
2022 🔗 Results of the WMT22 Metrics Shared Task

Note: You can find the original MQM data here.