Create README.md

a278c79 verified 29 days ago

4.79 kB

	---
	license: apache-2.0
	---
	# MetricX-24

	This is not an officially supported Google product.

	GitHub repository: https://github.com/google-research/metricx

	The repository contains the code for running inference on MetricX-24 models,
	a family of models for automatic evaluation of translations that were proposed
	in the WMT'24 Metrics Shared Task submission
	[MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task](https://aclanthology.org/2024.wmt-1.35/).
	The models were trained in [T5X](https://github.com/google-research/t5x) and
	then converted for use in PyTorch.


	## Available Models

	There are 3 MetricX-24 models available on Hugging Face that vary in the number
	of parameters. Unlike the MetricX-23 models, the MetricX-24 models are all
	hybrid models that can do both reference-based and reference-free (also known as
	quality estimation, or QE) inference:

	* [MetricX-24-Hybrid-XXL](https://huggingface.co/google/metricx-24-hybrid-xxl-v2p6)
	* [MetricX-24-Hybrid-XL](https://huggingface.co/google/metricx-24-hybrid-xl-v2p6)
	* [MetricX-24-Hybrid-Large](https://huggingface.co/google/metricx-24-hybrid-large-v2p6)

	We recommend using the XXL model versions for the best agreement with human
	judgments of translation quality, the Large versions for best speed, and the
	XL for an intermediate use case.


	## Changes to the WMT'24 Submission

	The MetricX-24 models available here are most similar to the primary submission
	to the WMT'24 Metrics Shared Task. They are initialized with
	[mT5](https://aclanthology.org/2021.naacl-main.41/),
	then fine-tuned on a combination of direct assessment and MQM data from
	WMT'15-'22. However, we made a couple of small changes that make these models
	different from the WMT'24 submissions.

	First, the metric scores get automatically clipped at 0 and 25, to ensure they
	are strictly in the [0, 25] range, as due to the nature of regression models,
	the scores could otherwise sometimes fall outside the range.

	Second, we included one additional type of synthetic training examples that
	weren't ready in time for the official submission. These are examples of perfect
	translations of multi-sentence segments, generated from the MQM data from
	WMT'20-'22. The purpose of this category of synthetic data is to reduce the
	model's bias against longer translations when the source segment and/or
	reference are also long.


	## Model Performance

	For comparison with the submissions to
	[WMT'24 Metrics Shared Task](https://www2.statmt.org/wmt24/pdf/2024.wmt-1.2.pdf),
	we provide an overview of the system- and segment-level correlation scores
	between the MetricX-24 scores and MQM ratings of translation quality, as
	calculated on the shared task's test sets:

	\| Model \| Sys-Level SPA (en-de) \| Seg-Level Acc (en-de) \| Sys-Level SPA (en-es) \| Seg-Level Acc (en-es) \| Sys-Level SPA (ja-zh) \| Seg-Level Acc (ja-zh) \|
	\| -------------------------- \| ----- \| ----- \| ----- \| ----- \| ----- \| ----- \|
	\| MetricX-24-Hybrid-XXL \| 0.865 \| 0.543 \| 0.785 \| 0.685 \| 0.878 \| 0.541 \|
	\| MetricX-24-Hybrid-XL \| 0.884 \| 0.522 \| 0.806 \| 0.683 \| 0.859 \| 0.528 \|
	\| MetricX-24-Hybrid-Large \| 0.879 \| 0.511 \| 0.795 \| 0.686 \| 0.845 \| 0.514 \|
	\| MetricX-24-Hybrid-QE-XXL \| 0.884 \| 0.525 \| 0.789 \| 0.685 \| 0.863 \| 0.527 \|
	\| MetricX-24-Hybrid-QE-XL \| 0.879 \| 0.502 \| 0.774 \| 0.683 \| 0.849 \| 0.509 \|
	\| MetricX-24-Hybrid-QE-Large \| 0.809 \| 0.490 \| 0.762 \| 0.684 \| 0.847 \| 0.508 \|

	Below are the above correlation scores averaged, as used in the shared task to
	determine the final ranking of the submissions:

	\| Model \| Average Correlation \|
	\| -------------------------- \| ----- \|
	\| MetricX-24-Hybrid-XXL \| 0.716 \|
	\| MetricX-24-Hybrid-XL \| 0.714 \|
	\| MetricX-24-Hybrid-Large \| 0.705 \|
	\| MetricX-24-Hybrid-QE-XXL \| 0.712 \|
	\| MetricX-24-Hybrid-QE-XL \| 0.699 \|
	\| MetricX-24-Hybrid-QE-Large \| 0.683 \|

	NOTE: Since MetricX-24 models are hybrid models, MetricX-24-\<size\> and
	MetricX-24-QE-\<size\> correspond to the same model, evaluated with and
	without the references, respectively.


	## Citation

	If you use MetricX-24 in your research, please cite the following publication:

	```bibtex
	@inproceedings{juraska-etal-2024-metricx,
	title = "{M}etric{X}-24: The {G}oogle Submission to the {WMT} 2024 Metrics Shared Task",
	author = "Juraska, Juraj and
	Deutsch, Daniel and
	Finkelstein, Mara and
	Freitag, Markus",
	editor = "Haddow, Barry and
	Kocmi, Tom and
	Koehn, Philipp and
	Monz, Christof",
	booktitle = "Proceedings of the Ninth Conference on Machine Translation",
	month = nov,
	year = "2024",
	address = "Miami, Florida, USA",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2024.wmt-1.35",
	pages = "492--504",
	}
	```