|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- projecte-aina/CA-ZH_Parallel_Corpus |
|
language: |
|
- zh |
|
- ca |
|
base_model: |
|
- facebook/m2m100_1.2B |
|
--- |
|
## Projecte Aina’s Chinese-Catalan machine translation model |
|
|
|
## Table of Contents |
|
<details> |
|
<summary>Click to expand</summary> |
|
|
|
- [Model description](#model-description) |
|
- [Intended uses and limitations](#intended-uses-and-limitations) |
|
- [How to use](#how-to-use) |
|
- [Limitations and bias](#limitations-and-bias) |
|
- [Training](#training) |
|
- [Evaluation](#evaluation) |
|
- [Additional information](#additional-information) |
|
|
|
</details> |
|
|
|
|
|
## Model description |
|
|
|
This machine translation model is built upon the M2M100 1.2B, fine-tuned specifically for Chinese-Catalan translation. |
|
It is trained on a combination of Catalan-Chinese datasets |
|
totalling 94,187,858 sentence pairs. 113,305 sentence pairs were parallel data collected from the web, while the remaining 94,074,553 sentence pairs |
|
were parallel synthetic data created using the |
|
[Aina Project's Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca) and the [Aina Project's English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca). |
|
The model was evaluated on the Flores, NTREX, and Projecte Aina's Catalan-Chinese evaluation datasets, achieving results comparable to those of Google Translate. |
|
|
|
## Intended uses and limitations |
|
|
|
You can use this model for machine translation from simplified Chinese to Catalan. |
|
|
|
## How to use |
|
|
|
### Usage |
|
|
|
Translate a sentence using python |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
model_id = "projecte-aina/aina-translator-zh-ca" |
|
|
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_id) |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
sentence = "欢迎来到 Aina 项目!" |
|
|
|
input_ids = tokenizer(sentence, return_tensors="pt").input_ids |
|
output_ids = model.generate(input_ids, max_length=200, num_beams=5) |
|
|
|
generated_translation= tokenizer.decode(output_ids[0], skip_special_tokens=True).strip() |
|
print(generated_translation) |
|
#Benvingut al projecte Aina! |
|
``` |
|
|
|
|
|
## Limitations and bias |
|
At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. |
|
However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated. |
|
|
|
## Training |
|
|
|
### Training data |
|
|
|
The Catalan-Chinese data collected from the web was a combination of the following datasets: |
|
|
|
| Dataset | Sentences before cleaning | |
|
|-------------------|----------------| |
|
| OpenSubtitles | 139,300 | |
|
| WikiMatrix | 90,643 | |
|
| Wikipedia | 68,623| |
|
| **Total** | **298,566** | |
|
|
|
94,074,553 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets and English-Chinese datasets: |
|
|
|
**Spanish-Chinese:** |
|
|
|
| Dataset | Sentences before cleaning | |
|
|-------------------|----------------| |
|
| NLLB |24,051,233| |
|
| UNPC | 17,599,223 | |
|
| MultiUN | 9,847,770 | |
|
| OpenSubtitles | 9,319,658 | |
|
| MultiParaCrawl | 3,410,087 | |
|
| MultiCCAligned | 3,006,694 | |
|
| WikiMatrix | 1,214,322 | |
|
| News Commentary | 375,982 | |
|
| Tatoeba | 9,404 | |
|
| **Total** | **68,834,373** | |
|
|
|
**English-Chinese:** |
|
|
|
| Dataset | Sentences before cleaning | |
|
|-------------------|----------------| |
|
| NLLB |71,383,325| |
|
| CCAligned | 15,181,415 | |
|
| Paracrawl | 14,170,869| |
|
| WikiMatrix | 2,595,119| |
|
| **Total** | **103,330,728** | |
|
|
|
|
|
### Training procedure |
|
|
|
### Data preparation |
|
|
|
The Chinese side of all datasets were first processed using the [Hanzi Identifier](https://github.com/tsroten/hanzidentifier) to detect Traditional Chinese, which was subsequently converted to Simplified Chinese using [OpenCC](https://github.com/BYVoid/OpenCC). |
|
|
|
All data was then filtered according to two specific criteria: |
|
|
|
- Alignment: sentence level alignments were calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) and sentence pairs with a score below 0.75 were discarded. |
|
|
|
- Language identification: the probability of being the target language was calculated using [Lingua.py](https://github.com/pemistahl/lingua-py) and sentences with a language probability score below 0.5 were discarded. |
|
|
|
Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca). |
|
|
|
The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94,187,858. |
|
|
|
|
|
#### Training |
|
|
|
The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework. |
|
The model was trained for 244,500 updates. |
|
Weights were saved every 500 updates. |
|
|
|
## Evaluation |
|
|
|
### Variable and metrics |
|
|
|
Below are the evaluation results on [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200), |
|
[NTREX](https://github.com/MicrosoftTranslator/NTREX), and Projecte Aina's Catalan-Chinese test sets (unpublished), compared to Google Translate for the ZH-CA direction. The evaluation was conducted using [`tower-eval`](https://github.com/deep-spin/tower-eval) following the standard setting (beam search with beam size 5, limiting the translation length to 200 tokens). We report the following metrics: |
|
|
|
- BLEU: Sacrebleu implementation, version: 2.4.0. |
|
- ChrF: Sacrebleu implementation. |
|
- Comet: Model checkpoint: "Unbabel/wmt22-comet-da". |
|
- Comet-kiwi: Model checkpoint: "Unbabel/wmt22-cometkiwi-da". |
|
|
|
|
|
### Evaluation results |
|
|
|
Below are the evaluation results on the machine translation from Chinese to Catalan compared to [Google Translate](https://translate.google.com/): |
|
|
|
|
|
#### Flores200-dev |
|
|
|
| | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ | |
|
|:-----------------------|-------:|------:|-------:|--------:| |
|
| aina-translator-zh-ca | 26.74 | 54.49 | **0.86** | **0.82** | |
|
| Google Translate | **27.71** | **55.37** | **0.86** | 0.81 | |
|
|
|
|
|
#### Flores200-devtest |
|
|
|
|
|
| | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ | |
|
|:-----------------------|-------:|------:|-------:|--------:| |
|
| aina-translator-zh-ca | 27.17 | 55.02 | **0.86** | **0.81** | |
|
| Google Translate | **27.47** | **55.51** | **0.86** | **0.81** | |
|
|
|
|
|
#### NTREX |
|
|
|
|
|
| | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ | |
|
|:-----------------------|-------:|------:|-------:|--------:| |
|
| aina-translator-zh-ca | 22.43 | 50.65 | **0.83** | **0.79** | |
|
| Google Translate | **23.49** | **51.29** | **0.83** | **0.79** | |
|
|
|
|
|
#### Projecte Aina's Catalan-Chinese evaluation dataset |
|
|
|
| | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ | |
|
|:-----------------------|-------:|------:|-------:|--------:| |
|
| aina-translator-zh-ca | **29.21** | 57.41 | **0.87** | **0.82** | |
|
| Google Translate | 28.86 | **57.73** | **0.87** | **0.82** | |
|
|
|
|
|
## Additional information |
|
|
|
### Author |
|
The Language Technologies Unit from Barcelona Supercomputing Center. |
|
|
|
### Contact |
|
For further information, please send an email to <langtech@bsc.es>. |
|
|
|
### Copyright |
|
Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center. |
|
|
|
### License |
|
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
### Funding |
|
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/). |
|
|
|
### Disclaimer |
|
|
|
<details> |
|
<summary>Click to expand</summary> |
|
|
|
The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0. |
|
|
|
Be aware that the model may have biases and/or any other undesirable distortions. |
|
|
|
When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) |
|
or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, |
|
in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence. |
|
|
|
In no event shall the owner and creator of the model (Barcelona Supercomputing Center) |
|
be liable for any results arising from the use made by third parties. |
|
|
|
</details> |
|
|