projecte-aina
/

aina-translator-ca-zh

Safetensors

Chinese

Catalan

m2m_100

Model card Files Files and versions Community

xixianliao commited on 10 days ago

Commit

21cb965

•

1 Parent(s): 40fde0e

Update README.md

Browse files

Files changed (1) hide show

README.md +215 -215

README.md CHANGED Viewed

@@ -1,215 +1,215 @@
----
-license: apache-2.0
-datasets:
-- projecte-aina/CA-ZH_Parallel_Corpus
-language:
-- zh
-- ca
-base_model:
-- facebook/m2m100_1.2B
----
-## Projecte Aina’s Catalan-Chinese machine translation model
-## Table of Contents
-<details>
-<summary>Click to expand</summary>
-- [Model description](#model-description)
-- [Intended uses and limitations](#intended-uses-and-limitations)
-- [How to use](#how-to-use)
-- [Limitations and bias](#limitations-and-bias)
-- [Training](#training)
-- [Evaluation](#evaluation)
-- [Additional information](#additional-information)
-</details>
-## Model description
-This machine translation model is built upon the M2M100 1.2B, fine-tuned specifically for Catalan-Chinese translation.
-It is trained on a combination of Catalan-Chinese datasets
-totalling 94,187,858 sentence pairs. 113,305 sentence pairs were parallel data collected from the web, while the remaining 94,074,553 sentence pairs
-were parallel synthetic data created using the
-[Aina Project's Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca) and the [Aina Project's English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
-Following the fine-tuning phase, Contrastive Preference Optimization (CPO) was applied to further refine the model's outputs. CPO training involved pairs of "chosen" and "rejected" translations for a total of 4,006 sentences. These sentences were sourced from the Flores development set (997 sentences), the Flores devtest set (1,012 sentences), and the NTREX set (1,997 sentences).
-The model was evaluated on the Projecte Aina's Catalan-Chinese evaluation dataset (unpublished), achieving results comparable to those of Google Translate.
-## Intended uses and limitations
-You can use this model for machine translation from Catalan to simplified Chinese.
-## How to use
-### Usage
-Translate a sentence using python
-```python
-from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
-model_id = "projecte-aina/aina-translator-ca-zh-v2"
-model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-sentence = "Benvingut al projecte Aina!"
-input_ids = tokenizer(sentence, return_tensors="pt").input_ids
-output_ids = model.generate(input_ids, max_length=200, num_beams=5)
-generated_translation= tokenizer.decode(output_ids[0], skip_special_tokens=True, spaces_between_special_tokens = False).strip()
-print(generated_translation)
-#欢迎来到 Aina 项目！
-```
-## Limitations and bias
-At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
-However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
-## Training
-### Training data
-The Catalan-Chinese data collected from the web was a combination of the following datasets:
-| Dataset       	| Sentences before cleaning	|
-|-------------------|----------------|
-| OpenSubtitles  	| 139,300	|
-| WikiMatrix | 90,643 |
-| Wikipedia 	| 68,623|
-| **Total**     	| **298,566** |
-94,074,553 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets and English-Chinese datasets:
-**Spanish-Chinese:**
-| Dataset       	| Sentences before cleaning	|
-|-------------------|----------------|
-| NLLB 	|24,051,233|
-| UNPC | 17,599,223 |
-| MultiUN | 9,847,770 |
-| OpenSubtitles | 9,319,658 |
-| MultiParaCrawl | 3,410,087 |
-| MultiCCAligned | 3,006,694 |
-| WikiMatrix | 1,214,322 |
-| News Commentary | 375,982 |
-| Tatoeba | 9,404 |
-| **Total**     	| **68,834,373** |
-**English-Chinese:**
-| Dataset       	| Sentences before cleaning	|
-|-------------------|----------------|
-| NLLB 	|71,383,325|
-| CCAligned | 15,181,415 |
-| Paracrawl | 14,170,869|
-| WikiMatrix | 2,595,119|
-| **Total**     	| **103,330,728** |
-### Training procedure
-### Data preparation
-**Catalan-Chinese parallel data**
-The Chinese side of all datasets were first processed using the [Hanzi Identifier](https://github.com/tsroten/hanzidentifier) to detect Traditional Chinese, which was subsequently converted to Simplified Chinese using [OpenCC](https://github.com/BYVoid/OpenCC).
-All data was then filtered according to two specific criteria:
-- Alignment: sentence level alignments were calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) and sentence pairs with a score below 0.75 were discarded.
-- Language identification: the probability of being the target language was calculated using [Lingua.py](https://github.com/pemistahl/lingua-py) and sentences with a language probability score below 0.5 were discarded.
-Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
-The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94,187,858.
-**Catalan-Chinese Contrastive Preference Optimization dataset**
-The CPO dataset is built by comparing the quality of translations across four distinct sources:
-- Reference translation: Chinese sentences from Flores test set, Flores devtest set, and NTREX dataset.
-- aina-translator-ca-zh: A specialized bilingual model for Catalan-Chinese translations.
-- Google Translate: A widely-used general-purpose machine translation system.
-- OpenAI GPT-4: A large-scale language model capable of performing a wide range of tasks in conversational settings, including high-quality translation.
-To evaluate the quality of translations without relying on human annotations, we employ two reference-free evaluation models:
-- [Unbabel/wmt23-cometkiwi-da-xxl](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xxl)
-- [Unbabel/XCOMET-XXL](https://huggingface.co/Unbabel/XCOMET-XXL)
-These models provide direct assessment scores for each translation. The scores from both models are averaged to determine the relative quality of each translation. Based on this evaluation, the highest-scoring ("chosen") and lowest-scoring ("rejected") translations are identified for each source sentence, forming contrastive pairs. The CPO dataset comprises a total of 4,006 such pairs of "chosen" and "rejected" translations.
-#### Training
-The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
-The model was trained for 245,000 updates.
-Following fine-tuning on the M2M100 1.2B model, Contrastive Preference Optimization (CPO) was performed using our CPO dataset and the Hugging Face CPO Trainer. This phase involved 1,500 updates.
-## Evaluation
-### Variable and metrics
-Below are the evaluation results on the Projecte Aina's Catalan-Chinese test set (unpublished), compared to Google Translate for the CA-ZH direction. The evaluation was conducted using [`tower-eval`](https://github.com/deep-spin/tower-eval) following the standard setting (beam search with beam size 5, limiting the translation length to 200 tokens). We report the following metrics:
-- BLEU: Sacrebleu implementation, version:2.4.0
-- ChrF: Sacrebleu implementation.
-- Comet: Model checkpoint: "Unbabel/wmt22-comet-da".
-- Comet-kiwi: Model checkpoint: "Unbabel/wmt22-cometkiwi-da".
-### Evaluation results
-Below are the evaluation results on the machine translation from Catalan to Chinese compared to [Google Translate](https://translate.google.com/):
-#### Projecte Aina's Catalan-Chinese evaluation dataset
-|              |   Bleu ↑  |   ChrF ↑ |   Comet ↑ |   Comet-kiwi ↑ |
-|:-----------------------|-------:|------:|-------:|--------:|
-| aina-translator-ca-zh-v2 |  43.88 |  40.19 |    **0.87** |         **0.81** |
-| Google Translate         |  **44.64**     |   **41.15**     |    **0.87**     |         0.80 |
-## Additional information
-### Author
-The Language Technologies Unit from Barcelona Supercomputing Center.
-### Contact
-For further information, please send an email to <langtech@bsc.es>.
-### Copyright
-Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
-### License
-[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
-### Funding
-This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
-### Disclaimer
-<details>
-<summary>Click to expand</summary>
-The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
-Be aware that the model may have biases and/or any other undesirable distortions.
-When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
-or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
-in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
-In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
-be liable for any results arising from the use made by third parties.
-</details>

+---
+license: apache-2.0
+datasets:
+- projecte-aina/CA-ZH_Parallel_Corpus
+language:
+- zh
+- ca
+base_model:
+- facebook/m2m100_1.2B
+---
+## Projecte Aina’s Catalan-Chinese machine translation model
+## Table of Contents
+<details>
+<summary>Click to expand</summary>
+- [Model description](#model-description)
+- [Intended uses and limitations](#intended-uses-and-limitations)
+- [How to use](#how-to-use)
+- [Limitations and bias](#limitations-and-bias)
+- [Training](#training)
+- [Evaluation](#evaluation)
+- [Additional information](#additional-information)
+</details>
+## Model description
+This machine translation model is built upon the M2M100 1.2B, fine-tuned specifically for Catalan-Chinese translation.
+It is trained on a combination of Catalan-Chinese datasets
+totalling 94,187,858 sentence pairs. 113,305 sentence pairs were parallel data collected from the web, while the remaining 94,074,553 sentence pairs
+were parallel synthetic data created using the
+[Aina Project's Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca) and the [Aina Project's English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
+Following the fine-tuning phase, Contrastive Preference Optimization (CPO) was applied to further refine the model's outputs. CPO training involved pairs of "chosen" and "rejected" translations for a total of 4,006 sentences. These sentences were sourced from the Flores development set (997 sentences), the Flores devtest set (1,012 sentences), and the NTREX set (1,997 sentences).
+The model was evaluated on the Projecte Aina's Catalan-Chinese evaluation dataset (unpublished), achieving results comparable to those of Google Translate.
+## Intended uses and limitations
+You can use this model for machine translation from Catalan to simplified Chinese.
+## How to use
+### Usage
+Translate a sentence using python
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+model_id = "projecte-aina/aina-translator-ca-zh"
+model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+sentence = "Benvingut al projecte Aina!"
+input_ids = tokenizer(sentence, return_tensors="pt").input_ids
+output_ids = model.generate(input_ids, max_length=200, num_beams=5)
+generated_translation= tokenizer.decode(output_ids[0], skip_special_tokens=True, spaces_between_special_tokens = False).strip()
+print(generated_translation)
+#欢迎来到 Aina 项目！
+```
+## Limitations and bias
+At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
+However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
+## Training
+### Training data
+The Catalan-Chinese data collected from the web was a combination of the following datasets:
+| Dataset       	| Sentences before cleaning	|
+|-------------------|----------------|
+| OpenSubtitles  	| 139,300	|
+| WikiMatrix | 90,643 |
+| Wikipedia 	| 68,623|
+| **Total**     	| **298,566** |
+94,074,553 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets and English-Chinese datasets:
+**Spanish-Chinese:**
+| Dataset       	| Sentences before cleaning	|
+|-------------------|----------------|
+| NLLB 	|24,051,233|
+| UNPC | 17,599,223 |
+| MultiUN | 9,847,770 |
+| OpenSubtitles | 9,319,658 |
+| MultiParaCrawl | 3,410,087 |
+| MultiCCAligned | 3,006,694 |
+| WikiMatrix | 1,214,322 |
+| News Commentary | 375,982 |
+| Tatoeba | 9,404 |
+| **Total**     	| **68,834,373** |
+**English-Chinese:**
+| Dataset       	| Sentences before cleaning	|
+|-------------------|----------------|
+| NLLB 	|71,383,325|
+| CCAligned | 15,181,415 |
+| Paracrawl | 14,170,869|
+| WikiMatrix | 2,595,119|
+| **Total**     	| **103,330,728** |
+### Training procedure
+### Data preparation
+**Catalan-Chinese parallel data**
+The Chinese side of all datasets were first processed using the [Hanzi Identifier](https://github.com/tsroten/hanzidentifier) to detect Traditional Chinese, which was subsequently converted to Simplified Chinese using [OpenCC](https://github.com/BYVoid/OpenCC).
+All data was then filtered according to two specific criteria:
+- Alignment: sentence level alignments were calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) and sentence pairs with a score below 0.75 were discarded.
+- Language identification: the probability of being the target language was calculated using [Lingua.py](https://github.com/pemistahl/lingua-py) and sentences with a language probability score below 0.5 were discarded.
+Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
+The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94,187,858.
+**Catalan-Chinese Contrastive Preference Optimization dataset**
+The CPO dataset is built by comparing the quality of translations across four distinct sources:
+- Reference translation: Chinese sentences from Flores test set, Flores devtest set, and NTREX dataset.
+- aina-translator-ca-zh: A specialized bilingual model for Catalan-Chinese translations.
+- Google Translate: A widely-used general-purpose machine translation system.
+- OpenAI GPT-4: A large-scale language model capable of performing a wide range of tasks in conversational settings, including high-quality translation.
+To evaluate the quality of translations without relying on human annotations, we employ two reference-free evaluation models:
+- [Unbabel/wmt23-cometkiwi-da-xxl](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xxl)
+- [Unbabel/XCOMET-XXL](https://huggingface.co/Unbabel/XCOMET-XXL)
+These models provide direct assessment scores for each translation. The scores from both models are averaged to determine the relative quality of each translation. Based on this evaluation, the highest-scoring ("chosen") and lowest-scoring ("rejected") translations are identified for each source sentence, forming contrastive pairs. The CPO dataset comprises a total of 4,006 such pairs of "chosen" and "rejected" translations.
+#### Training
+The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
+The model was trained for 245,000 updates.
+Following fine-tuning on the M2M100 1.2B model, Contrastive Preference Optimization (CPO) was performed using our CPO dataset and the Hugging Face CPO Trainer. This phase involved 1,500 updates.
+## Evaluation
+### Variable and metrics
+Below are the evaluation results on the Projecte Aina's Catalan-Chinese test set (unpublished), compared to Google Translate for the CA-ZH direction. The evaluation was conducted using [`tower-eval`](https://github.com/deep-spin/tower-eval) following the standard setting (beam search with beam size 5, limiting the translation length to 200 tokens). We report the following metrics:
+- BLEU: Sacrebleu implementation, version:2.4.0
+- ChrF: Sacrebleu implementation.
+- Comet: Model checkpoint: "Unbabel/wmt22-comet-da".
+- Comet-kiwi: Model checkpoint: "Unbabel/wmt22-cometkiwi-da".
+### Evaluation results
+Below are the evaluation results on the machine translation from Catalan to Chinese compared to [Google Translate](https://translate.google.com/):
+#### Projecte Aina's Catalan-Chinese evaluation dataset
+|              |   Bleu ↑  |   ChrF ↑ |   Comet ↑ |   Comet-kiwi ↑ |
+|:-----------------------|-------:|------:|-------:|--------:|
+| aina-translator-ca-zh |  43.88 |  40.19 |    **0.87** |         **0.81** |
+| Google Translate         |  **44.64**     |   **41.15**     |    **0.87**     |         0.80 |
+## Additional information
+### Author
+The Language Technologies Unit from Barcelona Supercomputing Center.
+### Contact
+For further information, please send an email to <langtech@bsc.es>.
+### Copyright
+Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
+### License
+[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+### Funding
+This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
+### Disclaimer
+<details>
+<summary>Click to expand</summary>
+The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
+Be aware that the model may have biases and/or any other undesirable distortions.
+When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
+or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
+in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
+In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
+be liable for any results arising from the use made by third parties.
+</details>