projecte-aina
/

aina-translator-ca-zh

Safetensors

Chinese

Catalan

m2m_100

Model card Files Files and versions Community

xixianliao commited on Dec 9, 2024

Commit

5af834f

1 Parent(s): c4f731f

Update

Browse files

Files changed (1) hide show

README.md +28 -28

README.md CHANGED Viewed

@@ -35,7 +35,7 @@ were parallel synthetic data created using the
 Following the fine-tuning phase, Contrastive Preference Optimization (CPO) was applied to further refine the model's outputs. CPO training involved pairs of "chosen" and "rejected" translations for a total of 4,006 sentences. These sentences were sourced from the Flores development set (997 sentences), the Flores devtest set (1,012 sentences), and the NTREX set (1,997 sentences).
-The model was evaluated on the Projecte Aina's Catalan-Chinese evaluation dataset, which contains 1022 sentences.
 ## Intended uses and limitations
@@ -59,7 +59,7 @@ sentence = "Benvingut al projecte Aina!"
 input_ids = tokenizer(sentence, return_tensors="pt").input_ids
 output_ids = model.generate(input_ids, max_length=200, num_beams=5)
-generated_translation= tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
 print(generated_translation)
 #欢迎来到 Aina 项目！
 ```
@@ -77,37 +77,37 @@ The Catalan-Chinese data collected from the web was a combination of the followi
 | Dataset       	| Sentences before cleaning	|
 |-------------------|----------------|
-| OpenSubtitles  	| 139.300	|
-| WikiMatrix | 90.643 |
-| Wikipedia 	| 68.623|
-| **Total**     	| **298.566** |
-94.074.553 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets and English-Chinese datasets:
 **Spanish-Chinese:**
 | Dataset       	| Sentences before cleaning	|
 |-------------------|----------------|
-| NLLB 	|24.051.233|
-| UNPC | 17.599.223 |
-| MultiUN | 9.847.770 |
-| OpenSubtitles | 9.319.658 |
-| MultiParaCrawl | 3.410.087 |
-| MultiCCAligned | 3.006.694 |
-| WikiMatrix | 1.214.322 |
-| News Commentary | 375.982 |
-| Tatoeba | 9.404 |
-| **Total**     	| **68.834.373** |
 **English-Chinese:**
 | Dataset       	| Sentences before cleaning	|
 |-------------------|----------------|
-| NLLB 	|71.383.325|
-| CCAligned | 15.181.415 |
-| Paracrawl | 14.170.869|
-| WikiMatrix | 2.595.119|
-| **Total**     	| **103.330.728** |
@@ -127,7 +127,7 @@ All data was then filtered according to two specific criteria:
 Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
-The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94.187.858.
 **Catalan-Chinese Contrastive Preference Optimization dataset**
@@ -149,7 +149,7 @@ These models provide direct assessment scores for each translation. The scores f
 #### Training
 The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
-The model was trained for 245.000 updates.
 Following fine-tuning on the M2M100 1.2B model, Contrastive Preference Optimization (CPO) was performed using our CPO dataset and the Hugging Face CPO Trainer. This phase involved 1,500 updates.
@@ -167,15 +167,15 @@ Below are the evaluation results on the Projecte Aina's Catalan-Chinese test set
 ### Evaluation results
-Below are the evaluation results on the machine translation from Chinese to Catalan compared to [Google Translate](https://translate.google.com/):
 #### Projecte Aina's Catalan-Chinese evaluation dataset
 |              |   Bleu ↑  |   ChrF ↑ |   Comet ↑ |   Comet-kiwi ↑ |
-|:-----------------------|-------:|------:|-------:|--------:|-------------:|---------:|
-| aina-translator-zh-ca-v2 |  **28.55** |  **57.64** |    **0.87** |         **0.82** |
-| Google Translate         |  26.84     |   55.7     |    0.86     |         **0.82** |

 Following the fine-tuning phase, Contrastive Preference Optimization (CPO) was applied to further refine the model's outputs. CPO training involved pairs of "chosen" and "rejected" translations for a total of 4,006 sentences. These sentences were sourced from the Flores development set (997 sentences), the Flores devtest set (1,012 sentences), and the NTREX set (1,997 sentences).
+The model was evaluated on the Projecte Aina's Catalan-Chinese evaluation dataset, achieving results comparable to those of Google Translate.
 ## Intended uses and limitations
 input_ids = tokenizer(sentence, return_tensors="pt").input_ids
 output_ids = model.generate(input_ids, max_length=200, num_beams=5)
+generated_translation= tokenizer.decode(output_ids[0], skip_special_tokens=True, spaces_between_special_tokens = False).strip()
 print(generated_translation)
 #欢迎来到 Aina 项目！
 ```
 | Dataset       	| Sentences before cleaning	|
 |-------------------|----------------|
+| OpenSubtitles  	| 139,300	|
+| WikiMatrix | 90,643 |
+| Wikipedia 	| 68,623|
+| **Total**     	| **298,566** |
+94,074,553 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets and English-Chinese datasets:
 **Spanish-Chinese:**
 | Dataset       	| Sentences before cleaning	|
 |-------------------|----------------|
+| NLLB 	|24,051,233|
+| UNPC | 17,599,223 |
+| MultiUN | 9,847,770 |
+| OpenSubtitles | 9,319,658 |
+| MultiParaCrawl | 3,410,087 |
+| MultiCCAligned | 3,006,694 |
+| WikiMatrix | 1,214,322 |
+| News Commentary | 375,982 |
+| Tatoeba | 9,404 |
+| **Total**     	| **68,834,373** |
 **English-Chinese:**
 | Dataset       	| Sentences before cleaning	|
 |-------------------|----------------|
+| NLLB 	|71,383,325|
+| CCAligned | 15,181,415 |
+| Paracrawl | 14,170,869|
+| WikiMatrix | 2,595,119|
+| **Total**     	| **103,330,728** |
 Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
+The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94,187,858.
 **Catalan-Chinese Contrastive Preference Optimization dataset**
 #### Training
 The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
+The model was trained for 245,000 updates.
 Following fine-tuning on the M2M100 1.2B model, Contrastive Preference Optimization (CPO) was performed using our CPO dataset and the Hugging Face CPO Trainer. This phase involved 1,500 updates.
 ### Evaluation results
+Below are the evaluation results on the machine translation from Catalan to Chinese compared to [Google Translate](https://translate.google.com/):
 #### Projecte Aina's Catalan-Chinese evaluation dataset
 |              |   Bleu ↑  |   ChrF ↑ |   Comet ↑ |   Comet-kiwi ↑ |
+|:-----------------------|-------:|------:|-------:|--------:|
+| aina-translator-ca-zh-v2 |  43.88 |  40.19 |    **0.87** |         **0.81** |
+| Google Translate         |  **44.64**     |   **41.15**     |    **0.87**     |         0.80 |