xixianliao
commited on
Commit
•
5af834f
1
Parent(s):
c4f731f
Update
Browse files
README.md
CHANGED
@@ -35,7 +35,7 @@ were parallel synthetic data created using the
|
|
35 |
|
36 |
Following the fine-tuning phase, Contrastive Preference Optimization (CPO) was applied to further refine the model's outputs. CPO training involved pairs of "chosen" and "rejected" translations for a total of 4,006 sentences. These sentences were sourced from the Flores development set (997 sentences), the Flores devtest set (1,012 sentences), and the NTREX set (1,997 sentences).
|
37 |
|
38 |
-
The model was evaluated on the Projecte Aina's Catalan-Chinese evaluation dataset,
|
39 |
|
40 |
## Intended uses and limitations
|
41 |
|
@@ -59,7 +59,7 @@ sentence = "Benvingut al projecte Aina!"
|
|
59 |
input_ids = tokenizer(sentence, return_tensors="pt").input_ids
|
60 |
output_ids = model.generate(input_ids, max_length=200, num_beams=5)
|
61 |
|
62 |
-
generated_translation= tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
|
63 |
print(generated_translation)
|
64 |
#欢迎来到 Aina 项目!
|
65 |
```
|
@@ -77,37 +77,37 @@ The Catalan-Chinese data collected from the web was a combination of the followi
|
|
77 |
|
78 |
| Dataset | Sentences before cleaning |
|
79 |
|-------------------|----------------|
|
80 |
-
| OpenSubtitles | 139
|
81 |
-
| WikiMatrix | 90
|
82 |
-
| Wikipedia | 68
|
83 |
-
| **Total** | **298
|
84 |
|
85 |
-
94
|
86 |
|
87 |
**Spanish-Chinese:**
|
88 |
|
89 |
| Dataset | Sentences before cleaning |
|
90 |
|-------------------|----------------|
|
91 |
-
| NLLB |24
|
92 |
-
| UNPC | 17
|
93 |
-
| MultiUN | 9
|
94 |
-
| OpenSubtitles | 9
|
95 |
-
| MultiParaCrawl | 3
|
96 |
-
| MultiCCAligned | 3
|
97 |
-
| WikiMatrix | 1
|
98 |
-
| News Commentary | 375
|
99 |
-
| Tatoeba | 9
|
100 |
-
| **Total** | **68
|
101 |
|
102 |
**English-Chinese:**
|
103 |
|
104 |
| Dataset | Sentences before cleaning |
|
105 |
|-------------------|----------------|
|
106 |
-
| NLLB |71
|
107 |
-
| CCAligned | 15
|
108 |
-
| Paracrawl | 14
|
109 |
-
| WikiMatrix | 2
|
110 |
-
| **Total** | **103
|
111 |
|
112 |
|
113 |
|
@@ -127,7 +127,7 @@ All data was then filtered according to two specific criteria:
|
|
127 |
|
128 |
Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
|
129 |
|
130 |
-
The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94
|
131 |
|
132 |
**Catalan-Chinese Contrastive Preference Optimization dataset**
|
133 |
|
@@ -149,7 +149,7 @@ These models provide direct assessment scores for each translation. The scores f
|
|
149 |
#### Training
|
150 |
|
151 |
The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
|
152 |
-
The model was trained for 245
|
153 |
|
154 |
Following fine-tuning on the M2M100 1.2B model, Contrastive Preference Optimization (CPO) was performed using our CPO dataset and the Hugging Face CPO Trainer. This phase involved 1,500 updates.
|
155 |
|
@@ -167,15 +167,15 @@ Below are the evaluation results on the Projecte Aina's Catalan-Chinese test set
|
|
167 |
|
168 |
### Evaluation results
|
169 |
|
170 |
-
Below are the evaluation results on the machine translation from
|
171 |
|
172 |
|
173 |
#### Projecte Aina's Catalan-Chinese evaluation dataset
|
174 |
|
175 |
| | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ |
|
176 |
-
|
177 |
-
| aina-translator-zh-
|
178 |
-
| Google Translate |
|
179 |
|
180 |
|
181 |
|
|
|
35 |
|
36 |
Following the fine-tuning phase, Contrastive Preference Optimization (CPO) was applied to further refine the model's outputs. CPO training involved pairs of "chosen" and "rejected" translations for a total of 4,006 sentences. These sentences were sourced from the Flores development set (997 sentences), the Flores devtest set (1,012 sentences), and the NTREX set (1,997 sentences).
|
37 |
|
38 |
+
The model was evaluated on the Projecte Aina's Catalan-Chinese evaluation dataset, achieving results comparable to those of Google Translate.
|
39 |
|
40 |
## Intended uses and limitations
|
41 |
|
|
|
59 |
input_ids = tokenizer(sentence, return_tensors="pt").input_ids
|
60 |
output_ids = model.generate(input_ids, max_length=200, num_beams=5)
|
61 |
|
62 |
+
generated_translation= tokenizer.decode(output_ids[0], skip_special_tokens=True, spaces_between_special_tokens = False).strip()
|
63 |
print(generated_translation)
|
64 |
#欢迎来到 Aina 项目!
|
65 |
```
|
|
|
77 |
|
78 |
| Dataset | Sentences before cleaning |
|
79 |
|-------------------|----------------|
|
80 |
+
| OpenSubtitles | 139,300 |
|
81 |
+
| WikiMatrix | 90,643 |
|
82 |
+
| Wikipedia | 68,623|
|
83 |
+
| **Total** | **298,566** |
|
84 |
|
85 |
+
94,074,553 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets and English-Chinese datasets:
|
86 |
|
87 |
**Spanish-Chinese:**
|
88 |
|
89 |
| Dataset | Sentences before cleaning |
|
90 |
|-------------------|----------------|
|
91 |
+
| NLLB |24,051,233|
|
92 |
+
| UNPC | 17,599,223 |
|
93 |
+
| MultiUN | 9,847,770 |
|
94 |
+
| OpenSubtitles | 9,319,658 |
|
95 |
+
| MultiParaCrawl | 3,410,087 |
|
96 |
+
| MultiCCAligned | 3,006,694 |
|
97 |
+
| WikiMatrix | 1,214,322 |
|
98 |
+
| News Commentary | 375,982 |
|
99 |
+
| Tatoeba | 9,404 |
|
100 |
+
| **Total** | **68,834,373** |
|
101 |
|
102 |
**English-Chinese:**
|
103 |
|
104 |
| Dataset | Sentences before cleaning |
|
105 |
|-------------------|----------------|
|
106 |
+
| NLLB |71,383,325|
|
107 |
+
| CCAligned | 15,181,415 |
|
108 |
+
| Paracrawl | 14,170,869|
|
109 |
+
| WikiMatrix | 2,595,119|
|
110 |
+
| **Total** | **103,330,728** |
|
111 |
|
112 |
|
113 |
|
|
|
127 |
|
128 |
Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
|
129 |
|
130 |
+
The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94,187,858.
|
131 |
|
132 |
**Catalan-Chinese Contrastive Preference Optimization dataset**
|
133 |
|
|
|
149 |
#### Training
|
150 |
|
151 |
The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
|
152 |
+
The model was trained for 245,000 updates.
|
153 |
|
154 |
Following fine-tuning on the M2M100 1.2B model, Contrastive Preference Optimization (CPO) was performed using our CPO dataset and the Hugging Face CPO Trainer. This phase involved 1,500 updates.
|
155 |
|
|
|
167 |
|
168 |
### Evaluation results
|
169 |
|
170 |
+
Below are the evaluation results on the machine translation from Catalan to Chinese compared to [Google Translate](https://translate.google.com/):
|
171 |
|
172 |
|
173 |
#### Projecte Aina's Catalan-Chinese evaluation dataset
|
174 |
|
175 |
| | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ |
|
176 |
+
|:-----------------------|-------:|------:|-------:|--------:|
|
177 |
+
| aina-translator-ca-zh-v2 | 43.88 | 40.19 | **0.87** | **0.81** |
|
178 |
+
| Google Translate | **44.64** | **41.15** | **0.87** | 0.80 |
|
179 |
|
180 |
|
181 |
|