Safetensors
Chinese
Catalan
m2m_100
xixianliao commited on
Commit
5af834f
1 Parent(s): c4f731f
Files changed (1) hide show
  1. README.md +28 -28
README.md CHANGED
@@ -35,7 +35,7 @@ were parallel synthetic data created using the
35
 
36
  Following the fine-tuning phase, Contrastive Preference Optimization (CPO) was applied to further refine the model's outputs. CPO training involved pairs of "chosen" and "rejected" translations for a total of 4,006 sentences. These sentences were sourced from the Flores development set (997 sentences), the Flores devtest set (1,012 sentences), and the NTREX set (1,997 sentences).
37
 
38
- The model was evaluated on the Projecte Aina's Catalan-Chinese evaluation dataset, which contains 1022 sentences.
39
 
40
  ## Intended uses and limitations
41
 
@@ -59,7 +59,7 @@ sentence = "Benvingut al projecte Aina!"
59
  input_ids = tokenizer(sentence, return_tensors="pt").input_ids
60
  output_ids = model.generate(input_ids, max_length=200, num_beams=5)
61
 
62
- generated_translation= tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
63
  print(generated_translation)
64
  #欢迎来到 Aina 项目!
65
  ```
@@ -77,37 +77,37 @@ The Catalan-Chinese data collected from the web was a combination of the followi
77
 
78
  | Dataset | Sentences before cleaning |
79
  |-------------------|----------------|
80
- | OpenSubtitles | 139.300 |
81
- | WikiMatrix | 90.643 |
82
- | Wikipedia | 68.623|
83
- | **Total** | **298.566** |
84
 
85
- 94.074.553 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets and English-Chinese datasets:
86
 
87
  **Spanish-Chinese:**
88
 
89
  | Dataset | Sentences before cleaning |
90
  |-------------------|----------------|
91
- | NLLB |24.051.233|
92
- | UNPC | 17.599.223 |
93
- | MultiUN | 9.847.770 |
94
- | OpenSubtitles | 9.319.658 |
95
- | MultiParaCrawl | 3.410.087 |
96
- | MultiCCAligned | 3.006.694 |
97
- | WikiMatrix | 1.214.322 |
98
- | News Commentary | 375.982 |
99
- | Tatoeba | 9.404 |
100
- | **Total** | **68.834.373** |
101
 
102
  **English-Chinese:**
103
 
104
  | Dataset | Sentences before cleaning |
105
  |-------------------|----------------|
106
- | NLLB |71.383.325|
107
- | CCAligned | 15.181.415 |
108
- | Paracrawl | 14.170.869|
109
- | WikiMatrix | 2.595.119|
110
- | **Total** | **103.330.728** |
111
 
112
 
113
 
@@ -127,7 +127,7 @@ All data was then filtered according to two specific criteria:
127
 
128
  Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
129
 
130
- The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94.187.858.
131
 
132
  **Catalan-Chinese Contrastive Preference Optimization dataset**
133
 
@@ -149,7 +149,7 @@ These models provide direct assessment scores for each translation. The scores f
149
  #### Training
150
 
151
  The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
152
- The model was trained for 245.000 updates.
153
 
154
  Following fine-tuning on the M2M100 1.2B model, Contrastive Preference Optimization (CPO) was performed using our CPO dataset and the Hugging Face CPO Trainer. This phase involved 1,500 updates.
155
 
@@ -167,15 +167,15 @@ Below are the evaluation results on the Projecte Aina's Catalan-Chinese test set
167
 
168
  ### Evaluation results
169
 
170
- Below are the evaluation results on the machine translation from Chinese to Catalan compared to [Google Translate](https://translate.google.com/):
171
 
172
 
173
  #### Projecte Aina's Catalan-Chinese evaluation dataset
174
 
175
  | | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ |
176
- |:-----------------------|-------:|------:|-------:|--------:|-------------:|---------:|
177
- | aina-translator-zh-ca-v2 | **28.55** | **57.64** | **0.87** | **0.82** |
178
- | Google Translate | 26.84 | 55.7 | 0.86 | **0.82** |
179
 
180
 
181
 
 
35
 
36
  Following the fine-tuning phase, Contrastive Preference Optimization (CPO) was applied to further refine the model's outputs. CPO training involved pairs of "chosen" and "rejected" translations for a total of 4,006 sentences. These sentences were sourced from the Flores development set (997 sentences), the Flores devtest set (1,012 sentences), and the NTREX set (1,997 sentences).
37
 
38
+ The model was evaluated on the Projecte Aina's Catalan-Chinese evaluation dataset, achieving results comparable to those of Google Translate.
39
 
40
  ## Intended uses and limitations
41
 
 
59
  input_ids = tokenizer(sentence, return_tensors="pt").input_ids
60
  output_ids = model.generate(input_ids, max_length=200, num_beams=5)
61
 
62
+ generated_translation= tokenizer.decode(output_ids[0], skip_special_tokens=True, spaces_between_special_tokens = False).strip()
63
  print(generated_translation)
64
  #欢迎来到 Aina 项目!
65
  ```
 
77
 
78
  | Dataset | Sentences before cleaning |
79
  |-------------------|----------------|
80
+ | OpenSubtitles | 139,300 |
81
+ | WikiMatrix | 90,643 |
82
+ | Wikipedia | 68,623|
83
+ | **Total** | **298,566** |
84
 
85
+ 94,074,553 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets and English-Chinese datasets:
86
 
87
  **Spanish-Chinese:**
88
 
89
  | Dataset | Sentences before cleaning |
90
  |-------------------|----------------|
91
+ | NLLB |24,051,233|
92
+ | UNPC | 17,599,223 |
93
+ | MultiUN | 9,847,770 |
94
+ | OpenSubtitles | 9,319,658 |
95
+ | MultiParaCrawl | 3,410,087 |
96
+ | MultiCCAligned | 3,006,694 |
97
+ | WikiMatrix | 1,214,322 |
98
+ | News Commentary | 375,982 |
99
+ | Tatoeba | 9,404 |
100
+ | **Total** | **68,834,373** |
101
 
102
  **English-Chinese:**
103
 
104
  | Dataset | Sentences before cleaning |
105
  |-------------------|----------------|
106
+ | NLLB |71,383,325|
107
+ | CCAligned | 15,181,415 |
108
+ | Paracrawl | 14,170,869|
109
+ | WikiMatrix | 2,595,119|
110
+ | **Total** | **103,330,728** |
111
 
112
 
113
 
 
127
 
128
  Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
129
 
130
+ The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94,187,858.
131
 
132
  **Catalan-Chinese Contrastive Preference Optimization dataset**
133
 
 
149
  #### Training
150
 
151
  The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
152
+ The model was trained for 245,000 updates.
153
 
154
  Following fine-tuning on the M2M100 1.2B model, Contrastive Preference Optimization (CPO) was performed using our CPO dataset and the Hugging Face CPO Trainer. This phase involved 1,500 updates.
155
 
 
167
 
168
  ### Evaluation results
169
 
170
+ Below are the evaluation results on the machine translation from Catalan to Chinese compared to [Google Translate](https://translate.google.com/):
171
 
172
 
173
  #### Projecte Aina's Catalan-Chinese evaluation dataset
174
 
175
  | | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ |
176
+ |:-----------------------|-------:|------:|-------:|--------:|
177
+ | aina-translator-ca-zh-v2 | 43.88 | 40.19 | **0.87** | **0.81** |
178
+ | Google Translate | **44.64** | **41.15** | **0.87** | 0.80 |
179
 
180
 
181