Safetensors
Chinese
Catalan
m2m_100
xixianliao commited on
Commit
21cb965
1 Parent(s): 40fde0e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +215 -215
README.md CHANGED
@@ -1,215 +1,215 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - projecte-aina/CA-ZH_Parallel_Corpus
5
- language:
6
- - zh
7
- - ca
8
- base_model:
9
- - facebook/m2m100_1.2B
10
- ---
11
- ## Projecte Aina’s Catalan-Chinese machine translation model
12
-
13
- ## Table of Contents
14
- <details>
15
- <summary>Click to expand</summary>
16
-
17
- - [Model description](#model-description)
18
- - [Intended uses and limitations](#intended-uses-and-limitations)
19
- - [How to use](#how-to-use)
20
- - [Limitations and bias](#limitations-and-bias)
21
- - [Training](#training)
22
- - [Evaluation](#evaluation)
23
- - [Additional information](#additional-information)
24
-
25
- </details>
26
-
27
-
28
- ## Model description
29
-
30
- This machine translation model is built upon the M2M100 1.2B, fine-tuned specifically for Catalan-Chinese translation.
31
- It is trained on a combination of Catalan-Chinese datasets
32
- totalling 94,187,858 sentence pairs. 113,305 sentence pairs were parallel data collected from the web, while the remaining 94,074,553 sentence pairs
33
- were parallel synthetic data created using the
34
- [Aina Project's Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca) and the [Aina Project's English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
35
-
36
- Following the fine-tuning phase, Contrastive Preference Optimization (CPO) was applied to further refine the model's outputs. CPO training involved pairs of "chosen" and "rejected" translations for a total of 4,006 sentences. These sentences were sourced from the Flores development set (997 sentences), the Flores devtest set (1,012 sentences), and the NTREX set (1,997 sentences).
37
-
38
- The model was evaluated on the Projecte Aina's Catalan-Chinese evaluation dataset (unpublished), achieving results comparable to those of Google Translate.
39
-
40
- ## Intended uses and limitations
41
-
42
- You can use this model for machine translation from Catalan to simplified Chinese.
43
-
44
- ## How to use
45
-
46
- ### Usage
47
-
48
- Translate a sentence using python
49
- ```python
50
- from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
51
-
52
- model_id = "projecte-aina/aina-translator-ca-zh-v2"
53
-
54
- model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
55
- tokenizer = AutoTokenizer.from_pretrained(model_id)
56
-
57
- sentence = "Benvingut al projecte Aina!"
58
-
59
- input_ids = tokenizer(sentence, return_tensors="pt").input_ids
60
- output_ids = model.generate(input_ids, max_length=200, num_beams=5)
61
-
62
- generated_translation= tokenizer.decode(output_ids[0], skip_special_tokens=True, spaces_between_special_tokens = False).strip()
63
- print(generated_translation)
64
- #欢迎来到 Aina 项目!
65
- ```
66
-
67
-
68
- ## Limitations and bias
69
- At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
70
- However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
71
-
72
- ## Training
73
-
74
- ### Training data
75
-
76
- The Catalan-Chinese data collected from the web was a combination of the following datasets:
77
-
78
- | Dataset | Sentences before cleaning |
79
- |-------------------|----------------|
80
- | OpenSubtitles | 139,300 |
81
- | WikiMatrix | 90,643 |
82
- | Wikipedia | 68,623|
83
- | **Total** | **298,566** |
84
-
85
- 94,074,553 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets and English-Chinese datasets:
86
-
87
- **Spanish-Chinese:**
88
-
89
- | Dataset | Sentences before cleaning |
90
- |-------------------|----------------|
91
- | NLLB |24,051,233|
92
- | UNPC | 17,599,223 |
93
- | MultiUN | 9,847,770 |
94
- | OpenSubtitles | 9,319,658 |
95
- | MultiParaCrawl | 3,410,087 |
96
- | MultiCCAligned | 3,006,694 |
97
- | WikiMatrix | 1,214,322 |
98
- | News Commentary | 375,982 |
99
- | Tatoeba | 9,404 |
100
- | **Total** | **68,834,373** |
101
-
102
- **English-Chinese:**
103
-
104
- | Dataset | Sentences before cleaning |
105
- |-------------------|----------------|
106
- | NLLB |71,383,325|
107
- | CCAligned | 15,181,415 |
108
- | Paracrawl | 14,170,869|
109
- | WikiMatrix | 2,595,119|
110
- | **Total** | **103,330,728** |
111
-
112
-
113
-
114
- ### Training procedure
115
-
116
- ### Data preparation
117
-
118
- **Catalan-Chinese parallel data**
119
-
120
- The Chinese side of all datasets were first processed using the [Hanzi Identifier](https://github.com/tsroten/hanzidentifier) to detect Traditional Chinese, which was subsequently converted to Simplified Chinese using [OpenCC](https://github.com/BYVoid/OpenCC).
121
-
122
- All data was then filtered according to two specific criteria:
123
-
124
- - Alignment: sentence level alignments were calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) and sentence pairs with a score below 0.75 were discarded.
125
-
126
- - Language identification: the probability of being the target language was calculated using [Lingua.py](https://github.com/pemistahl/lingua-py) and sentences with a language probability score below 0.5 were discarded.
127
-
128
- Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
129
-
130
- The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94,187,858.
131
-
132
- **Catalan-Chinese Contrastive Preference Optimization dataset**
133
-
134
- The CPO dataset is built by comparing the quality of translations across four distinct sources:
135
-
136
- - Reference translation: Chinese sentences from Flores test set, Flores devtest set, and NTREX dataset.
137
- - aina-translator-ca-zh: A specialized bilingual model for Catalan-Chinese translations.
138
- - Google Translate: A widely-used general-purpose machine translation system.
139
- - OpenAI GPT-4: A large-scale language model capable of performing a wide range of tasks in conversational settings, including high-quality translation.
140
-
141
- To evaluate the quality of translations without relying on human annotations, we employ two reference-free evaluation models:
142
-
143
- - [Unbabel/wmt23-cometkiwi-da-xxl](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xxl)
144
- - [Unbabel/XCOMET-XXL](https://huggingface.co/Unbabel/XCOMET-XXL)
145
-
146
- These models provide direct assessment scores for each translation. The scores from both models are averaged to determine the relative quality of each translation. Based on this evaluation, the highest-scoring ("chosen") and lowest-scoring ("rejected") translations are identified for each source sentence, forming contrastive pairs. The CPO dataset comprises a total of 4,006 such pairs of "chosen" and "rejected" translations.
147
-
148
-
149
- #### Training
150
-
151
- The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
152
- The model was trained for 245,000 updates.
153
-
154
- Following fine-tuning on the M2M100 1.2B model, Contrastive Preference Optimization (CPO) was performed using our CPO dataset and the Hugging Face CPO Trainer. This phase involved 1,500 updates.
155
-
156
- ## Evaluation
157
-
158
- ### Variable and metrics
159
-
160
- Below are the evaluation results on the Projecte Aina's Catalan-Chinese test set (unpublished), compared to Google Translate for the CA-ZH direction. The evaluation was conducted using [`tower-eval`](https://github.com/deep-spin/tower-eval) following the standard setting (beam search with beam size 5, limiting the translation length to 200 tokens). We report the following metrics:
161
-
162
- - BLEU: Sacrebleu implementation, version:2.4.0
163
- - ChrF: Sacrebleu implementation.
164
- - Comet: Model checkpoint: "Unbabel/wmt22-comet-da".
165
- - Comet-kiwi: Model checkpoint: "Unbabel/wmt22-cometkiwi-da".
166
-
167
-
168
- ### Evaluation results
169
-
170
- Below are the evaluation results on the machine translation from Catalan to Chinese compared to [Google Translate](https://translate.google.com/):
171
-
172
-
173
- #### Projecte Aina's Catalan-Chinese evaluation dataset
174
-
175
- | | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ |
176
- |:-----------------------|-------:|------:|-------:|--------:|
177
- | aina-translator-ca-zh-v2 | 43.88 | 40.19 | **0.87** | **0.81** |
178
- | Google Translate | **44.64** | **41.15** | **0.87** | 0.80 |
179
-
180
-
181
-
182
- ## Additional information
183
-
184
- ### Author
185
- The Language Technologies Unit from Barcelona Supercomputing Center.
186
-
187
- ### Contact
188
- For further information, please send an email to <langtech@bsc.es>.
189
-
190
- ### Copyright
191
- Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
192
-
193
- ### License
194
- [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
195
-
196
- ### Funding
197
- This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
198
-
199
- ### Disclaimer
200
-
201
- <details>
202
- <summary>Click to expand</summary>
203
-
204
- The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
205
-
206
- Be aware that the model may have biases and/or any other undesirable distortions.
207
-
208
- When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
209
- or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
210
- in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
211
-
212
- In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
213
- be liable for any results arising from the use made by third parties.
214
-
215
- </details>
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - projecte-aina/CA-ZH_Parallel_Corpus
5
+ language:
6
+ - zh
7
+ - ca
8
+ base_model:
9
+ - facebook/m2m100_1.2B
10
+ ---
11
+ ## Projecte Aina’s Catalan-Chinese machine translation model
12
+
13
+ ## Table of Contents
14
+ <details>
15
+ <summary>Click to expand</summary>
16
+
17
+ - [Model description](#model-description)
18
+ - [Intended uses and limitations](#intended-uses-and-limitations)
19
+ - [How to use](#how-to-use)
20
+ - [Limitations and bias](#limitations-and-bias)
21
+ - [Training](#training)
22
+ - [Evaluation](#evaluation)
23
+ - [Additional information](#additional-information)
24
+
25
+ </details>
26
+
27
+
28
+ ## Model description
29
+
30
+ This machine translation model is built upon the M2M100 1.2B, fine-tuned specifically for Catalan-Chinese translation.
31
+ It is trained on a combination of Catalan-Chinese datasets
32
+ totalling 94,187,858 sentence pairs. 113,305 sentence pairs were parallel data collected from the web, while the remaining 94,074,553 sentence pairs
33
+ were parallel synthetic data created using the
34
+ [Aina Project's Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca) and the [Aina Project's English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
35
+
36
+ Following the fine-tuning phase, Contrastive Preference Optimization (CPO) was applied to further refine the model's outputs. CPO training involved pairs of "chosen" and "rejected" translations for a total of 4,006 sentences. These sentences were sourced from the Flores development set (997 sentences), the Flores devtest set (1,012 sentences), and the NTREX set (1,997 sentences).
37
+
38
+ The model was evaluated on the Projecte Aina's Catalan-Chinese evaluation dataset (unpublished), achieving results comparable to those of Google Translate.
39
+
40
+ ## Intended uses and limitations
41
+
42
+ You can use this model for machine translation from Catalan to simplified Chinese.
43
+
44
+ ## How to use
45
+
46
+ ### Usage
47
+
48
+ Translate a sentence using python
49
+ ```python
50
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
51
+
52
+ model_id = "projecte-aina/aina-translator-ca-zh"
53
+
54
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
55
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
56
+
57
+ sentence = "Benvingut al projecte Aina!"
58
+
59
+ input_ids = tokenizer(sentence, return_tensors="pt").input_ids
60
+ output_ids = model.generate(input_ids, max_length=200, num_beams=5)
61
+
62
+ generated_translation= tokenizer.decode(output_ids[0], skip_special_tokens=True, spaces_between_special_tokens = False).strip()
63
+ print(generated_translation)
64
+ #欢迎来到 Aina 项目!
65
+ ```
66
+
67
+
68
+ ## Limitations and bias
69
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
70
+ However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
71
+
72
+ ## Training
73
+
74
+ ### Training data
75
+
76
+ The Catalan-Chinese data collected from the web was a combination of the following datasets:
77
+
78
+ | Dataset | Sentences before cleaning |
79
+ |-------------------|----------------|
80
+ | OpenSubtitles | 139,300 |
81
+ | WikiMatrix | 90,643 |
82
+ | Wikipedia | 68,623|
83
+ | **Total** | **298,566** |
84
+
85
+ 94,074,553 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets and English-Chinese datasets:
86
+
87
+ **Spanish-Chinese:**
88
+
89
+ | Dataset | Sentences before cleaning |
90
+ |-------------------|----------------|
91
+ | NLLB |24,051,233|
92
+ | UNPC | 17,599,223 |
93
+ | MultiUN | 9,847,770 |
94
+ | OpenSubtitles | 9,319,658 |
95
+ | MultiParaCrawl | 3,410,087 |
96
+ | MultiCCAligned | 3,006,694 |
97
+ | WikiMatrix | 1,214,322 |
98
+ | News Commentary | 375,982 |
99
+ | Tatoeba | 9,404 |
100
+ | **Total** | **68,834,373** |
101
+
102
+ **English-Chinese:**
103
+
104
+ | Dataset | Sentences before cleaning |
105
+ |-------------------|----------------|
106
+ | NLLB |71,383,325|
107
+ | CCAligned | 15,181,415 |
108
+ | Paracrawl | 14,170,869|
109
+ | WikiMatrix | 2,595,119|
110
+ | **Total** | **103,330,728** |
111
+
112
+
113
+
114
+ ### Training procedure
115
+
116
+ ### Data preparation
117
+
118
+ **Catalan-Chinese parallel data**
119
+
120
+ The Chinese side of all datasets were first processed using the [Hanzi Identifier](https://github.com/tsroten/hanzidentifier) to detect Traditional Chinese, which was subsequently converted to Simplified Chinese using [OpenCC](https://github.com/BYVoid/OpenCC).
121
+
122
+ All data was then filtered according to two specific criteria:
123
+
124
+ - Alignment: sentence level alignments were calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) and sentence pairs with a score below 0.75 were discarded.
125
+
126
+ - Language identification: the probability of being the target language was calculated using [Lingua.py](https://github.com/pemistahl/lingua-py) and sentences with a language probability score below 0.5 were discarded.
127
+
128
+ Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
129
+
130
+ The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94,187,858.
131
+
132
+ **Catalan-Chinese Contrastive Preference Optimization dataset**
133
+
134
+ The CPO dataset is built by comparing the quality of translations across four distinct sources:
135
+
136
+ - Reference translation: Chinese sentences from Flores test set, Flores devtest set, and NTREX dataset.
137
+ - aina-translator-ca-zh: A specialized bilingual model for Catalan-Chinese translations.
138
+ - Google Translate: A widely-used general-purpose machine translation system.
139
+ - OpenAI GPT-4: A large-scale language model capable of performing a wide range of tasks in conversational settings, including high-quality translation.
140
+
141
+ To evaluate the quality of translations without relying on human annotations, we employ two reference-free evaluation models:
142
+
143
+ - [Unbabel/wmt23-cometkiwi-da-xxl](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xxl)
144
+ - [Unbabel/XCOMET-XXL](https://huggingface.co/Unbabel/XCOMET-XXL)
145
+
146
+ These models provide direct assessment scores for each translation. The scores from both models are averaged to determine the relative quality of each translation. Based on this evaluation, the highest-scoring ("chosen") and lowest-scoring ("rejected") translations are identified for each source sentence, forming contrastive pairs. The CPO dataset comprises a total of 4,006 such pairs of "chosen" and "rejected" translations.
147
+
148
+
149
+ #### Training
150
+
151
+ The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
152
+ The model was trained for 245,000 updates.
153
+
154
+ Following fine-tuning on the M2M100 1.2B model, Contrastive Preference Optimization (CPO) was performed using our CPO dataset and the Hugging Face CPO Trainer. This phase involved 1,500 updates.
155
+
156
+ ## Evaluation
157
+
158
+ ### Variable and metrics
159
+
160
+ Below are the evaluation results on the Projecte Aina's Catalan-Chinese test set (unpublished), compared to Google Translate for the CA-ZH direction. The evaluation was conducted using [`tower-eval`](https://github.com/deep-spin/tower-eval) following the standard setting (beam search with beam size 5, limiting the translation length to 200 tokens). We report the following metrics:
161
+
162
+ - BLEU: Sacrebleu implementation, version:2.4.0
163
+ - ChrF: Sacrebleu implementation.
164
+ - Comet: Model checkpoint: "Unbabel/wmt22-comet-da".
165
+ - Comet-kiwi: Model checkpoint: "Unbabel/wmt22-cometkiwi-da".
166
+
167
+
168
+ ### Evaluation results
169
+
170
+ Below are the evaluation results on the machine translation from Catalan to Chinese compared to [Google Translate](https://translate.google.com/):
171
+
172
+
173
+ #### Projecte Aina's Catalan-Chinese evaluation dataset
174
+
175
+ | | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ |
176
+ |:-----------------------|-------:|------:|-------:|--------:|
177
+ | aina-translator-ca-zh | 43.88 | 40.19 | **0.87** | **0.81** |
178
+ | Google Translate | **44.64** | **41.15** | **0.87** | 0.80 |
179
+
180
+
181
+
182
+ ## Additional information
183
+
184
+ ### Author
185
+ The Language Technologies Unit from Barcelona Supercomputing Center.
186
+
187
+ ### Contact
188
+ For further information, please send an email to <langtech@bsc.es>.
189
+
190
+ ### Copyright
191
+ Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
192
+
193
+ ### License
194
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
195
+
196
+ ### Funding
197
+ This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
198
+
199
+ ### Disclaimer
200
+
201
+ <details>
202
+ <summary>Click to expand</summary>
203
+
204
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
205
+
206
+ Be aware that the model may have biases and/or any other undesirable distortions.
207
+
208
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
209
+ or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
210
+ in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
211
+
212
+ In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
213
+ be liable for any results arising from the use made by third parties.
214
+
215
+ </details>