PyTorch
Chinese
Catalan
m2m_100
xixianliao commited on
Commit
419b296
1 Parent(s): 064902a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +218 -218
README.md CHANGED
@@ -1,218 +1,218 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - projecte-aina/CA-ZH_Parallel_Corpus
5
- language:
6
- - zh
7
- - ca
8
- base_model:
9
- - facebook/m2m100_1.2B
10
- ---
11
- ## Projecte Aina’s Chinese-Catalan machine translation model
12
-
13
- ## Table of Contents
14
- <details>
15
- <summary>Click to expand</summary>
16
-
17
- - [Model description](#model-description)
18
- - [Intended uses and limitations](#intended-uses-and-limitations)
19
- - [How to use](#how-to-use)
20
- - [Limitations and bias](#limitations-and-bias)
21
- - [Training](#training)
22
- - [Evaluation](#evaluation)
23
- - [Additional information](#additional-information)
24
-
25
- </details>
26
-
27
-
28
- ## Model description
29
-
30
- This machine translation model is built upon the M2M100 1.2B, fine-tuned specifically for Chinese-Catalan translation.
31
- It is trained on a combination of Catalan-Chinese datasets
32
- totalling 94,187,858 sentence pairs. 113,305 sentence pairs were parallel data collected from the web, while the remaining 94,074,553 sentence pairs
33
- were parallel synthetic data created using the
34
- [Aina Project's Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca) and the [Aina Project's English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
35
- The model was evaluated on the Flores, NTREX, and Projecte Aina's Catalan-Chinese evaluation datasets, achieving results comparable to those of Google Translate.
36
-
37
- ## Intended uses and limitations
38
-
39
- You can use this model for machine translation from simplified Chinese to Catalan.
40
-
41
- ## How to use
42
-
43
- ### Usage
44
-
45
- Translate a sentence using python
46
- ```python
47
- from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
48
-
49
- model_id = "projecte-aina/aina-translator-zh-ca-v2"
50
-
51
- model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
52
- tokenizer = AutoTokenizer.from_pretrained(model_id)
53
-
54
- sentence = "欢迎来到 Aina 项目!"
55
-
56
- input_ids = tokenizer(sentence, return_tensors="pt").input_ids
57
- output_ids = model.generate(input_ids, max_length=200, num_beams=5)
58
-
59
- generated_translation= tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
60
- print(generated_translation)
61
- #Benvingut al projecte Aina!
62
- ```
63
-
64
-
65
- ## Limitations and bias
66
- At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
67
- However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
68
-
69
- ## Training
70
-
71
- ### Training data
72
-
73
- The Catalan-Chinese data collected from the web was a combination of the following datasets:
74
-
75
- | Dataset | Sentences before cleaning |
76
- |-------------------|----------------|
77
- | OpenSubtitles | 139,300 |
78
- | WikiMatrix | 90,643 |
79
- | Wikipedia | 68,623|
80
- | **Total** | **298,566** |
81
-
82
- 94,074,553 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets and English-Chinese datasets:
83
-
84
- **Spanish-Chinese:**
85
-
86
- | Dataset | Sentences before cleaning |
87
- |-------------------|----------------|
88
- | NLLB |24,051,233|
89
- | UNPC | 17,599,223 |
90
- | MultiUN | 9,847,770 |
91
- | OpenSubtitles | 9,319,658 |
92
- | MultiParaCrawl | 3,410,087 |
93
- | MultiCCAligned | 3,006,694 |
94
- | WikiMatrix | 1,214,322 |
95
- | News Commentary | 375,982 |
96
- | Tatoeba | 9,404 |
97
- | **Total** | **68,834,373** |
98
-
99
- **English-Chinese:**
100
-
101
- | Dataset | Sentences before cleaning |
102
- |-------------------|----------------|
103
- | NLLB |71,383,325|
104
- | CCAligned | 15,181,415 |
105
- | Paracrawl | 14,170,869|
106
- | WikiMatrix | 2,595,119|
107
- | **Total** | **103,330,728** |
108
-
109
-
110
- ### Training procedure
111
-
112
- ### Data preparation
113
-
114
- The Chinese side of all datasets were first processed using the [Hanzi Identifier](https://github.com/tsroten/hanzidentifier) to detect Traditional Chinese, which was subsequently converted to Simplified Chinese using [OpenCC](https://github.com/BYVoid/OpenCC).
115
-
116
- All data was then filtered according to two specific criteria:
117
-
118
- - Alignment: sentence level alignments were calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) and sentence pairs with a score below 0.75 were discarded.
119
-
120
- - Language identification: the probability of being the target language was calculated using [Lingua.py](https://github.com/pemistahl/lingua-py) and sentences with a language probability score below 0.5 were discarded.
121
-
122
- Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
123
-
124
- The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94,187,858.
125
-
126
-
127
- #### Training
128
-
129
- The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
130
- The model was trained for 244,500 updates.
131
- Weights were saved every 500 updates.
132
-
133
- ## Evaluation
134
-
135
- ### Variable and metrics
136
-
137
- Below are the evaluation results on [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200),
138
- [NTREX](https://github.com/MicrosoftTranslator/NTREX), and Projecte Aina's Catalan-Chinese test sets (unpublished), compared to Google Translate for the ZH-CA direction. The evaluation was conducted using [`tower-eval`](https://github.com/deep-spin/tower-eval) following the standard setting (beam search with beam size 5, limiting the translation length to 200 tokens). We report the following metrics:
139
-
140
- - BLEU: Sacrebleu implementation, version: 2.4.0.
141
- - ChrF: Sacrebleu implementation.
142
- - Comet: Model checkpoint: "Unbabel/wmt22-comet-da".
143
- - Comet-kiwi: Model checkpoint: "Unbabel/wmt22-cometkiwi-da".
144
-
145
-
146
- ### Evaluation results
147
-
148
- Below are the evaluation results on the machine translation from Chinese to Catalan compared to [Google Translate](https://translate.google.com/):
149
-
150
-
151
- #### Flores200-dev
152
-
153
- | | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ |
154
- |:-----------------------|-------:|------:|-------:|--------:|
155
- | aina-translator-zh-ca-v2 | 26.74 | 54.49 | **0.86** | **0.82** |
156
- | Google Translate | **27.71** | **55.37** | **0.86** | 0.81 |
157
-
158
-
159
- #### Flores200-devtest
160
-
161
-
162
- | | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ |
163
- |:-----------------------|-------:|------:|-------:|--------:|
164
- | aina-translator-zh-ca-v2 | 27.17 | 55.02 | **0.86** | **0.81** |
165
- | Google Translate | **27.47** | **55.51** | **0.86** | **0.81** |
166
-
167
-
168
- #### NTREX
169
-
170
-
171
- | | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ |
172
- |:-----------------------|-------:|------:|-------:|--------:|
173
- | aina-translator-zh-ca-v2 | 22.43 | 50.65 | **0.83** | **0.79** |
174
- | Google Translate | **23.49** | **51.29** | **0.83** | **0.79** |
175
-
176
-
177
- #### Projecte Aina's Catalan-Chinese evaluation dataset
178
-
179
- | | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ |
180
- |:-----------------------|-------:|------:|-------:|--------:|
181
- | aina-translator-zh-ca-v2 | **29.21** | 57.41 | **0.87** | **0.82** |
182
- | Google Translate | 28.86 | **57.73** | **0.87** | **0.82** |
183
-
184
-
185
- ## Additional information
186
-
187
- ### Author
188
- The Language Technologies Unit from Barcelona Supercomputing Center.
189
-
190
- ### Contact
191
- For further information, please send an email to <langtech@bsc.es>.
192
-
193
- ### Copyright
194
- Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
195
-
196
- ### License
197
- [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
198
-
199
- ### Funding
200
- This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
201
-
202
- ### Disclaimer
203
-
204
- <details>
205
- <summary>Click to expand</summary>
206
-
207
- The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
208
-
209
- Be aware that the model may have biases and/or any other undesirable distortions.
210
-
211
- When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
212
- or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
213
- in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
214
-
215
- In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
216
- be liable for any results arising from the use made by third parties.
217
-
218
- </details>
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - projecte-aina/CA-ZH_Parallel_Corpus
5
+ language:
6
+ - zh
7
+ - ca
8
+ base_model:
9
+ - facebook/m2m100_1.2B
10
+ ---
11
+ ## Projecte Aina’s Chinese-Catalan machine translation model
12
+
13
+ ## Table of Contents
14
+ <details>
15
+ <summary>Click to expand</summary>
16
+
17
+ - [Model description](#model-description)
18
+ - [Intended uses and limitations](#intended-uses-and-limitations)
19
+ - [How to use](#how-to-use)
20
+ - [Limitations and bias](#limitations-and-bias)
21
+ - [Training](#training)
22
+ - [Evaluation](#evaluation)
23
+ - [Additional information](#additional-information)
24
+
25
+ </details>
26
+
27
+
28
+ ## Model description
29
+
30
+ This machine translation model is built upon the M2M100 1.2B, fine-tuned specifically for Chinese-Catalan translation.
31
+ It is trained on a combination of Catalan-Chinese datasets
32
+ totalling 94,187,858 sentence pairs. 113,305 sentence pairs were parallel data collected from the web, while the remaining 94,074,553 sentence pairs
33
+ were parallel synthetic data created using the
34
+ [Aina Project's Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca) and the [Aina Project's English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
35
+ The model was evaluated on the Flores, NTREX, and Projecte Aina's Catalan-Chinese evaluation datasets, achieving results comparable to those of Google Translate.
36
+
37
+ ## Intended uses and limitations
38
+
39
+ You can use this model for machine translation from simplified Chinese to Catalan.
40
+
41
+ ## How to use
42
+
43
+ ### Usage
44
+
45
+ Translate a sentence using python
46
+ ```python
47
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
48
+
49
+ model_id = "projecte-aina/aina-translator-zh-ca"
50
+
51
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
52
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
53
+
54
+ sentence = "欢迎来到 Aina 项目!"
55
+
56
+ input_ids = tokenizer(sentence, return_tensors="pt").input_ids
57
+ output_ids = model.generate(input_ids, max_length=200, num_beams=5)
58
+
59
+ generated_translation= tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
60
+ print(generated_translation)
61
+ #Benvingut al projecte Aina!
62
+ ```
63
+
64
+
65
+ ## Limitations and bias
66
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
67
+ However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
68
+
69
+ ## Training
70
+
71
+ ### Training data
72
+
73
+ The Catalan-Chinese data collected from the web was a combination of the following datasets:
74
+
75
+ | Dataset | Sentences before cleaning |
76
+ |-------------------|----------------|
77
+ | OpenSubtitles | 139,300 |
78
+ | WikiMatrix | 90,643 |
79
+ | Wikipedia | 68,623|
80
+ | **Total** | **298,566** |
81
+
82
+ 94,074,553 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets and English-Chinese datasets:
83
+
84
+ **Spanish-Chinese:**
85
+
86
+ | Dataset | Sentences before cleaning |
87
+ |-------------------|----------------|
88
+ | NLLB |24,051,233|
89
+ | UNPC | 17,599,223 |
90
+ | MultiUN | 9,847,770 |
91
+ | OpenSubtitles | 9,319,658 |
92
+ | MultiParaCrawl | 3,410,087 |
93
+ | MultiCCAligned | 3,006,694 |
94
+ | WikiMatrix | 1,214,322 |
95
+ | News Commentary | 375,982 |
96
+ | Tatoeba | 9,404 |
97
+ | **Total** | **68,834,373** |
98
+
99
+ **English-Chinese:**
100
+
101
+ | Dataset | Sentences before cleaning |
102
+ |-------------------|----------------|
103
+ | NLLB |71,383,325|
104
+ | CCAligned | 15,181,415 |
105
+ | Paracrawl | 14,170,869|
106
+ | WikiMatrix | 2,595,119|
107
+ | **Total** | **103,330,728** |
108
+
109
+
110
+ ### Training procedure
111
+
112
+ ### Data preparation
113
+
114
+ The Chinese side of all datasets were first processed using the [Hanzi Identifier](https://github.com/tsroten/hanzidentifier) to detect Traditional Chinese, which was subsequently converted to Simplified Chinese using [OpenCC](https://github.com/BYVoid/OpenCC).
115
+
116
+ All data was then filtered according to two specific criteria:
117
+
118
+ - Alignment: sentence level alignments were calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) and sentence pairs with a score below 0.75 were discarded.
119
+
120
+ - Language identification: the probability of being the target language was calculated using [Lingua.py](https://github.com/pemistahl/lingua-py) and sentences with a language probability score below 0.5 were discarded.
121
+
122
+ Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
123
+
124
+ The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94,187,858.
125
+
126
+
127
+ #### Training
128
+
129
+ The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
130
+ The model was trained for 244,500 updates.
131
+ Weights were saved every 500 updates.
132
+
133
+ ## Evaluation
134
+
135
+ ### Variable and metrics
136
+
137
+ Below are the evaluation results on [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200),
138
+ [NTREX](https://github.com/MicrosoftTranslator/NTREX), and Projecte Aina's Catalan-Chinese test sets (unpublished), compared to Google Translate for the ZH-CA direction. The evaluation was conducted using [`tower-eval`](https://github.com/deep-spin/tower-eval) following the standard setting (beam search with beam size 5, limiting the translation length to 200 tokens). We report the following metrics:
139
+
140
+ - BLEU: Sacrebleu implementation, version: 2.4.0.
141
+ - ChrF: Sacrebleu implementation.
142
+ - Comet: Model checkpoint: "Unbabel/wmt22-comet-da".
143
+ - Comet-kiwi: Model checkpoint: "Unbabel/wmt22-cometkiwi-da".
144
+
145
+
146
+ ### Evaluation results
147
+
148
+ Below are the evaluation results on the machine translation from Chinese to Catalan compared to [Google Translate](https://translate.google.com/):
149
+
150
+
151
+ #### Flores200-dev
152
+
153
+ | | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ |
154
+ |:-----------------------|-------:|------:|-------:|--------:|
155
+ | aina-translator-zh-ca | 26.74 | 54.49 | **0.86** | **0.82** |
156
+ | Google Translate | **27.71** | **55.37** | **0.86** | 0.81 |
157
+
158
+
159
+ #### Flores200-devtest
160
+
161
+
162
+ | | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ |
163
+ |:-----------------------|-------:|------:|-------:|--------:|
164
+ | aina-translator-zh-ca | 27.17 | 55.02 | **0.86** | **0.81** |
165
+ | Google Translate | **27.47** | **55.51** | **0.86** | **0.81** |
166
+
167
+
168
+ #### NTREX
169
+
170
+
171
+ | | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ |
172
+ |:-----------------------|-------:|------:|-------:|--------:|
173
+ | aina-translator-zh-ca | 22.43 | 50.65 | **0.83** | **0.79** |
174
+ | Google Translate | **23.49** | **51.29** | **0.83** | **0.79** |
175
+
176
+
177
+ #### Projecte Aina's Catalan-Chinese evaluation dataset
178
+
179
+ | | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ |
180
+ |:-----------------------|-------:|------:|-------:|--------:|
181
+ | aina-translator-zh-ca | **29.21** | 57.41 | **0.87** | **0.82** |
182
+ | Google Translate | 28.86 | **57.73** | **0.87** | **0.82** |
183
+
184
+
185
+ ## Additional information
186
+
187
+ ### Author
188
+ The Language Technologies Unit from Barcelona Supercomputing Center.
189
+
190
+ ### Contact
191
+ For further information, please send an email to <langtech@bsc.es>.
192
+
193
+ ### Copyright
194
+ Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
195
+
196
+ ### License
197
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
198
+
199
+ ### Funding
200
+ This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
201
+
202
+ ### Disclaimer
203
+
204
+ <details>
205
+ <summary>Click to expand</summary>
206
+
207
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
208
+
209
+ Be aware that the model may have biases and/or any other undesirable distortions.
210
+
211
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
212
+ or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
213
+ in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
214
+
215
+ In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
216
+ be liable for any results arising from the use made by third parties.
217
+
218
+ </details>