xixianliao
commited on
Commit
•
0af6df0
1
Parent(s):
e136825
Update
Browse files
README.md
CHANGED
@@ -27,9 +27,9 @@ base_model:
|
|
27 |
|
28 |
## Model description
|
29 |
|
30 |
-
This machine translation model is built upon the
|
31 |
It is trained on a combination of Catalan-Chinese datasets
|
32 |
-
totalling 94
|
33 |
were parallel synthetic data created using the
|
34 |
[Aina Project's Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca) and the [Aina Project's English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
|
35 |
The model was evaluated on the Flores, NTREX, and Projecte Aina's Catalan-Chinese evaluation datasets, achieving results comparable to those of Google Translate.
|
@@ -74,37 +74,37 @@ The Catalan-Chinese data collected from the web was a combination of the followi
|
|
74 |
|
75 |
| Dataset | Sentences before cleaning |
|
76 |
|-------------------|----------------|
|
77 |
-
| OpenSubtitles | 139
|
78 |
-
| WikiMatrix | 90
|
79 |
-
| Wikipedia | 68
|
80 |
-
| **Total** | **298
|
81 |
|
82 |
-
94
|
83 |
|
84 |
**Spanish-Chinese:**
|
85 |
|
86 |
| Dataset | Sentences before cleaning |
|
87 |
|-------------------|----------------|
|
88 |
-
| NLLB |24
|
89 |
-
| UNPC | 17
|
90 |
-
| MultiUN | 9
|
91 |
-
| OpenSubtitles | 9
|
92 |
-
| MultiParaCrawl | 3
|
93 |
-
| MultiCCAligned | 3
|
94 |
-
| WikiMatrix | 1
|
95 |
-
| News Commentary | 375
|
96 |
-
| Tatoeba | 9
|
97 |
-
| **Total** | **68
|
98 |
|
99 |
**English-Chinese:**
|
100 |
|
101 |
| Dataset | Sentences before cleaning |
|
102 |
|-------------------|----------------|
|
103 |
-
| NLLB |71
|
104 |
-
| CCAligned | 15
|
105 |
-
| Paracrawl | 14
|
106 |
-
| WikiMatrix | 2
|
107 |
-
| **Total** | **103
|
108 |
|
109 |
|
110 |
### Training procedure
|
@@ -121,13 +121,13 @@ All data was then filtered according to two specific criteria:
|
|
121 |
|
122 |
Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
|
123 |
|
124 |
-
The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94
|
125 |
|
126 |
|
127 |
#### Training
|
128 |
|
129 |
The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
|
130 |
-
The model was trained for 244
|
131 |
Weights were saved every 500 updates.
|
132 |
|
133 |
## Evaluation
|
|
|
27 |
|
28 |
## Model description
|
29 |
|
30 |
+
This machine translation model is built upon the M2M100 1.2B, fine-tuned specifically for Chinese-Catalan translation.
|
31 |
It is trained on a combination of Catalan-Chinese datasets
|
32 |
+
totalling 94,187,858 sentence pairs. 113,305 sentence pairs were parallel data collected from the web, while the remaining 94,074,553 sentence pairs
|
33 |
were parallel synthetic data created using the
|
34 |
[Aina Project's Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca) and the [Aina Project's English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
|
35 |
The model was evaluated on the Flores, NTREX, and Projecte Aina's Catalan-Chinese evaluation datasets, achieving results comparable to those of Google Translate.
|
|
|
74 |
|
75 |
| Dataset | Sentences before cleaning |
|
76 |
|-------------------|----------------|
|
77 |
+
| OpenSubtitles | 139,300 |
|
78 |
+
| WikiMatrix | 90,643 |
|
79 |
+
| Wikipedia | 68,623|
|
80 |
+
| **Total** | **298,566** |
|
81 |
|
82 |
+
94,074,553 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets and English-Chinese datasets:
|
83 |
|
84 |
**Spanish-Chinese:**
|
85 |
|
86 |
| Dataset | Sentences before cleaning |
|
87 |
|-------------------|----------------|
|
88 |
+
| NLLB |24,051,233|
|
89 |
+
| UNPC | 17,599,223 |
|
90 |
+
| MultiUN | 9,847,770 |
|
91 |
+
| OpenSubtitles | 9,319,658 |
|
92 |
+
| MultiParaCrawl | 3,410,087 |
|
93 |
+
| MultiCCAligned | 3,006,694 |
|
94 |
+
| WikiMatrix | 1,214,322 |
|
95 |
+
| News Commentary | 375,982 |
|
96 |
+
| Tatoeba | 9,404 |
|
97 |
+
| **Total** | **68,834,373** |
|
98 |
|
99 |
**English-Chinese:**
|
100 |
|
101 |
| Dataset | Sentences before cleaning |
|
102 |
|-------------------|----------------|
|
103 |
+
| NLLB |71,383,325|
|
104 |
+
| CCAligned | 15,181,415 |
|
105 |
+
| Paracrawl | 14,170,869|
|
106 |
+
| WikiMatrix | 2,595,119|
|
107 |
+
| **Total** | **103,330,728** |
|
108 |
|
109 |
|
110 |
### Training procedure
|
|
|
121 |
|
122 |
Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
|
123 |
|
124 |
+
The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94,187,858.
|
125 |
|
126 |
|
127 |
#### Training
|
128 |
|
129 |
The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
|
130 |
+
The model was trained for 244,500 updates.
|
131 |
Weights were saved every 500 updates.
|
132 |
|
133 |
## Evaluation
|