PyTorch
Chinese
Catalan
m2m_100
xixianliao commited on
Commit
0af6df0
1 Parent(s): e136825
Files changed (1) hide show
  1. README.md +24 -24
README.md CHANGED
@@ -27,9 +27,9 @@ base_model:
27
 
28
  ## Model description
29
 
30
- This machine translation model is built upon the foundation of M2M100 1.2B.
31
  It is trained on a combination of Catalan-Chinese datasets
32
- totalling 94.187.858 sentence pairs. 113.305 sentence pairs were parallel data collected from the web, while the remaining 94.074.553 sentence pairs
33
  were parallel synthetic data created using the
34
  [Aina Project's Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca) and the [Aina Project's English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
35
  The model was evaluated on the Flores, NTREX, and Projecte Aina's Catalan-Chinese evaluation datasets, achieving results comparable to those of Google Translate.
@@ -74,37 +74,37 @@ The Catalan-Chinese data collected from the web was a combination of the followi
74
 
75
  | Dataset | Sentences before cleaning |
76
  |-------------------|----------------|
77
- | OpenSubtitles | 139.300 |
78
- | WikiMatrix | 90.643 |
79
- | Wikipedia | 68.623|
80
- | **Total** | **298.566** |
81
 
82
- 94.074.553 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets and English-Chinese datasets:
83
 
84
  **Spanish-Chinese:**
85
 
86
  | Dataset | Sentences before cleaning |
87
  |-------------------|----------------|
88
- | NLLB |24.051.233|
89
- | UNPC | 17.599.223 |
90
- | MultiUN | 9.847.770 |
91
- | OpenSubtitles | 9.319.658 |
92
- | MultiParaCrawl | 3.410.087 |
93
- | MultiCCAligned | 3.006.694 |
94
- | WikiMatrix | 1.214.322 |
95
- | News Commentary | 375.982 |
96
- | Tatoeba | 9.404 |
97
- | **Total** | **68.834.373** |
98
 
99
  **English-Chinese:**
100
 
101
  | Dataset | Sentences before cleaning |
102
  |-------------------|----------------|
103
- | NLLB |71.383.325|
104
- | CCAligned | 15.181.415 |
105
- | Paracrawl | 14.170.869|
106
- | WikiMatrix | 2.595.119|
107
- | **Total** | **103.330.728** |
108
 
109
 
110
  ### Training procedure
@@ -121,13 +121,13 @@ All data was then filtered according to two specific criteria:
121
 
122
  Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
123
 
124
- The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94.187.858.
125
 
126
 
127
  #### Training
128
 
129
  The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
130
- The model was trained for 244.500 updates.
131
  Weights were saved every 500 updates.
132
 
133
  ## Evaluation
 
27
 
28
  ## Model description
29
 
30
+ This machine translation model is built upon the M2M100 1.2B, fine-tuned specifically for Chinese-Catalan translation.
31
  It is trained on a combination of Catalan-Chinese datasets
32
+ totalling 94,187,858 sentence pairs. 113,305 sentence pairs were parallel data collected from the web, while the remaining 94,074,553 sentence pairs
33
  were parallel synthetic data created using the
34
  [Aina Project's Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca) and the [Aina Project's English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
35
  The model was evaluated on the Flores, NTREX, and Projecte Aina's Catalan-Chinese evaluation datasets, achieving results comparable to those of Google Translate.
 
74
 
75
  | Dataset | Sentences before cleaning |
76
  |-------------------|----------------|
77
+ | OpenSubtitles | 139,300 |
78
+ | WikiMatrix | 90,643 |
79
+ | Wikipedia | 68,623|
80
+ | **Total** | **298,566** |
81
 
82
+ 94,074,553 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets and English-Chinese datasets:
83
 
84
  **Spanish-Chinese:**
85
 
86
  | Dataset | Sentences before cleaning |
87
  |-------------------|----------------|
88
+ | NLLB |24,051,233|
89
+ | UNPC | 17,599,223 |
90
+ | MultiUN | 9,847,770 |
91
+ | OpenSubtitles | 9,319,658 |
92
+ | MultiParaCrawl | 3,410,087 |
93
+ | MultiCCAligned | 3,006,694 |
94
+ | WikiMatrix | 1,214,322 |
95
+ | News Commentary | 375,982 |
96
+ | Tatoeba | 9,404 |
97
+ | **Total** | **68,834,373** |
98
 
99
  **English-Chinese:**
100
 
101
  | Dataset | Sentences before cleaning |
102
  |-------------------|----------------|
103
+ | NLLB |71,383,325|
104
+ | CCAligned | 15,181,415 |
105
+ | Paracrawl | 14,170,869|
106
+ | WikiMatrix | 2,595,119|
107
+ | **Total** | **103,330,728** |
108
 
109
 
110
  ### Training procedure
 
121
 
122
  Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
123
 
124
+ The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94,187,858.
125
 
126
 
127
  #### Training
128
 
129
  The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
130
+ The model was trained for 244,500 updates.
131
  Weights were saved every 500 updates.
132
 
133
  ## Evaluation