Update README.md
Browse files
README.md
CHANGED
@@ -232,9 +232,11 @@ print("Generated Translations:", results_detokenized)
|
|
232 |
The training corpus consists of 70 billion tokens of Catalan- and Spanish-centric parallel data, including all of the official European languages plus Catalan, Basque,
|
233 |
Galician, Asturian, Aragonese and Aranese. It amounts to 3,157,965,012 parallel sentence pairs.
|
234 |
|
235 |
-
This highly multilingual corpus is predominantly composed of data sourced from OPUS, with additional data taken from the NTEU project and Project Aina’s existing corpora.
|
|
|
|
|
236 |
|
237 |
-
![](./
|
238 |
|
239 |
Click the expand button below to see the full list of corpora included in the training data.
|
240 |
|
|
|
232 |
The training corpus consists of 70 billion tokens of Catalan- and Spanish-centric parallel data, including all of the official European languages plus Catalan, Basque,
|
233 |
Galician, Asturian, Aragonese and Aranese. It amounts to 3,157,965,012 parallel sentence pairs.
|
234 |
|
235 |
+
This highly multilingual corpus is predominantly composed of data sourced from OPUS, with additional data taken from the NTEU project and Project Aina’s existing corpora.
|
236 |
+
Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
|
237 |
+
Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
|
238 |
|
239 |
+
![](./main/treemap.png)
|
240 |
|
241 |
Click the expand button below to see the full list of corpora included in the training data.
|
242 |
|