BSC-LT
/

salamandraTA-2B

text-generation

text-generation-inference

Inference Endpoints

🇪🇺 Region: EU

Model card Files Files and versions Community

AudreyVM commited on Nov 4, 2024

Commit

2612287

·

verified ·

1 Parent(s): 7b0fc6b

Update README.md

Files changed (1) hide show

README.md +4 -2

README.md CHANGED Viewed

@@ -232,9 +232,11 @@ print("Generated Translations:", results_detokenized)
 The training corpus consists of 70 billion tokens of Catalan- and Spanish-centric parallel data, including all of the official European languages plus Catalan, Basque,
 Galician, Asturian, Aragonese and Aranese. It amounts to 3,157,965,012 parallel sentence pairs.
-This highly multilingual corpus is predominantly composed of data sourced from OPUS, with additional data taken from the NTEU project and Project Aina’s existing corpora. Where little parallel Catalan <-> data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using Project Aina’s es-> ca model. (link and correct name). The final distribution of languages was as below:
-![](./images/treemap.png)
 Click the expand button below to see the full list of corpora included in the training data.

 The training corpus consists of 70 billion tokens of Catalan- and Spanish-centric parallel data, including all of the official European languages plus Catalan, Basque,
 Galician, Asturian, Aragonese and Aranese. It amounts to 3,157,965,012 parallel sentence pairs.
+This highly multilingual corpus is predominantly composed of data sourced from OPUS, with additional data taken from the NTEU project and Project Aina’s existing corpora.
+Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
+Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
+![](./main/treemap.png)
 Click the expand button below to see the full list of corpora included in the training data.