BSC-LT
/

salamandraTA-2B

text-generation

text-generation-inference

Inference Endpoints

🇪🇺 Region: EU

Model card Files Files and versions Community

AudreyVM commited on Nov 4, 2024

Commit

7b0fc6b

·

verified ·

1 Parent(s): 80a2ca8

Update README.md

Files changed (1) hide show

README.md +2 -1

README.md CHANGED Viewed

@@ -229,7 +229,8 @@ print("Generated Translations:", results_detokenized)
 ### Pretraining Data
-The training corpus consists of 70 billion tokens of Catalan- and Spanish-centric parallel data, including all of the official European languages plus Catalan, Basque, Galician, Asturian, Aragonese and Aranese. It amounts to xxxxx parallel sentence pairs.
 This highly multilingual corpus is predominantly composed of data sourced from OPUS, with additional data taken from the NTEU project and Project Aina’s existing corpora. Where little parallel Catalan <-> data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using Project Aina’s es-> ca model. (link and correct name). The final distribution of languages was as below:

 ### Pretraining Data
+The training corpus consists of 70 billion tokens of Catalan- and Spanish-centric parallel data, including all of the official European languages plus Catalan, Basque,
+Galician, Asturian, Aragonese and Aranese. It amounts to 3,157,965,012 parallel sentence pairs.
 This highly multilingual corpus is predominantly composed of data sourced from OPUS, with additional data taken from the NTEU project and Project Aina’s existing corpora. Where little parallel Catalan <-> data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using Project Aina’s es-> ca model. (link and correct name). The final distribution of languages was as below: