AudreyVM commited on
Commit
7b0fc6b
·
verified ·
1 Parent(s): 80a2ca8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -1
README.md CHANGED
@@ -229,7 +229,8 @@ print("Generated Translations:", results_detokenized)
229
 
230
  ### Pretraining Data
231
 
232
- The training corpus consists of 70 billion tokens of Catalan- and Spanish-centric parallel data, including all of the official European languages plus Catalan, Basque, Galician, Asturian, Aragonese and Aranese. It amounts to xxxxx parallel sentence pairs.
 
233
 
234
  This highly multilingual corpus is predominantly composed of data sourced from OPUS, with additional data taken from the NTEU project and Project Aina’s existing corpora. Where little parallel Catalan <-> data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using Project Aina’s es-> ca model. (link and correct name). The final distribution of languages was as below:
235
 
 
229
 
230
  ### Pretraining Data
231
 
232
+ The training corpus consists of 70 billion tokens of Catalan- and Spanish-centric parallel data, including all of the official European languages plus Catalan, Basque,
233
+ Galician, Asturian, Aragonese and Aranese. It amounts to 3,157,965,012 parallel sentence pairs.
234
 
235
  This highly multilingual corpus is predominantly composed of data sourced from OPUS, with additional data taken from the NTEU project and Project Aina’s existing corpora. Where little parallel Catalan <-> data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using Project Aina’s es-> ca model. (link and correct name). The final distribution of languages was as below:
236