Fairseq
Catalan
German

set back-translation as main

#1
Files changed (5) hide show
  1. README.md +14 -21
  2. config.json +0 -1
  3. model.bin +2 -2
  4. shared_vocabulary.json +0 -0
  5. spm.model +2 -2
README.md CHANGED
@@ -13,8 +13,8 @@ library_name: fairseq
13
 
14
  ## Model description
15
 
16
- This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-German datasets, totalling 100.000.000 sentence pairs.
17
- 6.258.272 sentence pairs were parallel data collected from the web while the remaining 93.741.728 sentence pairs were parallel synthetic data created using the ES-CA translator of [PlanTL](https://huggingface.co/PlanTL-GOB-ES/mt-plantl-es-ca). The model was evaluated on the Flores and NTREX evaluation datasets.
18
 
19
  ## Intended uses and limitations
20
 
@@ -36,7 +36,7 @@ import pyonmttok
36
  from huggingface_hub import snapshot_download
37
  model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-ca-de", revision="main")
38
 
39
- tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.50k.model")
40
  tokenized=tokenizer.tokenize("Benvingut al projecte Aina!")
41
 
42
  translator = ctranslate2.Translator(model_dir)
@@ -52,7 +52,7 @@ However, we are well aware that our models may be biased. We intend to conduct r
52
 
53
  ### Training data
54
 
55
- The Catalan-German data collected from the web was a combination of the following datasets:
56
 
57
  | Dataset | Sentences | Sentences after Cleaning|
58
  |-------------------|----------------|-------------------|
@@ -60,27 +60,19 @@ The Catalan-German data collected from the web was a combination of the followin
60
  | WikiMatrix | 180.322 | 125.811 |
61
  | GNOME | 12.333| 1.241|
62
  | KDE4 | 165.439 | 105.098 |
 
 
63
  | OpenSubtitles | 303.329 | 171.376 |
64
  | GlobalVoices| 4.636 | 3.578|
65
  | Tatoeba | 732 | 655 |
66
  | Books | 4.445 | 2049 |
67
  | Europarl | 1.734.643 | 1.734.643 |
68
  | Tilde | 3.434.091 | 3.434.091 |
 
69
 
70
  All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
71
- The Europarl and Tilde corpora are synthetic parallel corpora created from the original Spanish-German corpora by [SoftCatalà](https://github.com/Softcatala).
72
 
73
- The 93.741.728 sentence pairs of synthetic parallel data were created from the following Spanish-German datasets:
74
-
75
- | Dataset | Sentences before cleaning |
76
- |-------------------|----------------|
77
- |globalvoices_es-de_20230901 | 70.097 |
78
- |multiparacrawl_es-de_20230901 | 56.873.541 |
79
- |dgt_es-de_20240129 | 4.899.734 |
80
- |eubookshop_es-de_20240129 | 4.750.170 |
81
- |nllb_es-de_20240129 | 112.444.838 |
82
- |opensubtitles_es-de_20240129 | 18.951.214 |
83
- | **Total** | **197.989.594** |
84
 
85
  ### Training procedure
86
 
@@ -88,7 +80,8 @@ The 93.741.728 sentence pairs of synthetic parallel data were created from the f
88
 
89
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
90
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
91
- The filtered datasets are then concatenated and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 
92
 
93
 
94
  #### Tokenization
@@ -137,10 +130,10 @@ and [Google Translate](https://translate.google.es/?hl=es):
137
 
138
  | Test set | SoftCatalà | Google Translate | aina-translator-ca-de |
139
  |----------------------|------------|------------------|---------------|
140
- | Flores 101 dev | 26,2 | **34,8** | 34,1 |
141
- | Flores 101 devtest |26,3 | **34,0** | 33,3 |
142
- | NTREX | 21,7 | **28,8** | 27,8 |
143
- | Average | 24,7 | **32,5** | 31,7 |
144
 
145
  ## Additional information
146
 
 
13
 
14
  ## Model description
15
 
16
+ This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-German datasets,
17
+ which after filtering and cleaning comprised 6.258.272 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
18
 
19
  ## Intended uses and limitations
20
 
 
36
  from huggingface_hub import snapshot_download
37
  model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-ca-de", revision="main")
38
 
39
+ tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
40
  tokenized=tokenizer.tokenize("Benvingut al projecte Aina!")
41
 
42
  translator = ctranslate2.Translator(model_dir)
 
52
 
53
  ### Training data
54
 
55
+ The model was trained on a combination of the following datasets:
56
 
57
  | Dataset | Sentences | Sentences after Cleaning|
58
  |-------------------|----------------|-------------------|
 
60
  | WikiMatrix | 180.322 | 125.811 |
61
  | GNOME | 12.333| 1.241|
62
  | KDE4 | 165.439 | 105.098 |
63
+ | QED | 63.041 | 49.181 |
64
+ | TED2020 v1 | 46.680 | 38.428 |
65
  | OpenSubtitles | 303.329 | 171.376 |
66
  | GlobalVoices| 4.636 | 3.578|
67
  | Tatoeba | 732 | 655 |
68
  | Books | 4.445 | 2049 |
69
  | Europarl | 1.734.643 | 1.734.643 |
70
  | Tilde | 3.434.091 | 3.434.091 |
71
+ | **Total** | **7.427.843** | **6.258.272** |
72
 
73
  All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
74
+ The Europarl and Tilde corpora are synthetic parallel corpora created from the original Spanish-Catalan corpora by [SoftCatalà](https://github.com/Softcatala).
75
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ### Training procedure
78
 
 
80
 
81
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
82
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
83
+ The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized
84
+ using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
85
 
86
 
87
  #### Tokenization
 
130
 
131
  | Test set | SoftCatalà | Google Translate | aina-translator-ca-de |
132
  |----------------------|------------|------------------|---------------|
133
+ | Flores 101 dev | 26,2 | **34,8** | 27,5 |
134
+ | Flores 101 devtest |26,3 | **34,0** | 26,9 |
135
+ | NTREX | 21,7 | **28,8** | 20,4 |
136
+ | Average | 24,7 | **32,5** | 24,9 |
137
 
138
  ## Additional information
139
 
config.json CHANGED
@@ -5,6 +5,5 @@
5
  "decoder_start_token": "</s>",
6
  "eos_token": "</s>",
7
  "layer_norm_epsilon": null,
8
- "multi_query_attention": false,
9
  "unk_token": "<unk>"
10
  }
 
5
  "decoder_start_token": "</s>",
6
  "eos_token": "</s>",
7
  "layer_norm_epsilon": null,
 
8
  "unk_token": "<unk>"
9
  }
model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:38e98a6e9dd5e6c00cd94647c0695666d78c070bfb1c73ab70a36f7c6557a9e6
3
- size 1860811612
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e55f9e37a616e6d7cf7cc6111920e95133be662bbe4792cc6131f2df4ec25788
3
+ size 1860745998
shared_vocabulary.json CHANGED
The diff for this file is too large to render. See raw diff
 
spm.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ec1143f19e1763ed954c4060f440ade847163d8267ec9e131b7c829080df8eb7
3
- size 1182306
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8999682b24246eb4bac8e43e528a47fe5555a7101710f04f4d3780804a703a77
3
+ size 1182213