Correct some stuff
Browse files
README.md
CHANGED
@@ -156,7 +156,8 @@ print(tokenizer.decode(outputs[0]))
|
|
156 |
### Training
|
157 |
|
158 |
- Architecture: Same as [mt5-xxl](https://huggingface.co/google/mt5-xxl)
|
159 |
-
- Finetuning
|
|
|
160 |
- Hardware: TPUv4-128
|
161 |
- Software: T5X, Jax
|
162 |
|
@@ -174,47 +175,15 @@ All datasets are subset to the 101 languages supported by [mT5]. See the [paper]
|
|
174 |
|
175 |
## Evaluation
|
176 |
|
177 |
-
|
178 |
-
|
179 |
-
> We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages – including discriminative, generative tasks, human evaluation and simulated win rates that cover both held-out tasks and
|
180 |
-
> in-distribution performance.
|
181 |
-
|
182 |
-
Below, we provide evaluation results for the Aya model on unseen discriminative tasks, and in-distribution generative tasks compared to mT0, BLOOMZ, Bactrian-X 13B, and mT0x. To ensure a fair comparison with our Aya model in terms of language coverage, we finetune a new variant of mT5, that we dub mT0x. It is trained using the original datasets that are part of the xP3 collection but extended to 101 languages (xP3x).
|
183 |
-
|
184 |
-
For Multlingual MMLU, Simulated and Human Win-rates, please refer to the [paper](arxiv.com)
|
185 |
-
|
186 |
-
### Discriminative Tasks
|
187 |
-
|
188 |
-
| Model | Base Model | IFT Mixture | XCOPA (Acc %) | XNLI (Acc %) | XSC (Acc %) | XWG (Acc %) | **<u>Avg</u>** |
|
189 |
-
| :---------------- | :--------- | :---------: | :-----------: | :----------: | :---------: | :---------: | :------------: |
|
190 |
-
| **46 Languages** | | | | | | | |
|
191 |
-
| mT0 | mT5 13B | xP3 | 75.6 | 55.3 | 87.2 | 73.6 | 72.9 |
|
192 |
-
| BLOOMZ | BLOOM 176B | xP3 | 64.3 | 52.0 | 82.6 | 63.3 | 65.5 |
|
193 |
-
| **52 Languages** | | | | | | | |
|
194 |
-
| Bactrian-X 13B | Llama 13B | Bactrian-X | 52.4 | 34.5 | 51.8 | 50.5 | 47.3 |
|
195 |
-
| **101 Languages** | | | | | | | |
|
196 |
-
| mT0x | mT5 13B | xP3x | 71.7 | 45.9 | 85.1 | 60.6 | 65.8 |
|
197 |
-
| Aya model | mT5 13B | All Mixture | 76.7 | 58.3 | 90.0 | 70.7 | 73.9 |
|
198 |
-
|
199 |
-
### Generative Tasks
|
200 |
-
|
201 |
-
| Model | Base Model | IFT Mixture | FLORES-200 (spBleu) | FLORES-200 (spBleu) | XLSum (RougeLsum) | Tydi-QA (F1) |
|
202 |
-
| :---------------- | :--------: | :---------- | :-----------------: | :-----------------: | :---------------: | :----------: |
|
203 |
-
| | | | X→ En | En → X | | |
|
204 |
-
| **101 Languages** | | | | | | |
|
205 |
-
| mT0x | mT5 13B | xP3x | 20.2 | 14.5 | 21.4 | 76.1 |
|
206 |
-
| Aya Model | mT5 13B | All Mixture | 29.1 | 19.0 | 22.0 | 77.8 |
|
207 |
-
|
208 |
-
Note: We cannot compare mT0, and BLOOMZ for the above generative tasks, as the validation splits are part of mT0 and BLOOMZ's training data.
|
209 |
|
210 |
## Bias, Risks, and Limitations
|
211 |
|
212 |
-
Like any base language model or fine-tuned model without safety filtering, it is relatively easy for a user to prompt these models to generate harmful and generally sensitive content.
|
213 |
-
Aya model, as released, does not include any safety filtering.
|
214 |
-
We hope that the release of the Aya model will make community-based redteaming efforts possible, by exposing an open-source massively-multilingual model for community research.
|
215 |
|
216 |
For a detailed overview of our effort at safety mitigation and benchmarking toxicity and bias across multiple languages, we refer Sections 6 and 7 of our paper: [Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model](arxiv.com).
|
217 |
|
|
|
|
|
218 |
## Citation
|
219 |
|
220 |
**BibTeX:**
|
|
|
156 |
### Training
|
157 |
|
158 |
- Architecture: Same as [mt5-xxl](https://huggingface.co/google/mt5-xxl)
|
159 |
+
- Number of Finetuning Samples: 25M
|
160 |
+
- Batch size: 256
|
161 |
- Hardware: TPUv4-128
|
162 |
- Software: T5X, Jax
|
163 |
|
|
|
175 |
|
176 |
## Evaluation
|
177 |
|
178 |
+
We refer to Section 5 from our paper for multilingual eval across 99 languages – including discriminative, generative tasks, human evaluation and simulated win rates that cover both held-out tasks and in-distribution performance.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
179 |
|
180 |
## Bias, Risks, and Limitations
|
181 |
|
|
|
|
|
|
|
182 |
|
183 |
For a detailed overview of our effort at safety mitigation and benchmarking toxicity and bias across multiple languages, we refer Sections 6 and 7 of our paper: [Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model](arxiv.com).
|
184 |
|
185 |
+
We hope that the release of the Aya model will make community-based redteaming efforts possible, by exposing an open-source massively-multilingual model for community research.
|
186 |
+
|
187 |
## Citation
|
188 |
|
189 |
**BibTeX:**
|