Text2Text Generation
Transformers
Safetensors
t5
text-generation-inference
Inference Endpoints
Tomlim commited on
Commit
ac4d8b3
1 Parent(s): 1f3179d

Model Card

Browse files
Files changed (1) hide show
  1. README.md +522 -0
README.md CHANGED
@@ -1,3 +1,525 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - af
5
+ - am
6
+ - ar
7
+ - az
8
+ - be
9
+ - bg
10
+ - bn
11
+ - ca
12
+ - ceb
13
+ - co
14
+ - cs
15
+ - cy
16
+ - da
17
+ - de
18
+ - el
19
+ - en
20
+ - eo
21
+ - es
22
+ - et
23
+ - eu
24
+ - fa
25
+ - fi
26
+ - fil
27
+ - fr
28
+ - fy
29
+ - ga
30
+ - gd
31
+ - gl
32
+ - gu
33
+ - ha
34
+ - haw
35
+ - he
36
+ - hi
37
+ - hmn
38
+ - ht
39
+ - hu
40
+ - hy
41
+ - id
42
+ - ig
43
+ - is
44
+ - it
45
+ - iw
46
+ - ja
47
+ - jv
48
+ - ka
49
+ - kk
50
+ - km
51
+ - kn
52
+ - ko
53
+ - ku
54
+ - ky
55
+ - la
56
+ - lb
57
+ - lo
58
+ - lt
59
+ - lv
60
+ - mg
61
+ - mi
62
+ - mk
63
+ - ml
64
+ - mn
65
+ - mr
66
+ - ms
67
+ - mt
68
+ - my
69
+ - ne
70
+ - nl
71
+ - 'no'
72
+ - ny
73
+ - pa
74
+ - pl
75
+ - ps
76
+ - pt
77
+ - ro
78
+ - ru
79
+ - sd
80
+ - si
81
+ - sk
82
+ - sl
83
+ - sm
84
+ - sn
85
+ - so
86
+ - sq
87
+ - sr
88
+ - st
89
+ - su
90
+ - sv
91
+ - sw
92
+ - ta
93
+ - te
94
+ - tg
95
+ - th
96
+ - tr
97
+ - uk
98
+ - und
99
+ - ur
100
+ - uz
101
+ - vi
102
+ - xh
103
+ - yi
104
+ - yo
105
+ - zh
106
+ - zu
107
+ datasets:
108
+ - mc4
109
  ---
110
+
111
+ # MyT5
112
+
113
+
114
+
115
+ ## Model Details
116
+
117
+ MyT5 (**My**te **T5**) is a multilingual language model based on T5 architecture.
118
+ The model uses a **m**orphologically-driven **byte** (**MYTE**) representation described in our paper [Limisiewicz et al., 2024](https://arxiv.org/pdf/2403.10691.pdf).
119
+
120
+ ### Model Description
121
+
122
+ <!-- Provide a longer summary of what this model is. -->
123
+
124
+ - **Developed by:** Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer
125
+ - **Funded by:** University of Washington Fellowship, Charles University Grant Agency
126
+ - **Model type:** T5
127
+ - **Language(s) (NLP):** Multilingual
128
+ - **License:** MIT
129
+
130
+ ### Model Sizes
131
+
132
+ - **[Small](https://huggingface.co/Tomlim/myt5-small)**: 300M parameters
133
+ - **[Base](https://huggingface.co/Tomlim/myt5-base)**: 582M parameters
134
+ - **[Large](https://huggingface.co/Tomlim/myt5-large)**: 1.2B parameters
135
+
136
+ ### Model Sources
137
+
138
+ <!-- Provide the basic links for the model. -->
139
+
140
+ - **[Repository](https://github.com/tomlimi/MYTE)**
141
+ - **[Paper](https://arxiv.org/pdf/2403.10691.pdf)**
142
+
143
+ ## How to Get Started with the Model
144
+
145
+ The snippet below shows the basic usage of the model for multilingual language modeling.
146
+ Custom Tokenizer is available in [GitHub](https://github.com/tomlimi/MYTE])repository, in `src/myt5/myt5_tokenizer.py`.
147
+ We also plan to release it on HuggingFace in the future.
148
+
149
+ ```python
150
+ from transformers import T5ForConditionalGeneration
151
+ from src.myt5.myt5_tokenizer import MyT5Tokenizer
152
+ import torch
153
+
154
+ MODEL_SIZE = "large" # small, base, or large
155
+
156
+ model = T5ForConditionalGeneration.from_pretrained(f"Tomlim/MyT5_{MODEL_SIZE}", use_safetensors=True)
157
+ tokenizer = MyT5Tokenizer()
158
+
159
+ pre_texts = ['"We now have',
160
+ '„Mamy teraz myszy w wieku',
161
+ '"""எங்களிடம் இப்போது']
162
+ post_texts = ['4-month-old mice that are non-diabetic that used to be diabetic," he added.',
163
+ '4 miesięcy, które miały cukrzycę, ale zostały z niej wyleczone” – dodał.',
164
+ '4-மாத-வயதுடைய எலி ஒன்று உள்ளது, முன்னர் அதற்கு நீரிழிவு இருந்தது தற்போது இல்லை"" என்று அவர் மேலும் கூறினார்."']
165
+
166
+ inputs = tokenizer(pre_texts, padding="longest", return_tensors="pt")
167
+ targets = tokenizer(post_texts, padding="longest", return_tensors="pt")
168
+
169
+
170
+ outputs = model(**inputs, labels=targets.input_ids)
171
+ probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
172
+ ```
173
+
174
+ ## Training Details
175
+
176
+ ### Training Data
177
+
178
+ The model was trained on the standard T5 task of restoring corrupted spans in the multilingual MC4 dataset.
179
+
180
+ ### Preprocessing
181
+
182
+ Instead of UTF-8 bytes, we used morphologically-driven byte representation.
183
+ See the description in our [paper](https://arxiv.org/pdf/2403.10691.pdf) for more details.
184
+
185
+
186
+ ### Training Hyperparameters
187
+
188
+ We used the same hyperparameters as in the original ByT5 paper.
189
+ The only difference is that we decreased the number of training steps to 250,000 to avoid overfiting.
190
+
191
+ ### Computational Infrastructure
192
+
193
+ Models were trained on TPUs available through TPU Research Cloud (TRC).
194
+ We used v3-8 TPU for training small and base models and v3-32 for a large model.
195
+ The training for each instance took:
196
+
197
+ - **Small**: 90h
198
+ - **Base**: 230h
199
+ - **Large**: 190h
200
+
201
+ # Evaluation
202
+
203
+ <!-- This section describes the evaluation protocols and provides the results. -->
204
+
205
+ MyT5 models are compared with reimplementation of [ByT5](https://huggingface.co/docs/transformers/model_doc/byt5) models trained for 250,000 steps.
206
+
207
+ ## Language Modeling
208
+
209
+ We have evaluated LM performance on multi-parallel [FLORES 200](https://arxiv.org/pdf/2207.04672v3.pdf) corpus.
210
+ To compare the scores across languages and models, we used a normalized metric, i.e., Bit-per-English-Byte (BPEB).
211
+
212
+ ### Results
213
+
214
+ | | | ByT5 | | MyT5 | |
215
+ |-------|-----------|------|--------|------|--------|
216
+ | | | BPEB | T (ms) | BPEB | T (ms) |
217
+ | small | All | 10.1 | 7.0 | 4.6 | 6.7 |
218
+ | | Latin | 4.6 | 5.9 | 4.2 | 6.6 |
219
+ | | Non Latin | 18.1 | 8.5 | 5.1 | 6.8 |
220
+ | base | All | 8.2 | 11.5 | 5.8 | 8.9 |
221
+ | | Latin | 4.9 | 9.4 | 5.0 | 8.7 |
222
+ | | Non Latin | 13.0 | 14.6 | 6.9 | 9.1 |
223
+ | large | All | 13.4 | 31.8 | 4.6 | 26.7 |
224
+ | | Latin | 10.1 | 28.1 | 4.0 | 26.6 |
225
+ | | Non Latin | 18.2 | 37.3 | 5.4 | 27.0 |
226
+
227
+ Byte-per-English-Bits and Inference times (average per Flores 200 sentence) averaged for three language groupings.
228
+ The inference was run on an A40 GPU core.
229
+
230
+ ## Downstream Tasks
231
+
232
+ We tested the large model in four end-tasks: question answering, NER, semantic parsing, and machine translation.
233
+ The test data come from XTREME-UP benchmark ([Ruder, Clark et al., 2023](https://arxiv.org/pdf/2305.11938.pdf)), which covers mainly low-resource languages
234
+
235
+ ### Fine-tuning
236
+
237
+ In each task, we fine-tuned for all languages jointly.
238
+ We used 1e-3 learning rate with square root decay and dropout of 0.1.
239
+ The batch size and training varied across tasks:
240
+
241
+ - **NER**: 128 examples per batch, 6000 steps
242
+ - **QA**: 64 examples per batch, 6500 steps
243
+ - **Semantic Parsing**: 64 examples per batch, 1000 steps
244
+ - **MT**: 64 examples per batch, 10000 steps
245
+
246
+
247
+ ### Results
248
+
249
+ Task | QA (F1) | NER (F1) | Semantic Parsing (EM)| MT (chrF)
250
+ ------------|------|------|------------------|------
251
+ Flan-PaLM* | 22.9 | 12.0 | 0.1 | ---
252
+ mT5* | 59.7 | 74.0 | 21.8 | ---
253
+ ByT5 | 73.2 | 81.5 | 25.1 | 20.1
254
+ MyT5 | 75.3 | 80.8 | 19.6 | 20.4
255
+ Inference Times per example (ms)
256
+ ByT5 | 36.2 | 13.8 | 13.2 | 15.9
257
+ MyT5 | 35.6 | 12.6 | 12.4 | 12.6
258
+
259
+ The average result of XTREME-UP tasks across low-resource languages.
260
+ The baseline results of mT5 and Flan-PaLM (in-context-learning evaluation) are reported in [Ruder, Clark et al., 2023](https://arxiv.org/pdf/2305.11938.pdf).
261
+ The reported inference time is an average across evaluation examples; the inference was run on an A40 GPU core.
262
+
263
+ ## Citation
264
+
265
+ ```bibtex
266
+ @misc{limisiewicz2024myte,
267
+ title={MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling},
268
+ author={Tomasz Limisiewicz and Terra Blevins and Hila Gonen and Orevaoghene Ahia and Luke Zettlemoyer},
269
+ year={2024},
270
+ eprint={2403.10691},
271
+ archivePrefix={arXiv},
272
+ primaryClass={cs.CL}
273
+ }
274
+ ```
275
+
276
+
277
+ ## Model Card Author
278
+
279
+ [Tomasz Limisiewicz](mailto:limisewicz@ufal.mff.cuni.cz)---
280
+ license: mit
281
+ language:
282
+ - af
283
+ - am
284
+ - ar
285
+ - az
286
+ - be
287
+ - bg
288
+ - bn
289
+ - ca
290
+ - ceb
291
+ - co
292
+ - cs
293
+ - cy
294
+ - da
295
+ - de
296
+ - el
297
+ - en
298
+ - eo
299
+ - es
300
+ - et
301
+ - eu
302
+ - fa
303
+ - fi
304
+ - fil
305
+ - fr
306
+ - fy
307
+ - ga
308
+ - gd
309
+ - gl
310
+ - gu
311
+ - ha
312
+ - haw
313
+ - he
314
+ - hi
315
+ - hmn
316
+ - ht
317
+ - hu
318
+ - hy
319
+ - id
320
+ - ig
321
+ - is
322
+ - it
323
+ - iw
324
+ - ja
325
+ - jv
326
+ - ka
327
+ - kk
328
+ - km
329
+ - kn
330
+ - ko
331
+ - ku
332
+ - ky
333
+ - la
334
+ - lb
335
+ - lo
336
+ - lt
337
+ - lv
338
+ - mg
339
+ - mi
340
+ - mk
341
+ - ml
342
+ - mn
343
+ - mr
344
+ - ms
345
+ - mt
346
+ - my
347
+ - ne
348
+ - nl
349
+ - 'no'
350
+ - ny
351
+ - pa
352
+ - pl
353
+ - ps
354
+ - pt
355
+ - ro
356
+ - ru
357
+ - sd
358
+ - si
359
+ - sk
360
+ - sl
361
+ - sm
362
+ - sn
363
+ - so
364
+ - sq
365
+ - sr
366
+ - st
367
+ - su
368
+ - sv
369
+ - sw
370
+ - ta
371
+ - te
372
+ - tg
373
+ - th
374
+ - tr
375
+ - uk
376
+ - und
377
+ - ur
378
+ - uz
379
+ - vi
380
+ - xh
381
+ - yi
382
+ - yo
383
+ - zh
384
+ - zu
385
+ datasets:
386
+ - mc4
387
+ ---
388
+
389
+ # MyT5
390
+
391
+
392
+
393
+ ## Model Details
394
+
395
+ MyT5 (**My**te **T5**) is a multilingual language model based on T5 architecture.
396
+ The model uses a **m**orphologically-driven **byte** (**MYTE**) representation described in our paper [Limisiewicz et al., 2024](https://arxiv.org/pdf/2403.10691.pdf).
397
+
398
+ ### Model Description
399
+
400
+ <!-- Provide a longer summary of what this model is. -->
401
+
402
+ - **Developed by:** Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer
403
+ - **Funded by:** University of Washington Fellowship, Charles University Grant Agency
404
+ - **Model type:** T5
405
+ - **Language(s) (NLP):** Multilingual
406
+ - **License:** MIT
407
+
408
+ ### Model Sizes
409
+
410
+ - **[Small](https://huggingface.co/Tomlim/myt5-small)**: 300M parameters
411
+ - **[Base](https://huggingface.co/Tomlim/myt5-base)**: 582M parameters
412
+ - **[Large](https://huggingface.co/Tomlim/myt5-large)**: 1.2B parameters
413
+
414
+ ### Model Sources
415
+
416
+ <!-- Provide the basic links for the model. -->
417
+
418
+ - **[Repository](https://github.com/tomlimi/MYTE)**
419
+ - **[Paper](https://arxiv.org/pdf/2403.10691.pdf)**
420
+
421
+ ## How to Get Started with the Model
422
+
423
+ The snippet below shows the basic usage of the model for multilingual language modeling.
424
+ Custom Tokenizer is available in [GitHub](https://github.com/tomlimi/MYTE])repository, in `src/myt5/myt5_tokenizer.py`.
425
+ We also plan to release it on HuggingFace in the future.
426
+
427
+ ```python
428
+ from transformers import T5ForConditionalGeneration
429
+ from src.myt5.myt5_tokenizer import MyT5Tokenizer
430
+ import torch
431
+
432
+ MODEL_SIZE = "large" # small, base, or large
433
+
434
+ model = T5ForConditionalGeneration.from_pretrained(f"Tomlim/MyT5_{MODEL_SIZE}", use_safetensors=True)
435
+ tokenizer = MyT5Tokenizer()
436
+
437
+ pre_texts = ['"We now have',
438
+ '„Mamy teraz myszy w wieku',
439
+ '"""எங்களிடம் இப்போது']
440
+ post_texts = ['4-month-old mice that are non-diabetic that used to be diabetic," he added.',
441
+ '4 miesięcy, które miały cukrzycę, ale zostały z niej wyleczone” – dodał.',
442
+ '4-���ாத-வயதுடைய எலி ஒன்று உள்ளது, முன்னர் அதற்கு நீரிழிவு இருந்தது தற்போது இல்லை"" என்று அவர் மேலும் கூறினார்."']
443
+
444
+ inputs = tokenizer(pre_texts, padding="longest", return_tensors="pt")
445
+ targets = tokenizer(post_texts, padding="longest", return_tensors="pt")
446
+
447
+
448
+ outputs = model(**inputs, labels=targets.input_ids)
449
+ probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
450
+ ```
451
+
452
+ ## Training Details
453
+
454
+ ### Training Data
455
+
456
+ The model was trained on the standard T5 task of restoring corrupted spans in the multilingual MC4 dataset.
457
+
458
+ ### Preprocessing
459
+
460
+ Instead of UTF-8 bytes, we used morphologically-driven byte representation.
461
+ See the description in our [paper](https://arxiv.org/pdf/2403.10691.pdf) for more details.
462
+
463
+
464
+ ### Training Hyperparameters
465
+
466
+ We used the same hyperparameters as in the original ByT5 paper.
467
+ The only difference is that we decreased the number of training steps to 250,000 to avoid overfiting.
468
+
469
+ ### Computational Infrastructure
470
+
471
+ Models were trained on TPUs available through TPU Research Cloud (TRC).
472
+ We used v3-8 TPU for training small and base models and v3-32 for a large model.
473
+ The training for each instance took:
474
+
475
+ - **Small**: 90h
476
+ - **Base**: 230h
477
+ - **Large**: 190h
478
+
479
+ # Evaluation
480
+
481
+ <!-- This section describes the evaluation protocols and provides the results. -->
482
+
483
+ MyT5 models are compared with reimplementation of [ByT5](https://huggingface.co/docs/transformers/model_doc/byt5) models trained for 250,000 steps.
484
+
485
+ ## Language Modeling
486
+
487
+ We have evaluated LM performance on multi-parallel [FLORES 200](https://arxiv.org/pdf/2207.04672v3.pdf) corpus.
488
+ To compare the scores across languages and models, we used a normalized metric, i.e., Bit-per-English-Byte (BPEB).
489
+
490
+ ### Results
491
+
492
+ | | | ByT5 | | MyT5 | |
493
+ |-------|-----------|------|--------|------|--------|
494
+ | | | BPEB | T (ms) | BPEB | T (ms) |
495
+ | small | All | 10.1 | 7.0 | 4.6 | 6.7 |
496
+ | | Latin | 4.6 | 5.9 | 4.2 | 6.6 |
497
+ | | Non Latin | 18.1 | 8.5 | 5.1 | 6.8 |
498
+ | base | All | 8.2 | 11.5 | 5.8 | 8.9 |
499
+ | | Latin | 4.9 | 9.4 | 5.0 | 8.7 |
500
+ | | Non Latin | 13.0 | 14.6 | 6.9 | 9.1 |
501
+ | large | All | 13.4 | 31.8 | 4.6 | 26.7 |
502
+ | | Latin | 10.1 | 28.1 | 4.0 | 26.6 |
503
+ | | Non Latin | 18.2 | 37.3 | 5.4 | 27.0 |
504
+
505
+ Byte-per-English-Bits and Inference times (average per Flores 200 sentence) averaged for three language groupings.
506
+ The inference was run on an A40 GPU core.
507
+
508
+
509
+ ## Citation
510
+
511
+ ```bibtex
512
+ @misc{limisiewicz2024myte,
513
+ title={MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling},
514
+ author={Tomasz Limisiewicz and Terra Blevins and Hila Gonen and Orevaoghene Ahia and Luke Zettlemoyer},
515
+ year={2024},
516
+ eprint={2403.10691},
517
+ archivePrefix={arXiv},
518
+ primaryClass={cs.CL}
519
+ }
520
+ ```
521
+
522
+
523
+ ## Model Card Author
524
+
525
+ [Tomasz Limisiewicz](mailto:limisewicz@ufal.mff.cuni.cz)