jbochi commited on
Commit
924f0dd
1 Parent(s): df8a496

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +182 -14
README.md CHANGED
@@ -1,6 +1,7 @@
1
  ---
2
  license: apache-2.0
3
  language:
 
4
  - en
5
  - ru
6
  - es
@@ -421,40 +422,207 @@ language:
421
  - msb
422
  library_name: transformers
423
  tags:
 
424
  - text-generation-inference
425
  datasets:
426
  - allenai/MADLAD-400
427
  pipeline_tag: translation
 
 
 
 
 
 
 
428
  ---
429
 
430
- T5ForConditionalGeneration files for Google's [Madlad-400](https://github.com/google-research/google-research/tree/master/madlad_400) 10B parameter MT-BT model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
431
 
 
432
 
433
- Available models:
434
- - [3B](https://huggingface.co/jbochi/madlad400-3b-mt)
435
- - [7B](https://huggingface.co/jbochi/madlad400-7b-mt)
436
- - [7B-BT](https://huggingface.co/jbochi/madlad400-7b-mt-bt)
437
- - [10B](https://huggingface.co/jbochi/madlad400-10b-mt)
438
 
 
 
 
 
 
 
 
 
 
439
 
440
- Article: [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662)
441
 
442
- Abstract:
443
 
444
- > We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering over 450 languages using publicly available data, and find that it is competitive with models that are significantly larger, and report the results on different domains. In addition, we train a 8B-parameter language model, and assess the results on few-shot translation. We make the baseline models available to the research community.
 
 
 
 
 
445
 
446
  ```python
447
  from transformers import T5ForConditionalGeneration, T5Tokenizer, GenerationConfig
448
 
449
- model = T5ForConditionalGeneration.from_pretrained('jbochi/madlad400-10b-mt')
450
- tokenizer = T5Tokenizer.from_pretrained('jbochi/madlad400-10b-mt')
 
451
 
452
  text = "<2pt> I love pizza!"
453
- input_ids = tokenizer(text, return_tensors="pt").input_ids
454
  outputs = model.generate(input_ids=input_ids)
455
 
456
  tokenizer.decode(outputs[0], skip_special_tokens=True)
457
- # Eu amo pizza!
458
  ```
459
 
460
- Colab to generate these files is [here](https://colab.research.google.com/drive/13DQoNLMgtLnWkpfibdYpQsKcEISppQhO#scrollTo=9FWzbjqcPE5U).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  language:
4
+ - multilingual
5
  - en
6
  - ru
7
  - es
 
422
  - msb
423
  library_name: transformers
424
  tags:
425
+ - text2text-generation
426
  - text-generation-inference
427
  datasets:
428
  - allenai/MADLAD-400
429
  pipeline_tag: translation
430
+
431
+ widget:
432
+ - text: "<2en> Como vai, amigo?"
433
+ example_title: "Translation to English"
434
+ - text: "<2de> Do you speak German?"
435
+ example_title: "Translation to German"
436
+
437
  ---
438
 
439
+ # Model Card for MADLAD-400-7B-MT
440
+
441
+ # Table of Contents
442
+
443
+ 0. [TL;DR](#TL;DR)
444
+ 1. [Model Details](#model-details)
445
+ 2. [Usage](#usage)
446
+ 3. [Uses](#uses)
447
+ 4. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
448
+ 5. [Training Details](#training-details)
449
+ 6. [Evaluation](#evaluation)
450
+ 7. [Environmental Impact](#environmental-impact)
451
+ 8. [Citation](#citation)
452
+
453
+ # TL;DR
454
+
455
+ MADLAD-400-10B-MT is a multilingual machine translation model based on the T5 architecture that was
456
+ trained on 250 billion tokens covering over 450 languages using publicly available data.
457
+ It is competitive with models that are significantly larger.
458
+
459
+ **Disclaimer**: [Juarez Bochi](https://huggingface.co/jbochi), who was not involved in this research, converted
460
+ the original weights and wrote the contents of this model card based on the original paper and Flan-T5.
461
 
462
+ # Model Details
463
 
464
+ ## Model Description
 
 
 
 
465
 
466
+ - **Model type:** Language model
467
+ - **Language(s) (NLP):** Multilingual (400+ languages)
468
+ - **License:** Apache 2.0
469
+ - **Related Models:** [All MADLAD-400 Checkpoints](https://huggingface.co/models?search=madlad)
470
+ - **Original Checkpoints:** [All Original MADLAD-400 Checkpoints](https://github.com/google-research/google-research/tree/master/madlad_400)
471
+ - **Resources for more information:**
472
+ - [Research paper](https://arxiv.org/abs/2309.04662)
473
+ - [GitHub Repo](https://github.com/google-research/t5x)
474
+ - [Hugging Face MADLAD-400 Docs (Similar to T5) ](https://huggingface.co/docs/transformers/model_doc/MADLAD-400) - [Pending PR](https://github.com/huggingface/transformers/pull/27471)
475
 
476
+ # Usage
477
 
478
+ Find below some example scripts on how to use the model:
479
 
480
+ ## Using the Pytorch model with `transformers`
481
+
482
+ ### Running the model on a CPU or GPU
483
+
484
+ <details>
485
+ <summary> Click to expand </summary>
486
 
487
  ```python
488
  from transformers import T5ForConditionalGeneration, T5Tokenizer, GenerationConfig
489
 
490
+ model_name = 'jbochi/madlad400-10b-mt'
491
+ model = T5ForConditionalGeneration.from_pretrained(model_name, device="auto")
492
+ tokenizer = T5Tokenizer.from_pretrained(model_name)
493
 
494
  text = "<2pt> I love pizza!"
495
+ input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
496
  outputs = model.generate(input_ids=input_ids)
497
 
498
  tokenizer.decode(outputs[0], skip_special_tokens=True)
499
+ # Eu adoro pizza!
500
  ```
501
 
502
+ </details>
503
+
504
+ ## Running the model with Candle
505
+
506
+ <details>
507
+ <summary> Click to expand </summary>
508
+
509
+ Usage with [candle](https://github.com/huggingface/candle):
510
+
511
+ ```bash
512
+ $ cargo run --example t5 --release -- \
513
+ --model-id "jbochi/madlad400-10b-mt" \
514
+ --prompt "<2de> How are you, my friend?" \
515
+ --decode --temperature 0
516
+ ```
517
+
518
+ We also provide a quantized model (1.65 GB vs the original 11.8 GB file):
519
+
520
+ ```
521
+ cargo run --example quantized-t5 --release -- \
522
+ --model-id "jbochi/madlad400-10b-mt" --weight-file "model-q4k.gguf" \
523
+ --prompt "<2de> How are you, my friend?" \
524
+ --temperature 0
525
+ ...
526
+ Wie geht es dir, mein Freund?
527
+ ```
528
+
529
+ </details>
530
+
531
+
532
+ # Uses
533
+
534
+ ## Direct Use and Downstream Use
535
+
536
+ > Primary intended uses: Machine Translation and multilingual NLP tasks on over 400 languages.
537
+ > Primary intended users: Research community.
538
+
539
+ ## Out-of-Scope Use
540
+
541
+ > These models are trained on general domain data and are therefore not meant to
542
+ > work on domain-specific models out-of-the box. Moreover, these research models have not been assessed
543
+ > for production usecases.
544
+
545
+ # Bias, Risks, and Limitations
546
+
547
+ > We note that we evaluate on only 204 of the languages supported by these models and on machine translation
548
+ > and few-shot machine translation tasks. Users must consider use of this model carefully for their own
549
+ > usecase.
550
+
551
+ ## Ethical considerations and risks
552
+
553
+ > We trained these models with MADLAD-400 and publicly available data to create baseline models that
554
+ > support NLP for over 400 languages, with a focus on languages underrepresented in large-scale corpora.
555
+ > Given that these models were trained with web-crawled datasets that may contain sensitive, offensive or
556
+ > otherwise low-quality content despite extensive preprocessing, it is still possible that these issues to the
557
+ > underlying training data may cause differences in model performance and toxic (or otherwise problematic)
558
+ > output for certain domains. Moreover, large models are dual use technologies that have specific risks
559
+ > associated with their use and development. We point the reader to surveys such as those written by
560
+ > Weidinger et al. or Bommasani et al. for a more detailed discussion of these risks, and to Liebling
561
+ > et al. for a thorough discussion of the risks of machine translation systems.
562
+
563
+ ## Known Limitations
564
+
565
+ More information needed
566
+
567
+ ## Sensitive Use:
568
+
569
+ More information needed
570
+
571
+ # Training Details
572
+
573
+ > We train models of various sizes: a 3B, 32-layer parameter model,
574
+ > a 7.2B 48-layer parameter model and a 10.7B 32-layer parameter model.
575
+ > We share all parameters of the model across language pairs,
576
+ > and use a Sentence Piece Model with 256k tokens shared on both the encoder and decoder
577
+ > side. Each input sentence has a <2xx> token prepended to the source sentence to indicate the target
578
+ > language.
579
+
580
+ See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
581
+
582
+ ## Training Data
583
+
584
+ > For both the machine translation and language model, MADLAD-400 is used. For the machine translation
585
+ > model, a combination of parallel datasources covering 157 languages is also used. Further details are
586
+ > described in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
587
+
588
+ ## Training Procedure
589
+
590
+ See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
591
+
592
+ # Evaluation
593
+
594
+ ## Testing Data, Factors & Metrics
595
+
596
+ > For evaluation, we used WMT, NTREX, Flores-200 and Gatones datasets as described in Section 4.3 in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
597
+
598
+ > The translation quality of this model varies based on language, as seen in the paper, and likely varies on
599
+ > domain, though we have not assessed this.
600
+
601
+ ## Results
602
+
603
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/EzsMD1AwCuFH0S0DeD-n8.png)
604
+
605
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/CJ5zCUVy7vTU76Lc8NZcK.png)
606
+
607
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/NK0S-yVeWuhKoidpLYh3m.png)
608
+
609
+ See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
610
+
611
+ # Environmental Impact
612
+
613
+ More information needed
614
+
615
+ # Citation
616
+
617
+ **BibTeX:**
618
+
619
+ ```bibtex
620
+ @misc{kudugunta2023madlad400,
621
+ title={MADLAD-400: A Multilingual And Document-Level Large Audited Dataset},
622
+ author={Sneha Kudugunta and Isaac Caswell and Biao Zhang and Xavier Garcia and Christopher A. Choquette-Choo and Katherine Lee and Derrick Xin and Aditya Kusupati and Romi Stella and Ankur Bapna and Orhan Firat},
623
+ year={2023},
624
+ eprint={2309.04662},
625
+ archivePrefix={arXiv},
626
+ primaryClass={cs.CL}
627
+ }
628
+ ```