KhangHatto commited on
Commit
f7ff895
·
1 Parent(s): 29cd61c

Create README.md

Browse files

First README test

Files changed (1) hide show
  1. README.md +39 -0
README.md ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - bigscience-data/roots_vi_binhvq_news_corpus
5
+ - wikipedia
6
+ language:
7
+ - vi
8
+ - en
9
+ - zh
10
+ library_name: transformers
11
+ tags:
12
+ - t5
13
+ - flant5
14
+ - summarization
15
+ - translation
16
+ - question-answering
17
+ ---
18
+ ## HattoFlanT5-Large
19
+ We utilized [SentencePiece](https://github.com/google/sentencepiece) to retrain a tokenizer for Vietnamese, English, and Chinese. This newly trained tokenizer's vocabulary was then combined with Flan-T5's original vocabulary, eliminating any duplicate tokens. The resulting merged vocabulary consists of 106611 tokens.
20
+
21
+ For a single-epoch continual pretraining, also referred to as incremental pretraining, we employed the Flan-T5-Large model. This pretraining was conducted on a diverse dataset exceeding 100 GB, incorporating the following sources:
22
+ - [NewsCorpus](https://github.com/binhvq/news-corpus)
23
+ - Vietnamese Wikipedia
24
+ - Vietnamese books
25
+ - Vietnamese legal documents
26
+ - Vietnamese legal text
27
+ - English Wikipedia
28
+ - Chinese Text
29
+
30
+ ## How to use
31
+ ```python
32
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
33
+ tokenizer = AutoTokenizer.from_pretrained("Hatto/HattoFlanT5-Large")
34
+ model = AutoModelForSeq2SeqLM.from_pretrained("Hatto/HattoFlanT5-Large")
35
+ model.cuda()
36
+ ```
37
+ ## Citation
38
+ Hatto
39
+ IpTech