Hatto
/

Vietnamese-FlanT5-Large

text2text-generation

question-answering

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

KhangHatto commited on Nov 22, 2023

Commit

f7ff895

·

1 Parent(s): 29cd61c

Create README.md

First README test

Files changed (1) hide show

README.md +39 -0

README.md ADDED Viewed

	@@ -0,0 +1,39 @@

+---
+license: mit
+datasets:
+- bigscience-data/roots_vi_binhvq_news_corpus
+- wikipedia
+language:
+- vi
+- en
+- zh
+library_name: transformers
+tags:
+- t5
+- flant5
+- summarization
+- translation
+- question-answering
+---
+## HattoFlanT5-Large
+We utilized [SentencePiece](https://github.com/google/sentencepiece) to retrain a tokenizer for Vietnamese, English, and Chinese. This newly trained tokenizer's vocabulary was then combined with Flan-T5's original vocabulary, eliminating any duplicate tokens. The resulting merged vocabulary consists of 106611 tokens.
+For a single-epoch continual pretraining, also referred to as incremental pretraining, we employed the Flan-T5-Large model. This pretraining was conducted on a diverse dataset exceeding 100 GB, incorporating the following sources:
+- [NewsCorpus](https://github.com/binhvq/news-corpus)
+- Vietnamese Wikipedia
+- Vietnamese books
+- Vietnamese legal documents
+- Vietnamese legal text
+- English Wikipedia
+- Chinese Text
+## How to use
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+tokenizer = AutoTokenizer.from_pretrained("Hatto/HattoFlanT5-Large")
+model = AutoModelForSeq2SeqLM.from_pretrained("Hatto/HattoFlanT5-Large")
+model.cuda()
+```
+## Citation
+Hatto
+IpTech