Update README.md
Browse files
README.md
CHANGED
@@ -4,6 +4,7 @@ IndicBART is a multilingual, sequence-to-sequence pre-trained model focusing on
|
|
4 |
<li >Supported languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Odiya, Punjabi, Kannada, Malayalam, Tamil, Telugu and English. Not all of these languages are supported by mBART50 and mT5. </li>
|
5 |
<li >The model is much smaller than the mBART and mT5(-base) models, so less computationally expensive for finetuning and decoding. </li>
|
6 |
<li> Trained on large Indic language corpora (452 million sentences and 9 billion tokens) which also includes Indian English content. </li>
|
|
|
7 |
</ul>
|
8 |
|
9 |
You can read more about IndicBART in this <a href="https://arxiv.org/abs/2109.02903">paper</a>.
|
@@ -92,6 +93,7 @@ print(decoded_output) # मला ओळखलं पाहिजे
|
|
92 |
1. This is compatible with the latest version of transformers but was developed with version 4.3.2 so consider using 4.3.2 if possible.
|
93 |
2. While I have only shown how to get logits and loss and how to generate outputs, you can do pretty much everything the MBartForConditionalGeneration class can do as in https://huggingface.co/docs/transformers/model_doc/mbart#transformers.MBartForConditionalGeneration
|
94 |
3. Note that the tokenizer I have used is based on sentencepiece and not BPE. Therefore, I used the AlbertTokenizer class and not the MBartTokenizer class.
|
|
|
95 |
|
96 |
# Fine-tuning on a downstream task
|
97 |
|
|
|
4 |
<li >Supported languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Odiya, Punjabi, Kannada, Malayalam, Tamil, Telugu and English. Not all of these languages are supported by mBART50 and mT5. </li>
|
5 |
<li >The model is much smaller than the mBART and mT5(-base) models, so less computationally expensive for finetuning and decoding. </li>
|
6 |
<li> Trained on large Indic language corpora (452 million sentences and 9 billion tokens) which also includes Indian English content. </li>
|
7 |
+
<li> All languages, except English, have been represented in Devanagari script to encourage transfer learning among the related languages. </li>
|
8 |
</ul>
|
9 |
|
10 |
You can read more about IndicBART in this <a href="https://arxiv.org/abs/2109.02903">paper</a>.
|
|
|
93 |
1. This is compatible with the latest version of transformers but was developed with version 4.3.2 so consider using 4.3.2 if possible.
|
94 |
2. While I have only shown how to get logits and loss and how to generate outputs, you can do pretty much everything the MBartForConditionalGeneration class can do as in https://huggingface.co/docs/transformers/model_doc/mbart#transformers.MBartForConditionalGeneration
|
95 |
3. Note that the tokenizer I have used is based on sentencepiece and not BPE. Therefore, I used the AlbertTokenizer class and not the MBartTokenizer class.
|
96 |
+
4. If you wish to use any language written in a non-Devanagari script (except English), then you should first convert it to Devanagari using the <a href="https://github.com/anoopkunchukuttan/indic_nlp_library">Indic NLP Library</a>. After you get the output, you should convert it back into the original script.
|
97 |
|
98 |
# Fine-tuning on a downstream task
|
99 |
|