File size: 2,377 Bytes
5e8e929 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
---
language:
- bn
licenses:
- cc-by-nc-sa-4.0
---
# BanglaBERT
This repository contains the pretrained discriminator checkpoint of the model **BanglaBERT**. This is an [ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB) discriminator model pretrained with the Replaced Token Detection (RTD) objective. Finetuned models using this checkpoint achieve state-of-the-art results on many of the NLP tasks in bengali.
For finetuning on different downstream tasks such as `Sentiment classification`, `Named Entity Recognition`, `Natural Language Inference` etc., refer to the scripts in the official [repository](https://https://github.com/csebuetnlp/banglabert).
## Using this model as a discriminator in `transformers` (tested on 4.11.0.dev0)
```python
from transformers import ElectraForPreTraining, ElectraTokenizerFast
from normalizer import normalize # pip install git+https://github.com/abhik1505040/normalizer
import torch
model = ElectraForPreTraining.from_pretrained("banglabert")
tokenizer = ElectraTokenizerFast.from_pretrained("banglabert")
original_sentence = "আমি কৃতজ্ঞ কারণ আপনি আমার জন্য অনেক কিছু করেছেন।"
fake_sentence = "আমি হতাশ কারণ আপনি আমার জন্য অনেক কিছু করেছেন।"
fake_sentence = normalize(fake_sentence) # this normalization step is required before tokenizing the text
fake_tokens = tokenizer.tokenize(fake_sentence)
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
discriminator_outputs = model(fake_inputs).logits
predictions = torch.round((torch.sign(discriminator_outputs) + 1) / 2)
[print("%7s" % token, end="") for token in fake_tokens]
print("\n" + "-" * 50)
[print("%7s" % int(prediction), end="") for prediction in predictions.squeeze().tolist()[1:-1]]
print("\n" + "-" * 50)
```
## Citation
If you use this model, please cite the following paper:
```
@misc{bhattacharjee2021banglabert,
title={BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding},
author={Abhik Bhattacharjee and Tahmid Hasan and Kazi Samin and Md Saiful Islam and M. Sohel Rahman and Anindya Iqbal and Rifat Shahriyar},
year={2021},
eprint={2101.00204},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
|