|
--- |
|
language: |
|
- bn |
|
licenses: |
|
- cc-by-nc-sa-4.0 |
|
--- |
|
|
|
# BanglaBERT |
|
|
|
This repository contains the pretrained discriminator checkpoint of the model **BanglaBERT**. This is an [ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB) discriminator model pretrained with the Replaced Token Detection (RTD) objective. Finetuned models using this checkpoint achieve state-of-the-art results on many of the NLP tasks in bengali. |
|
|
|
For finetuning on different downstream tasks such as `Sentiment classification`, `Named Entity Recognition`, `Natural Language Inference` etc., refer to the scripts in the official [repository](https://https://github.com/csebuetnlp/banglabert). |
|
|
|
## Using this model as a discriminator in `transformers` (tested on 4.11.0.dev0) |
|
|
|
```python |
|
from transformers import ElectraForPreTraining, ElectraTokenizerFast |
|
from normalizer import normalize # pip install git+https://github.com/abhik1505040/normalizer |
|
import torch |
|
|
|
model = ElectraForPreTraining.from_pretrained("banglabert") |
|
tokenizer = ElectraTokenizerFast.from_pretrained("banglabert") |
|
|
|
original_sentence = "আমি কৃতজ্ঞ কারণ আপনি আমার জন্য অনেক কিছু করেছেন।" |
|
fake_sentence = "আমি হতাশ কারণ আপনি আমার জন্য অনেক কিছু করেছেন।" |
|
fake_sentence = normalize(fake_sentence) # this normalization step is required before tokenizing the text |
|
|
|
fake_tokens = tokenizer.tokenize(fake_sentence) |
|
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt") |
|
discriminator_outputs = model(fake_inputs).logits |
|
predictions = torch.round((torch.sign(discriminator_outputs) + 1) / 2) |
|
|
|
[print("%7s" % token, end="") for token in fake_tokens] |
|
print("\n" + "-" * 50) |
|
[print("%7s" % int(prediction), end="") for prediction in predictions.squeeze().tolist()[1:-1]] |
|
print("\n" + "-" * 50) |
|
``` |
|
|
|
## Citation |
|
|
|
If you use this model, please cite the following paper: |
|
``` |
|
@misc{bhattacharjee2021banglabert, |
|
title={BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding}, |
|
author={Abhik Bhattacharjee and Tahmid Hasan and Kazi Samin and Md Saiful Islam and M. Sohel Rahman and Anindya Iqbal and Rifat Shahriyar}, |
|
year={2021}, |
|
eprint={2101.00204}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|
|
|