metadata
language:
- bn
licenses:
- cc-by-nc-sa-4.0
BanglaBERT
This repository contains the pretrained discriminator checkpoint of the model BanglaBERT. This is an ELECTRA discriminator model pretrained with the Replaced Token Detection (RTD) objective. Finetuned models using this checkpoint achieve state-of-the-art results on many of the NLP tasks in bengali.
For finetuning on different downstream tasks such as Sentiment classification
, Named Entity Recognition
, Natural Language Inference
etc., refer to the scripts in the official repository.
Using this model as a discriminator in transformers
(tested on 4.11.0.dev0)
from transformers import ElectraForPreTraining, ElectraTokenizerFast
from normalizer import normalize # pip install git+https://github.com/abhik1505040/normalizer
import torch
model = ElectraForPreTraining.from_pretrained("banglabert")
tokenizer = ElectraTokenizerFast.from_pretrained("banglabert")
original_sentence = "আমি কৃতজ্ঞ কারণ আপনি আমার জন্য অনেক কিছু করেছেন।"
fake_sentence = "আমি হতাশ কারণ আপনি আমার জন্য অনেক কিছু করেছেন।"
fake_sentence = normalize(fake_sentence) # this normalization step is required before tokenizing the text
fake_tokens = tokenizer.tokenize(fake_sentence)
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
discriminator_outputs = model(fake_inputs).logits
predictions = torch.round((torch.sign(discriminator_outputs) + 1) / 2)
[print("%7s" % token, end="") for token in fake_tokens]
print("\n" + "-" * 50)
[print("%7s" % int(prediction), end="") for prediction in predictions.squeeze().tolist()[1:-1]]
print("\n" + "-" * 50)
Citation
If you use this model, please cite the following paper:
@misc{bhattacharjee2021banglabert,
title={BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding},
author={Abhik Bhattacharjee and Tahmid Hasan and Kazi Samin and Md Saiful Islam and M. Sohel Rahman and Anindya Iqbal and Rifat Shahriyar},
year={2021},
eprint={2101.00204},
archivePrefix={arXiv},
primaryClass={cs.CL}
}