banglabert / README.md
abhik1505040's picture
Initial commit
5e8e929
|
raw
history blame
2.38 kB
metadata
language:
  - bn
licenses:
  - cc-by-nc-sa-4.0

BanglaBERT

This repository contains the pretrained discriminator checkpoint of the model BanglaBERT. This is an ELECTRA discriminator model pretrained with the Replaced Token Detection (RTD) objective. Finetuned models using this checkpoint achieve state-of-the-art results on many of the NLP tasks in bengali.

For finetuning on different downstream tasks such as Sentiment classification, Named Entity Recognition, Natural Language Inference etc., refer to the scripts in the official repository.

Using this model as a discriminator in transformers (tested on 4.11.0.dev0)

from transformers import ElectraForPreTraining, ElectraTokenizerFast
from normalizer import normalize # pip install git+https://github.com/abhik1505040/normalizer
import torch

model = ElectraForPreTraining.from_pretrained("banglabert")
tokenizer = ElectraTokenizerFast.from_pretrained("banglabert")

original_sentence = "আমি কৃতজ্ঞ কারণ আপনি আমার জন্য অনেক কিছু করেছেন।"
fake_sentence = "আমি হতাশ কারণ আপনি আমার জন্য অনেক কিছু করেছেন।"
fake_sentence = normalize(fake_sentence) # this normalization step is required before tokenizing the text

fake_tokens = tokenizer.tokenize(fake_sentence)
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
discriminator_outputs = model(fake_inputs).logits
predictions = torch.round((torch.sign(discriminator_outputs) + 1) / 2)

[print("%7s" % token, end="") for token in fake_tokens]
print("\n" + "-" * 50)
[print("%7s" % int(prediction), end="") for prediction in predictions.squeeze().tolist()[1:-1]]
print("\n" + "-" * 50)

Citation

If you use this model, please cite the following paper:

@misc{bhattacharjee2021banglabert,
      title={BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding}, 
      author={Abhik Bhattacharjee and Tahmid Hasan and Kazi Samin and Md Saiful Islam and M. Sohel Rahman and Anindya Iqbal and Rifat Shahriyar},
      year={2021},
      eprint={2101.00204},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}