--- language: - bn licenses: - cc-by-nc-sa-4.0 --- # BanglaBERT This repository contains the pretrained discriminator checkpoint of the model **BanglaBERT**. This is an [ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB) discriminator model pretrained with the Replaced Token Detection (RTD) objective. Finetuned models using this checkpoint achieve state-of-the-art results on many of the NLP tasks in bengali. For finetuning on different downstream tasks such as `Sentiment classification`, `Named Entity Recognition`, `Natural Language Inference` etc., refer to the scripts in the official [repository](https://https://github.com/csebuetnlp/banglabert). ## Using this model as a discriminator in `transformers` (tested on 4.11.0.dev0) ```python from transformers import ElectraForPreTraining, ElectraTokenizerFast from normalizer import normalize # pip install git+https://github.com/abhik1505040/normalizer import torch model = ElectraForPreTraining.from_pretrained("banglabert") tokenizer = ElectraTokenizerFast.from_pretrained("banglabert") original_sentence = "আমি কৃতজ্ঞ কারণ আপনি আমার জন্য অনেক কিছু করেছেন।" fake_sentence = "আমি হতাশ কারণ আপনি আমার জন্য অনেক কিছু করেছেন।" fake_sentence = normalize(fake_sentence) # this normalization step is required before tokenizing the text fake_tokens = tokenizer.tokenize(fake_sentence) fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt") discriminator_outputs = model(fake_inputs).logits predictions = torch.round((torch.sign(discriminator_outputs) + 1) / 2) [print("%7s" % token, end="") for token in fake_tokens] print("\n" + "-" * 50) [print("%7s" % int(prediction), end="") for prediction in predictions.squeeze().tolist()[1:-1]] print("\n" + "-" * 50) ``` ## Citation If you use this model, please cite the following paper: ``` @misc{bhattacharjee2021banglabert, title={BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding}, author={Abhik Bhattacharjee and Tahmid Hasan and Kazi Samin and Md Saiful Islam and M. Sohel Rahman and Anindya Iqbal and Rifat Shahriyar}, year={2021}, eprint={2101.00204}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```