csebuetnlp
/

banglabert

Inference Endpoints

Model card Files Files and versions Community

banglabert / README.md

abhik1505040's picture

Initial commit

5e8e929 over 3 years ago

|

2.38 kB

	---
	language:
	- bn
	licenses:
	- cc-by-nc-sa-4.0
	---

	# BanglaBERT

	This repository contains the pretrained discriminator checkpoint of the model BanglaBERT. This is an [ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB) discriminator model pretrained with the Replaced Token Detection (RTD) objective. Finetuned models using this checkpoint achieve state-of-the-art results on many of the NLP tasks in bengali.

	For finetuning on different downstream tasks such as `Sentiment classification`, `Named Entity Recognition`, `Natural Language Inference` etc., refer to the scripts in the official [repository](https://https://github.com/csebuetnlp/banglabert).

	## Using this model as a discriminator in `transformers` (tested on 4.11.0.dev0)

	```python
	from transformers import ElectraForPreTraining, ElectraTokenizerFast
	from normalizer import normalize # pip install git+https://github.com/abhik1505040/normalizer
	import torch

	model = ElectraForPreTraining.from_pretrained("banglabert")
	tokenizer = ElectraTokenizerFast.from_pretrained("banglabert")

	original_sentence = "আমি কৃতজ্ঞ কারণ আপনি আমার জন্য অনেক কিছু করেছেন।"
	fake_sentence = "আমি হতাশ কারণ আপনি আমার জন্য অনেক কিছু করেছেন।"
	fake_sentence = normalize(fake_sentence) # this normalization step is required before tokenizing the text

	fake_tokens = tokenizer.tokenize(fake_sentence)
	fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
	discriminator_outputs = model(fake_inputs).logits
	predictions = torch.round((torch.sign(discriminator_outputs) + 1) / 2)

	[print("%7s" % token, end="") for token in fake_tokens]
	print("\n" + "-" * 50)
	[print("%7s" % int(prediction), end="") for prediction in predictions.squeeze().tolist()[1:-1]]
	print("\n" + "-" * 50)
	```

	## Citation

	If you use this model, please cite the following paper:
	```
	@misc{bhattacharjee2021banglabert,
	title={BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding},
	author={Abhik Bhattacharjee and Tahmid Hasan and Kazi Samin and Md Saiful Islam and M. Sohel Rahman and Anindya Iqbal and Rifat Shahriyar},
	year={2021},
	eprint={2101.00204},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```