--- language: - bn licenses: - cc-by-nc-sa-4.0 --- # banglat5_banglaparaphrase This repository contains the pretrained checkpoint of the model **BanglaT5** finetuned on [BanglaParaphrase](https://huggingface.co/datasets/csebuetnlp/BanglaParaphrase) dataset. This is a sequence to sequence transformer model pretrained with the ["Span Corruption"]() objective. Finetuned models using this checkpoint achieve competitive results on the dataset. For finetuning and inference, refer to the scripts in the official GitHub repository of [BanglaNLG](https://github.com/csebuetnlp/BanglaNLG). **Note**: This model was pretrained using a specific normalization pipeline available [here](https://github.com/csebuetnlp/normalizer). All finetuning scripts in the official GitHub repository use this normalization by default. If you need to adapt the pretrained model for a different task make sure the text units are normalized using this pipeline before tokenizing to get best results. A basic example is given below: ## Using this model in `transformers` ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer from normalizer import normalize # pip install git+https://github.com/csebuetnlp/normalizer model = AutoModelForSeq2SeqLM.from_pretrained("csebuetnlp/banglat5_banglaparaphrase") tokenizer = AutoTokenizer.from_pretrained("csebuetnlp/banglat5_banglaparaphrase", use_fast=False) input_sentence = "" input_ids = tokenizer(normalize(input_sentence), return_tensors="pt").input_ids generated_tokens = model.generate(input_ids) decoded_tokens = tokenizer.batch_decode(generated_tokens)[0] print(decoded_tokens) ``` ## Benchmarks * Supervised fine-tuning | Test Set | Model | sacreBLEU | ROUGE-L | PINC | BERTScore | BERT-iBLEU | | -------- | ----- | --------- | ------- | ---- | --------- | ---------- | | [BanglaParaphrase](https://huggingface.co/datasets/csebuetnlp/BanglaParaphrase) | [BanglaT5](https://huggingface.co/csebuetnlp/banglat5)
[IndicBART](https://huggingface.co/ai4bharat/IndicBART)
[IndicBARTSS](https://huggingface.co/ai4bharat/IndicBARTSS)| 32.8
5.60
4.90 | 63.58
35.61
33.66 | 74.40
80.26
82.10 | 94.80
91.50
91.10 | 92.18
91.16
90.95 | | [IndicParaphrase](https://huggingface.co/datasets/ai4bharat/IndicParaphrase) |BanglaT5
IndicBART
IndicBARTSS| 11.0
12.0
10.7| 19.99
21.58
20.59| 74.50
76.83
77.60| 94.80
93.30
93.10 | 87.738
90.65
90.54| The dataset can be found in the link below: * **[BanglaParaphrase](https://huggingface.co/datasets/csebuetnlp/BanglaParaphrase)** ## Citation If you use this model, please cite the following paper: ``` @article{akil2022banglaparaphrase, title={BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset}, author={Akil, Ajwad and Sultana, Najrin and Bhattacharjee, Abhik and Shahriyar, Rifat}, journal={arXiv preprint arXiv:2210.05109}, year={2022} } ```