Translation
PEFT
Safetensors
rumourscape's picture
Update README.md
e72ef39 verified
|
raw
history blame
2.94 kB
---
library_name: peft
license: cc-by-4.0
datasets:
- SPRINGLab/shiksha
- SPRINGLab/BPCC_cleaned
language:
- bn
- gu
- hi
- mr
- ml
- kn
- ta
- te
- en
metrics:
- bleu
base_model:
- facebook/nllb-200-3.3B
pipeline_tag: translation
---
# Shiksha MT Model Card
## Model Details
### 1. Model Description
- **Developed by:** [SPRING Lab](https://asr.iitm.ac.in)
- **Model type:** LoRA Adaptor
- **Language(s) (NLP):** Bengali, Gujarati, Hindi, Marathi, Malayalam, Kannada, Tamil, Telugu
- **License:** CC-BY-4.0
- **Finetuned from model:** [NLLB-200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B)
### 2. Model Sources
- **Paper:** https://arxiv.org/abs/2412.09025
- **Demo:** https://asr.iitm.ac.in/demo/ttt
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
## How to Get Started with the Model
Use the code below to get started with the model.
```python
import torch
from peft import AutoPeftModelForSeq2SeqLM
from transformers import NllbTokenizerFast
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load model and tokenizer from local checkpoint
model = AutoPeftModelForSeq2SeqLM.from_pretrained("SPRINGLab/shiksha-MT-nllb-3.3B", device_map=device)
tokenizer = NllbTokenizerFast.from_pretrained("facebook/nllb-200-3.3B")
input_text = "Welcome back to the lecture series in Cell Culture."
# Lang codes: https://github.com/facebookresearch/flores/tree/main/flores200
tgt_lang = "hin_Deva"
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)
output = model.generate(input_ids=inputs["input_ids"].to(device), max_new_tokens=256, forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang))
output_text = tokenizer.batch_decode(output, skip_special_tokens=True)
print(output_text[0])
```
## Training Details
### 1. Training Data
We used the following datasets for training this adapter:
Shiksha: https://huggingface.co/datasets/SPRINGLab/shiksha
<br>
BPCC-cleaned: https://huggingface.co/datasets/SPRINGLab/BPCC_cleaned
#### 2. Training Hyperparameters
- peft-type: LORA
- rank: 256
- lora alpha: 256
- lora dropout: 0.1
- rslora: True
- target modules: all-linear
- learning rate: 4e-5
- optimizer: adafactor
- data-type: BF-16
- epochs: 1
### 3. Compute Infrastructure
We used 8 x A100 40GB GPUs for training this adapter. We would like to thank [CDAC](https://cdac.in) for providing the compute resources.
## Citation
If you use this model in your work, please cite us:
**BibTeX:**
```bibtex
@misc{joglekar2024shikshatechnicaldomainfocused,
title={Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages},
author={Advait Joglekar and Srinivasan Umesh},
year={2024},
eprint={2412.09025},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.09025},
}
```