|
--- |
|
library_name: peft |
|
license: cc-by-4.0 |
|
datasets: |
|
- SPRINGLab/shiksha |
|
- SPRINGLab/BPCC_cleaned |
|
language: |
|
- bn |
|
- gu |
|
- hi |
|
- mr |
|
- ml |
|
- kn |
|
- ta |
|
- te |
|
- en |
|
metrics: |
|
- bleu |
|
base_model: |
|
- facebook/nllb-200-3.3B |
|
pipeline_tag: translation |
|
--- |
|
|
|
# Shiksha MT Model Card |
|
|
|
## Model Details |
|
|
|
### 1. Model Description |
|
|
|
- **Developed by:** [SPRING Lab](https://asr.iitm.ac.in) |
|
- **Model type:** LoRA Adaptor |
|
- **Language(s) (NLP):** Bengali, Gujarati, Hindi, Marathi, Malayalam, Kannada, Tamil, Telugu |
|
- **License:** CC-BY-4.0 |
|
- **Finetuned from model:** [NLLB-200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) |
|
|
|
### 2. Model Sources |
|
|
|
- **Paper:** https://arxiv.org/abs/2412.09025 |
|
- **Demo:** https://asr.iitm.ac.in/demo/ttt |
|
|
|
## Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
```python |
|
import torch |
|
from peft import AutoPeftModelForSeq2SeqLM |
|
from transformers import NllbTokenizerFast |
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
# Load model and tokenizer from local checkpoint |
|
model = AutoPeftModelForSeq2SeqLM.from_pretrained("SPRINGLab/shiksha-MT-nllb-3.3B", device_map=device) |
|
tokenizer = NllbTokenizerFast.from_pretrained("facebook/nllb-200-3.3B") |
|
|
|
input_text = "Welcome back to the lecture series in Cell Culture." |
|
|
|
# Lang codes: https://github.com/facebookresearch/flores/tree/main/flores200 |
|
tgt_lang = "hin_Deva" |
|
|
|
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True) |
|
|
|
output = model.generate(input_ids=inputs["input_ids"].to(device), max_new_tokens=256, forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang)) |
|
|
|
output_text = tokenizer.batch_decode(output, skip_special_tokens=True) |
|
|
|
print(output_text[0]) |
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
### 1. Training Data |
|
|
|
We used the following datasets for training this adapter: |
|
|
|
Shiksha: https://huggingface.co/datasets/SPRINGLab/shiksha |
|
<br> |
|
BPCC-cleaned: https://huggingface.co/datasets/SPRINGLab/BPCC_cleaned |
|
|
|
|
|
#### 2. Training Hyperparameters |
|
|
|
- peft-type: LORA |
|
- rank: 256 |
|
- lora alpha: 256 |
|
- lora dropout: 0.1 |
|
- rslora: True |
|
- target modules: all-linear |
|
- learning rate: 4e-5 |
|
- optimizer: adafactor |
|
- data-type: BF-16 |
|
- epochs: 1 |
|
|
|
|
|
### 3. Compute Infrastructure |
|
|
|
We used 8 x A100 40GB GPUs for training this adapter. We would like to thank [CDAC](https://cdac.in) for providing the compute resources. |
|
|
|
## Citation |
|
|
|
If you use this model in your work, please cite us: |
|
|
|
**BibTeX:** |
|
```bibtex |
|
@misc{joglekar2024shikshatechnicaldomainfocused, |
|
title={Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages}, |
|
author={Advait Joglekar and Srinivasan Umesh}, |
|
year={2024}, |
|
eprint={2412.09025}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2412.09025}, |
|
} |
|
``` |
|
|