Shiksha MT Model Card

Model Details

1. Model Description

Developed by: SPRING Lab
Model type: LoRA Adaptor
Language(s) (NLP): Bengali, Gujarati, Hindi, Marathi, Malayalam, Kannada, Tamil, Telugu
License: CC-BY-4.0
Finetuned from model: NLLB-200 3.3B

2. Model Sources

Paper: https://arxiv.org/abs/2412.09025
Demo: https://asr.iitm.ac.in/demo/ttt

Uses

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from peft import AutoPeftModelForSeq2SeqLM
from transformers import NllbTokenizerFast

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and tokenizer from local checkpoint
model = AutoPeftModelForSeq2SeqLM.from_pretrained("SPRINGLab/shiksha-MT-nllb-3.3B", device_map=device)
tokenizer = NllbTokenizerFast.from_pretrained("facebook/nllb-200-3.3B")

input_text = "Welcome back to the lecture series in Cell Culture."

# Lang codes: https://github.com/facebookresearch/flores/tree/main/flores200
tgt_lang = "hin_Deva"

inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)

output = model.generate(input_ids=inputs["input_ids"].to(device), max_new_tokens=256, forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang))

output_text = tokenizer.batch_decode(output, skip_special_tokens=True)

print(output_text[0])

Training Details

1. Training Data

We used the following datasets for training this adapter:

Shiksha: https://huggingface.co/datasets/SPRINGLab/shiksha
BPCC-cleaned: https://huggingface.co/datasets/SPRINGLab/BPCC_cleaned

2. Training Hyperparameters

peft-type: LORA
rank: 256
lora alpha: 256
lora dropout: 0.1
rslora: True
target modules: all-linear
learning rate: 4e-5
optimizer: adafactor
data-type: BF-16
epochs: 1

3. Compute Infrastructure

We used 8 x A100 40GB GPUs for training this adapter. We would like to thank CDAC for providing the compute resources.

Citation

If you use this model in your work, please cite us:

BibTeX:

@misc{joglekar2024shikshatechnicaldomainfocused,
      title={Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages}, 
      author={Advait Joglekar and Srinivasan Umesh},
      year={2024},
      eprint={2412.09025},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.09025}, 
}

SPRINGLab
/

shiksha-MT-nllb-3.3B