--- library_name: peft license: cc-by-4.0 datasets: - SPRINGLab/shiksha - SPRINGLab/BPCC_cleaned language: - bn - gu - hi - mr - ml - kn - ta - te - en metrics: - bleu base_model: - facebook/nllb-200-3.3B pipeline_tag: translation --- # Shiksha MT Model Card ## Model Details ### 1. Model Description - **Developed by:** [SPRING Lab](https://asr.iitm.ac.in) - **Model type:** LoRA Adaptor - **Language(s) (NLP):** Bengali, Gujarati, Hindi, Marathi, Malayalam, Kannada, Tamil, Telugu - **License:** CC-BY-4.0 - **Finetuned from model:** [NLLB-200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) ### 2. Model Sources - **Paper:** https://arxiv.org/abs/2412.09025 - **Demo:** https://asr.iitm.ac.in/demo/ttt ## Uses ## How to Get Started with the Model Use the code below to get started with the model. ```python import torch from peft import AutoPeftModelForSeq2SeqLM from transformers import NllbTokenizerFast device = "cuda" if torch.cuda.is_available() else "cpu" # Load model and tokenizer from local checkpoint model = AutoPeftModelForSeq2SeqLM.from_pretrained("SPRINGLab/shiksha-MT-nllb-3.3B", device_map=device) tokenizer = NllbTokenizerFast.from_pretrained("facebook/nllb-200-3.3B") input_text = "Welcome back to the lecture series in Cell Culture." # Lang codes: https://github.com/facebookresearch/flores/tree/main/flores200 tgt_lang = "hin_Deva" inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True) output = model.generate(input_ids=inputs["input_ids"].to(device), max_new_tokens=256, forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang)) output_text = tokenizer.batch_decode(output, skip_special_tokens=True) print(output_text[0]) ``` ## Training Details ### 1. Training Data We used the following datasets for training this adapter: Shiksha: https://huggingface.co/datasets/SPRINGLab/shiksha
BPCC-cleaned: https://huggingface.co/datasets/SPRINGLab/BPCC_cleaned #### 2. Training Hyperparameters - peft-type: LORA - rank: 256 - lora alpha: 256 - lora dropout: 0.1 - rslora: True - target modules: all-linear - learning rate: 4e-5 - optimizer: adafactor - data-type: BF-16 - epochs: 1 ### 3. Compute Infrastructure We used 8 x A100 40GB GPUs for training this adapter. We would like to thank [CDAC](https://cdac.in) for providing the compute resources. ## Citation If you use this model in your work, please cite us: **BibTeX:** ```bibtex @misc{joglekar2024shikshatechnicaldomainfocused, title={Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages}, author={Advait Joglekar and Srinivasan Umesh}, year={2024}, eprint={2412.09025}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.09025}, } ```