SPRINGLab
/

shiksha-MT-nllb-3.3B

Model card Files Files and versions Community

shiksha-MT-nllb-3.3B / README.md

rumourscape's picture

Update README.md

e72ef39 verified 17 days ago

|

2.94 kB

	---
	library_name: peft
	license: cc-by-4.0
	datasets:
	- SPRINGLab/shiksha
	- SPRINGLab/BPCC_cleaned
	language:
	- bn
	- gu
	- hi
	- mr
	- ml
	- kn
	- ta
	- te
	- en
	metrics:
	- bleu
	base_model:
	- facebook/nllb-200-3.3B
	pipeline_tag: translation
	---

	# Shiksha MT Model Card

	## Model Details

	### 1. Model Description

	- Developed by: [SPRING Lab](https://asr.iitm.ac.in)
	- Model type: LoRA Adaptor
	- Language(s) (NLP): Bengali, Gujarati, Hindi, Marathi, Malayalam, Kannada, Tamil, Telugu
	- License: CC-BY-4.0
	- Finetuned from model: [NLLB-200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B)

	### 2. Model Sources

	- Paper: https://arxiv.org/abs/2412.09025
	- Demo: https://asr.iitm.ac.in/demo/ttt

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->


	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	import torch
	from peft import AutoPeftModelForSeq2SeqLM
	from transformers import NllbTokenizerFast

	device = "cuda" if torch.cuda.is_available() else "cpu"

	# Load model and tokenizer from local checkpoint
	model = AutoPeftModelForSeq2SeqLM.from_pretrained("SPRINGLab/shiksha-MT-nllb-3.3B", device_map=device)
	tokenizer = NllbTokenizerFast.from_pretrained("facebook/nllb-200-3.3B")

	input_text = "Welcome back to the lecture series in Cell Culture."

	# Lang codes: https://github.com/facebookresearch/flores/tree/main/flores200
	tgt_lang = "hin_Deva"

	inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)

	output = model.generate(input_ids=inputs["input_ids"].to(device), max_new_tokens=256, forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang))

	output_text = tokenizer.batch_decode(output, skip_special_tokens=True)

	print(output_text[0])
	```


	## Training Details

	### 1. Training Data

	We used the following datasets for training this adapter:

	Shiksha: https://huggingface.co/datasets/SPRINGLab/shiksha
	<br>
	BPCC-cleaned: https://huggingface.co/datasets/SPRINGLab/BPCC_cleaned


	#### 2. Training Hyperparameters

	- peft-type: LORA
	- rank: 256
	- lora alpha: 256
	- lora dropout: 0.1
	- rslora: True
	- target modules: all-linear
	- learning rate: 4e-5
	- optimizer: adafactor
	- data-type: BF-16
	- epochs: 1


	### 3. Compute Infrastructure

	We used 8 x A100 40GB GPUs for training this adapter. We would like to thank [CDAC](https://cdac.in) for providing the compute resources.

	## Citation

	If you use this model in your work, please cite us:

	BibTeX:
	```bibtex
	@misc{joglekar2024shikshatechnicaldomainfocused,
	title={Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages},
	author={Advait Joglekar and Srinivasan Umesh},
	year={2024},
	eprint={2412.09025},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2412.09025},
	}
	```