Mudra-VLM: Bharatanatyam Mudra Recognition
A parameter-efficient vision-language model for recognizing Bharatanatyam hand mudras. Built on Gemma-3 4B with LoRA adaptation, achieving 96.7% accuracy on 52 mudra classes while training only 0.1% of parameters.
Model Details
- Base Model: Gemma-3 4B (unsloth/gemma-3-4b-it)
- Adaptation: LoRA (r=32)
- Classes: 52 Bharatanatyam mudras (28 Asamyukta + 24 Samyukta)
- Training Data: 28,431 images
- Accuracy: 96.7%
Usage
Installation
pip install modal torch unsloth pillow requests
Local Inference
from unsloth import FastVisionModel
import torch
from PIL import Image
# Load model
model, tokenizer = FastVisionModel.from_pretrained(
"Samarth0710/bharatanatyam-mudra-model-gemma-3-4b-it-r32",
load_in_4bit=True,
)
FastVisionModel.for_inference(model)
# Prepare input
messages = [{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Identify the Bharatanatyam mudra shown in this image."}
]}]
image = Image.open("mudra.jpg")
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")
# Generate
outputs = model.generate(**inputs, max_new_tokens=64, temperature=0.1, do_sample=False)
result = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True).strip()
print(f"Mudra: {result}")
Inference on Modal (L40S GPU)
import modal
app = modal.App("bharatanatyam-mudra-inference")
image = (
modal.Image.debian_slim(python_version="3.10")
.uv_pip_install(["pip3-autoremove"])
.uv_pip_install(
["torch", "torchvision", "torchaudio", "xformers"],
index_url="https://download.pytorch.org/whl/cu124"
)
.uv_pip_install(["unsloth"])
.env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
)
@app.function(
image=image,
gpu="L40S",
timeout=600,
volumes={"/cache": modal.Volume.from_name("unsloth-cache", create_if_missing=True)},
)
def run_inference(
image_url: str = "https://www.shutterstock.com/image-photo/woman-hand-showing-hamsapaksha-hasta-260nw-37167580.jpg",
instruction: str = "Identify the Bharatanatyam mudra shown in this image.",
):
import os
os.environ["HF_HOME"] = "/cache"
from unsloth import FastVisionModel
import torch
from PIL import Image
import requests
from io import BytesIO
# Load model
model, tokenizer = FastVisionModel.from_pretrained(
"Samarth0710/bharatanatyam-mudra-model-gemma-3-4b-it-r32",
load_in_4bit=True,
use_gradient_checkpointing="unsloth",
)
FastVisionModel.for_inference(model)
# Load image
response = requests.get(image_url)
img = Image.open(BytesIO(response.content))
# Prepare input
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": instruction}
]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(
img,
input_text,
add_special_tokens=False,
return_tensors="pt",
).to("cuda")
# Generate
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=64,
use_cache=True,
temperature=0.1,
do_sample=False
)
# Decode
result = tokenizer.decode(
outputs[0][len(inputs.input_ids[0]):],
skip_special_tokens=True
).strip()
print(f"Predicted Mudra: {result}")
return {"mudra": result, "image_url": image_url}
@app.local_entrypoint()
def main(image_url: str = "https://www.shutterstock.com/image-photo/woman-hand-showing-hamsapaksha-hasta-260nw-37167580.jpg"):
result = run_inference.remote(image_url=image_url)
return result
Run:
modal run mudra_vlm_inference.py --image-url "https://your-image-url.jpg"
Training
- Method: Instruction-based supervised fine-tuning with LoRA
- LoRA Config: rank=32, alpha=64, dropout=0.05
- Hardware: NVIDIA A100 GPU
- Training Time: ~6 hours
- Format: Visual question-answering style
User: <image> Identify the Bharatanatyam mudra shown in this image.
Assistant: [Mudra Name]
Performance
| Model | Accuracy |
|---|---|
| Mudra-VLM (Ours) | 96.7% |
| Haridas et al. (2022) | 95.0% |
| Naaz et al. (2023) | 94.9% |
| MudraGyaan (2024) | 91.9% |
Limitations
- Trained on static images only (not suitable for video/continuous gestures)
- Best performance in controlled lighting conditions
- May struggle with extreme occlusions or motion blur
Citation
@article{samarth2025mudravlm,
title={Mudra-VLM: Adapting Vision-Language Models for Fine-Grained Bharatanatyam Mudra Recognition},
author={Samarth P and Sakshi Rajani},
year={2025},
institution={PES University, Bengaluru, India}
}
Authors
Samarth P, Sakshi Rajani (PES University, Bengaluru, India)
License
Apache 2.0
- Downloads last month
- 8
Model tree for Samarth0710/bharatanatyam-mudra-model-gemma-3-4b-it-r32
Base model
google/gemma-3-4b-pt
Finetuned
google/gemma-3-4b-it
Quantized
unsloth/gemma-3-4b-it-unsloth-bnb-4bit