Mudra-VLM: Bharatanatyam Mudra Recognition

A parameter-efficient vision-language model for recognizing Bharatanatyam hand mudras. Built on Gemma-3 4B with LoRA adaptation, achieving 96.7% accuracy on 52 mudra classes while training only 0.1% of parameters.

Model Details

Base Model: Gemma-3 4B (unsloth/gemma-3-4b-it)
Adaptation: LoRA (r=32)
Classes: 52 Bharatanatyam mudras (28 Asamyukta + 24 Samyukta)
Training Data: 28,431 images
Accuracy: 96.7%

Usage

Installation

pip install modal torch unsloth pillow requests

Local Inference

from unsloth import FastVisionModel
import torch
from PIL import Image

# Load model
model, tokenizer = FastVisionModel.from_pretrained(
    "Samarth0710/bharatanatyam-mudra-model-gemma-3-4b-it-r32",
    load_in_4bit=True,
)
FastVisionModel.for_inference(model)

# Prepare input
messages = [{"role": "user", "content": [
    {"type": "image"},
    {"type": "text", "text": "Identify the Bharatanatyam mudra shown in this image."}
]}]

image = Image.open("mudra.jpg")
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")

# Generate
outputs = model.generate(**inputs, max_new_tokens=64, temperature=0.1, do_sample=False)
result = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True).strip()

print(f"Mudra: {result}")

Inference on Modal (L40S GPU)

import modal

app = modal.App("bharatanatyam-mudra-inference")

image = (
    modal.Image.debian_slim(python_version="3.10")
    .uv_pip_install(["pip3-autoremove"])
    .uv_pip_install(
        ["torch", "torchvision", "torchaudio", "xformers"], 
        index_url="https://download.pytorch.org/whl/cu124"
    )
    .uv_pip_install(["unsloth"])
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
)

@app.function(
    image=image,
    gpu="L40S",
    timeout=600,
    volumes={"/cache": modal.Volume.from_name("unsloth-cache", create_if_missing=True)},
)
def run_inference(
    image_url: str = "https://www.shutterstock.com/image-photo/woman-hand-showing-hamsapaksha-hasta-260nw-37167580.jpg",
    instruction: str = "Identify the Bharatanatyam mudra shown in this image.",
):
    import os
    os.environ["HF_HOME"] = "/cache"
    
    from unsloth import FastVisionModel
    import torch
    from PIL import Image
    import requests
    from io import BytesIO
    
    # Load model
    model, tokenizer = FastVisionModel.from_pretrained(
        "Samarth0710/bharatanatyam-mudra-model-gemma-3-4b-it-r32",
        load_in_4bit=True,
        use_gradient_checkpointing="unsloth",
    )
    FastVisionModel.for_inference(model)
    
    # Load image
    response = requests.get(image_url)
    img = Image.open(BytesIO(response.content))
    
    # Prepare input
    messages = [
        {"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": instruction}
        ]}
    ]
    
    input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
    inputs = tokenizer(
        img,
        input_text,
        add_special_tokens=False,
        return_tensors="pt",
    ).to("cuda")
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=64,
            use_cache=True,
            temperature=0.1,
            do_sample=False
        )
    
    # Decode
    result = tokenizer.decode(
        outputs[0][len(inputs.input_ids[0]):],
        skip_special_tokens=True
    ).strip()
    
    print(f"Predicted Mudra: {result}")
    return {"mudra": result, "image_url": image_url}

@app.local_entrypoint()
def main(image_url: str = "https://www.shutterstock.com/image-photo/woman-hand-showing-hamsapaksha-hasta-260nw-37167580.jpg"):
    result = run_inference.remote(image_url=image_url)
    return result

Run:

modal run mudra_vlm_inference.py --image-url "https://your-image-url.jpg"

Training

Method: Instruction-based supervised fine-tuning with LoRA
LoRA Config: rank=32, alpha=64, dropout=0.05
Hardware: NVIDIA A100 GPU
Training Time: ~6 hours
Format: Visual question-answering style

User: <image> Identify the Bharatanatyam mudra shown in this image.
Assistant: [Mudra Name]

Performance

Model	Accuracy
Mudra-VLM (Ours)	96.7%
Haridas et al. (2022)	95.0%
Naaz et al. (2023)	94.9%
MudraGyaan (2024)	91.9%

Limitations

Trained on static images only (not suitable for video/continuous gestures)
Best performance in controlled lighting conditions
May struggle with extreme occlusions or motion blur

Citation

@article{samarth2025mudravlm,
  title={Mudra-VLM: Adapting Vision-Language Models for Fine-Grained Bharatanatyam Mudra Recognition},
  author={Samarth P and Sakshi Rajani},
  year={2025},
  institution={PES University, Bengaluru, India}
}

Authors

Samarth P, Sakshi Rajani (PES University, Bengaluru, India)

License

Apache 2.0

Downloads last month: 8

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for Samarth0710/bharatanatyam-mudra-model-gemma-3-4b-it-r32

Base model

google/gemma-3-4b-pt

Finetuned

google/gemma-3-4b-it

Quantized

unsloth/gemma-3-4b-it-unsloth-bnb-4bit