Mudra-VLM: Bharatanatyam Mudra Recognition

Model Dataset

A parameter-efficient vision-language model for recognizing Bharatanatyam hand mudras. Built on Gemma-3 4B with LoRA adaptation, achieving 96.7% accuracy on 52 mudra classes while training only 0.1% of parameters.

Model Details

  • Base Model: Gemma-3 4B (unsloth/gemma-3-4b-it)
  • Adaptation: LoRA (r=32)
  • Classes: 52 Bharatanatyam mudras (28 Asamyukta + 24 Samyukta)
  • Training Data: 28,431 images
  • Accuracy: 96.7%

Usage

Installation

pip install modal torch unsloth pillow requests

Local Inference

from unsloth import FastVisionModel
import torch
from PIL import Image

# Load model
model, tokenizer = FastVisionModel.from_pretrained(
    "Samarth0710/bharatanatyam-mudra-model-gemma-3-4b-it-r32",
    load_in_4bit=True,
)
FastVisionModel.for_inference(model)

# Prepare input
messages = [{"role": "user", "content": [
    {"type": "image"},
    {"type": "text", "text": "Identify the Bharatanatyam mudra shown in this image."}
]}]

image = Image.open("mudra.jpg")
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")

# Generate
outputs = model.generate(**inputs, max_new_tokens=64, temperature=0.1, do_sample=False)
result = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True).strip()

print(f"Mudra: {result}")

Inference on Modal (L40S GPU)

import modal

app = modal.App("bharatanatyam-mudra-inference")

image = (
    modal.Image.debian_slim(python_version="3.10")
    .uv_pip_install(["pip3-autoremove"])
    .uv_pip_install(
        ["torch", "torchvision", "torchaudio", "xformers"], 
        index_url="https://download.pytorch.org/whl/cu124"
    )
    .uv_pip_install(["unsloth"])
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
)

@app.function(
    image=image,
    gpu="L40S",
    timeout=600,
    volumes={"/cache": modal.Volume.from_name("unsloth-cache", create_if_missing=True)},
)
def run_inference(
    image_url: str = "https://www.shutterstock.com/image-photo/woman-hand-showing-hamsapaksha-hasta-260nw-37167580.jpg",
    instruction: str = "Identify the Bharatanatyam mudra shown in this image.",
):
    import os
    os.environ["HF_HOME"] = "/cache"
    
    from unsloth import FastVisionModel
    import torch
    from PIL import Image
    import requests
    from io import BytesIO
    
    # Load model
    model, tokenizer = FastVisionModel.from_pretrained(
        "Samarth0710/bharatanatyam-mudra-model-gemma-3-4b-it-r32",
        load_in_4bit=True,
        use_gradient_checkpointing="unsloth",
    )
    FastVisionModel.for_inference(model)
    
    # Load image
    response = requests.get(image_url)
    img = Image.open(BytesIO(response.content))
    
    # Prepare input
    messages = [
        {"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": instruction}
        ]}
    ]
    
    input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
    inputs = tokenizer(
        img,
        input_text,
        add_special_tokens=False,
        return_tensors="pt",
    ).to("cuda")
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=64,
            use_cache=True,
            temperature=0.1,
            do_sample=False
        )
    
    # Decode
    result = tokenizer.decode(
        outputs[0][len(inputs.input_ids[0]):],
        skip_special_tokens=True
    ).strip()
    
    print(f"Predicted Mudra: {result}")
    return {"mudra": result, "image_url": image_url}

@app.local_entrypoint()
def main(image_url: str = "https://www.shutterstock.com/image-photo/woman-hand-showing-hamsapaksha-hasta-260nw-37167580.jpg"):
    result = run_inference.remote(image_url=image_url)
    return result

Run:

modal run mudra_vlm_inference.py --image-url "https://your-image-url.jpg"

Training

  • Method: Instruction-based supervised fine-tuning with LoRA
  • LoRA Config: rank=32, alpha=64, dropout=0.05
  • Hardware: NVIDIA A100 GPU
  • Training Time: ~6 hours
  • Format: Visual question-answering style
User: <image> Identify the Bharatanatyam mudra shown in this image.
Assistant: [Mudra Name]

Performance

Model Accuracy
Mudra-VLM (Ours) 96.7%
Haridas et al. (2022) 95.0%
Naaz et al. (2023) 94.9%
MudraGyaan (2024) 91.9%

Limitations

  • Trained on static images only (not suitable for video/continuous gestures)
  • Best performance in controlled lighting conditions
  • May struggle with extreme occlusions or motion blur

Citation

@article{samarth2025mudravlm,
  title={Mudra-VLM: Adapting Vision-Language Models for Fine-Grained Bharatanatyam Mudra Recognition},
  author={Samarth P and Sakshi Rajani},
  year={2025},
  institution={PES University, Bengaluru, India}
}

Authors

Samarth P, Sakshi Rajani (PES University, Bengaluru, India)

License

Apache 2.0

Downloads last month
8
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Samarth0710/bharatanatyam-mudra-model-gemma-3-4b-it-r32

Finetuned
(958)
this model

Dataset used to train Samarth0710/bharatanatyam-mudra-model-gemma-3-4b-it-r32