๐ I built a Multimodal Vision-Language Model from using Gemma-270M + CLIP!
Just finished training my multimodal model on the full LLaVA-Instruct-150K dataset (157K samples) and wanted to share the results!
๐ง What I Built: A vision-language model that can understand images and answer questions about them, combining: - Google Gemma-3-270M (language) - OpenAI CLIP ViT-Large/14 (vision) - LoRA fine-tuning for efficiency
๐ Training Stats: - 157,712 training samples (full LLaVA dataset) - 3 epochs on A100 40GB - ~9 hours training time - Final loss: 1.333 training / 1.430 validation - Only 18.6M trainable params (3.4% of 539M total)
๐ sagar007/multigemma Benchmark Results: - VQA Accuracy: 53.8% - Works great for: animal detection, room identification, scene understanding
โ 6.00 WER on the English Open ASR Leaderboard โ 4.74 WER on the Multilingual Open ASR Leaderboard โ Beats NVIDIA Parakeet (6.34 WER) and Whisper-large-v3-turbo (7.8 WER) โ Strong improvements in Arabic, Hindi, Chinese โ Maintains quality with background and environmental noise โ Optimized inference engines for NVIDIA and Apple โ Hugging Face Transformers interface for easy use โ Best-in-class speed on NVIDIA GPUs and power efficiency on Apple devices โ NVIDIA Jetson Thor support