base_model: llava-hf/llava-v1.6-mistral-7b-hf
library_name: peft
license: apache-2.0
datasets:
- mirzaei2114/stackoverflowVQA-filtered-small
language:
- en
tags:
- llava
- llava-next
- fine-tuned
- stack-overflow
- qlora
- images
- vqa
- 4bit
Model Card for Model ID
Finetuned LLaVA-Next model for Visual QA on Stack Overflow questions with images.
Model Details
Model Description
This model is a finetuned version of LLaVA-Next (llava-hf/llava-v1.6-mistral-7b-hf) specifically for visual question answering (VQA) on Stack Overflow questions containing images. The model was finetuned using QLoRA with 4-bit quantization, optimized to handle both text and image inputs.
The training dataset was filtered from the mirzaei2114/stackoverflowVQA-filtered-small dataset. Only samples with a maximum input length of 1024 (for both question and answer combined) were used. Images were kept to size to capture detail needed for methods such as optical character recognition.
- Developed by: Adam Cassidy
- Model type: Visual QA
- Language(s) (NLP): EN
- License: Apache License, Version 2.0
- Finetuned from model: llava-hf/llava-v1.6-mistral-7b-hf
Model Sources
- Repository: llava-hf/llava-v1.6-mistral-7b-hf
Uses
Drag a snipping rectangle for a screenshot around the exact focus/context for a question related to software development(usually front end) and accompany it with the question for inference.
Direct Use
Visual Question Answering (VQA) on technical Stack Overflow (software-adjacent) questions with accompanying images.
Out-of-Scope Use
General-purpose VQA tasks, though performance on non-technical domains may vary.
Bias, Risks, and Limitations
Model Capacity: The model was trained using 4-bit QLoRA. Dataset Size: The training dataset is relatively small, and this may impact generalization to other VQA datasets or domains outside of Stack Overflow.
How to Get Started with the Model
To use this model, ensure you have the following dependencies installed: torch==2.4.1+cu121 transformers==4.45.1
Do inference according to this multi-image inference llava-next example: https://huggingface.co/docs/transformers/en/model_doc/llava_next#:~:text=skip_special_tokens%3DTrue))-,Multi%20image%20inference,-LLaVa%2DNext%20can
Training Details
Training Data
mirzaei2114/stackoverflowVQA-filtered-small
Training Procedure
Training Hyperparameters
TrainingArguments( per_device_train_batch_size=4, per_device_eval_batch_size=4, max_grad_norm=0.1, evaluation_strategy="steps", eval_steps=15, group_by_length=True, logging_steps=15, gradient_checkpointing=True, gradient_accumulation_steps=2, num_train_epochs=3, weight_decay=0.1, warmup_steps=10, lr_scheduler_type="cosine", learning_rate=1e-5, save_steps=15, save_total_limit=5, bf16=True, remove_unused_columns=False )
Speeds, Sizes, Times
checkpoint-240
Evaluation
Evaluation Loss (Pre-finetuning): 2.93 Validation Loss (Post-finetuning): 1.78
Testing Data, Factors & Metrics
Testing Data
mirzaei2114/stackoverflowVQA-filtered-small
Compute Infrastructure
Hardware
L4 GPU
Software
Google Colab
Framework versions
- PEFT 0.13.1.dev0
- PyTorch 2.4.1+cu121
- Transformers 4.45.1