metadata

base_model: llava-hf/llava-v1.6-mistral-7b-hf
library_name: peft
license: apache-2.0
datasets:
  - mirzaei2114/stackoverflowVQA-filtered-small
language:
  - en
tags:
  - llava
  - llava-next
  - fine-tuned
  - stack-overflow
  - qlora
  - images
  - vqa
  - 4bit

Model Card for Model ID

Finetuned LLaVA-Next model for Visual QA on Stack Overflow questions with images.

Model Details

Model Description

This model is a finetuned version of LLaVA-Next (llava-hf/llava-v1.6-mistral-7b-hf) specifically for visual question answering (VQA) on Stack Overflow questions containing images. The model was finetuned using QLoRA with 4-bit quantization, optimized to handle both text and image inputs.

The training dataset was filtered from the mirzaei2114/stackoverflowVQA-filtered-small dataset. Only samples with a maximum input length of 1024 (for both question and answer combined) were used. Images were kept to size to capture detail needed for methods such as optical character recognition.

Developed by: Adam Cassidy
Model type: Visual QA
Language(s) (NLP): EN
License: Apache License, Version 2.0
Finetuned from model: llava-hf/llava-v1.6-mistral-7b-hf

Model Sources

Repository: llava-hf/llava-v1.6-mistral-7b-hf

Uses

Drag a snipping rectangle for a screenshot around the exact focus/context for a question related to software development(usually front end) and accompany it with the question for inference.

Direct Use

Visual Question Answering (VQA) on technical Stack Overflow (software-adjacent) questions with accompanying images.

Out-of-Scope Use

General-purpose VQA tasks, though performance on non-technical domains may vary.

Bias, Risks, and Limitations

Model Capacity: The model was trained using 4-bit QLoRA. Dataset Size: The training dataset is relatively small, and this may impact generalization to other VQA datasets or domains outside of Stack Overflow.

How to Get Started with the Model

To use this model, ensure you have the following dependencies installed: torch==2.4.1+cu121 transformers==4.45.1

Do inference according to this multi-image inference llava-next example: https://huggingface.co/docs/transformers/en/model_doc/llava_next#:~:text=skip_special_tokens%3DTrue))-,Multi%20image%20inference,-LLaVa%2DNext%20can

Training Details

Training Data

mirzaei2114/stackoverflowVQA-filtered-small

Training Procedure

Training Hyperparameters

TrainingArguments( per_device_train_batch_size=4, per_device_eval_batch_size=4, max_grad_norm=0.1, evaluation_strategy="steps", eval_steps=15, group_by_length=True, logging_steps=15, gradient_checkpointing=True, gradient_accumulation_steps=2, num_train_epochs=3, weight_decay=0.1, warmup_steps=10, lr_scheduler_type="cosine", learning_rate=1e-5, save_steps=15, save_total_limit=5, bf16=True, remove_unused_columns=False )

Speeds, Sizes, Times

checkpoint-240

Evaluation

Evaluation Loss (Pre-finetuning): 2.93 Validation Loss (Post-finetuning): 1.78

Testing Data, Factors & Metrics

Testing Data

mirzaei2114/stackoverflowVQA-filtered-small

Compute Infrastructure

Hardware

L4 GPU

Software

Google Colab

Framework versions

PEFT 0.13.1.dev0
PyTorch 2.4.1+cu121
Transformers 4.45.1