File size: 3,701 Bytes
5127524
 
 
ce85b24
 
 
 
 
 
 
 
 
 
 
 
 
 
5127524
 
 
 
ce85b24
5127524
 
 
 
 
ce85b24
 
 
5127524
ce85b24
 
 
5127524
 
 
ce85b24
 
 
 
0c7b145
5127524
ce85b24
5127524
ce85b24
5127524
 
 
ce85b24
 
5127524
 
 
ce85b24
5127524
 
 
ce85b24
5127524
 
 
ce85b24
 
5127524
 
 
ce85b24
 
 
5127524
c656482
960e2e6
5127524
 
 
 
ce85b24
5127524
 
 
 
 
ce85b24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5127524
 
 
ce85b24
 
5127524
 
 
 
 
ce85b24
5127524
 
 
 
 
ce85b24
5127524
 
 
ce85b24
5127524
 
 
0c1116b
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
base_model: llava-hf/llava-v1.6-mistral-7b-hf
library_name: peft
license: apache-2.0
datasets:
- mirzaei2114/stackoverflowVQA-filtered-small
language:
- en
tags:
- llava
- llava-next
- fine-tuned
- stack-overflow
- qlora
- images
- vqa
- 4bit
---

# Model Card for Model ID

Finetuned LLaVA-Next model for Visual QA on Stack Overflow questions with images.

## Model Details

### Model Description

This model is a finetuned version of **LLaVA-Next (llava-hf/llava-v1.6-mistral-7b-hf)** specifically for visual question answering (VQA)
on Stack Overflow questions containing images. The model was finetuned using **QLoRA** with 4-bit quantization, optimized to handle both
text and image inputs.

The training dataset was filtered from the **mirzaei2114/stackoverflowVQA-filtered-small** dataset.
Only samples with a maximum input length of 1024 (for both question and answer combined) were used. Images were kept
to size to capture detail needed for methods such as optical character recognition.



- **Developed by:** Adam Cassidy
- **Model type:** Visual QA
- **Language(s) (NLP):** EN
- **License:** Apache License, Version 2.0
- **Finetuned from model:** llava-hf/llava-v1.6-mistral-7b-hf

### Model Sources

- **Repository:** [llava-hf/llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf)

## Uses

Drag a snipping rectangle for a screenshot around the exact focus/context for a question related to software development(usually front end)
and accompany it with the question for inference.

### Direct Use

Visual Question Answering (VQA) on technical Stack Overflow (software-adjacent) questions with accompanying images.

### Out-of-Scope Use

General-purpose VQA tasks, though performance on non-technical domains may vary.

## Bias, Risks, and Limitations

Model Capacity: The model was trained using 4-bit QLoRA.
Dataset Size: The training dataset is relatively small, and this may impact generalization to other VQA datasets or domains outside of Stack Overflow.

## How to Get Started with the Model

To use this model, ensure you have the following dependencies installed:
torch==2.4.1+cu121
transformers==4.45.1

Do inference according to this multi-image inference llava-next example: https://huggingface.co/docs/transformers/en/model_doc/llava_next#:~:text=skip_special_tokens%3DTrue))-,Multi%20image%20inference,-LLaVa%2DNext%20can

## Training Details

### Training Data

[mirzaei2114/stackoverflowVQA-filtered-small](https://huggingface.co/datasets/mirzaei2114/stackoverflowVQA-filtered-small/viewer/default/train)

### Training Procedure

#### Training Hyperparameters

  TrainingArguments(
      per_device_train_batch_size=4,
      per_device_eval_batch_size=4,
      max_grad_norm=0.1,
      evaluation_strategy="steps",
      eval_steps=15,
      group_by_length=True,
      logging_steps=15,
      gradient_checkpointing=True,
      gradient_accumulation_steps=2,
      num_train_epochs=3,
      weight_decay=0.1,
      warmup_steps=10,
      lr_scheduler_type="cosine",
      learning_rate=1e-5,
      save_steps=15,
      save_total_limit=5,
      bf16=True,
      remove_unused_columns=False
  )

#### Speeds, Sizes, Times

checkpoint-240

## Evaluation

Evaluation Loss (Pre-finetuning): 2.93
Validation Loss (Post-finetuning): 1.78

### Testing Data, Factors & Metrics

#### Testing Data

[mirzaei2114/stackoverflowVQA-filtered-small](https://huggingface.co/datasets/mirzaei2114/stackoverflowVQA-filtered-small/viewer/default/test)

### Compute Infrastructure

#### Hardware

L4 GPU

#### Software

Google Colab

### Framework versions

- PEFT 0.13.1.dev0
- PyTorch 2.4.1+cu121
- Transformers 4.45.1