Safetensors
English
llava_next
File size: 14,034 Bytes
9d5ef50
 
72ce980
 
 
 
 
 
 
9d5ef50
72ce980
 
 
 
 
 
 
 
 
 
9d5ef50
24e08b1
72ce980
24e08b1
 
1bc4e2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b20e40d
72ce980
 
 
6006f78
72ce980
 
b20e40d
 
 
72ce980
 
 
 
 
 
 
 
 
 
 
 
 
 
2554306
 
72ce980
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b20e40d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72ce980
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7880460
 
 
 
 
 
 
 
 
 
 
 
72ce980
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
---
license: apache-2.0
datasets:
- FreedomIntelligence/ALLaVA-4V
- Vision-Flan/vision-flan_191-task_1k
language:
- en
base_model:
- Lin-Chen/open-llava-next-llama3-8b
---
# Adapting Multimodal Large Language Models to Domains via Post-Training

This repos contains the **visual-instruction synthesizer** in our paper: [On Domain-Specific Post-Training for Multimodal Large Language Models](https://huggingface.co/papers/2411.19930).

The main project page is: [Adapt-MLLM-to-Domains](https://huggingface.co/AdaptLLM/Adapt-MLLM-to-Domains/edit/main/README.md)

We investigate domain adaptation of MLLMs through post-training, focusing on data synthesis, training pipelines, and task evaluation. 
**(1) Data Synthesis**: Using open-source models, we develop a visual instruction synthesizer that effectively generates diverse visual instruction tasks from domain-specific image-caption pairs. **Our synthetic tasks surpass those generated by manual rules, GPT-4, and GPT-4V in enhancing the domain-specific performance of MLLMs.** 
**(2) Training Pipeline**: While the two-stage training--initially on image-caption pairs followed by visual instruction tasks--is commonly adopted for developing general MLLMs, we apply a single-stage training pipeline to enhance task diversity for domain-specific post-training. 
**(3) Task Evaluation**: We conduct experiments in two domains, biomedicine and food, by post-training MLLMs of different sources and scales (e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM performance on various domain-specific tasks.

<p align='left'>
    <img src="https://cdn-uploads.huggingface.co/production/uploads/650801ced5578ef7e20b33d4/bRu85CWwP9129bSCRzos2.png" width="1000">
</p>

## Resources
**🤗 We share our data and models with example usages, feel free to open any issues or discussions! 🤗**

| Model                                                                       | Repo ID in HF 🤗                           | Domain       | Base Model              | Training Data                                                                                  | Evaluation Benchmark |
|:----------------------------------------------------------------------------|:--------------------------------------------|:--------------|:-------------------------|:------------------------------------------------------------------------------------------------|-----------------------|
| [Visual Instruction Synthesizer](https://huggingface.co/AdaptLLM/visual-instruction-synthesizer) | AdaptLLM/visual-instruction-synthesizer     | -  | open-llava-next-llama3-8b    | VisionFLAN and ALLaVA | -                   |
| [AdaMLLM-med-2B](https://huggingface.co/AdaptLLM/biomed-Qwen2-VL-2B-Instruct) | AdaptLLM/biomed-Qwen2-VL-2B-Instruct     | Biomedicine  | Qwen2-VL-2B-Instruct    | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark)                   |
| [AdaMLLM-food-2B](https://huggingface.co/AdaptLLM/food-Qwen2-VL-2B-Instruct) | AdaptLLM/food-Qwen2-VL-2B-Instruct     | Food  | Qwen2-VL-2B-Instruct    | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark)                   |
| [AdaMLLM-med-8B](https://huggingface.co/AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B) | AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B     | Biomedicine  | open-llava-next-llama3-8b    | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark)                   |
| [AdaMLLM-food-8B](https://huggingface.co/AdaptLLM/food-LLaVA-NeXT-Llama3-8B) |AdaptLLM/food-LLaVA-NeXT-Llama3-8B     | Food  | open-llava-next-llama3-8b    | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) |  [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark)                   |
| [AdaMLLM-med-11B](https://huggingface.co/AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct     | Biomedicine  | Llama-3.2-11B-Vision-Instruct    | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark)                   |
| [AdaMLLM-food-11B](https://huggingface.co/AdaptLLM/food-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/food-Llama-3.2-11B-Vision-Instruct     | Food | Llama-3.2-11B-Vision-Instruct    | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) |  [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark)                   |

**Code**: [https://github.com/bigai-ai/QA-Synthesizer](https://github.com/bigai-ai/QA-Synthesizer)

### 1. Basic Usage: Synthesize a task triplet based on a given image-caption pair
To synthesize an "instruction-informative response-precise response" triplet based on the following image-caption pair.

<p align='left'>
    <img src="https://cdn-uploads.huggingface.co/production/uploads/650801ced5578ef7e20b33d4/mgI_Ayj12_Q_kviWvfAVb.jpeg" width="200">
</p>

<details>
<summary> Click to expand </summary>

```python
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import requests

# Define your input image-caption pair here:
## image 
url = "https://cdn-uploads.huggingface.co/production/uploads/650801ced5578ef7e20b33d4/mgI_Ayj12_Q_kviWvfAVb.jpeg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

## Caption
caption = "Dish: Strawberry Waffles\n\nSteps to prepare:\na). Preheat and grease a waffle iron according to manufacturer's instructions.\nb). Sift flour, baking powder, and salt together in a bowl. Whisk buttermilk, yogurt, butter, eggs, and sugar together in a separate bowl; stir into flour mixture until batter is smooth. Fold strawberries into batter.\nc). Pour about 1/3 cup batter into preheated waffle iron; cook until lightly browned, 5 to 7 minutes. Repeat with remaining batter.\n\nIngredients you'll need:\n(a). 2 1/2 cups all-purpose flour\n(b). 4 teaspoons baking powder\n(c). 3/4 teaspoon salt\n(d). 2 cups buttermilk\n(e). 1/2 cup vanilla Greek-style yogurt\n(f). 1/2 cup butter, melted\n(g). 2 eggs, beaten\n(h). 1 1/2 tablespoons white sugar\n(i). 3/4 cup chopped strawberries, or more to taste"

# =========================== Do NOT need to modify the following ===============================

# Path to synthesizer
model_path = "AdaptLLM/visual-instruction-synthesizer"

# Prompt Hints
caption_hint = "Describe the image."
precise_hint = "Answer with a precise response.\n"
informative_hint = "Answer with an informative response.\n"

# Function to parse predictions
def parse_pred(pred):
    if not pred.endswith("<|end_of_text|>"):
        return []

    pred = pred[:-len("<|end_of_text|>")]

    QA_str_list = pred.split("<|start_header_id|>user<|end_header_id|>\n\n")
    if not pred.endswith("<|eot_id|>"):
        QA_str_list = QA_str_list[:-1]

    QA_list = []
    for QA_str in QA_str_list:
        try:
            assert "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" in QA_str
            Q_str, A_str = QA_str.split("<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n")
            Q_str, A_str = Q_str.strip(), A_str[:-len("<|eot_id|>")].strip()
            assert Q_str and A_str
            QA_list.append({"Q": Q_str, "A": A_str})
        except AssertionError:
            pass  # Skip invalid entries

    conversations = []
    for qa_entry in QA_list:
        conversations.append({"from": "human", "value": qa_entry["Q"]})
        conversations.append({"from": "gpt", "value": qa_entry["A"]})
    return conversations

# Function to extract task triplets
def get_task_triplet(pred):
    pred_QAs = parse_pred(pred)
    precise_QAs = {}
    informative_QAs = {}
    collected_QA = None

    for idx in range(0, len(pred_QAs), 2):  # Iterate over question-answer pairs
        question = pred_QAs[idx]["value"]
        answer = pred_QAs[idx + 1]["value"]
        if question.startswith(precise_hint):
            precise_q = question[len(precise_hint):]
            if precise_q in informative_QAs:
                collected_QA = {
                    "Q": precise_q,
                    "precise_A": answer,
                    "informative_A": informative_QAs[precise_q],
                }
                break
            else:
                precise_QAs[precise_q] = answer
        elif question.startswith(informative_hint):
            informative_q = question[len(informative_hint):]
            if informative_q in precise_QAs:
                collected_QA = {
                    "Q": informative_q,
                    "precise_A": precise_QAs[informative_q],
                    "informative_A": answer,
                }
                break
            else:
                informative_QAs[informative_q] = answer

    return collected_QA

# Load the processor
processor = LlavaNextProcessor.from_pretrained(model_path)

# Define image token
image_token = "<|reserved_special_token_4|>"

# Format the prompt
prompt = (
    f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
    f"You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
    f"<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n"
    f"{image_token}\n{caption_hint}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
    f"{caption}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n"
)

# Load the model
model = LlavaNextForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")

# Prepare inputs and generate output
inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)
answer_start = int(inputs["input_ids"].shape[-1])
output = model.generate(**inputs, max_new_tokens=512)

# Decode predictions
pred = processor.decode(output[0][answer_start:], skip_special_tokens=False)
print(f"## Synthesizer predictions:\n{pred}")

# Extract task triplets
task_triplet = get_task_triplet(pred)
print(f"## Synthesized Task triplet:\n{task_triplet}")
```
</details>

### 2. Advanced Usage: Convert Image-Caption Pairs into Visual Instructions at Scale
The following steps show how to convert your own data into visual instructions for post-training MLLMs. 

We leverage vLLM to accelerate the synthesis process. On a single A100-80GB GPU, it takes about 12.5 hours to convert 100K image-caption pairs.

<details>
<summary> Click to expand </summary>

### 1) Setup
Install vLLM using `pip` or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source).  
```bash
pip install vllm
```

Clone our code repository and navigate to the inference directory:
```bash
git clone https://github.com/bigai-ai/QA-Synthesizer.git
cd QA-Synthesizer/vllm_inference
SYNTHESIZER=AdaptLLM/visual-instruction-synthesizer
CONSISTENCY_CHECKER=meta-llama/Meta-Llama-3-8B  # Language model for consistency checks  
```  

### 2) Prepare Your Image-Caption Pairs
Format your `image_caption_pairs` file to match the following structure (similar to ShareGPT), or you can use our [data_samples/image_caption_pairs.json](https://github.com/bigai-ai/QA-Synthesizer/blob/main/docs/data_samples/image_caption_pairs.json) for a quick try.

```json
[
  {
    "images": ["image_xxx.jpg"],
    "messages": [
      {
        "content": "<image>instruction",
        "role": "user"
      },
      {
        "content": "response",
        "role": "assistant"
      }
    ]
  },
  ...
]
```

### 3) Run Synthesis

The following command generate task triplets using the synthesizer and apply consistency-based filtering to enhance data quality:

```bash  
IMAGE_CAPTION='../data_samples/image_caption_pairs.json'  # Path to image-caption pairs  
IMAGE_FOLDER='../data_samples/images'  # Path to the image folder  
OUTPUT_DIR='../data_samples/'  # Output directory for synthesized data  

# Run synthesis with data parallelism; adjust CUDA devices as needed:
CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' bash run_synthesis.sh ${SYNTHESIZER} ${CONSISTENCY_CHECKER} ${IMAGE_CAPTION} ${IMAGE_FOLDER} ${OUTPUT_DIR}  
```  

The synthesized output will be saved at:  
```bash
${OUTPUT_DIR}/image_caption_and_synthetic_task.json
```  

This output can be directly utilized for single-stage post-training with code repo like [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory).  

</details>


## Citation
If you find our work helpful, please cite us.

AdaMLLM
```bibtex
@article{adamllm,
  title={On Domain-Specific Post-Training for Multimodal Large Language Models},
  author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang},
  journal={arXiv preprint arXiv:2411.19930},
  year={2024}
}
```

[Instruction Pre-Training](https://huggingface.co/papers/2406.14491) (EMNLP 2024)
```bibtex
@article{cheng2024instruction,
  title={Instruction Pre-Training: Language Models are Supervised Multitask Learners},
  author={Cheng, Daixuan and Gu, Yuxian and Huang, Shaohan and Bi, Junyu and Huang, Minlie and Wei, Furu},
  journal={arXiv preprint arXiv:2406.14491},
  year={2024}
}
```

[Adapt LLM to Domains](https://huggingface.co/papers/2309.09530) (ICLR 2024)
```bibtex
@inproceedings{
cheng2024adapting,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=y886UXPEZ0}
}
```