File size: 14,029 Bytes
9d5ef50 72ce980 9d5ef50 72ce980 9d5ef50 24e08b1 72ce980 24e08b1 1bc4e2c b20e40d 72ce980 6006f78 72ce980 b20e40d 72ce980 2554306 72ce980 b20e40d 623d557 b20e40d 72ce980 7880460 72ce980 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 |
---
license: apache-2.0
datasets:
- FreedomIntelligence/ALLaVA-4V
- Vision-Flan/vision-flan_191-task_1k
language:
- en
base_model:
- Lin-Chen/open-llava-next-llama3-8b
---
# Adapting Multimodal Large Language Models to Domains via Post-Training
This repos contains the **visual-instruction synthesizer** in our paper: [On Domain-Specific Post-Training for Multimodal Large Language Models](https://huggingface.co/papers/2411.19930).
The main project page is: [Adapt-MLLM-to-Domains](https://huggingface.co/AdaptLLM/Adapt-MLLM-to-Domains/edit/main/README.md)
We investigate domain adaptation of MLLMs through post-training, focusing on data synthesis, training pipelines, and task evaluation.
**(1) Data Synthesis**: Using open-source models, we develop a visual instruction synthesizer that effectively generates diverse visual instruction tasks from domain-specific image-caption pairs. **Our synthetic tasks surpass those generated by manual rules, GPT-4, and GPT-4V in enhancing the domain-specific performance of MLLMs.**
**(2) Training Pipeline**: While the two-stage training--initially on image-caption pairs followed by visual instruction tasks--is commonly adopted for developing general MLLMs, we apply a single-stage training pipeline to enhance task diversity for domain-specific post-training.
**(3) Task Evaluation**: We conduct experiments in two domains, biomedicine and food, by post-training MLLMs of different sources and scales (e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM performance on various domain-specific tasks.
<p align='left'>
<img src="https://cdn-uploads.huggingface.co/production/uploads/650801ced5578ef7e20b33d4/bRu85CWwP9129bSCRzos2.png" width="1000">
</p>
## Resources
**🤗 We share our data and models with example usages, feel free to open any issues or discussions! 🤗**
| Model | Repo ID in HF 🤗 | Domain | Base Model | Training Data | Evaluation Benchmark |
|:----------------------------------------------------------------------------|:--------------------------------------------|:--------------|:-------------------------|:------------------------------------------------------------------------------------------------|-----------------------|
| [Visual Instruction Synthesizer](https://huggingface.co/AdaptLLM/visual-instruction-synthesizer) | AdaptLLM/visual-instruction-synthesizer | - | open-llava-next-llama3-8b | VisionFLAN and ALLaVA | - |
| [AdaMLLM-med-2B](https://huggingface.co/AdaptLLM/biomed-Qwen2-VL-2B-Instruct) | AdaptLLM/biomed-Qwen2-VL-2B-Instruct | Biomedicine | Qwen2-VL-2B-Instruct | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark) |
| [AdaMLLM-food-2B](https://huggingface.co/AdaptLLM/food-Qwen2-VL-2B-Instruct) | AdaptLLM/food-Qwen2-VL-2B-Instruct | Food | Qwen2-VL-2B-Instruct | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark) |
| [AdaMLLM-med-8B](https://huggingface.co/AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B) | AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B | Biomedicine | open-llava-next-llama3-8b | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark) |
| [AdaMLLM-food-8B](https://huggingface.co/AdaptLLM/food-LLaVA-NeXT-Llama3-8B) |AdaptLLM/food-LLaVA-NeXT-Llama3-8B | Food | open-llava-next-llama3-8b | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark) |
| [AdaMLLM-med-11B](https://huggingface.co/AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct | Biomedicine | Llama-3.2-11B-Vision-Instruct | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark) |
| [AdaMLLM-food-11B](https://huggingface.co/AdaptLLM/food-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/food-Llama-3.2-11B-Vision-Instruct | Food | Llama-3.2-11B-Vision-Instruct | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark) |
**Code**: [https://github.com/bigai-ai/QA-Synthesizer](https://github.com/bigai-ai/QA-Synthesizer)
### 1. Basic Usage: Synthesize a task triplet based on a given image-caption pair
To synthesize an "instruction-informative response-precise response" triplet based on the following image-caption pair.
<p align='left'>
<img src="https://cdn-uploads.huggingface.co/production/uploads/650801ced5578ef7e20b33d4/mgI_Ayj12_Q_kviWvfAVb.jpeg" width="200">
</p>
<details>
<summary> Click to expand </summary>
```python
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import requests
# Define your input image-caption pair here:
## image
url = "https://cdn-uploads.huggingface.co/production/uploads/650801ced5578ef7e20b33d4/mgI_Ayj12_Q_kviWvfAVb.jpeg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
## Caption
caption = "Dish: Strawberry Waffles\n\nSteps to prepare:\na). Preheat and grease a waffle iron according to manufacturer's instructions.\nb). Sift flour, baking powder, and salt together in a bowl. Whisk buttermilk, yogurt, butter, eggs, and sugar together in a separate bowl; stir into flour mixture until batter is smooth. Fold strawberries into batter.\nc). Pour about 1/3 cup batter into preheated waffle iron; cook until lightly browned, 5 to 7 minutes. Repeat with remaining batter.\n\nIngredients you'll need:\n(a). 2 1/2 cups all-purpose flour\n(b). 4 teaspoons baking powder\n(c). 3/4 teaspoon salt\n(d). 2 cups buttermilk\n(e). 1/2 cup vanilla Greek-style yogurt\n(f). 1/2 cup butter, melted\n(g). 2 eggs, beaten\n(h). 1 1/2 tablespoons white sugar\n(i). 3/4 cup chopped strawberries, or more to taste"
# =========================== Do NOT need to modify the following ===============================
# Path to synthesizer
model_path = "AdaptLLM/visual-instruction-synthesizer"
# Prompt Hints
caption_hint = "Describe the image."
precise_hint = "Answer with a precise response.\n"
informative_hint = "Answer with an informative response.\n"
# Function to parse predictions
def parse_pred(pred):
if not pred.endswith("<|end_of_text|>"):
return []
pred = pred[:-len("<|end_of_text|>")]
QA_str_list = pred.split("<|start_header_id|>user<|end_header_id|>\n\n")
if not pred.endswith("<|eot_id|>"):
QA_str_list = QA_str_list[:-1]
QA_list = []
for QA_str in QA_str_list:
try:
assert "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" in QA_str
Q_str, A_str = QA_str.split("<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n")
Q_str, A_str = Q_str.strip(), A_str[:-len("<|eot_id|>")].strip()
assert Q_str and A_str
QA_list.append({"Q": Q_str, "A": A_str})
except AssertionError:
pass # Skip invalid entries
conversations = []
for qa_entry in QA_list:
conversations.append({"from": "human", "value": qa_entry["Q"]})
conversations.append({"from": "gpt", "value": qa_entry["A"]})
return conversations
# Function to extract task triplets
def get_task_triplet(pred):
pred_QAs = parse_pred(pred)
precise_QAs = {}
informative_QAs = {}
collected_QA = None
for idx in range(0, len(pred_QAs), 2): # Iterate over question-answer pairs
question = pred_QAs[idx]["value"]
answer = pred_QAs[idx + 1]["value"]
if question.startswith(precise_hint):
precise_q = question[len(precise_hint):]
if precise_q in informative_QAs:
collected_QA = {
"Q": precise_q,
"precise_A": answer,
"informative_A": informative_QAs[precise_q],
}
break
else:
precise_QAs[precise_q] = answer
elif question.startswith(informative_hint):
informative_q = question[len(informative_hint):]
if informative_q in precise_QAs:
collected_QA = {
"Q": informative_q,
"precise_A": precise_QAs[informative_q],
"informative_A": answer,
}
break
else:
informative_QAs[informative_q] = answer
return collected_QA
# Load the processor
processor = LlavaNextProcessor.from_pretrained(model_path)
# Define image token
image_token = "<|reserved_special_token_4|>"
# Format the prompt
prompt = (
f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
f"You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
f"<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n"
f"{image_token}\n{caption_hint}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
f"{caption}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n"
)
# Load the model
model = LlavaNextForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")
# Prepare inputs and generate output
inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)
answer_start = int(inputs["input_ids"].shape[-1])
output = model.generate(**inputs, max_new_tokens=512)
# Decode predictions
pred = processor.decode(output[0][answer_start:], skip_special_tokens=False)
print(f"## Synthesizer predictions:\n{pred}")
# Extract task triplets
task_triplet = get_task_triplet(pred)
print(f"## Synthesized Task triplet:\n{task_triplet}")
```
</details>
### 2. Advanced Usage: Convert Image-Caption Pairs into Visual Instructions at Scale
The following steps show how to convert your own data into visual instructions for post-training MLLMs.
We leverage vLLM to accelerate the synthesis process. On a single A100-80GB GPU, it takes about 12.5 hours to convert 100K image-caption pairs.
<details>
<summary> Click to expand </summary>
### 1) Setup
Install vLLM using `pip` or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source).
```bash
pip install vllm
```
Clone our code repository and navigate to the inference directory:
```bash
git clone https://github.com/bigai-ai/QA-Synthesizer.git
cd QA-Synthesizer/vllm_inference
SYNTHESIZER=AdaptLLM/visual-instruction-synthesizer
CONSISTENCY_CHECKER=meta-llama/Meta-Llama-3-8B # Language model for consistency checks
```
### 2) Prepare Your Image-Caption Pairs
Format your `image_caption_pairs` file to match the following structure (similar to ShareGPT), or you can use our [data_samples/image_caption_pairs.json](https://github.com/bigai-ai/QA-Synthesizer/blob/main/data_samples/image_caption_pairs.json) for a quick try.
```json
[
{
"images": ["image_xxx.jpg"],
"messages": [
{
"content": "<image>instruction",
"role": "user"
},
{
"content": "response",
"role": "assistant"
}
]
},
...
]
```
### 3) Run Synthesis
The following command generate task triplets using the synthesizer and apply consistency-based filtering to enhance data quality:
```bash
IMAGE_CAPTION='../data_samples/image_caption_pairs.json' # Path to image-caption pairs
IMAGE_FOLDER='../data_samples/images' # Path to the image folder
OUTPUT_DIR='../data_samples/' # Output directory for synthesized data
# Run synthesis with data parallelism; adjust CUDA devices as needed:
CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' bash run_synthesis.sh ${SYNTHESIZER} ${CONSISTENCY_CHECKER} ${IMAGE_CAPTION} ${IMAGE_FOLDER} ${OUTPUT_DIR}
```
The synthesized output will be saved at:
```bash
${OUTPUT_DIR}/image_caption_and_synthetic_task.json
```
This output can be directly utilized for single-stage post-training with code repo like [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory).
</details>
## Citation
If you find our work helpful, please cite us.
AdaMLLM
```bibtex
@article{adamllm,
title={On Domain-Specific Post-Training for Multimodal Large Language Models},
author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang},
journal={arXiv preprint arXiv:2411.19930},
year={2024}
}
```
[Instruction Pre-Training](https://huggingface.co/papers/2406.14491) (EMNLP 2024)
```bibtex
@article{cheng2024instruction,
title={Instruction Pre-Training: Language Models are Supervised Multitask Learners},
author={Cheng, Daixuan and Gu, Yuxian and Huang, Shaohan and Bi, Junyu and Huang, Minlie and Wei, Furu},
journal={arXiv preprint arXiv:2406.14491},
year={2024}
}
```
[Adapt LLM to Domains](https://huggingface.co/papers/2309.09530) (ICLR 2024)
```bibtex
@inproceedings{
cheng2024adapting,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=y886UXPEZ0}
}
``` |