AdaptLLM
/

visual-instruction-synthesizer

Safetensors

English

llava_next

Model card Files Files and versions Community

AdaptLLM commited on 3 days ago

Commit

b20e40d

•

1 Parent(s): 7880460

Update README.md

Browse files

Files changed (1) hide show

README.md +71 -1

README.md CHANGED Viewed

@@ -38,13 +38,16 @@ We investigate domain adaptation of MLLMs through post-training, focusing on dat
 **Code**: [https://github.com/bigai-ai/QA-Synthesizer](https://github.com/bigai-ai/QA-Synthesizer)
-## How to use
 To synthesize an "instruction-informative response-precise response" triplet based on the following image-caption pair.
 <p align='left'>
     <img src="https://cdn-uploads.huggingface.co/production/uploads/650801ced5578ef7e20b33d4/mgI_Ayj12_Q_kviWvfAVb.jpeg" width="200">
 </p>
 ```python
 from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
 import torch
@@ -163,6 +166,73 @@ print(f"## Synthesizer predictions:\n{pred}")
 task_triplet = get_task_triplet(pred)
 print(f"## Synthesized Task triplet:\n{task_triplet}")
 ```
 ## Citation

 **Code**: [https://github.com/bigai-ai/QA-Synthesizer](https://github.com/bigai-ai/QA-Synthesizer)
+### 1. Basic Usage: Synthesize a task triplet based on a given image-caption pair
 To synthesize an "instruction-informative response-precise response" triplet based on the following image-caption pair.
 <p align='left'>
     <img src="https://cdn-uploads.huggingface.co/production/uploads/650801ced5578ef7e20b33d4/mgI_Ayj12_Q_kviWvfAVb.jpeg" width="200">
 </p>
+<details>
+<summary> Click to expand </summary>
 ```python
 from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
 import torch
 task_triplet = get_task_triplet(pred)
 print(f"## Synthesized Task triplet:\n{task_triplet}")
 ```
+</details>
+### 2. Advanced Usage: Convert Image-Caption Pairs into Visual Instructions at Scale
+The following steps show how to convert your own data into visual instructions for post-training MLLMs.
+We leverage vLLM to accelerate the synthesis process. On a single A100-80GB GPU, it takes about 12.5 hours to convert 100K image-caption pairs.
+<details>
+<summary> Click to expand </summary>
+### 1) Setup
+Install vLLM using `pip` or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source).
+```bash
+pip install vllm
+```
+Clone our code repository and navigate to the inference directory:
+```bash
+git clone https://github.com/bigai-ai/QA-Synthesizer.git
+cd QA-Synthesizer/vllm_inference
+SYNTHESIZER=AdaptLLM/visual-instruction-synthesizer
+CONSISTENCY_CHECKER=meta-llama/Meta-Llama-3-8B  # Language model for consistency checks
+```
+### 2) Prepare Your Image-Caption Pairs
+Format your `image_caption_pairs` file to match the following structure (similar to ShareGPT), or you can use our [data_samples/image_caption_pairs.json](https://github.com/bigai-ai/QA-Synthesizer/blob/main/docs/data_samples/image_caption_pairs.json) for a quick try.
+```json
+[
+  {
+    "images": ["image_xxx.jpg"],
+    "messages": [
+      {
+        "content": "<image>instruction",
+        "role": "user"
+      },
+      {
+        "content": "response",
+        "role": "assistant"
+      }
+    ]
+  },
+  ...
+]
+```
+### 3) Run Synthesis
+The following command generate task triplets using the synthesizer and apply consistency-based filtering to enhance data quality:
+```bash
+IMAGE_CAPTION='../data_samples/image_caption_pairs.json'  # Path to image-caption pairs
+IMAGE_FOLDER='../data_samples/images'  # Path to the image folder
+OUTPUT_DIR='../data_samples/'  # Output directory for synthesized data
+# Run synthesis with data parallelism; adjust CUDA devices as needed:
+CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' bash run_synthesis.sh ${SYNTHESIZER} ${CONSISTENCY_CHECKER} ${IMAGE_CAPTION} ${IMAGE_FOLDER} ${OUTPUT_DIR}
+```
+The synthesized output will be saved at:
+```bash
+${OUTPUT_DIR}/image_caption_and_synthetic_task.json
+```
+This output can be directly utilized for single-stage post-training with code repo like [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory).
+</details>
 ## Citation