Safetensors
English
llava_next
AdaptLLM commited on
Commit
b20e40d
1 Parent(s): 7880460

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -1
README.md CHANGED
@@ -38,13 +38,16 @@ We investigate domain adaptation of MLLMs through post-training, focusing on dat
38
 
39
  **Code**: [https://github.com/bigai-ai/QA-Synthesizer](https://github.com/bigai-ai/QA-Synthesizer)
40
 
41
- ## How to use
42
  To synthesize an "instruction-informative response-precise response" triplet based on the following image-caption pair.
43
 
44
  <p align='left'>
45
  <img src="https://cdn-uploads.huggingface.co/production/uploads/650801ced5578ef7e20b33d4/mgI_Ayj12_Q_kviWvfAVb.jpeg" width="200">
46
  </p>
47
 
 
 
 
48
  ```python
49
  from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
50
  import torch
@@ -163,6 +166,73 @@ print(f"## Synthesizer predictions:\n{pred}")
163
  task_triplet = get_task_triplet(pred)
164
  print(f"## Synthesized Task triplet:\n{task_triplet}")
165
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
 
167
 
168
  ## Citation
 
38
 
39
  **Code**: [https://github.com/bigai-ai/QA-Synthesizer](https://github.com/bigai-ai/QA-Synthesizer)
40
 
41
+ ### 1. Basic Usage: Synthesize a task triplet based on a given image-caption pair
42
  To synthesize an "instruction-informative response-precise response" triplet based on the following image-caption pair.
43
 
44
  <p align='left'>
45
  <img src="https://cdn-uploads.huggingface.co/production/uploads/650801ced5578ef7e20b33d4/mgI_Ayj12_Q_kviWvfAVb.jpeg" width="200">
46
  </p>
47
 
48
+ <details>
49
+ <summary> Click to expand </summary>
50
+
51
  ```python
52
  from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
53
  import torch
 
166
  task_triplet = get_task_triplet(pred)
167
  print(f"## Synthesized Task triplet:\n{task_triplet}")
168
  ```
169
+ </details>
170
+
171
+ ### 2. Advanced Usage: Convert Image-Caption Pairs into Visual Instructions at Scale
172
+ The following steps show how to convert your own data into visual instructions for post-training MLLMs.
173
+
174
+ We leverage vLLM to accelerate the synthesis process. On a single A100-80GB GPU, it takes about 12.5 hours to convert 100K image-caption pairs.
175
+
176
+ <details>
177
+ <summary> Click to expand </summary>
178
+
179
+ ### 1) Setup
180
+ Install vLLM using `pip` or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source).
181
+ ```bash
182
+ pip install vllm
183
+ ```
184
+
185
+ Clone our code repository and navigate to the inference directory:
186
+ ```bash
187
+ git clone https://github.com/bigai-ai/QA-Synthesizer.git
188
+ cd QA-Synthesizer/vllm_inference
189
+ SYNTHESIZER=AdaptLLM/visual-instruction-synthesizer
190
+ CONSISTENCY_CHECKER=meta-llama/Meta-Llama-3-8B # Language model for consistency checks
191
+ ```
192
+
193
+ ### 2) Prepare Your Image-Caption Pairs
194
+ Format your `image_caption_pairs` file to match the following structure (similar to ShareGPT), or you can use our [data_samples/image_caption_pairs.json](https://github.com/bigai-ai/QA-Synthesizer/blob/main/docs/data_samples/image_caption_pairs.json) for a quick try.
195
+
196
+ ```json
197
+ [
198
+ {
199
+ "images": ["image_xxx.jpg"],
200
+ "messages": [
201
+ {
202
+ "content": "<image>instruction",
203
+ "role": "user"
204
+ },
205
+ {
206
+ "content": "response",
207
+ "role": "assistant"
208
+ }
209
+ ]
210
+ },
211
+ ...
212
+ ]
213
+ ```
214
+
215
+ ### 3) Run Synthesis
216
+
217
+ The following command generate task triplets using the synthesizer and apply consistency-based filtering to enhance data quality:
218
+
219
+ ```bash
220
+ IMAGE_CAPTION='../data_samples/image_caption_pairs.json' # Path to image-caption pairs
221
+ IMAGE_FOLDER='../data_samples/images' # Path to the image folder
222
+ OUTPUT_DIR='../data_samples/' # Output directory for synthesized data
223
+
224
+ # Run synthesis with data parallelism; adjust CUDA devices as needed:
225
+ CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' bash run_synthesis.sh ${SYNTHESIZER} ${CONSISTENCY_CHECKER} ${IMAGE_CAPTION} ${IMAGE_FOLDER} ${OUTPUT_DIR}
226
+ ```
227
+
228
+ The synthesized output will be saved at:
229
+ ```bash
230
+ ${OUTPUT_DIR}/image_caption_and_synthetic_task.json
231
+ ```
232
+
233
+ This output can be directly utilized for single-stage post-training with code repo like [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory).
234
+
235
+ </details>
236
 
237
 
238
  ## Citation