AdaptLLM commited on
Commit
7be0be6
1 Parent(s): 1475016

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -3
README.md CHANGED
@@ -1,3 +1,120 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - Qwen/Qwen2-VL-2B-Instruct
7
+ tags:
8
+ - food
9
+ - recipe
10
+ ---
11
+ # Adapting Multimodal Large Language Models to Domains via Post-Training
12
+
13
+ This repos contains the **food MLLM developed from Qwen-2-VL-2B-Instruct** in our paper: [On Domain-Specific Post-Training for Multimodal Large Language Models](https://huggingface.co/papers/2411.19930).
14
+
15
+ The main project page is: [Adapt-MLLM-to-Domains](https://huggingface.co/AdaptLLM/Adapt-MLLM-to-Domains/edit/main/README.md)
16
+
17
+ We investigate domain adaptation of MLLMs through post-training, focusing on data synthesis, training pipelines, and task evaluation.
18
+ **(1) Data Synthesis**: Using open-source models, we develop a visual instruction synthesizer that effectively generates diverse visual instruction tasks from domain-specific image-caption pairs. **Our synthetic tasks surpass those generated by manual rules, GPT-4, and GPT-4V in enhancing the domain-specific performance of MLLMs.**
19
+ **(2) Training Pipeline**: While the two-stage training--initially on image-caption pairs followed by visual instruction tasks--is commonly adopted for developing general MLLMs, we apply a single-stage training pipeline to enhance task diversity for domain-specific post-training.
20
+ **(3) Task Evaluation**: We conduct experiments in two domains, biomedicine and food, by post-training MLLMs of different sources and scales (e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM performance on various domain-specific tasks.
21
+
22
+ <p align='left'>
23
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/650801ced5578ef7e20b33d4/bRu85CWwP9129bSCRzos2.png" width="1000">
24
+ </p>
25
+
26
+ ## How to use
27
+ 1. Set up
28
+ ```bash
29
+ pip install qwen-vl-utils
30
+ ```
31
+ 2. Inference
32
+ ```python
33
+ from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
34
+ from qwen_vl_utils import process_vision_info
35
+
36
+ # default: Load the model on the available device(s)
37
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
38
+ "AdaptLLM/food-Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
39
+ )
40
+
41
+ # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
42
+ # model = Qwen2VLForConditionalGeneration.from_pretrained(
43
+ # "AdaptLLM/food-Qwen2-VL-2B-Instruct",
44
+ # torch_dtype=torch.bfloat16,
45
+ # attn_implementation="flash_attention_2",
46
+ # device_map="auto",
47
+ # )
48
+
49
+ # default processer
50
+ processor = AutoProcessor.from_pretrained("AdaptLLM/food-Qwen2-VL-2B-Instruct")
51
+
52
+ # The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
53
+ # min_pixels = 256*28*28
54
+ # max_pixels = 1280*28*28
55
+ # processor = AutoProcessor.from_pretrained("AdaptLLM/food-Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
56
+
57
+ messages = [
58
+ {
59
+ "role": "user",
60
+ "content": [
61
+ {
62
+ "type": "image",
63
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
64
+ },
65
+ {"type": "text", "text": "Describe this image."},
66
+ ],
67
+ }
68
+ ]
69
+
70
+ # Preparation for inference
71
+ text = processor.apply_chat_template(
72
+ messages, tokenize=False, add_generation_prompt=True
73
+ )
74
+ image_inputs, video_inputs = process_vision_info(messages)
75
+ inputs = processor(
76
+ text=[text],
77
+ images=image_inputs,
78
+ videos=video_inputs,
79
+ padding=True,
80
+ return_tensors="pt",
81
+ )
82
+ inputs = inputs.to("cuda")
83
+
84
+ # Inference: Generation of the output
85
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
86
+ generated_ids_trimmed = [
87
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
88
+ ]
89
+ output_text = processor.batch_decode(
90
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
91
+ )
92
+ print(output_text)
93
+ ```
94
+
95
+ Since our model architecture aligns with the base model, you can refer to the official repository of [Qwen-2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct/edit/main/README.md) for more advanced usage instructions.
96
+
97
+ ## Citation
98
+ If you find our work helpful, please cite us.
99
+
100
+ AdaMLLM
101
+ ```bibtex
102
+ @article{adamllm,
103
+ title={On Domain-Specific Post-Training for Multimodal Large Language Models},
104
+ author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang},
105
+ journal={arXiv preprint arXiv:2411.19930},
106
+ year={2024}
107
+ }
108
+ ```
109
+
110
+ [AdaptLLM](https://huggingface.co/papers/2309.09530) (ICLR 2024)
111
+ ```bibtex
112
+ @inproceedings{
113
+ adaptllm,
114
+ title={Adapting Large Language Models via Reading Comprehension},
115
+ author={Daixuan Cheng and Shaohan Huang and Furu Wei},
116
+ booktitle={The Twelfth International Conference on Learning Representations},
117
+ year={2024},
118
+ url={https://openreview.net/forum?id=y886UXPEZ0}
119
+ }
120
+ ```