Safetensors
English
qwen2_vl
biology
medical
chemistry
AdaptLLM commited on
Commit
e77d70f
1 Parent(s): 5d775f6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +123 -3
README.md CHANGED
@@ -1,3 +1,123 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - AdaptLLM/medicine-visual-instructions
5
+ language:
6
+ - en
7
+ base_model:
8
+ - Qwen/Qwen2-VL-2B-Instruct
9
+ tags:
10
+ - biology
11
+ - medical
12
+ - chemistry
13
+ ---
14
+ # Adapting Multimodal Large Language Models to Domains via Post-Training
15
+
16
+ This repos contains the **biomedicine MLLM developed from Qwen-2-VL-2B-Instruct** in our paper: [On Domain-Specific Post-Training for Multimodal Large Language Models](https://huggingface.co/papers/2411.19930). The correspoding training dataset is in [medicine-visual-instructions](https://huggingface.co/datasets/AdaptLLM/medicine-visual-instructions).
17
+
18
+ The main project page is: [Adapt-MLLM-to-Domains](https://huggingface.co/AdaptLLM/Adapt-MLLM-to-Domains/edit/main/README.md)
19
+
20
+ We investigate domain adaptation of MLLMs through post-training, focusing on data synthesis, training pipelines, and task evaluation.
21
+ **(1) Data Synthesis**: Using open-source models, we develop a visual instruction synthesizer that effectively generates diverse visual instruction tasks from domain-specific image-caption pairs. **Our synthetic tasks surpass those generated by manual rules, GPT-4, and GPT-4V in enhancing the domain-specific performance of MLLMs.**
22
+ **(2) Training Pipeline**: While the two-stage training--initially on image-caption pairs followed by visual instruction tasks--is commonly adopted for developing general MLLMs, we apply a single-stage training pipeline to enhance task diversity for domain-specific post-training.
23
+ **(3) Task Evaluation**: We conduct experiments in two domains, biomedicine and food, by post-training MLLMs of different sources and scales (e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM performance on various domain-specific tasks.
24
+
25
+ <p align='center'>
26
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/650801ced5578ef7e20b33d4/-Jp7pAsCR2Tj4WwfwsbCo.png" width="600">
27
+ </p>
28
+
29
+ ## How to use
30
+ 1. Set up
31
+ ```bash
32
+ pip install qwen-vl-utils
33
+ ```
34
+ 2. Inference
35
+ ```python
36
+ from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
37
+ from qwen_vl_utils import process_vision_info
38
+
39
+ # default: Load the model on the available device(s)
40
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
41
+ "AdaptLLM/medicine-Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
42
+ )
43
+
44
+ # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
45
+ # model = Qwen2VLForConditionalGeneration.from_pretrained(
46
+ # "AdaptLLM/medicine-Qwen2-VL-2B-Instruct",
47
+ # torch_dtype=torch.bfloat16,
48
+ # attn_implementation="flash_attention_2",
49
+ # device_map="auto",
50
+ # )
51
+
52
+ # default processer
53
+ processor = AutoProcessor.from_pretrained("AdaptLLM/medicine-Qwen2-VL-2B-Instruct")
54
+
55
+ # The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
56
+ # min_pixels = 256*28*28
57
+ # max_pixels = 1280*28*28
58
+ # processor = AutoProcessor.from_pretrained("AdaptLLM/medicine-Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
59
+
60
+ messages = [
61
+ {
62
+ "role": "user",
63
+ "content": [
64
+ {
65
+ "type": "image",
66
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
67
+ },
68
+ {"type": "text", "text": "Describe this image."},
69
+ ],
70
+ }
71
+ ]
72
+
73
+ # Preparation for inference
74
+ text = processor.apply_chat_template(
75
+ messages, tokenize=False, add_generation_prompt=True
76
+ )
77
+ image_inputs, video_inputs = process_vision_info(messages)
78
+ inputs = processor(
79
+ text=[text],
80
+ images=image_inputs,
81
+ videos=video_inputs,
82
+ padding=True,
83
+ return_tensors="pt",
84
+ )
85
+ inputs = inputs.to("cuda")
86
+
87
+ # Inference: Generation of the output
88
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
89
+ generated_ids_trimmed = [
90
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
91
+ ]
92
+ output_text = processor.batch_decode(
93
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
94
+ )
95
+ print(output_text)
96
+ ```
97
+
98
+ Since our model architecture aligns with the base model, you can refer to the official repository of [Qwen-2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct/edit/main/README.md) for more advanced usage instructions.
99
+
100
+ ## Citation
101
+ If you find our work helpful, please cite us.
102
+
103
+ AdaMLLM
104
+ ```bibtex
105
+ @article{adamllm,
106
+ title={On Domain-Specific Post-Training for Multimodal Large Language Models},
107
+ author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang},
108
+ journal={arXiv preprint arXiv:2411.19930},
109
+ year={2024}
110
+ }
111
+ ```
112
+
113
+ [AdaptLLM](https://huggingface.co/papers/2309.09530) (ICLR 2024)
114
+ ```bibtex
115
+ @inproceedings{
116
+ adaptllm,
117
+ title={Adapting Large Language Models via Reading Comprehension},
118
+ author={Daixuan Cheng and Shaohan Huang and Furu Wei},
119
+ booktitle={The Twelfth International Conference on Learning Representations},
120
+ year={2024},
121
+ url={https://openreview.net/forum?id=y886UXPEZ0}
122
+ }
123
+ ```