Update README.md
Browse files
README.md
CHANGED
@@ -39,9 +39,15 @@ The model is developed to process diverse inputs, including images and text, fac
|
|
39 |
|
40 |
Cephalo provides a robust framework for multimodal interaction and understanding, including the development of complex generative pipelines to create 2D and 3D renderings of material microstructures as input for additive manufacturing methods.
|
41 |
|
42 |
-
This version of Cephalo, lamm-mit/Cephalo-Phi-3-MoE-vision-128k-3x4b-beta, is a Mixture-of-Expert model based on the Phi-3-Vision-128K-Instruct model.
|
43 |
|
44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
|
46 |
```python
|
47 |
import torch
|
@@ -67,7 +73,12 @@ count_parameters(moe_model)
|
|
67 |
|
68 |
## Make a Phi-3-V-MoE model from several pre-trained models
|
69 |
|
70 |
-
Download .py files
|
|
|
|
|
|
|
|
|
|
|
71 |
```python
|
72 |
from huggingface_hub import HfApi, hf_hub_download
|
73 |
from tqdm.notebook import tqdm
|
@@ -162,6 +173,8 @@ In the following example we show how it is done. The training set consists of im
|
|
162 |
|
163 |
Sample training set and process to train (for simplicity we use only three images, one characteristic of each expert):
|
164 |
```python
|
|
|
|
|
165 |
|
166 |
image_1 = Image.open(requests.get("https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg", stream=True).raw)
|
167 |
image_2 = Image.open(requests.get("https://media.wired.com/photos/5aa32b912ba43111d1213e0c/master/w_2240,c_limit/akhacouple.jpg", stream=True).raw)
|
@@ -169,13 +182,13 @@ image_3 = Image.open(requests.get("https://upload.wikimedia.org/wikipedia/common
|
|
169 |
|
170 |
prompts_per_expert = [
|
171 |
[{"text": "<|user|>\n<|image_1|>\nPrompt 1 for expert 1<|end|>\n<|assistant|>\n", "image": [image_1]},
|
172 |
-
{"text": "<|user|>\n<|image_1|>\nPrompt
|
173 |
|
174 |
[{"text": "<|user|>\n<|image_1|>\nPrompt 1 for expert 2<|end|>\n<|assistant|>\n", "image": [image_2]},
|
175 |
-
{"text": "<|user|>\n<|image_1|>\nPrompt
|
176 |
|
177 |
[{"text": "<|user|>\n<|image_1|>\nPrompt 1 for expert 3<|end|>\n<|assistant|>\n", "image": [image_3]},
|
178 |
-
{"text": "<|user|>\n<|image_1|>\nPrompt
|
179 |
]
|
180 |
|
181 |
# Train gating layers using the provided prompts
|
@@ -185,8 +198,20 @@ gating_layer_params = moe_model.preselect_gating_layer_params(processor, prompts
|
|
185 |
moe_model.set_gating_layer_params(gating_layer_params)
|
186 |
```
|
187 |
|
188 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
189 |
|
|
|
190 |
|
191 |
### Chat Format
|
192 |
|
@@ -233,7 +258,7 @@ prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_g
|
|
233 |
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")
|
234 |
|
235 |
generation_args = {
|
236 |
-
"max_new_tokens":
|
237 |
"temperature": 0.1,
|
238 |
"do_sample": True,
|
239 |
"stop_strings": ['<|end|>',
|
|
|
39 |
|
40 |
Cephalo provides a robust framework for multimodal interaction and understanding, including the development of complex generative pipelines to create 2D and 3D renderings of material microstructures as input for additive manufacturing methods.
|
41 |
|
42 |
+
This version of Cephalo, lamm-mit/Cephalo-Phi-3-MoE-vision-128k-3x4b-beta, is a Mixture-of-Expert model based on the Phi-3-Vision-128K-Instruct model. The model architecture is as follows:
|
43 |
|
44 |
+

|
45 |
+
|
46 |
+
### Download MoE Model
|
47 |
+
|
48 |
+
```markdown
|
49 |
+
pip install transformers -U
|
50 |
+
```
|
51 |
|
52 |
```python
|
53 |
import torch
|
|
|
73 |
|
74 |
## Make a Phi-3-V-MoE model from several pre-trained models
|
75 |
|
76 |
+
Download .py files that implement the Phi-3-V and the Mixture-of-Expert Vision model
|
77 |
+
|
78 |
+
```markdown
|
79 |
+
pip install huggingface_hub
|
80 |
+
```
|
81 |
+
|
82 |
```python
|
83 |
from huggingface_hub import HfApi, hf_hub_download
|
84 |
from tqdm.notebook import tqdm
|
|
|
173 |
|
174 |
Sample training set and process to train (for simplicity we use only three images, one characteristic of each expert):
|
175 |
```python
|
176 |
+
from PIL import Image
|
177 |
+
import requests
|
178 |
|
179 |
image_1 = Image.open(requests.get("https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg", stream=True).raw)
|
180 |
image_2 = Image.open(requests.get("https://media.wired.com/photos/5aa32b912ba43111d1213e0c/master/w_2240,c_limit/akhacouple.jpg", stream=True).raw)
|
|
|
182 |
|
183 |
prompts_per_expert = [
|
184 |
[{"text": "<|user|>\n<|image_1|>\nPrompt 1 for expert 1<|end|>\n<|assistant|>\n", "image": [image_1]},
|
185 |
+
{"text": "<|user|>\n<|image_1|>\nPrompt 2 for expert 1<|end|>\n<|assistant|>\n", "image": [image_1]}],
|
186 |
|
187 |
[{"text": "<|user|>\n<|image_1|>\nPrompt 1 for expert 2<|end|>\n<|assistant|>\n", "image": [image_2]},
|
188 |
+
{"text": "<|user|>\n<|image_1|>\nPrompt 2 for expert 2<|end|>\n<|assistant|>\n", "image": [image_2]}],
|
189 |
|
190 |
[{"text": "<|user|>\n<|image_1|>\nPrompt 1 for expert 3<|end|>\n<|assistant|>\n", "image": [image_3]},
|
191 |
+
{"text": "<|user|>\n<|image_1|>\nPrompt 2 for expert 3<|end|>\n<|assistant|>\n", "image": [image_3]}],
|
192 |
]
|
193 |
|
194 |
# Train gating layers using the provided prompts
|
|
|
198 |
moe_model.set_gating_layer_params(gating_layer_params)
|
199 |
```
|
200 |
|
201 |
+
### Peparing gating network for training
|
202 |
+
|
203 |
+
To freeze all parameters in the model except for the gating neural networks, you can use:
|
204 |
+
|
205 |
+
```python
|
206 |
+
freeze_except_gating_layers(moe_model)
|
207 |
+
count_parameters(moe_model)
|
208 |
+
```
|
209 |
+
You can unfreeze:
|
210 |
+
```python
|
211 |
+
un_freeze_all(moe_model)
|
212 |
+
```
|
213 |
|
214 |
+
## Inference
|
215 |
|
216 |
### Chat Format
|
217 |
|
|
|
258 |
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")
|
259 |
|
260 |
generation_args = {
|
261 |
+
"max_new_tokens": 256,
|
262 |
"temperature": 0.1,
|
263 |
"do_sample": True,
|
264 |
"stop_strings": ['<|end|>',
|