Does there any tutorial on how to use the multimodal?
#8
by
lucasjin
- opened
minG looks awesome, the benchmark board it's not the only metric reveals the model performance, but if one model extremly good, it should handles these benchmarks.
The model inference should refer to THUDM/glm-4-9b-chat-1m and THUDM/glm-4v-9b.
It should be like this:
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4v-9b", trust_remote_code=True)
query = 'Describe the image.'
image = Image.open("your image").convert('RGB')
inputs = tokenizer.apply_chat_template([{"role": "user", "image": image, "content": query}],
add_generation_prompt=True, tokenize=True, return_tensors="pt",
return_dict=True) # chat mode
inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
"CausalLM/miniG",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).to(device).eval()
gen_kwargs = {"max_length": 2500, "do_sample": True, "temperature": 0.3, "top_p":0.8 }
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
print(tokenizer.decode(outputs[0]))
JosephusCheung
changed discussion status to
closed
This comment has been hidden
hi what's the vision encoder usd here and what's the input resolution