Sample code for inference in Google Colab? RuntimeError: "slow_conv2d_cuda" not implemented for 'Byte'
#12
by
sanjeev-bhandari01
- opened
Hi, I want to test the inference of this model in google Colab (free-tier). I have tried different method to make it work but it didn't work. One of the script and error was from below script:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
import torch
from transformers.generation import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="fp4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
model_kwargs = dict(
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True,
quantization_config=quantization_config,
)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", **model_kwargs).eval()
query = tokenizer.from_list_format([
{'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'},
{'text': 'Generate the caption in English with grounding:'},
])
inputs = tokenizer(query, return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
But it returns error from line pred=model.generate(**inputs)
:
RuntimeError: "slow_conv2d_cuda" not implemented for 'Byte'
i think quantization_config
is causing the issue, you probably just need to pass load_in_4bit=True inside AutoModelForCausalLM.from_pretrained