Error when using vLLM

#2
by sheliak - opened

vllm serve unsloth/Qwen3-Next-80B-A3B-Instruct-bnb-4bit --port 8000 --max-model-len 8192 --reasoning-parser deepseek_r1 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' --cpu-offload-gb 26 --dtype auto --max-num-seqs 1 --gpu-memory-utilization 0.9

Leads to:

AssertionError: Attempted to load weight (torch.Size([1024, 1])) into parameter (torch.Size([1, 2048]))

I tried vllm with other unsloth bnb models, and the conclusion is currently you cannot run them with vllm. Although you could download the full bf16 model then load in bnb 4bit.

I tried vllm with other unsloth bnb models, and the conclusion is currently you cannot run them with vllm. Although you could download the full bf16 model then load in bnb 4bit.

How much tokens per second was 4bit here it if you loaded in CPU only? Model seemz is great for 64GB RAM.

Unsloth AI org

Currently new bnb models usually only work in transformers or unsloth! Usually they should work in vLLM but only for non-new models

Unsloth AI org

I tried vllm with other unsloth bnb models, and the conclusion is currently you cannot run them with vllm. Although you could download the full bf16 model then load in bnb 4bit.

Which ones did you try? Usually they should work for non-weird architecture models, e.g. Llama, normal Qwen3 etc.

I tried vllm with other unsloth bnb models, and the conclusion is currently you cannot run them with vllm. Although you could download the full bf16 model then load in bnb 4bit.

Which ones did you try? Usually they should work for non-weird architecture models, e.g. Llama, normal Qwen3 etc.

Thanks for your reply. I remember it was gemma3n-4b-it.

Sign up or log in to comment