Error when using vLLM

by sheliak - opened 29 days ago

29 days ago

vllm serve unsloth/Qwen3-Next-80B-A3B-Instruct-bnb-4bit --port 8000 --max-model-len 8192 --reasoning-parser deepseek_r1 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' --cpu-offload-gb 26 --dtype auto --max-num-seqs 1 --gpu-memory-utilization 0.9

Leads to:

AssertionError: Attempted to load weight (torch.Size([1024, 1])) into parameter (torch.Size([1, 2048]))

CHNtentes

29 days ago

I tried vllm with other unsloth bnb models, and the conclusion is currently you cannot run them with vllm. Although you could download the full bf16 model then load in bnb 4bit.

ct-2

28 days ago

I tried vllm with other unsloth bnb models, and the conclusion is currently you cannot run them with vllm. Although you could download the full bf16 model then load in bnb 4bit.

How much tokens per second was 4bit here it if you loaded in CPU only? Model seemz is great for 64GB RAM.

shimmyshimmer

Unsloth AI org 27 days ago

Currently new bnb models usually only work in transformers or unsloth! Usually they should work in vLLM but only for non-new models

shimmyshimmer

Unsloth AI org 27 days ago

I tried vllm with other unsloth bnb models, and the conclusion is currently you cannot run them with vllm. Although you could download the full bf16 model then load in bnb 4bit.

Which ones did you try? Usually they should work for non-weird architecture models, e.g. Llama, normal Qwen3 etc.

CHNtentes

26 days ago

I tried vllm with other unsloth bnb models, and the conclusion is currently you cannot run them with vllm. Although you could download the full bf16 model then load in bnb 4bit.

Which ones did you try? Usually they should work for non-weird architecture models, e.g. Llama, normal Qwen3 etc.

Thanks for your reply. I remember it was gemma3n-4b-it.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment