Error when using vLLM
vllm serve unsloth/Qwen3-Next-80B-A3B-Instruct-bnb-4bit --port 8000 --max-model-len 8192 --reasoning-parser deepseek_r1 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' --cpu-offload-gb 26 --dtype auto --max-num-seqs 1 --gpu-memory-utilization 0.9
Leads to:
AssertionError: Attempted to load weight (torch.Size([1024, 1])) into parameter (torch.Size([1, 2048]))
I tried vllm with other unsloth bnb models, and the conclusion is currently you cannot run them with vllm. Although you could download the full bf16 model then load in bnb 4bit.
I tried vllm with other unsloth bnb models, and the conclusion is currently you cannot run them with vllm. Although you could download the full bf16 model then load in bnb 4bit.
How much tokens per second was 4bit here it if you loaded in CPU only? Model seemz is great for 64GB RAM.
Currently new bnb models usually only work in transformers or unsloth! Usually they should work in vLLM but only for non-new models
I tried vllm with other unsloth bnb models, and the conclusion is currently you cannot run them with vllm. Although you could download the full bf16 model then load in bnb 4bit.
Which ones did you try? Usually they should work for non-weird architecture models, e.g. Llama, normal Qwen3 etc.
I tried vllm with other unsloth bnb models, and the conclusion is currently you cannot run them with vllm. Although you could download the full bf16 model then load in bnb 4bit.
Which ones did you try? Usually they should work for non-weird architecture models, e.g. Llama, normal Qwen3 etc.
Thanks for your reply. I remember it was gemma3n-4b-it.