Error when running in VLLM

#1
by d8rt8v - opened

I get KeyError: 'layers.31.mlp.shared_expert.down_proj.weight' when i run this quant on latest vllm (v0.10.2rc3.dev13+gfdb09c77d) with H100 GPU via

pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

Run command

vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit  --gpu-memory-utilization 0.9 --host 0.0.0.0 --port 8080 --max-model-len 60000

I have the same error on both this and the Thinking version.

I get KeyError: 'layers.31.mlp.shared_expert.down_proj.weight' when i run this quant on latest vllm (v0.10.2rc3.dev13+gfdb09c77d) with H100 GPU via

pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

Run command

vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit  --gpu-memory-utilization 0.9 --host 0.0.0.0 --port 8080 --max-model-len 60000

The weight is in model-00007-of-00010.safetensors. Could you check the SHA256?

lrwxrwxrwx 1 owner owner 76 Sep 12 10:05 model-00007-of-00010.safetensors -> ../../blobs/f8bc272ecbbf035e204b83cee5d610409a3ef33811838a5b5a10fcb10f452f37

sha256sum f8bc272ecbbf035e204b83cee5d610409a3ef33811838a5b5a10fcb10f452f37
f8bc272ecbbf035e204b83cee5d610409a3ef33811838a5b5a10fcb10f452f37 f8bc272ecbbf035e204b83cee5d610409a3ef33811838a5b5a10fcb10f452f37

Same issue here.

Get a similar error:

KeyError: 'layers.20.mlp.shared_expert.down_proj.weight'

...using the main branch of vLLM as recommended (0.10.2rc3.dev23+gb0d1213ac).

Hi everyone, I am really sorry for this.

In addition to the loading error, some important components are over-quantized and thus it outputs gibberish. The model is now re-quantized, and it should complete in the next 16-18 hours.

However, the model can be loaded by changing /vllm/vllm/model_executor/models/qwen3_next.py to qwen3_next.py

Hi everyone, I am really sorry for this.

In addition to the loading error, some important components are over-quantized and thus it outputs gibberish. The model is now re-quantized, and it should complete in the next 16-18 hours.

However, the model can be loaded by changing /vllm/vllm/model_executor/models/qwen3_next.py to qwen3_next.py

Thanks for the update. I will check back tomorrow!

Hey, I have reuploaded the weights and it works!

It turns out that ignoring shared_expert during the quantization process does not allow the model to load well with vllm afterwards.

For some reasons, NCCL_SHM_DISABLE=1 is required to not get nccl errors in my local environment, and I don't know about others. Please consider setting NCCL_SHM_DISABLE=1 if any nccl problem occurs.

Please redownload the weights, and let me know what you think!

I'm able to get a response from the model via API endpoints on my build. VLLM isn't optimal for my setup because I have one 4090 and 3 3090s, however I'm able to get responses! the speed isn't the best right now, but this model seems to be working in its current form.

I have no clue why, but sometimes upon startup, or after startup and with first prompt, or in loading large contexts, my CUDA Device: 2 keeps crashing hard, like the computer stops detecting it level bad. It's always this device, number 2. The model is spread evenly across all 4 cards, so I'm chalking it up to an issue with uneven architecture distribution. Just thought it should be noted, if this happens to other perhaps something else is up.

Startup Command:

vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit   --tensor-parallel-size 4   --max-model-len 8192   --dtype float16 --enforce-eager

Speeds:

INFO 09-13 10:56:46 [loggers.py:123] Engine 000: Avg prompt throughput: 80.5 tokens/s, Avg generation throughput: 8.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

I'll admit, I'm an ik_llama.cpp type of guy, but I've been dying to test this model.
Great work @cpatonn ! I look forward to future optimizations

Great!
I ran the following command, and it works perfectly:

vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit
--tensor-parallel-size 2
--max-model-len 95000
--gpu-memory-utilization 0.88
--host 0.0.0.0
--port 11435
--dtype float16

Works for me now. Although it takes quite some time to see the first output. Feels like it does reasoning. Isn't this variant a "non-thinker"? ;-) Will have to check the vLLM issues that might pop up.

(EngineCore_DP0 pid=414811) /opt/pluski/svc/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:105: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (31) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=414811)   return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=414811) /opt/pluski/svc/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:105: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (31) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=414811)   return fn(*contiguous_args, **contiguous_kwargs)
(APIServer pid=414728) INFO 09-13 15:48:53 [loggers.py:123] Engine 000: Avg prompt throughput: 3.1 tokens/s, Avg generation throughput: 106.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

I do not get any nccl errors. Only this warnings.

Running v0.10.2rc3.dev50+g15b8fef45 on a NVIDIA H100 NVL.

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit --dtype float16

Thanks for all your efforts!

So It is working now, although on the first request after startup it takes about 30 seconds to start outputting tokens.
If I start with:
NCCL_SHM_DISABLE=1 CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,2,3,1 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit --port 8000 --tensor-parallel-size 2 --pipeline-parallel-size 2 --max-model-len 235000
I see this in the logs:

(Worker_PP0_TP0 pid=84334) /home/owner/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py:523: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1578.)
(Worker_PP0_TP0 pid=84334)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)

And if I start with:
NCCL_SHM_DISABLE=1 VLLM_PP_LAYER_PARTITION="15,9,9,15" CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,2,3,1 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit --port 8000 --tensor-parallel-size 1 --pipeline-parallel-size 4 --max-model-len 235000
I see this in the logs:

(Worker_PP0 pid=81631) /home/owner/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/fla/ops/utils.py:105: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (21) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_PP0 pid=81631)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_PP0 pid=81631) /home/owner/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py:523: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1578.)
(Worker_PP0 pid=81631)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)

I am using this inside wsl2, but I almost exclusively use AWQ models from cpatonn and never saw any issues like this before.

Can this run on 2 * 24Gb cards? Mine are A5000 and i can't figure out parameters to run. Seems like --cpu-offload-gb is non functional atm, when enabling offload i get AssertionError: Cannot re-initialize the input batch when CPU weight offloading is enabled. See https://github.com/vllm-project/vllm/pull/18298 for more details. and if i not enable offloading it's always OOM'd at some point, despite setting --max-model-len 1024 --gpu-memory-utilization 0.98 -tp 2 --max-num-seqs 4 (tried many other combination as well)

About the 30s to get the first answer, its happening to me too with the bf16 weights, I'm using pipeline parallelism though.

I'm unable to run this on vLLM 0.10.2 (on an RTX Pro 6000 96GB):

I see KeyError: 'layers.24.mlp.shared_expert.down_proj.weight'

Does it only run on the 0.10.2rc3 release candidate?

My apologies, the fix is merged 3 days ago and not in nightly yet. Please build vllm from source to use the latest model update.

Got it, thanks I can confirm that it works.

Sign up or log in to comment