Qwen3-80B FP8 Dynamic Quantization with LLMCompressor
Introduction
Environment Requirements
- Python 3.10+
- NVIDIA GPU (Hopper architecture supporting FP8, e.g., H100/A100)
- CUDA 12.x
- PyTorch 2.6
- Dependencies installation:
uv pip install llmcompressor torch uv pip install git+https://github.com/huggingface/transformers.git@main
Usage Steps
Save the following script as
quantize.py:from llmcompressor.transformers import SparseAutoModelForCausalLM from transformers import AutoTokenizer model_name = "Qwen/Qwen3-Next-80B-A3B-Thinking" # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained(model_name) model = SparseAutoModelForCausalLM.from_pretrained( model_name, dtype="auto", device_map="auto" ) from llmcompressor.transformers import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier # Configure simple PTQ quantization recipe = QuantizationModifier( targets="Linear", scheme="FP8_DYNAMIC", ignore=[ "lm_head", "re:.*mlp.gate$", # Ignore standard gate layers "re:.*shared_expert_gate$", # Ignore shared expert gate layers "re:.*router$" # Ignore router layers ] ) # Apply quantization algorithm oneshot(model=model, recipe=recipe) # Save model SAVE_DIR = model_name.split("/")[1] + "-FP8-Dynamic" model.save_pretrained(SAVE_DIR) tokenizer.save_pretrained(SAVE_DIR)Run the script:
python quantize.pyThe quantized model will be saved in the
Qwen3-Next-80B-A3B-Thinking-FP8-Dynamicdirectory.
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen3-Next-80B-A3B-Thinking-FP8-Dynamic --port 8080 --tensor-parallel-size 2 --api-key 123 --gpu-memory-utilization 0.95 --max_num_seqs 2 --max-model-len 131072 --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser deepseek_r1 # --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
Notes
- There is compatibility issues between the quantized version and MTP
References
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for llllwxxx/Qwen3-Next-80B-A3B-Thinking-FP8-Dynamic
Base model
Qwen/Qwen3-Next-80B-A3B-Thinking