Qwen3-80B FP8 Dynamic Quantization with LLMCompressor

Introduction

Environment Requirements

Python 3.10+
NVIDIA GPU (Hopper architecture supporting FP8, e.g., H100/A100)
CUDA 12.x
PyTorch 2.6

Dependencies installation:

uv pip install llmcompressor torch
uv pip install git+https://github.com/huggingface/transformers.git@main

Usage Steps

Save the following script as quantize.py:

from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer
model_name = "Qwen/Qwen3-Next-80B-A3B-Thinking"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = SparseAutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto"
)
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
# Configure simple PTQ quantization
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=[
        "lm_head",
        "re:.*mlp.gate$",           # Ignore standard gate layers
        "re:.*shared_expert_gate$",  # Ignore shared expert gate layers
        "re:.*router$"               # Ignore router layers
    ]
)
# Apply quantization algorithm
oneshot(model=model, recipe=recipe)
# Save model
SAVE_DIR = model_name.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

Run the script:
```
python quantize.py
```
The quantized model will be saved in the Qwen3-Next-80B-A3B-Thinking-FP8-Dynamic directory.

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen3-Next-80B-A3B-Thinking-FP8-Dynamic --port 8080 --tensor-parallel-size 2 --api-key 123 --gpu-memory-utilization 0.95 --max_num_seqs 2 --max-model-len 131072 --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser deepseek_r1 # --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

Notes

There is compatibility issues between the quantized version and MTP

References

LLMCompressor Official Documentation

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for llllwxxx/Qwen3-Next-80B-A3B-Thinking-FP8-Dynamic

Base model

Qwen/Qwen3-Next-80B-A3B-Thinking

Quantized

(43)

this model