Qwen3-80B FP8 Dynamic Quantization with LLMCompressor

Introduction


Environment Requirements

  • Python 3.10+
  • NVIDIA GPU (Hopper architecture supporting FP8, e.g., H100/A100)
  • CUDA 12.x
  • PyTorch 2.6
  • Dependencies installation:
    uv pip install llmcompressor torch
    uv pip install git+https://github.com/huggingface/transformers.git@main
    

Usage Steps

  1. Save the following script as quantize.py:

    from llmcompressor.transformers import SparseAutoModelForCausalLM
    from transformers import AutoTokenizer
    model_name = "Qwen/Qwen3-Next-80B-A3B-Thinking"
    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = SparseAutoModelForCausalLM.from_pretrained(
        model_name,
        dtype="auto",
        device_map="auto"
    )
    from llmcompressor.transformers import oneshot
    from llmcompressor.modifiers.quantization import QuantizationModifier
    # Configure simple PTQ quantization
    recipe = QuantizationModifier(
        targets="Linear",
        scheme="FP8_DYNAMIC",
        ignore=[
            "lm_head",
            "re:.*mlp.gate$",           # Ignore standard gate layers
            "re:.*shared_expert_gate$",  # Ignore shared expert gate layers
            "re:.*router$"               # Ignore router layers
        ]
    )
    # Apply quantization algorithm
    oneshot(model=model, recipe=recipe)
    # Save model
    SAVE_DIR = model_name.split("/")[1] + "-FP8-Dynamic"
    model.save_pretrained(SAVE_DIR)
    tokenizer.save_pretrained(SAVE_DIR)
    
  2. Run the script:

    python quantize.py
    
  3. The quantized model will be saved in the Qwen3-Next-80B-A3B-Thinking-FP8-Dynamic directory.

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen3-Next-80B-A3B-Thinking-FP8-Dynamic --port 8080 --tensor-parallel-size 2 --api-key 123 --gpu-memory-utilization 0.95 --max_num_seqs 2 --max-model-len 131072 --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser deepseek_r1 # --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'


Notes

  1. There is compatibility issues between the quantized version and MTP

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for llllwxxx/Qwen3-Next-80B-A3B-Thinking-FP8-Dynamic

Quantized
(43)
this model