Model Overview

Qwen3.5-397B-A17B-eagle3 is a specialized EAGLE3 draft model designed to accelerate inference for the Qwen3.5-397B-A17B ecosystem.

Built for speculative decoding, this model predicts multiple future tokens which are then verified by the target model. By reducing expensive target-model decoding steps, Eagle3 can improve practical end-to-end throughput while preserving the output distribution of the base model.

Compared with MTP, this Eagle3 draft model achieves competitive or higher throughput on several text reasoning and coding benchmarks. Although the current training scale limits the average acceptance length, Eagle3 still delivers stronger throughput on multiple workloads due to its efficient draft-and-verify behavior.

Performance & Acceleration

The following results are measured with bs 1. Each result is averaged over three runs.

Throughput Comparison

image

Benchmark Eagle3 MTP Difference
MT-Bench 224.09 224.92 MTP +0.4%
GSM8K 248.71 241.88 Eagle3 +2.8%
Math500 257.60 250.10 Eagle3 +3.0%
HumanEval 252.36 246.74 Eagle3 +2.3%
MMStar 188.95 208.57 MTP +10.4%
CEval 35.19 35.61 MTP +1.2%

Eagle3 shows higher throughput on GSM8K, Math500, and HumanEval, indicating strong acceleration potential for math reasoning and code generation workloads.

Average Acceptance Length

image

Benchmark Eagle3 MTP Difference
MT-Bench 3.03 3.28 MTP +8.3%
GSM8K 3.40 3.54 MTP +4.1%
Math500 3.53 3.66 MTP +3.7%
HumanEval 3.47 3.62 MTP +4.3%
MMStar 2.67 3.21 MTP +20.2%
CEval 1.77 2.34 MTP +32.2%

MTP currently has higher average acceptance length across these benchmarks. This is mainly due to the limited training scale of the current Eagle3 draft model. Even so, Eagle3 achieves higher throughput on several important text benchmarks, showing that acceptance length is not the only factor determining practical decoding speed.

Recommended Speculative Decoding Configuration

Qwen3.5 MoE + EAGLE3 requires Spec V2 and the extra-buffer Mamba scheduler strategy in SGLang.

export SGLANG_ENABLE_SPEC_V2=1

--mamba-scheduler-strategy extra_buffer
--speculative-algorithm EAGLE3
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4

Quick Start

Requirements

  • NVIDIA GPU
  • CUDA 12.0+
  • PyTorch 2.0+
  • SGLang with Qwen3.5 MoE + EAGLE3 support

Important: Qwen3.5 MoE + EAGLE3 currently requires the SGLang fix in sgl-project/sglang#25408. Please use a version of SGLang that includes this PR, or apply the patch manually before running this draft model.

Installation

Please install SGLang from a version that includes sgl-project/sglang#25408.

Until the fix is included in an official SGLang release, please build SGLang from source with the PR applied.

git clone https://github.com/sgl-project/sglang.git
cd sglang

# Apply or check out a branch that includes PR #25408 before installing.
pip install -e "python[all]"

Inference with SGLang

export SGLANG_ENABLE_SPEC_V2=1

python3 -m sglang.launch_server \
    --model-path /models/Qwen3.5-397B-A17B \
    --host 0.0.0.0 \
    --port 30012 \
    --trust-remote-code \
    --mem-fraction-static 0.9 \
    --tp-size 8 \
    --mamba-scheduler-strategy extra_buffer \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path /models/Qwen3.5-397B-A17B-eagle3 \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4

Adjust --model-path, --speculative-draft-model-path, --tp-size, and memory-related parameters according to your deployment environment.

Notes

This release focuses on practical throughput acceleration for Qwen3.5-397B-A17B. The current Eagle3 draft model has not yet matched MTP in average acceptance length, but it already achieves better throughput on multiple reasoning and coding benchmarks. Further improvements are expected with larger-scale training and continued optimization.

Qwen3.5 MoE + EAGLE3 requires SGLANG_ENABLE_SPEC_V2=1 and --mamba-scheduler-strategy extra_buffer when running with SGLang. Please make sure your SGLang installation includes sgl-project/sglang#25408.

Citation

If you use this model in your research or application, please cite:

@misc{qwen35eagle3,
  title={Qwen3.5-397B-A17B-eagle3: Accelerating Qwen3.5 Inference with EAGLE3},
  author={Ant AQ Team},
  year={2026},
}
Downloads last month
71
Safetensors
Model size
0.6B params
Tensor type
I64
BF16
BOOL
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support