Model Overview
Qwen3.5-397B-A17B-eagle3 is a specialized EAGLE3 draft model designed to accelerate inference for the Qwen3.5-397B-A17B ecosystem.
Built for speculative decoding, this model predicts multiple future tokens which are then verified by the target model. By reducing expensive target-model decoding steps, Eagle3 can improve practical end-to-end throughput while preserving the output distribution of the base model.
Compared with MTP, this Eagle3 draft model achieves competitive or higher throughput on several text reasoning and coding benchmarks. Although the current training scale limits the average acceptance length, Eagle3 still delivers stronger throughput on multiple workloads due to its efficient draft-and-verify behavior.
Performance & Acceleration
The following results are measured with bs 1. Each result is averaged over three runs.
Throughput Comparison
| Benchmark | Eagle3 | MTP | Difference |
|---|---|---|---|
| MT-Bench | 224.09 | 224.92 | MTP +0.4% |
| GSM8K | 248.71 | 241.88 | Eagle3 +2.8% |
| Math500 | 257.60 | 250.10 | Eagle3 +3.0% |
| HumanEval | 252.36 | 246.74 | Eagle3 +2.3% |
| MMStar | 188.95 | 208.57 | MTP +10.4% |
| CEval | 35.19 | 35.61 | MTP +1.2% |
Eagle3 shows higher throughput on GSM8K, Math500, and HumanEval, indicating strong acceleration potential for math reasoning and code generation workloads.
Average Acceptance Length
| Benchmark | Eagle3 | MTP | Difference |
|---|---|---|---|
| MT-Bench | 3.03 | 3.28 | MTP +8.3% |
| GSM8K | 3.40 | 3.54 | MTP +4.1% |
| Math500 | 3.53 | 3.66 | MTP +3.7% |
| HumanEval | 3.47 | 3.62 | MTP +4.3% |
| MMStar | 2.67 | 3.21 | MTP +20.2% |
| CEval | 1.77 | 2.34 | MTP +32.2% |
MTP currently has higher average acceptance length across these benchmarks. This is mainly due to the limited training scale of the current Eagle3 draft model. Even so, Eagle3 achieves higher throughput on several important text benchmarks, showing that acceptance length is not the only factor determining practical decoding speed.
Recommended Speculative Decoding Configuration
Qwen3.5 MoE + EAGLE3 requires Spec V2 and the extra-buffer Mamba scheduler strategy in SGLang.
export SGLANG_ENABLE_SPEC_V2=1
--mamba-scheduler-strategy extra_buffer
--speculative-algorithm EAGLE3
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
Quick Start
Requirements
- NVIDIA GPU
- CUDA 12.0+
- PyTorch 2.0+
- SGLang with Qwen3.5 MoE + EAGLE3 support
Important: Qwen3.5 MoE + EAGLE3 currently requires the SGLang fix in sgl-project/sglang#25408. Please use a version of SGLang that includes this PR, or apply the patch manually before running this draft model.
Installation
Please install SGLang from a version that includes sgl-project/sglang#25408.
Until the fix is included in an official SGLang release, please build SGLang from source with the PR applied.
git clone https://github.com/sgl-project/sglang.git
cd sglang
# Apply or check out a branch that includes PR #25408 before installing.
pip install -e "python[all]"
Inference with SGLang
export SGLANG_ENABLE_SPEC_V2=1
python3 -m sglang.launch_server \
--model-path /models/Qwen3.5-397B-A17B \
--host 0.0.0.0 \
--port 30012 \
--trust-remote-code \
--mem-fraction-static 0.9 \
--tp-size 8 \
--mamba-scheduler-strategy extra_buffer \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path /models/Qwen3.5-397B-A17B-eagle3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4
Adjust --model-path, --speculative-draft-model-path, --tp-size, and memory-related parameters according to your deployment environment.
Notes
This release focuses on practical throughput acceleration for Qwen3.5-397B-A17B. The current Eagle3 draft model has not yet matched MTP in average acceptance length, but it already achieves better throughput on multiple reasoning and coding benchmarks. Further improvements are expected with larger-scale training and continued optimization.
Qwen3.5 MoE + EAGLE3 requires SGLANG_ENABLE_SPEC_V2=1 and --mamba-scheduler-strategy extra_buffer when running with SGLang. Please make sure your SGLang installation includes sgl-project/sglang#25408.
Citation
If you use this model in your research or application, please cite:
@misc{qwen35eagle3,
title={Qwen3.5-397B-A17B-eagle3: Accelerating Qwen3.5 Inference with EAGLE3},
author={Ant AQ Team},
year={2026},
}
- Downloads last month
- 71

