jiahang01 commited on
Commit
6b50dc1
·
verified ·
1 Parent(s): 8dead17

Update README.md

Browse files

![image](https://cdn-uploads.huggingface.co/production/uploads/68b135e5cbe2378afae384d2/c6mofi3fnEF-t_7_CdVg8.png)

![image](https://cdn-uploads.huggingface.co/production/uploads/68b135e5cbe2378afae384d2/Zp7UaAMJZk6dGJEgjB9Fg.png)

Files changed (1) hide show
  1. README.md +109 -0
README.md CHANGED
@@ -1,3 +1,112 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+
5
+ # Qwen3.5-397B-A17B-eagle3
6
+
7
+ **Eagle3 Optimized Draft Model for Qwen3.5-397B-A17B**
8
+
9
+ Thanks to the **SpecForge** framework for its foundational contributions to speculative decoding and EAGLE-style draft model acceleration.
10
+
11
+ ## Model Overview
12
+
13
+ **Qwen3.5-397B-A17B-eagle3** is a specialized EAGLE3 draft model designed to accelerate inference for the **Qwen3.5-397B-A17B** ecosystem.
14
+
15
+ Built for speculative decoding, this model predicts multiple future tokens which are then verified by the target model. By reducing expensive target-model decoding steps, Eagle3 can improve practical end-to-end throughput while preserving the output distribution of the base model.
16
+
17
+ Compared with MTP, this Eagle3 draft model achieves competitive or higher throughput on several text reasoning and coding benchmarks. Although the current training scale limits the average acceptance length, Eagle3 still delivers stronger throughput on multiple workloads due to its efficient draft-and-verify behavior.
18
+
19
+ ## Performance & Acceleration
20
+
21
+ The following results are measured with **bs 1**. Each result is averaged over **three runs**.
22
+
23
+ ### Throughput Comparison
24
+
25
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/68b135e5cbe2378afae384d2/Zp7UaAMJZk6dGJEgjB9Fg.png)
26
+
27
+ | Benchmark | Eagle3 | MTP | Difference |
28
+ | :-- | --: | --: | :-- |
29
+ | **MT-Bench** | 224.09 | 224.92 | MTP +0.4% |
30
+ | **GSM8K** | 248.71 | 241.88 | **Eagle3 +2.8%** |
31
+ | **Math500** | 257.60 | 250.10 | **Eagle3 +3.0%** |
32
+ | **HumanEval** | 252.36 | 246.74 | **Eagle3 +2.3%** |
33
+ | **MMStar** | 188.95 | 208.57 | MTP +10.4% |
34
+ | **CEval** | 35.19 | 35.61 | MTP +1.2% |
35
+
36
+ Eagle3 shows higher throughput on **GSM8K**, **Math500**, and **HumanEval**, indicating strong acceleration potential for math reasoning and code generation workloads.
37
+
38
+ ### Average Acceptance Length
39
+
40
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/68b135e5cbe2378afae384d2/c6mofi3fnEF-t_7_CdVg8.png)
41
+
42
+ | Benchmark | Eagle3 | MTP | Difference |
43
+ | :-- | --: | --: | :-- |
44
+ | **MT-Bench** | 3.03 | 3.28 | MTP +8.3% |
45
+ | **GSM8K** | 3.40 | 3.54 | MTP +4.1% |
46
+ | **Math500** | 3.53 | 3.66 | MTP +3.7% |
47
+ | **HumanEval** | 3.47 | 3.62 | MTP +4.3% |
48
+ | **MMStar** | 2.67 | 3.21 | MTP +20.2% |
49
+ | **CEval** | 1.77 | 2.34 | MTP +32.2% |
50
+
51
+ MTP currently has higher average acceptance length across these benchmarks. This is mainly due to the limited training scale of the current Eagle3 draft model. Even so, Eagle3 achieves higher throughput on several important text benchmarks, showing that acceptance length is not the only factor determining practical decoding speed.
52
+
53
+ ## Recommended Speculative Decoding Configuration
54
+
55
+ ```bash
56
+ --speculative-algorithm EAGLE3
57
+ --speculative-num-steps 3
58
+ --speculative-eagle-topk 1
59
+ --speculative-num-draft-tokens 4
60
+ ```
61
+
62
+ ## Quick Start
63
+
64
+ ### Requirements
65
+
66
+ - NVIDIA GPU
67
+ - CUDA 12.0+
68
+ - PyTorch 2.0+
69
+ - SGLang with EAGLE3 support
70
+
71
+ ### Installation
72
+
73
+ ```bash
74
+ pip install sglang==0.5.10
75
+ ```
76
+
77
+ Please make sure your SGLang installation includes EAGLE3 support.
78
+
79
+ ### Inference with SGLang
80
+
81
+ ```bash
82
+ python3 -m sglang.launch_server \
83
+ --model-path /models/Qwen3.5-397B-A17B \
84
+ --host 0.0.0.0 \
85
+ --port 30012 \
86
+ --trust-remote-code \
87
+ --mem-fraction-static 0.9 \
88
+ --tp-size 8 \
89
+ --speculative-algorithm EAGLE3 \
90
+ --speculative-draft-model-path /models/Qwen3.5-397B-A17B-eagle3 \
91
+ --speculative-num-steps 3 \
92
+ --speculative-eagle-topk 1 \
93
+ --speculative-num-draft-tokens 4
94
+ ```
95
+
96
+ Adjust `--model-path`, `--speculative-draft-model-path`, `--tp-size`, and memory-related parameters according to your deployment environment.
97
+
98
+ ## Notes
99
+
100
+ This release focuses on practical throughput acceleration for Qwen3.5-397B-A17B. The current Eagle3 draft model has not yet matched MTP in average acceptance length, but it already achieves better throughput on multiple reasoning and coding benchmarks. Further improvements are expected with larger-scale training and continued optimization.
101
+
102
+ ## Citation
103
+
104
+ If you use this model in your research or application, please cite:
105
+
106
+ ```bibtex
107
+ @misc{qwen35eagle3,
108
+ title={Qwen3.5-397B-A17B-eagle3: Accelerating Qwen3.5 Inference with EAGLE3},
109
+ author={Ant AQ Team},
110
+ year={2026},
111
+ }
112
+ ```