Update README.md
Browse files

README.md
CHANGED
|
@@ -1,3 +1,112 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
+
|
| 5 |
+
# Qwen3.5-397B-A17B-eagle3
|
| 6 |
+
|
| 7 |
+
**Eagle3 Optimized Draft Model for Qwen3.5-397B-A17B**
|
| 8 |
+
|
| 9 |
+
Thanks to the **SpecForge** framework for its foundational contributions to speculative decoding and EAGLE-style draft model acceleration.
|
| 10 |
+
|
| 11 |
+
## Model Overview
|
| 12 |
+
|
| 13 |
+
**Qwen3.5-397B-A17B-eagle3** is a specialized EAGLE3 draft model designed to accelerate inference for the **Qwen3.5-397B-A17B** ecosystem.
|
| 14 |
+
|
| 15 |
+
Built for speculative decoding, this model predicts multiple future tokens which are then verified by the target model. By reducing expensive target-model decoding steps, Eagle3 can improve practical end-to-end throughput while preserving the output distribution of the base model.
|
| 16 |
+
|
| 17 |
+
Compared with MTP, this Eagle3 draft model achieves competitive or higher throughput on several text reasoning and coding benchmarks. Although the current training scale limits the average acceptance length, Eagle3 still delivers stronger throughput on multiple workloads due to its efficient draft-and-verify behavior.
|
| 18 |
+
|
| 19 |
+
## Performance & Acceleration
|
| 20 |
+
|
| 21 |
+
The following results are measured with **bs 1**. Each result is averaged over **three runs**.
|
| 22 |
+
|
| 23 |
+
### Throughput Comparison
|
| 24 |
+
|
| 25 |
+

|
| 26 |
+
|
| 27 |
+
| Benchmark | Eagle3 | MTP | Difference |
|
| 28 |
+
| :-- | --: | --: | :-- |
|
| 29 |
+
| **MT-Bench** | 224.09 | 224.92 | MTP +0.4% |
|
| 30 |
+
| **GSM8K** | 248.71 | 241.88 | **Eagle3 +2.8%** |
|
| 31 |
+
| **Math500** | 257.60 | 250.10 | **Eagle3 +3.0%** |
|
| 32 |
+
| **HumanEval** | 252.36 | 246.74 | **Eagle3 +2.3%** |
|
| 33 |
+
| **MMStar** | 188.95 | 208.57 | MTP +10.4% |
|
| 34 |
+
| **CEval** | 35.19 | 35.61 | MTP +1.2% |
|
| 35 |
+
|
| 36 |
+
Eagle3 shows higher throughput on **GSM8K**, **Math500**, and **HumanEval**, indicating strong acceleration potential for math reasoning and code generation workloads.
|
| 37 |
+
|
| 38 |
+
### Average Acceptance Length
|
| 39 |
+
|
| 40 |
+

|
| 41 |
+
|
| 42 |
+
| Benchmark | Eagle3 | MTP | Difference |
|
| 43 |
+
| :-- | --: | --: | :-- |
|
| 44 |
+
| **MT-Bench** | 3.03 | 3.28 | MTP +8.3% |
|
| 45 |
+
| **GSM8K** | 3.40 | 3.54 | MTP +4.1% |
|
| 46 |
+
| **Math500** | 3.53 | 3.66 | MTP +3.7% |
|
| 47 |
+
| **HumanEval** | 3.47 | 3.62 | MTP +4.3% |
|
| 48 |
+
| **MMStar** | 2.67 | 3.21 | MTP +20.2% |
|
| 49 |
+
| **CEval** | 1.77 | 2.34 | MTP +32.2% |
|
| 50 |
+
|
| 51 |
+
MTP currently has higher average acceptance length across these benchmarks. This is mainly due to the limited training scale of the current Eagle3 draft model. Even so, Eagle3 achieves higher throughput on several important text benchmarks, showing that acceptance length is not the only factor determining practical decoding speed.
|
| 52 |
+
|
| 53 |
+
## Recommended Speculative Decoding Configuration
|
| 54 |
+
|
| 55 |
+
```bash
|
| 56 |
+
--speculative-algorithm EAGLE3
|
| 57 |
+
--speculative-num-steps 3
|
| 58 |
+
--speculative-eagle-topk 1
|
| 59 |
+
--speculative-num-draft-tokens 4
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
## Quick Start
|
| 63 |
+
|
| 64 |
+
### Requirements
|
| 65 |
+
|
| 66 |
+
- NVIDIA GPU
|
| 67 |
+
- CUDA 12.0+
|
| 68 |
+
- PyTorch 2.0+
|
| 69 |
+
- SGLang with EAGLE3 support
|
| 70 |
+
|
| 71 |
+
### Installation
|
| 72 |
+
|
| 73 |
+
```bash
|
| 74 |
+
pip install sglang==0.5.10
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
Please make sure your SGLang installation includes EAGLE3 support.
|
| 78 |
+
|
| 79 |
+
### Inference with SGLang
|
| 80 |
+
|
| 81 |
+
```bash
|
| 82 |
+
python3 -m sglang.launch_server \
|
| 83 |
+
--model-path /models/Qwen3.5-397B-A17B \
|
| 84 |
+
--host 0.0.0.0 \
|
| 85 |
+
--port 30012 \
|
| 86 |
+
--trust-remote-code \
|
| 87 |
+
--mem-fraction-static 0.9 \
|
| 88 |
+
--tp-size 8 \
|
| 89 |
+
--speculative-algorithm EAGLE3 \
|
| 90 |
+
--speculative-draft-model-path /models/Qwen3.5-397B-A17B-eagle3 \
|
| 91 |
+
--speculative-num-steps 3 \
|
| 92 |
+
--speculative-eagle-topk 1 \
|
| 93 |
+
--speculative-num-draft-tokens 4
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
Adjust `--model-path`, `--speculative-draft-model-path`, `--tp-size`, and memory-related parameters according to your deployment environment.
|
| 97 |
+
|
| 98 |
+
## Notes
|
| 99 |
+
|
| 100 |
+
This release focuses on practical throughput acceleration for Qwen3.5-397B-A17B. The current Eagle3 draft model has not yet matched MTP in average acceptance length, but it already achieves better throughput on multiple reasoning and coding benchmarks. Further improvements are expected with larger-scale training and continued optimization.
|
| 101 |
+
|
| 102 |
+
## Citation
|
| 103 |
+
|
| 104 |
+
If you use this model in your research or application, please cite:
|
| 105 |
+
|
| 106 |
+
```bibtex
|
| 107 |
+
@misc{qwen35eagle3,
|
| 108 |
+
title={Qwen3.5-397B-A17B-eagle3: Accelerating Qwen3.5 Inference with EAGLE3},
|
| 109 |
+
author={Ant AQ Team},
|
| 110 |
+
year={2026},
|
| 111 |
+
}
|
| 112 |
+
```
|