MeganEFlynn commited on
Commit
c11c4cd
·
verified ·
1 Parent(s): 0a5be04

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +209 -0
  2. config.json +70 -0
  3. metadata.json +15 -0
  4. model.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: nvidia-open-model-license
4
+ license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
5
+ base_model: nvidia/Llama-4-Maverick-17B-128E-Eagle3
6
+ tags:
7
+ - speculative-decoding
8
+ - eagle3
9
+ - llama3
10
+ - llama4
11
+ - vllm
12
+ - speculators
13
+ ---
14
+
15
+ # Llama4-Maverick-Eagle3-Speculators
16
+
17
+ ## Model Description
18
+
19
+ **⚠️ Development Reference Model**: This model has been converted as a reference for development on vLLM. Once development is complete, it can be served using:
20
+ ```bash
21
+ vllm serve nm-testing/Llama4-Maverick-Eagle3-Speculators
22
+ ```
23
+
24
+ This is a manually converted Eagle3 speculator model based on NVIDIA's Llama-4-Maverick-17B-128E-Eagle3, reformatted for compatibility with the [Speculators](https://github.com/neuralmagic/speculators) library and vLLM speculative decoding.
25
+
26
+ ### Development Status
27
+ 🚧 **Reference Implementation for vLLM Development**
28
+ - This model serves as a reference implementation for vLLM Eagle3 support
29
+ - Contains non-standard features (auxiliary hidden states) that require vLLM extensions
30
+ - Once vLLM development is complete, will support direct serving
31
+
32
+ ### Key Features
33
+ - **Architecture**: Eagle3 speculator with Llama3-based draft head
34
+ - **Target Verifier**: Llama4 Maverick 17B (quantized w4a16)
35
+ - **Vocabulary Size**: 202,048 tokens (unusually large for a draft model)
36
+ - **Special Feature**: Uses auxiliary hidden states from verifier layers [1, 23, 44]
37
+
38
+ ## Configuration Details
39
+
40
+ This model represents a unique hybrid configuration:
41
+ - **Draft Model**: Llama3-based Eagle3 head (single transformer layer)
42
+ - **Verifier Model**: `RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16`
43
+ - **Architecture Class**: `Llama4ForConditionalGeneration` for verifier
44
+
45
+ ### Non-Standard Features
46
+
47
+ This model includes several non-standard Eagle3 features preserved from the NVIDIA checkpoint:
48
+ - Auxiliary hidden state layers from positions [1, 23, 44]
49
+ - Custom layer normalization configurations
50
+ - Large vocabulary matching the target model
51
+
52
+ ## Usage
53
+
54
+ ### With vLLM (After Development Complete)
55
+
56
+ ```bash
57
+ # Once vLLM development is complete, serve directly:
58
+ vllm serve nm-testing/Llama4-Maverick-Eagle3-Speculators
59
+ ```
60
+
61
+ ### With Speculators Library
62
+
63
+ ```python
64
+ from speculators import SpeculatorModel
65
+ from transformers import AutoModelForCausalLM
66
+
67
+ # Load the speculator
68
+ speculator = SpeculatorModel.from_pretrained("nm-testing/Llama4-Maverick-Eagle3-Speculators")
69
+
70
+ # Load and attach the verifier
71
+ verifier = AutoModelForCausalLM.from_pretrained(
72
+ "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16",
73
+ trust_remote_code=True
74
+ )
75
+ speculator.attach_verifier(verifier)
76
+
77
+ # Use for generation
78
+ outputs = speculator.generate(input_ids, max_length=100)
79
+ ```
80
+
81
+ ## Configuration Structure
82
+
83
+ The model uses the Speculators Eagle3 format with additional fields for NVIDIA-specific features:
84
+
85
+ ```json
86
+ {
87
+ "speculators_model_type": "eagle3",
88
+ "architectures": ["Eagle3Speculator"],
89
+ "draft_vocab_size": 202048,
90
+ "transformer_layer_config": {
91
+ "rope_scaling": {
92
+ "rope_type": "llama3" // Confirms Llama3 architecture
93
+ }
94
+ },
95
+ "eagle_aux_hidden_state_layer_ids": [1, 23, 44],
96
+ "use_aux_hidden_state": true
97
+ }
98
+ ```
99
+
100
+ ## Benchmarking
101
+
102
+ ### Text-Only Inference
103
+
104
+ **Command:**
105
+ ```bash
106
+ python examples/offline_inference/spec_decode.py \
107
+ --method "eagle3" \
108
+ --tp 8 \
109
+ --print-output \
110
+ --model-dir "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16" \
111
+ --eagle-dir "nm-testing/Llama4-Maverick-Eagle3-Speculators" \
112
+ --dataset_name "hf" \
113
+ --dataset_path "philschmid/mt-bench" \
114
+ --num-spec-tokens 3
115
+ ```
116
+
117
+ **Results:**
118
+ - Mean acceptance length: 2.53
119
+ - Per-position acceptance rates: 0.71, 0.48, 0.34
120
+ - Auxiliary layers used: [1, 23, 44] (configured via speculator config)
121
+
122
+ ```bash
123
+ --------------------------------------------------
124
+ --------------------------------------------------
125
+ total_num_output_tokens: 227215
126
+ num_drafts: 90393
127
+ num_draft_tokens: 271179
128
+ num_accepted_tokens: 136677
129
+ mean acceptance length: 2.53
130
+ --------------------------------------------------
131
+ acceptance at token 0: 0.71
132
+ acceptance at token 1: 0.48
133
+ acceptance at token 2: 0.34
134
+ ```
135
+
136
+ ### Multimodal Inference
137
+
138
+ **Command:**
139
+ ```bash
140
+ python examples/offline_inference/spec_decode.py \
141
+ --method "eagle3" \
142
+ --tp 8 \
143
+ --model-dir "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16" \
144
+ --eagle-dir "nm-testing/Llama4-Maverick-Eagle3-Speculators" \
145
+ --custom-mm-prompts \
146
+ --num-spec-tokens 3
147
+ ```
148
+
149
+ **Results:**
150
+ - Mean acceptance length: 2.12
151
+ - Per-position acceptance rates: 0.60, 0.34, 0.19
152
+ - Note: The acceptance rate is lower than text-only inference. Multimodal support will be investigated and expanded in a future PR.
153
+
154
+ ```bash
155
+ --------------------------------------------------
156
+ total_num_output_tokens: 181036
157
+ num_drafts: 85369
158
+ num_draft_tokens: 256107
159
+ num_accepted_tokens: 95711
160
+ mean acceptance length: 2.12
161
+ --------------------------------------------------
162
+ acceptance at token 0: 0.60
163
+ acceptance at token 1: 0.34
164
+ acceptance at token 2: 0.19
165
+ ```
166
+
167
+ **Benchmarking Script:** [vLLM spec_decode.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/spec_decode.py)
168
+
169
+ ## Performance Notes
170
+
171
+ - **Vocabulary Size**: The 202K vocabulary is unusually large and may impact memory usage
172
+ - **Auxiliary Hidden States**: May require custom Eagle3Speculator extensions for full functionality
173
+ - **Acceptance Rate**: Achieves ~2.5 tokens per forward pass on text-only tasks, ~2.1 on multimodal tasks
174
+
175
+ ## Model Weights
176
+
177
+ - **Format**: SafeTensors
178
+ - **Precision**: bfloat16
179
+ - **Size**: ~3.2GB
180
+
181
+ ## Citation
182
+
183
+ If you use this model, please cite both the original NVIDIA model and the Speculators library:
184
+
185
+ ```bibtex
186
+ @misc{nvidia2025llama4maverick,
187
+ title={Llama 4 Maverick 17B Eagle3},
188
+ author={NVIDIA Corporation},
189
+ year={2025},
190
+ publisher={Hugging Face}
191
+ }
192
+
193
+ @misc{speculators2024,
194
+ title={Speculators: A Unified Library for Speculative Decoding},
195
+ author={Neural Magic},
196
+ year={2024},
197
+ url={https://github.com/neuralmagic/speculators}
198
+ }
199
+ ```
200
+
201
+ ## License
202
+
203
+ This model is subject to the NVIDIA Open Model License. Please review the license terms before use.
204
+
205
+ ## Acknowledgments
206
+
207
+ - Original model by NVIDIA Corporation
208
+ - Conversion and formatting for Speculators/vLLM compatibility
209
+ - Based on Eagle3 architecture with Llama3 draft head targeting Llama4 verifier
config.json ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Eagle3Speculator"
4
+ ],
5
+ "speculators_model_type": "eagle3",
6
+ "speculators_version": "0.1.0.dev42",
7
+ "draft_vocab_size": 64000,
8
+ "norm_before_residual": false,
9
+ "target_hidden_size": null,
10
+ "eagle_aux_hidden_state_layer_ids": [
11
+ 1,
12
+ 23,
13
+ 44
14
+ ],
15
+ "transformer_layer_config": {
16
+ "model_type": "llama",
17
+ "vocab_size": 202048,
18
+ "hidden_size": 5120,
19
+ "intermediate_size": 32768,
20
+ "num_hidden_layers": 1,
21
+ "num_attention_heads": 40,
22
+ "num_key_value_heads": 8,
23
+ "head_dim": 128,
24
+ "hidden_act": "silu",
25
+ "max_position_embeddings": 1048576,
26
+ "initializer_range": 0.02,
27
+ "rms_norm_eps": 1e-05,
28
+ "pretraining_tp": 1,
29
+ "use_cache": true,
30
+ "rope_theta": 500000.0,
31
+ "rope_scaling": {
32
+ "factor": 8.0,
33
+ "high_freq_factor": 4.0,
34
+ "low_freq_factor": 1.0,
35
+ "original_max_position_embeddings": 8192,
36
+ "rope_type": "llama3"
37
+ },
38
+ "attention_bias": false,
39
+ "attention_dropout": 0.0,
40
+ "mlp_bias": false,
41
+ "tie_word_embeddings": false
42
+ },
43
+ "speculators_config": {
44
+ "algorithm": "eagle3",
45
+ "default_proposal_method": "greedy",
46
+ "proposal_methods": [
47
+ {
48
+ "proposal_type": "greedy",
49
+ "speculative_tokens": 3,
50
+ "verifier_accept_k": 1,
51
+ "accept_tolerance": 0.0
52
+ }
53
+ ],
54
+ "verifier": {
55
+ "name_or_path": "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16",
56
+ "architectures": [
57
+ "Llama4ForConditionalGeneration"
58
+ ]
59
+ }
60
+ },
61
+ "torch_dtype": "bfloat16",
62
+ "_comment": "Eagle3 head based on Llama3 architecture targeting Llama4 Maverick verifier",
63
+ "_conversion_notes": {
64
+ "source": "nvidia/Llama-4-Maverick-17B-128E-Eagle3",
65
+ "architecture_notes": "Eagle3 head uses Llama3 rope_type, targets Llama4 verifier",
66
+ "vocabulary_notes": "Large 202K vocabulary, same for draft and target",
67
+ "auxiliary_layers": "Uses hidden states from verifier layers 1, 23, 44",
68
+ "implementation_note": "May require Eagle3Speculator extensions for aux hidden states"
69
+ }
70
+ }
metadata.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "conversion_tool": "create_final_eagle3_config.py",
3
+ "source_checkpoint": "nvidia/Llama-4-Maverick-17B-128E-Eagle3",
4
+ "format": "speculators-eagle3",
5
+ "architecture": "Llama3-based Eagle3 head",
6
+ "verifier": "Llama4 Maverick",
7
+ "notes": [
8
+ "Eagle3 head based on Llama3 architecture (rope_type: llama3)",
9
+ "Targets Llama4 Maverick verifier (Llama4ForConditionalGeneration)",
10
+ "Large vocabulary of 202,048 tokens",
11
+ "Uses auxiliary hidden states from layers 1, 23, 44",
12
+ "NVIDIA-specific fields preserved as extra configuration",
13
+ "May require Eagle3Speculator implementation extensions"
14
+ ]
15
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a8f075fce4f9ad4a109167c703d40d4470b7318390864a8850b7de23cb99647b
3
+ size 2019265328