Upload folder using huggingface_hub
Browse files- README.md +209 -0
- config.json +70 -0
- metadata.json +15 -0
- model.safetensors +3 -0
README.md
ADDED
|
@@ -0,0 +1,209 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: other
|
| 3 |
+
license_name: nvidia-open-model-license
|
| 4 |
+
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
|
| 5 |
+
base_model: nvidia/Llama-4-Maverick-17B-128E-Eagle3
|
| 6 |
+
tags:
|
| 7 |
+
- speculative-decoding
|
| 8 |
+
- eagle3
|
| 9 |
+
- llama3
|
| 10 |
+
- llama4
|
| 11 |
+
- vllm
|
| 12 |
+
- speculators
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# Llama4-Maverick-Eagle3-Speculators
|
| 16 |
+
|
| 17 |
+
## Model Description
|
| 18 |
+
|
| 19 |
+
**⚠️ Development Reference Model**: This model has been converted as a reference for development on vLLM. Once development is complete, it can be served using:
|
| 20 |
+
```bash
|
| 21 |
+
vllm serve nm-testing/Llama4-Maverick-Eagle3-Speculators
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
This is a manually converted Eagle3 speculator model based on NVIDIA's Llama-4-Maverick-17B-128E-Eagle3, reformatted for compatibility with the [Speculators](https://github.com/neuralmagic/speculators) library and vLLM speculative decoding.
|
| 25 |
+
|
| 26 |
+
### Development Status
|
| 27 |
+
🚧 **Reference Implementation for vLLM Development**
|
| 28 |
+
- This model serves as a reference implementation for vLLM Eagle3 support
|
| 29 |
+
- Contains non-standard features (auxiliary hidden states) that require vLLM extensions
|
| 30 |
+
- Once vLLM development is complete, will support direct serving
|
| 31 |
+
|
| 32 |
+
### Key Features
|
| 33 |
+
- **Architecture**: Eagle3 speculator with Llama3-based draft head
|
| 34 |
+
- **Target Verifier**: Llama4 Maverick 17B (quantized w4a16)
|
| 35 |
+
- **Vocabulary Size**: 202,048 tokens (unusually large for a draft model)
|
| 36 |
+
- **Special Feature**: Uses auxiliary hidden states from verifier layers [1, 23, 44]
|
| 37 |
+
|
| 38 |
+
## Configuration Details
|
| 39 |
+
|
| 40 |
+
This model represents a unique hybrid configuration:
|
| 41 |
+
- **Draft Model**: Llama3-based Eagle3 head (single transformer layer)
|
| 42 |
+
- **Verifier Model**: `RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16`
|
| 43 |
+
- **Architecture Class**: `Llama4ForConditionalGeneration` for verifier
|
| 44 |
+
|
| 45 |
+
### Non-Standard Features
|
| 46 |
+
|
| 47 |
+
This model includes several non-standard Eagle3 features preserved from the NVIDIA checkpoint:
|
| 48 |
+
- Auxiliary hidden state layers from positions [1, 23, 44]
|
| 49 |
+
- Custom layer normalization configurations
|
| 50 |
+
- Large vocabulary matching the target model
|
| 51 |
+
|
| 52 |
+
## Usage
|
| 53 |
+
|
| 54 |
+
### With vLLM (After Development Complete)
|
| 55 |
+
|
| 56 |
+
```bash
|
| 57 |
+
# Once vLLM development is complete, serve directly:
|
| 58 |
+
vllm serve nm-testing/Llama4-Maverick-Eagle3-Speculators
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
### With Speculators Library
|
| 62 |
+
|
| 63 |
+
```python
|
| 64 |
+
from speculators import SpeculatorModel
|
| 65 |
+
from transformers import AutoModelForCausalLM
|
| 66 |
+
|
| 67 |
+
# Load the speculator
|
| 68 |
+
speculator = SpeculatorModel.from_pretrained("nm-testing/Llama4-Maverick-Eagle3-Speculators")
|
| 69 |
+
|
| 70 |
+
# Load and attach the verifier
|
| 71 |
+
verifier = AutoModelForCausalLM.from_pretrained(
|
| 72 |
+
"RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16",
|
| 73 |
+
trust_remote_code=True
|
| 74 |
+
)
|
| 75 |
+
speculator.attach_verifier(verifier)
|
| 76 |
+
|
| 77 |
+
# Use for generation
|
| 78 |
+
outputs = speculator.generate(input_ids, max_length=100)
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
## Configuration Structure
|
| 82 |
+
|
| 83 |
+
The model uses the Speculators Eagle3 format with additional fields for NVIDIA-specific features:
|
| 84 |
+
|
| 85 |
+
```json
|
| 86 |
+
{
|
| 87 |
+
"speculators_model_type": "eagle3",
|
| 88 |
+
"architectures": ["Eagle3Speculator"],
|
| 89 |
+
"draft_vocab_size": 202048,
|
| 90 |
+
"transformer_layer_config": {
|
| 91 |
+
"rope_scaling": {
|
| 92 |
+
"rope_type": "llama3" // Confirms Llama3 architecture
|
| 93 |
+
}
|
| 94 |
+
},
|
| 95 |
+
"eagle_aux_hidden_state_layer_ids": [1, 23, 44],
|
| 96 |
+
"use_aux_hidden_state": true
|
| 97 |
+
}
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
## Benchmarking
|
| 101 |
+
|
| 102 |
+
### Text-Only Inference
|
| 103 |
+
|
| 104 |
+
**Command:**
|
| 105 |
+
```bash
|
| 106 |
+
python examples/offline_inference/spec_decode.py \
|
| 107 |
+
--method "eagle3" \
|
| 108 |
+
--tp 8 \
|
| 109 |
+
--print-output \
|
| 110 |
+
--model-dir "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16" \
|
| 111 |
+
--eagle-dir "nm-testing/Llama4-Maverick-Eagle3-Speculators" \
|
| 112 |
+
--dataset_name "hf" \
|
| 113 |
+
--dataset_path "philschmid/mt-bench" \
|
| 114 |
+
--num-spec-tokens 3
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
**Results:**
|
| 118 |
+
- Mean acceptance length: 2.53
|
| 119 |
+
- Per-position acceptance rates: 0.71, 0.48, 0.34
|
| 120 |
+
- Auxiliary layers used: [1, 23, 44] (configured via speculator config)
|
| 121 |
+
|
| 122 |
+
```bash
|
| 123 |
+
--------------------------------------------------
|
| 124 |
+
--------------------------------------------------
|
| 125 |
+
total_num_output_tokens: 227215
|
| 126 |
+
num_drafts: 90393
|
| 127 |
+
num_draft_tokens: 271179
|
| 128 |
+
num_accepted_tokens: 136677
|
| 129 |
+
mean acceptance length: 2.53
|
| 130 |
+
--------------------------------------------------
|
| 131 |
+
acceptance at token 0: 0.71
|
| 132 |
+
acceptance at token 1: 0.48
|
| 133 |
+
acceptance at token 2: 0.34
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
### Multimodal Inference
|
| 137 |
+
|
| 138 |
+
**Command:**
|
| 139 |
+
```bash
|
| 140 |
+
python examples/offline_inference/spec_decode.py \
|
| 141 |
+
--method "eagle3" \
|
| 142 |
+
--tp 8 \
|
| 143 |
+
--model-dir "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16" \
|
| 144 |
+
--eagle-dir "nm-testing/Llama4-Maverick-Eagle3-Speculators" \
|
| 145 |
+
--custom-mm-prompts \
|
| 146 |
+
--num-spec-tokens 3
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
**Results:**
|
| 150 |
+
- Mean acceptance length: 2.12
|
| 151 |
+
- Per-position acceptance rates: 0.60, 0.34, 0.19
|
| 152 |
+
- Note: The acceptance rate is lower than text-only inference. Multimodal support will be investigated and expanded in a future PR.
|
| 153 |
+
|
| 154 |
+
```bash
|
| 155 |
+
--------------------------------------------------
|
| 156 |
+
total_num_output_tokens: 181036
|
| 157 |
+
num_drafts: 85369
|
| 158 |
+
num_draft_tokens: 256107
|
| 159 |
+
num_accepted_tokens: 95711
|
| 160 |
+
mean acceptance length: 2.12
|
| 161 |
+
--------------------------------------------------
|
| 162 |
+
acceptance at token 0: 0.60
|
| 163 |
+
acceptance at token 1: 0.34
|
| 164 |
+
acceptance at token 2: 0.19
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
**Benchmarking Script:** [vLLM spec_decode.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/spec_decode.py)
|
| 168 |
+
|
| 169 |
+
## Performance Notes
|
| 170 |
+
|
| 171 |
+
- **Vocabulary Size**: The 202K vocabulary is unusually large and may impact memory usage
|
| 172 |
+
- **Auxiliary Hidden States**: May require custom Eagle3Speculator extensions for full functionality
|
| 173 |
+
- **Acceptance Rate**: Achieves ~2.5 tokens per forward pass on text-only tasks, ~2.1 on multimodal tasks
|
| 174 |
+
|
| 175 |
+
## Model Weights
|
| 176 |
+
|
| 177 |
+
- **Format**: SafeTensors
|
| 178 |
+
- **Precision**: bfloat16
|
| 179 |
+
- **Size**: ~3.2GB
|
| 180 |
+
|
| 181 |
+
## Citation
|
| 182 |
+
|
| 183 |
+
If you use this model, please cite both the original NVIDIA model and the Speculators library:
|
| 184 |
+
|
| 185 |
+
```bibtex
|
| 186 |
+
@misc{nvidia2025llama4maverick,
|
| 187 |
+
title={Llama 4 Maverick 17B Eagle3},
|
| 188 |
+
author={NVIDIA Corporation},
|
| 189 |
+
year={2025},
|
| 190 |
+
publisher={Hugging Face}
|
| 191 |
+
}
|
| 192 |
+
|
| 193 |
+
@misc{speculators2024,
|
| 194 |
+
title={Speculators: A Unified Library for Speculative Decoding},
|
| 195 |
+
author={Neural Magic},
|
| 196 |
+
year={2024},
|
| 197 |
+
url={https://github.com/neuralmagic/speculators}
|
| 198 |
+
}
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
+
## License
|
| 202 |
+
|
| 203 |
+
This model is subject to the NVIDIA Open Model License. Please review the license terms before use.
|
| 204 |
+
|
| 205 |
+
## Acknowledgments
|
| 206 |
+
|
| 207 |
+
- Original model by NVIDIA Corporation
|
| 208 |
+
- Conversion and formatting for Speculators/vLLM compatibility
|
| 209 |
+
- Based on Eagle3 architecture with Llama3 draft head targeting Llama4 verifier
|
config.json
ADDED
|
@@ -0,0 +1,70 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"Eagle3Speculator"
|
| 4 |
+
],
|
| 5 |
+
"speculators_model_type": "eagle3",
|
| 6 |
+
"speculators_version": "0.1.0.dev42",
|
| 7 |
+
"draft_vocab_size": 64000,
|
| 8 |
+
"norm_before_residual": false,
|
| 9 |
+
"target_hidden_size": null,
|
| 10 |
+
"eagle_aux_hidden_state_layer_ids": [
|
| 11 |
+
1,
|
| 12 |
+
23,
|
| 13 |
+
44
|
| 14 |
+
],
|
| 15 |
+
"transformer_layer_config": {
|
| 16 |
+
"model_type": "llama",
|
| 17 |
+
"vocab_size": 202048,
|
| 18 |
+
"hidden_size": 5120,
|
| 19 |
+
"intermediate_size": 32768,
|
| 20 |
+
"num_hidden_layers": 1,
|
| 21 |
+
"num_attention_heads": 40,
|
| 22 |
+
"num_key_value_heads": 8,
|
| 23 |
+
"head_dim": 128,
|
| 24 |
+
"hidden_act": "silu",
|
| 25 |
+
"max_position_embeddings": 1048576,
|
| 26 |
+
"initializer_range": 0.02,
|
| 27 |
+
"rms_norm_eps": 1e-05,
|
| 28 |
+
"pretraining_tp": 1,
|
| 29 |
+
"use_cache": true,
|
| 30 |
+
"rope_theta": 500000.0,
|
| 31 |
+
"rope_scaling": {
|
| 32 |
+
"factor": 8.0,
|
| 33 |
+
"high_freq_factor": 4.0,
|
| 34 |
+
"low_freq_factor": 1.0,
|
| 35 |
+
"original_max_position_embeddings": 8192,
|
| 36 |
+
"rope_type": "llama3"
|
| 37 |
+
},
|
| 38 |
+
"attention_bias": false,
|
| 39 |
+
"attention_dropout": 0.0,
|
| 40 |
+
"mlp_bias": false,
|
| 41 |
+
"tie_word_embeddings": false
|
| 42 |
+
},
|
| 43 |
+
"speculators_config": {
|
| 44 |
+
"algorithm": "eagle3",
|
| 45 |
+
"default_proposal_method": "greedy",
|
| 46 |
+
"proposal_methods": [
|
| 47 |
+
{
|
| 48 |
+
"proposal_type": "greedy",
|
| 49 |
+
"speculative_tokens": 3,
|
| 50 |
+
"verifier_accept_k": 1,
|
| 51 |
+
"accept_tolerance": 0.0
|
| 52 |
+
}
|
| 53 |
+
],
|
| 54 |
+
"verifier": {
|
| 55 |
+
"name_or_path": "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16",
|
| 56 |
+
"architectures": [
|
| 57 |
+
"Llama4ForConditionalGeneration"
|
| 58 |
+
]
|
| 59 |
+
}
|
| 60 |
+
},
|
| 61 |
+
"torch_dtype": "bfloat16",
|
| 62 |
+
"_comment": "Eagle3 head based on Llama3 architecture targeting Llama4 Maverick verifier",
|
| 63 |
+
"_conversion_notes": {
|
| 64 |
+
"source": "nvidia/Llama-4-Maverick-17B-128E-Eagle3",
|
| 65 |
+
"architecture_notes": "Eagle3 head uses Llama3 rope_type, targets Llama4 verifier",
|
| 66 |
+
"vocabulary_notes": "Large 202K vocabulary, same for draft and target",
|
| 67 |
+
"auxiliary_layers": "Uses hidden states from verifier layers 1, 23, 44",
|
| 68 |
+
"implementation_note": "May require Eagle3Speculator extensions for aux hidden states"
|
| 69 |
+
}
|
| 70 |
+
}
|
metadata.json
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"conversion_tool": "create_final_eagle3_config.py",
|
| 3 |
+
"source_checkpoint": "nvidia/Llama-4-Maverick-17B-128E-Eagle3",
|
| 4 |
+
"format": "speculators-eagle3",
|
| 5 |
+
"architecture": "Llama3-based Eagle3 head",
|
| 6 |
+
"verifier": "Llama4 Maverick",
|
| 7 |
+
"notes": [
|
| 8 |
+
"Eagle3 head based on Llama3 architecture (rope_type: llama3)",
|
| 9 |
+
"Targets Llama4 Maverick verifier (Llama4ForConditionalGeneration)",
|
| 10 |
+
"Large vocabulary of 202,048 tokens",
|
| 11 |
+
"Uses auxiliary hidden states from layers 1, 23, 44",
|
| 12 |
+
"NVIDIA-specific fields preserved as extra configuration",
|
| 13 |
+
"May require Eagle3Speculator implementation extensions"
|
| 14 |
+
]
|
| 15 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a8f075fce4f9ad4a109167c703d40d4470b7318390864a8850b7de23cb99647b
|
| 3 |
+
size 2019265328
|