--- license: other license_name: nvidia-open-model-license license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license base_model: nvidia/Llama-4-Maverick-17B-128E-Eagle3 tags: - speculative-decoding - eagle3 - llama3 - llama4 - vllm - speculators --- # Llama4-Maverick-Eagle3-Speculators ## Model Description **⚠️ Development Reference Model**: This model has been converted as a reference for development on vLLM. Once development is complete, it can be served using: ```bash vllm serve nm-testing/Llama4-Maverick-Eagle3-Speculators ``` This is a manually converted Eagle3 speculator model based on NVIDIA's Llama-4-Maverick-17B-128E-Eagle3, reformatted for compatibility with the [Speculators](https://github.com/neuralmagic/speculators) library and vLLM speculative decoding. ### Development Status 🚧 **Reference Implementation for vLLM Development** - This model serves as a reference implementation for vLLM Eagle3 support - Contains non-standard features (auxiliary hidden states) that require vLLM extensions - Once vLLM development is complete, will support direct serving ### Key Features - **Architecture**: Eagle3 speculator with Llama3-based draft head - **Target Verifier**: Llama4 Maverick 17B (quantized w4a16) - **Vocabulary Size**: 202,048 tokens (unusually large for a draft model) - **Special Feature**: Uses auxiliary hidden states from verifier layers [1, 23, 44] ## Configuration Details This model represents a unique hybrid configuration: - **Draft Model**: Llama3-based Eagle3 head (single transformer layer) - **Verifier Model**: `RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16` - **Architecture Class**: `Llama4ForConditionalGeneration` for verifier ### Non-Standard Features This model includes several non-standard Eagle3 features preserved from the NVIDIA checkpoint: - Auxiliary hidden state layers from positions [1, 23, 44] - Custom layer normalization configurations - Large vocabulary matching the target model ## Usage ### With vLLM (After Development Complete) ```bash # Once vLLM development is complete, serve directly: vllm serve nm-testing/Llama4-Maverick-Eagle3-Speculators ``` ### With Speculators Library ```python from speculators import SpeculatorModel from transformers import AutoModelForCausalLM # Load the speculator speculator = SpeculatorModel.from_pretrained("nm-testing/Llama4-Maverick-Eagle3-Speculators") # Load and attach the verifier verifier = AutoModelForCausalLM.from_pretrained( "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16", trust_remote_code=True ) speculator.attach_verifier(verifier) # Use for generation outputs = speculator.generate(input_ids, max_length=100) ``` ## Configuration Structure The model uses the Speculators Eagle3 format with additional fields for NVIDIA-specific features: ```json { "speculators_model_type": "eagle3", "architectures": ["Eagle3Speculator"], "draft_vocab_size": 202048, "transformer_layer_config": { "rope_scaling": { "rope_type": "llama3" // Confirms Llama3 architecture } }, "eagle_aux_hidden_state_layer_ids": [1, 23, 44], "use_aux_hidden_state": true } ``` ## Benchmarking ### Text-Only Inference **Command:** ```bash python examples/offline_inference/spec_decode.py \ --method "eagle3" \ --tp 8 \ --print-output \ --model-dir "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16" \ --eagle-dir "nm-testing/Llama4-Maverick-Eagle3-Speculators" \ --dataset_name "hf" \ --dataset_path "philschmid/mt-bench" \ --num-spec-tokens 3 ``` **Results:** - Mean acceptance length: 2.53 - Per-position acceptance rates: 0.71, 0.48, 0.34 - Auxiliary layers used: [1, 23, 44] (configured via speculator config) ```bash -------------------------------------------------- -------------------------------------------------- total_num_output_tokens: 227215 num_drafts: 90393 num_draft_tokens: 271179 num_accepted_tokens: 136677 mean acceptance length: 2.53 -------------------------------------------------- acceptance at token 0: 0.71 acceptance at token 1: 0.48 acceptance at token 2: 0.34 ``` ### Multimodal Inference **Command:** ```bash python examples/offline_inference/spec_decode.py \ --method "eagle3" \ --tp 8 \ --model-dir "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16" \ --eagle-dir "nm-testing/Llama4-Maverick-Eagle3-Speculators" \ --custom-mm-prompts \ --num-spec-tokens 3 ``` **Results:** - Mean acceptance length: 2.12 - Per-position acceptance rates: 0.60, 0.34, 0.19 - Note: The acceptance rate is lower than text-only inference. Multimodal support will be investigated and expanded in a future PR. ```bash -------------------------------------------------- total_num_output_tokens: 181036 num_drafts: 85369 num_draft_tokens: 256107 num_accepted_tokens: 95711 mean acceptance length: 2.12 -------------------------------------------------- acceptance at token 0: 0.60 acceptance at token 1: 0.34 acceptance at token 2: 0.19 ``` **Benchmarking Script:** [vLLM spec_decode.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/spec_decode.py) ## Performance Notes - **Vocabulary Size**: The 202K vocabulary is unusually large and may impact memory usage - **Auxiliary Hidden States**: May require custom Eagle3Speculator extensions for full functionality - **Acceptance Rate**: Achieves ~2.5 tokens per forward pass on text-only tasks, ~2.1 on multimodal tasks ## Model Weights - **Format**: SafeTensors - **Precision**: bfloat16 - **Size**: ~3.2GB ## Citation If you use this model, please cite both the original NVIDIA model and the Speculators library: ```bibtex @misc{nvidia2025llama4maverick, title={Llama 4 Maverick 17B Eagle3}, author={NVIDIA Corporation}, year={2025}, publisher={Hugging Face} } @misc{speculators2024, title={Speculators: A Unified Library for Speculative Decoding}, author={Neural Magic}, year={2024}, url={https://github.com/neuralmagic/speculators} } ``` ## License This model is subject to the NVIDIA Open Model License. Please review the license terms before use. ## Acknowledgments - Original model by NVIDIA Corporation - Conversion and formatting for Speculators/vLLM compatibility - Based on Eagle3 architecture with Llama3 draft head targeting Llama4 verifier