---
license: other
license_name: nvidia-open-model-license
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
base_model: nvidia/Llama-4-Maverick-17B-128E-Eagle3
tags:
- speculative-decoding
- eagle3
- llama3
- llama4
- vllm
- speculators
---

# Llama4-Maverick-Eagle3-Speculators

## Model Description

**⚠️ Development Reference Model**: This model has been converted as a reference for development on vLLM. Once development is complete, it can be served using:
```bash
vllm serve nm-testing/Llama4-Maverick-Eagle3-Speculators
```

This is a manually converted Eagle3 speculator model based on NVIDIA's Llama-4-Maverick-17B-128E-Eagle3, reformatted for compatibility with the [Speculators](https://github.com/neuralmagic/speculators) library and vLLM speculative decoding.

### Development Status
🚧 **Reference Implementation for vLLM Development**
- This model serves as a reference implementation for vLLM Eagle3 support
- Contains non-standard features (auxiliary hidden states) that require vLLM extensions
- Once vLLM development is complete, will support direct serving

### Key Features
- **Architecture**: Eagle3 speculator with Llama3-based draft head
- **Target Verifier**: Llama4 Maverick 17B (quantized w4a16)
- **Vocabulary Size**: 202,048 tokens (unusually large for a draft model)
- **Special Feature**: Uses auxiliary hidden states from verifier layers [1, 23, 44]

## Configuration Details

This model represents a unique hybrid configuration:
- **Draft Model**: Llama3-based Eagle3 head (single transformer layer)
- **Verifier Model**: `RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16`
- **Architecture Class**: `Llama4ForConditionalGeneration` for verifier

### Non-Standard Features

This model includes several non-standard Eagle3 features preserved from the NVIDIA checkpoint:
- Auxiliary hidden state layers from positions [1, 23, 44]
- Custom layer normalization configurations
- Large vocabulary matching the target model

## Usage

### With vLLM (After Development Complete)

```bash
# Once vLLM development is complete, serve directly:
vllm serve nm-testing/Llama4-Maverick-Eagle3-Speculators
```

### With Speculators Library

```python
from speculators import SpeculatorModel
from transformers import AutoModelForCausalLM

# Load the speculator
speculator = SpeculatorModel.from_pretrained("nm-testing/Llama4-Maverick-Eagle3-Speculators")

# Load and attach the verifier
verifier = AutoModelForCausalLM.from_pretrained(
    "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16",
    trust_remote_code=True
)
speculator.attach_verifier(verifier)

# Use for generation
outputs = speculator.generate(input_ids, max_length=100)
```

## Configuration Structure

The model uses the Speculators Eagle3 format with additional fields for NVIDIA-specific features:

```json
{
  "speculators_model_type": "eagle3",
  "architectures": ["Eagle3Speculator"],
  "draft_vocab_size": 202048,
  "transformer_layer_config": {
    "rope_scaling": {
      "rope_type": "llama3"  // Confirms Llama3 architecture
    }
  },
  "eagle_aux_hidden_state_layer_ids": [1, 23, 44],
  "use_aux_hidden_state": true
}
```

## Benchmarking

### Text-Only Inference

**Command:**
```bash
python examples/offline_inference/spec_decode.py \
  --method "eagle3" \
  --tp 8 \
  --print-output \
  --model-dir "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16" \
  --eagle-dir "nm-testing/Llama4-Maverick-Eagle3-Speculators" \
  --dataset_name "hf" \
  --dataset_path "philschmid/mt-bench" \
  --num-spec-tokens 3
```

**Results:**
- Mean acceptance length: 2.53
- Per-position acceptance rates: 0.71, 0.48, 0.34
- Auxiliary layers used: [1, 23, 44] (configured via speculator config)

```bash
--------------------------------------------------
--------------------------------------------------
total_num_output_tokens: 227215
num_drafts: 90393
num_draft_tokens: 271179
num_accepted_tokens: 136677
mean acceptance length: 2.53
--------------------------------------------------
acceptance at token 0: 0.71
acceptance at token 1: 0.48
acceptance at token 2: 0.34
```

### Multimodal Inference

**Command:**
```bash
python examples/offline_inference/spec_decode.py \
  --method "eagle3" \
  --tp 8 \
  --model-dir "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16" \
  --eagle-dir "nm-testing/Llama4-Maverick-Eagle3-Speculators" \
  --custom-mm-prompts \
  --num-spec-tokens 3
```

**Results:**
- Mean acceptance length: 2.12
- Per-position acceptance rates: 0.60, 0.34, 0.19
- Note: The acceptance rate is lower than text-only inference. Multimodal support will be investigated and expanded in a future PR.

```bash
--------------------------------------------------
total_num_output_tokens: 181036
num_drafts: 85369
num_draft_tokens: 256107
num_accepted_tokens: 95711
mean acceptance length: 2.12
--------------------------------------------------
acceptance at token 0: 0.60
acceptance at token 1: 0.34
acceptance at token 2: 0.19
```

**Benchmarking Script:** [vLLM spec_decode.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/spec_decode.py)

## Performance Notes

- **Vocabulary Size**: The 202K vocabulary is unusually large and may impact memory usage
- **Auxiliary Hidden States**: May require custom Eagle3Speculator extensions for full functionality
- **Acceptance Rate**: Achieves ~2.5 tokens per forward pass on text-only tasks, ~2.1 on multimodal tasks

## Model Weights

- **Format**: SafeTensors
- **Precision**: bfloat16
- **Size**: ~3.2GB

## Citation

If you use this model, please cite both the original NVIDIA model and the Speculators library:

```bibtex
@misc{nvidia2025llama4maverick,
  title={Llama 4 Maverick 17B Eagle3},
  author={NVIDIA Corporation},
  year={2025},
  publisher={Hugging Face}
}

@misc{speculators2024,
  title={Speculators: A Unified Library for Speculative Decoding},
  author={Neural Magic},
  year={2024},
  url={https://github.com/neuralmagic/speculators}
}
```

## License

This model is subject to the NVIDIA Open Model License. Please review the license terms before use.

## Acknowledgments

- Original model by NVIDIA Corporation
- Conversion and formatting for Speculators/vLLM compatibility
- Based on Eagle3 architecture with Llama3 draft head targeting Llama4 verifier