Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +209 -0
config.json +70 -0
metadata.json +15 -0
model.safetensors +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,209 @@

+---
+license: other
+license_name: nvidia-open-model-license
+license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
+base_model: nvidia/Llama-4-Maverick-17B-128E-Eagle3
+tags:
+- speculative-decoding
+- eagle3
+- llama3
+- llama4
+- vllm
+- speculators
+---
+# Llama4-Maverick-Eagle3-Speculators
+## Model Description
+**⚠️ Development Reference Model**: This model has been converted as a reference for development on vLLM. Once development is complete, it can be served using:
+```bash
+vllm serve nm-testing/Llama4-Maverick-Eagle3-Speculators
+```
+This is a manually converted Eagle3 speculator model based on NVIDIA's Llama-4-Maverick-17B-128E-Eagle3, reformatted for compatibility with the [Speculators](https://github.com/neuralmagic/speculators) library and vLLM speculative decoding.
+### Development Status
+🚧 **Reference Implementation for vLLM Development**
+- This model serves as a reference implementation for vLLM Eagle3 support
+- Contains non-standard features (auxiliary hidden states) that require vLLM extensions
+- Once vLLM development is complete, will support direct serving
+### Key Features
+- **Architecture**: Eagle3 speculator with Llama3-based draft head
+- **Target Verifier**: Llama4 Maverick 17B (quantized w4a16)
+- **Vocabulary Size**: 202,048 tokens (unusually large for a draft model)
+- **Special Feature**: Uses auxiliary hidden states from verifier layers [1, 23, 44]
+## Configuration Details
+This model represents a unique hybrid configuration:
+- **Draft Model**: Llama3-based Eagle3 head (single transformer layer)
+- **Verifier Model**: `RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16`
+- **Architecture Class**: `Llama4ForConditionalGeneration` for verifier
+### Non-Standard Features
+This model includes several non-standard Eagle3 features preserved from the NVIDIA checkpoint:
+- Auxiliary hidden state layers from positions [1, 23, 44]
+- Custom layer normalization configurations
+- Large vocabulary matching the target model
+## Usage
+### With vLLM (After Development Complete)
+```bash
+# Once vLLM development is complete, serve directly:
+vllm serve nm-testing/Llama4-Maverick-Eagle3-Speculators
+```
+### With Speculators Library
+```python
+from speculators import SpeculatorModel
+from transformers import AutoModelForCausalLM
+# Load the speculator
+speculator = SpeculatorModel.from_pretrained("nm-testing/Llama4-Maverick-Eagle3-Speculators")
+# Load and attach the verifier
+verifier = AutoModelForCausalLM.from_pretrained(
+    "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16",
+    trust_remote_code=True
+)
+speculator.attach_verifier(verifier)
+# Use for generation
+outputs = speculator.generate(input_ids, max_length=100)
+```
+## Configuration Structure
+The model uses the Speculators Eagle3 format with additional fields for NVIDIA-specific features:
+```json
+{
+  "speculators_model_type": "eagle3",
+  "architectures": ["Eagle3Speculator"],
+  "draft_vocab_size": 202048,
+  "transformer_layer_config": {
+    "rope_scaling": {
+      "rope_type": "llama3"  // Confirms Llama3 architecture
+    }
+  },
+  "eagle_aux_hidden_state_layer_ids": [1, 23, 44],
+  "use_aux_hidden_state": true
+}
+```
+## Benchmarking
+### Text-Only Inference
+**Command:**
+```bash
+python examples/offline_inference/spec_decode.py \
+  --method "eagle3" \
+  --tp 8 \
+  --print-output \
+  --model-dir "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16" \
+  --eagle-dir "nm-testing/Llama4-Maverick-Eagle3-Speculators" \
+  --dataset_name "hf" \
+  --dataset_path "philschmid/mt-bench" \
+  --num-spec-tokens 3
+```
+**Results:**
+- Mean acceptance length: 2.53
+- Per-position acceptance rates: 0.71, 0.48, 0.34
+- Auxiliary layers used: [1, 23, 44] (configured via speculator config)
+```bash
+--------------------------------------------------
+--------------------------------------------------
+total_num_output_tokens: 227215
+num_drafts: 90393
+num_draft_tokens: 271179
+num_accepted_tokens: 136677
+mean acceptance length: 2.53
+--------------------------------------------------
+acceptance at token 0: 0.71
+acceptance at token 1: 0.48
+acceptance at token 2: 0.34
+```
+### Multimodal Inference
+**Command:**
+```bash
+python examples/offline_inference/spec_decode.py \
+  --method "eagle3" \
+  --tp 8 \
+  --model-dir "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16" \
+  --eagle-dir "nm-testing/Llama4-Maverick-Eagle3-Speculators" \
+  --custom-mm-prompts \
+  --num-spec-tokens 3
+```
+**Results:**
+- Mean acceptance length: 2.12
+- Per-position acceptance rates: 0.60, 0.34, 0.19
+- Note: The acceptance rate is lower than text-only inference. Multimodal support will be investigated and expanded in a future PR.
+```bash
+--------------------------------------------------
+total_num_output_tokens: 181036
+num_drafts: 85369
+num_draft_tokens: 256107
+num_accepted_tokens: 95711
+mean acceptance length: 2.12
+--------------------------------------------------
+acceptance at token 0: 0.60
+acceptance at token 1: 0.34
+acceptance at token 2: 0.19
+```
+**Benchmarking Script:** [vLLM spec_decode.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/spec_decode.py)
+## Performance Notes
+- **Vocabulary Size**: The 202K vocabulary is unusually large and may impact memory usage
+- **Auxiliary Hidden States**: May require custom Eagle3Speculator extensions for full functionality
+- **Acceptance Rate**: Achieves ~2.5 tokens per forward pass on text-only tasks, ~2.1 on multimodal tasks
+## Model Weights
+- **Format**: SafeTensors
+- **Precision**: bfloat16
+- **Size**: ~3.2GB
+## Citation
+If you use this model, please cite both the original NVIDIA model and the Speculators library:
+```bibtex
+@misc{nvidia2025llama4maverick,
+  title={Llama 4 Maverick 17B Eagle3},
+  author={NVIDIA Corporation},
+  year={2025},
+  publisher={Hugging Face}
+}
+@misc{speculators2024,
+  title={Speculators: A Unified Library for Speculative Decoding},
+  author={Neural Magic},
+  year={2024},
+  url={https://github.com/neuralmagic/speculators}
+}
+```
+## License
+This model is subject to the NVIDIA Open Model License. Please review the license terms before use.
+## Acknowledgments
+- Original model by NVIDIA Corporation
+- Conversion and formatting for Speculators/vLLM compatibility
+- Based on Eagle3 architecture with Llama3 draft head targeting Llama4 verifier

config.json ADDED Viewed

	@@ -0,0 +1,70 @@

+{
+  "architectures": [
+    "Eagle3Speculator"
+  ],
+  "speculators_model_type": "eagle3",
+  "speculators_version": "0.1.0.dev42",
+  "draft_vocab_size": 64000,
+  "norm_before_residual": false,
+  "target_hidden_size": null,
+  "eagle_aux_hidden_state_layer_ids": [
+    1,
+    23,
+    44
+  ],
+  "transformer_layer_config": {
+    "model_type": "llama",
+    "vocab_size": 202048,
+    "hidden_size": 5120,
+    "intermediate_size": 32768,
+    "num_hidden_layers": 1,
+    "num_attention_heads": 40,
+    "num_key_value_heads": 8,
+    "head_dim": 128,
+    "hidden_act": "silu",
+    "max_position_embeddings": 1048576,
+    "initializer_range": 0.02,
+    "rms_norm_eps": 1e-05,
+    "pretraining_tp": 1,
+    "use_cache": true,
+    "rope_theta": 500000.0,
+    "rope_scaling": {
+      "factor": 8.0,
+      "high_freq_factor": 4.0,
+      "low_freq_factor": 1.0,
+      "original_max_position_embeddings": 8192,
+      "rope_type": "llama3"
+    },
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "mlp_bias": false,
+    "tie_word_embeddings": false
+  },
+  "speculators_config": {
+    "algorithm": "eagle3",
+    "default_proposal_method": "greedy",
+    "proposal_methods": [
+      {
+        "proposal_type": "greedy",
+        "speculative_tokens": 3,
+        "verifier_accept_k": 1,
+        "accept_tolerance": 0.0
+      }
+    ],
+    "verifier": {
+      "name_or_path": "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16",
+      "architectures": [
+        "Llama4ForConditionalGeneration"
+      ]
+    }
+  },
+  "torch_dtype": "bfloat16",
+  "_comment": "Eagle3 head based on Llama3 architecture targeting Llama4 Maverick verifier",
+  "_conversion_notes": {
+    "source": "nvidia/Llama-4-Maverick-17B-128E-Eagle3",
+    "architecture_notes": "Eagle3 head uses Llama3 rope_type, targets Llama4 verifier",
+    "vocabulary_notes": "Large 202K vocabulary, same for draft and target",
+    "auxiliary_layers": "Uses hidden states from verifier layers 1, 23, 44",
+    "implementation_note": "May require Eagle3Speculator extensions for aux hidden states"
+  }
+}

metadata.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "conversion_tool": "create_final_eagle3_config.py",
+  "source_checkpoint": "nvidia/Llama-4-Maverick-17B-128E-Eagle3",
+  "format": "speculators-eagle3",
+  "architecture": "Llama3-based Eagle3 head",
+  "verifier": "Llama4 Maverick",
+  "notes": [
+    "Eagle3 head based on Llama3 architecture (rope_type: llama3)",
+    "Targets Llama4 Maverick verifier (Llama4ForConditionalGeneration)",
+    "Large vocabulary of 202,048 tokens",
+    "Uses auxiliary hidden states from layers 1, 23, 44",
+    "NVIDIA-specific fields preserved as extra configuration",
+    "May require Eagle3Speculator implementation extensions"
+  ]
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a8f075fce4f9ad4a109167c703d40d4470b7318390864a8850b7de23cb99647b
+size 2019265328