@grimjim on Hugging Face: "Speculative decoding only requires that the tokenizers for the two LLMs used…"

Post

1186

Speculative decoding only requires that the tokenizers for the two LLMs used line up; the model architectures do not have to be otherwise compatible. As proof of concept, I used exllamav2 to run Llama 3.2 1B Instruct (at 6bpw, for speed) as the draft model to accelerate the target model of a Llama 3 8B merge of Instruct models (at 8bpw, for accuracy). The difference between tokenizers was minor enough to allow this. With 8k context length allocated for each model, both fit in under 13GB VRAM.
https://github.com/turboderp/exllamav2
meta-llama/Llama-3.2-1B-Instruct
grimjim/llama-3-Nephilim-v3-8B

The proof-of-concept Python script compared a zero-shot creative task of writing a story limited to 500 tokens. Speculative decoding improved performance by approximately one third (e.g., increasing from 31 tokens/sec to 46 tokens/sec) over conventional decoding, and was consistent over a few runs. While not statistically significant, this implies that smaller models aimed at edge computing can serve effectively as draft models in the general case.

It is straightforward to consult literature to affirm that fine-tuning draft models can be a way of inducing behavioral change in target models, in a manner not unlike how samplers can be used to induce changes. I speculate that the impact of a fine-tuned draft model would be on part with a LoRA (Low-Rank Adaptation), as the target model retains veto power. The small size of draft model candidates means that more people can perform local full fine-tuning.

It is intuitively obvious that a distilled model can be used as a draft model for the larger teacher model so long as tokenizers line up; e.g., a distilled 8B model can draft for a 70B teacher model. Perhaps Llama-3.1-SuperNova-Lite 8B could effectively draft for the original Llama-3.1-405B-Instruct model.
arcee-ai/Llama-3.1-SuperNova-Lite
meta-llama/Llama-3.1-405B-Instruct

Join the conversation