Text Generation
Transformers
Safetensors
mistral
chat
conversational
text-generation-inference
Inference Endpoints

Consider a mistral-7b-instruct-v0.3 to use as an exl2 draft model with this?

#3
by gghfez - opened

So Mistral-Large an mistral-7b-instruct-v0.3 share the same vocab, which allows us exl2 users to use the 7b as a 'draft model', increasing tokens/second from like 13 -> 20 when the 7b predicts accurately.

But with the Lumimaid finetune, it obviously misses more often, and no doubt it'll be the same for this magnum.

It'd be nice to have a magnum finetune of the 7b mistral so that it predicts more accurately as a draft model for this one :)

Anthracite org

interesting use-case! I've added it internally for future reference, but in our testing draft models seemed to affect text-performance too much, even if it was faster.

lucyknada changed discussion status to closed

interesting use-case! I've added it internally for future reference, but in our testing draft models seemed to affect text-performance too much, even if it was faster.

It shouldn't actually effect the text generation though:

https://arxiv.org/abs/2211.17192

we can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models, potentially generating several tokens concurrently, and without changing the distribution.

The larger model only uses the sequence generated by the smallest model up until the location where the max probability token diverges, so the output should be identical?

That's right, it doesn't affect it (at least the current exllamav2 version) provided the vocab matches. I've seen people have issues with perplexity when they use the wrong draft model.

Anthracite org

no worries, the past testing won't affect if we try it out or not (already in queue), we simply are looking into other things first that were on the backburner for a while.

Sign up or log in to comment