Consider a mistral-7b-instruct-v0.3 to use as an exl2 draft model with this?
So Mistral-Large an mistral-7b-instruct-v0.3 share the same vocab, which allows us exl2 users to use the 7b as a 'draft model', increasing tokens/second from like 13 -> 20 when the 7b predicts accurately.
But with the Lumimaid finetune, it obviously misses more often, and no doubt it'll be the same for this magnum.
It'd be nice to have a magnum finetune of the 7b mistral so that it predicts more accurately as a draft model for this one :)
interesting use-case! I've added it internally for future reference, but in our testing draft models seemed to affect text-performance too much, even if it was faster.
interesting use-case! I've added it internally for future reference, but in our testing draft models seemed to affect text-performance too much, even if it was faster.
It shouldn't actually effect the text generation though:
https://arxiv.org/abs/2211.17192
we can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models, potentially generating several tokens concurrently, and without changing the distribution.
The larger model only uses the sequence generated by the smallest model up until the location where the max probability token diverges, so the output should be identical?
That's right, it doesn't affect it (at least the current exllamav2 version) provided the vocab matches. I've seen people have issues with perplexity when they use the wrong draft model.
no worries, the past testing won't affect if we try it out or not (already in queue), we simply are looking into other things first that were on the backburner for a while.