[Word Vomit pretty early on]
I've been searching for SLMs in the 7b - 20b range with a longer context length and I stumbled upon this one (8k).
According to the evals, this model's supposed to be better than Mistral's first 7b Instruct model. That is not the experience I'm having.
It stuffers from the repetition problem (word vomit) & its pretty bad. I did not see this problem occur with the original mistral model nor any of its fine tunes.
Another reason why we shouldn't blindly trust evals. Data contamination is a real issue & we need to figure out a way of best making sure that eval datasets do not leak into pre-training, SFT or preference datasets.
Hey, thanks for the comment @vikram0711
Two questions:
Did you use the correct prompt template/system messages?
Do you have a reproducible notebook so I can examine what you did? Did you assess each model with their appropriate prompt templates? Did you use the same generation parameters? I'd like to see how you're reaching this conclusion.
Here are a few resources you can read (with links to notebooks so you can reproduce) which compare DeciLM to Mistral:
- https://deci.ai/blog/llm-evaluation-and-how-decoding-strategies-impact-instruction-following/
- https://deci.ai/blog/decilm-7b-vs-mistral-7b-on-chain-of-thought-tasks/
Cheers
Yes, I used the correct chat template. The Jinja template in
tokenizer_config.json
is the one I used.Unfortunately, I cannot share my notebook since it contains proprietary company datasets. But, I can assure you, yes I did assess each model with its appropriate template, I used Greedy sampling (
do_sample = False
) for both. And the Mistral fine tune I'm using performs significantly better in comparison. Even as a standalone model for this use case, I'm seeing subpar performance at best fromDeciLM-7B-instruct
. I expected to see word vomit only as the size of the context passed in the prompt increases (see here), but I started observing the word vomit problem for smaller context sizes itself. It could be that the task itself is too complicated for the model, I'm not sure. Maybe the fact that this model wasn't aligned at all might be an issue or it could also be a problem with the SFT-ing too, because I'm not too happy with its instruction following either.Maybe its just the case that this model is not well suited for my use case or I might need to tune the prompt I'm using, but these are my observations off the bat, with the ChatGPT prompt that was being used for this use case.
@vikram0711 , I'm happy to hop on a call with you if you'd like.