Chat template
The "</s>
" (EOS) in the current chat template seems to stop the model from generating tokens. Removing the "</s>
" works for me.
FWIW the upstream chat template in the config json is super weird and hard to read...
Huh, something does seem a bit odd. The model card suggests normal looking Mistral prompt format (including the goofy white spaces):
<s>[INST] {prompt}[/INST] </s>
The model loading debug log in llama.cpp seems to match expected Mistral tokens:
tokenizer.ggml.tokens arr[str,32768] = ["<unk>", "<s>", "</s>", "[INST]", "[...
However, the model loading debug log shows what looks like ChatML prompt format chat_example
:
chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n"
In practice, when I use Mistral format, the response always begins with that tell-tell extra white space. When I use ChatML format prompt, it doesn't start with the extra white-space. Using ChatML also tends to not stop inference prematurely in my limited testing.
I'm not 100% sure which one to actually use, but it likely matters - especially with how you do the system prompt. I'm leaning towards using ChatML given it does not produce the extra white-space and tends to continue inference without premature ending.
Yeah that's my experience as well, the ChatML prompt also works.
I can't really tell the difference between the outputs of [INST] or ChatML prompts.
If you use llama.cpp use --chat-template llama2
in your commands.