Unexpected Behaviors
Using ChatML for instruct formatting.
I've observed some peculiar behaviors while using the ExL2 8.0bpw quant through Oobabooga's text generation web UI and SillyTavern for roleplay:
Formatting Inconsistencies
Name Formatting: The model frequently initiates responses with a name, but applies formatting inconsistently (e.g., inconsistent use of asterisks or other markdown formatting around names).
Long Context Behavior: In longer contexts (approximately 8k tokens), the model occasionally abandons message formatting entirely, reverting to plain text for dialogue instead of using "quotation marks", for instance.Context Retention Issues
Persona Shifts: At times, the model seems to disregard context completely, reverting to a default "assistant" persona and responding out of character (such as commenting on the narrative instead of maintaining character roleplay). This appears to occur randomly.
I'm speculating whether this might be related to the new tokenizer, but I'm honestly just guessing.
The tokenizer is the problem, I had to modify it for it to work properly. I deleted some stuff referring to the EOS token " < /s > " on the tokenizer file, just below where it says:
"post_processor": {
"type": "TemplateProcessing",
It was making it not work great with twinbook, oobabooga's notebook, the "Start reply with" option or with the continue button. The tokenizer is weirdly incompatible with anything that tries to make the model continue a generation.
I don't know if this would solve your problems but it did solve mine.
The tokenizer was already fixed on the official upload by mistral, but most quantizations still have the broken tokenizer. I opened a pull request for it to get fixed here as well.
Damn, did not expect such a fast reply AND fix. Will probably use some fine-tune/merge instead of the basic model, but I appreciate the effort.
bless
Anyone manage to get this working? I tried 8bpw on Oobabooga's text generation web UI after this tokenizer patch.
Works fine with short context. But when using longer context, the answer seems erratic and appeared to be cut short.
What is the prompt format that you guys use? I use [INST] {your prompt} [/INST].
I use 8bpw in exui with Chat-RP and ChatML works almost flawlessly. 32k context uses ~20gb on 4090 with FP16 cache mode.
I use 8bpw in exui with Chat-RP and ChatML works almost flawlessly. 32k context uses ~20gb on 4090 with FP16 cache mode.
Are you aware of how to delete previous chat history from the exui api? The answers keep degrading and getting shorter as I prompt repeatedly. I suspect its the history.
I use 8bpw in exui with Chat-RP and ChatML works almost flawlessly. 32k context uses ~20gb on 4090 with FP16 cache mode.
Are you aware of how to delete previous chat history from the exui api? The answers keep degrading and getting shorter as I prompt repeatedly. I suspect its the history.
Sounds like cache quantization is too harsh or context length is not enough.
To clean cache you can reload model or call get_loaded_model().cache.reset()
somewhere in backend/sessions.py: https://pastebin.com/fGsFc9Wm