turboderp
/

Mistral-Nemo-Instruct-12B-exl2

Model card Files Files and versions Community

Unexpected Behaviors

by Olafangensan - opened Jul 21, 2024

Jul 21, 2024

Using ChatML for instruct formatting.

I've observed some peculiar behaviors while using the ExL2 8.0bpw quant through Oobabooga's text generation web UI and SillyTavern for roleplay:

Formatting Inconsistencies
Name Formatting: The model frequently initiates responses with a name, but applies formatting inconsistently (e.g., inconsistent use of asterisks or other markdown formatting around names).
Long Context Behavior: In longer contexts (approximately 8k tokens), the model occasionally abandons message formatting entirely, reverting to plain text for dialogue instead of using "quotation marks", for instance.
Context Retention Issues
Persona Shifts: At times, the model seems to disregard context completely, reverting to a default "assistant" persona and responding out of character (such as commenting on the narrative instead of maintaining character roleplay). This appears to occur randomly.

I'm speculating whether this might be related to the new tokenizer, but I'm honestly just guessing.

CrazyKrow

Jul 21, 2024

•

edited Jul 22, 2024

The tokenizer is the problem, I had to modify it for it to work properly. I deleted some stuff referring to the EOS token " < /s > " on the tokenizer file, just below where it says:

"post_processor": {

"type": "TemplateProcessing",

It was making it not work great with twinbook, oobabooga's notebook, the "Start reply with" option or with the continue button. The tokenizer is weirdly incompatible with anything that tries to make the model continue a generation.

I don't know if this would solve your problems but it did solve mine.

CrazyKrow

Jul 22, 2024

•

edited Jul 22, 2024

The tokenizer was already fixed on the official upload by mistral, but most quantizations still have the broken tokenizer. I opened a pull request for it to get fixed here as well.

Olafangensan

Jul 22, 2024

Damn, did not expect such a fast reply AND fix. Will probably use some fine-tune/merge instead of the basic model, but I appreciate the effort.

bless

tango67

Jul 23, 2024

Anyone manage to get this working? I tried 8bpw on Oobabooga's text generation web UI after this tokenizer patch.
Works fine with short context. But when using longer context, the answer seems erratic and appeared to be cut short.
What is the prompt format that you guys use? I use ~~[INST] {your prompt} [/INST].~~

none-user

Jul 23, 2024

I use 8bpw in exui with Chat-RP and ChatML works almost flawlessly. 32k context uses ~20gb on 4090 with FP16 cache mode.

s180014

Jul 24, 2024

I use 8bpw in exui with Chat-RP and ChatML works almost flawlessly. 32k context uses ~20gb on 4090 with FP16 cache mode.

Are you aware of how to delete previous chat history from the exui api? The answers keep degrading and getting shorter as I prompt repeatedly. I suspect its the history.

none-user

Jul 31, 2024

•

edited Jul 31, 2024

I use 8bpw in exui with Chat-RP and ChatML works almost flawlessly. 32k context uses ~20gb on 4090 with FP16 cache mode.

Are you aware of how to delete previous chat history from the exui api? The answers keep degrading and getting shorter as I prompt repeatedly. I suspect its the history.

Sounds like cache quantization is too harsh or context length is not enough.
To clean cache you can reload model or call get_loaded_model().cache.reset() somewhere in backend/sessions.py: https://pastebin.com/fGsFc9Wm

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment