No thinking tags when it runs?

by Disdrix - opened 26 days ago

26 days ago

Getting an issue where thinking tags never appear which results in no markdowns, making the model unusable. I've only tested it with Kimi-K2-Thinking-UD-Q3_K_XL. This works fine with Deepseek R1, just for example. I don't know if this is an issue with the latest llama cpp or the GGUF.

I run with the following:
llama-server --model Kimi-K2-Thinking-UD-Q3_K_XL-00001-of-00010.gguf -ts 99,0 -fa on --temp 1.0 --min-p 0.01 -c 131072 --threads 38 -ngl 99 --n-cpu-moe 54

AliceThirty

26 days ago

•

edited 26 days ago

I have this problem too with GLM4.5 and GLM4.6 quantized by unsloth, but it's random. The </think> only appears 80% of the time, the longer the context, the lower the probability of success is. Not sure if it's related.

danielhanchen

Unsloth AI org 26 days ago

You have to do --special and then it comes up and then you'll see the think token. This is normal expected behavior

CC: @Disdrix @AliceThirty

dsg22

26 days ago

Thanks @danielhanchen - that worked. It would be good if this were added to the unsloth guide, as it doesn't seem documented anywhere and I've seen several people asking in various forums.

The only downside is that now it ends every answer with <|im_end|>. Maybe a template issue?

Also, it seems to have an identity crisis and thinks it's Claude. Probably an issue with the base model, but funny.

gghfez

26 days ago

The only downside is that now it ends every answer with <|im_end|>.

Intended behavior when printing special tokens. <|im_end|> is a special token after all. You can set <|im_end|> as a stop string in openwebui.

Most front ends like ST, etc should support that.

Also, it seems to have an identity crisis and thinks it's Claude.

Makes sense

danielhanchen

Unsloth AI org 26 days ago

Ok we'll add it to the guide @dsg22 thanks!

danielhanchen

Unsloth AI org 26 days ago

Thanks @danielhanchen - that worked. It would be good if this were added to the unsloth guide, as it doesn't seem documented anywhere and I've seen several people asking in various forums.

The only downside is that now it ends every answer with <|im_end|>. Maybe a template issue?

Also, it seems to have an identity crisis and thinks it's Claude. Probably an issue with the base model, but funny.

We added it here: https://docs.unsloth.ai/models/kimi-k2-and-thinking-how-to-run-locally#thinking-tags

Disdrix

25 days ago

Here is how to get rid of the im_end. Add this to the custom json.

{"prompt": "...", "stop": ["<|im_end|>"]}

Disdrix

25 days ago

Another update. After some number of messages back and forth, markdowns fail again. It seems to be that --special did not fix the issue entirely.

Disdrix

25 days ago

•

edited 25 days ago

It seems to happen repeatedly around 2000 tokens. I did a "write a story" initial prompt and "continue the story" a couple times.

Here, you can see it didn't even bother reasoning and went right to normal text generation. It then did a weird thing where it added a think token and repeated this section of story again exactly. Sometimes at this point it will actually reason, but won't markdown. After it finished, I did another "continue the story" and it did reason this time, but no markdown.

danielhanchen

Unsloth AI org 25 days ago

@Disdrix Could you try setting min_p = 0.01 and temperature = 0.8 to temperature = 1.0

Disdrix

25 days ago

•

edited 25 days ago

@Disdrix Could you try setting min_p = 0.01 and temperature = 0.8 to temperature = 1.0

This is what I am doing already. It seems to always after ~2k tokens, start this problem. I've tried many new chats and it always does, 100% of the time in that range.

I have not yet tried using something other than llama cpp to rule out that being the issue yet.

Disdrix

24 days ago

•

edited 24 days ago

Just updated again to the latest llama cpp and it persists. I did find some interesting behavior, however. If I am writing a story then suddenly prompt the AI with "hi", it will reason again.

Disdrix changed discussion status to closed 24 days ago

Disdrix changed discussion status to open 24 days ago

Rotating

19 days ago

I'm not convinced we should be using it with only temp and min-p 0.01 at these lower quants. The folk theory is that the harsher the quanting, the higher the effective temp compared to the unquanted model.

I think I'm going to be starting at more like temp 0.7 and min-p 0.05 for UD-TQ1_0 next time I play with it.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment