The Quant gets broken when model goes above 19000-19400 context in Chat.

#1
by Panzer333 - opened

Heya, I mostly use 12B models with EXL2 quants with 32k context, so as the title says i'm facing this weird issue with this quant where Model would stop generating output one it reaches a certain threshold around 19124 context in chat and also this is the only available Exl2 quant of V4.1 so i didn't had any other option to test with this same model, I tried using Unslop-Nemo-12B-V3-exl2-6bpw and it works fine with 32k context. Also since i'm pretty much a beginner myself i don't have that much of a grasp that what might have gone wrong with this quant.

Hmm, works fine on my end with exllama version 0.2.3:

13:31:41-325658 INFO Loading "Jellon_UnslopNemo-12B-v4.1-exl2-6bpw"
13:34:01-158781 INFO Loaded "Jellon_UnslopNemo-12B-v4.1-exl2-6bpw" in 139.83 seconds.
13:34:01-162773 INFO LOADER: "ExLlamav2_HF"
13:34:01-163769 INFO TRUNCATION LENGTH: 24576
13:34:01-165764 INFO INSTRUCTION TEMPLATE: "Alpaca"
Output generated in 26.37 seconds (26.69 tokens/s, 704 tokens, context 22836, seed 498578960)

How are you loading the model and which exllama version are you using? And are you getting any error message?

Sign up or log in to comment