matatonic/Xwin-LM-70B-V0.1-exl2-4.800b

My exllamav2 based quantization for Xwin-LM-70B-V0.1 targetted for 48G VRAM, seems to have hit a sweet spot in evaluations.

Original model: https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1
Exllamav2 4.8bpw conversion from https://huggingface.co/firelzrd/Xwin-LM-70B-V0.1-fp16-safetensors.
Fits in 48G (2x24G) VRAM with 4k or 8k context with or without the 8bit cache enabled.
Recommended settings: 6400 context, alpha_value 1.6, gpu_split 20,23.5
alpha_value at or over 1.75 seems to result in an occasional 'stutter', very obvious when the model outputs dates. Ex ("The Sixth Sense (19999)")
Seems to have hit some lucky quantization and the 4.800b was better than the 4bit-128g, 4bit-32g, Q4_K_S, 4.650b, 4.900b and even the 5.000b!
Experimentation has shown that alpha_value at 1.6 instead of 1.75 seems better at 1.5x context and even 1.5625x
Maybe obvious to some but there is no perplexity impact to using an 8bit cache.

Made using exllamav2/convert.py with the following command:

python3 convert.py -i models/firelzrd_Xwin-LM-70B-V0.1-fp16-safetensors/ \
 -cf models/matatonic_Xwin-LM-70B-V0.1-exl2-4.800b \
 -o tmp/ \
 -c parquet/wikitext-test.parquet \
 -b 4.800

Perplexity (wikitext) evaluated as:

Model	Perplexity	Comment (alpha_value)
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b	3.21780776977539	4096 ctx
matatonic_Xwin-LM-70B-V0.1-exl2-4.900b	3.2188525199890137	4096 ctx (not released)
firelzrd_Xwin-LM-70B-V0.1-exl2_5-bpw	3.22019362449646	4096 ctx (8b cache)
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b	3.239454746246338	5120 ctx (1.375)
LoneStriker_Xwin-LM-70B-V0.1-4.65bpw-h6-exl2	3.2419090270996094	4096 ctx
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b	3.2434027194976807	6400 ctx (1.6)
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b	3.2434027194976807	6400 ctx (1.6, 8b cache)
xwin-lm-70b-v0.1.Q4_K_S.gguf	3.2480294704437256	4096 ctx
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b	3.253002405166626	6144 ctx (1.75)
TheBloke_Xwin-LM-70B-V0.1-GPTQ_gptq-4bit-32g-actorder_True	3.266364574432373	4096 ctx
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b	3.278069496154785	6656 ctx (1.95)
TheBloke_Xwin-LM-70B-V0.1-GPTQ_gptq-4bit-128g-actorder_True	3.2803425788879395	4096 ctx
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b	3.304278612136841	7168 ctx (2.125)
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b	3.359946727752685	8192 ctx (2.5)

*) Should be better than xwin-lm-70b-v0.1.Q4_K_M.gguf also, which reports 4.8bpw, but so far my perplexity eval has not been successful.