High Precision quant of 🚀Reflection-Llama-3.1-70B🚀

This gets 99.96% perplexity at 50gb filesize whereas fp8 (not tested on this model) is known to be 97-98.8%

Only posting one quant because it's really annoying to make these and I haven't automated it yet, takes 30+ iterations of models as I have to recompile llama.cpp every build/test step until the lowest perplexity loss per weight quantization configs are found. End result is... saves 5gb of space vs regular q6_k

🐧 To download faster on Linux sudo apt install -y aria2 🍎 On Mac brew install aria2

These links will download 9x faster, feel free to paste them all in or one at a time

aria2c -x 9 -o reflection-70b-precisequant-6bpw-00001-of-00002.gguf https://huggingface.co/nisten/Reflection-70b-PreciseQuant-6bpw-gguf/resolve/main/reflection-70b-precisequant-6bpw-00001-of-00002.gguf

aria2c -x 9 -o reflection-70b-precisequant-6bpw-00002-of-00002.gguf https://huggingface.co/nisten/Reflection-70b-PreciseQuant-6bpw-gguf/resolve/main/reflection-70b-precisequant-6bpw-00002-of-00002.gguf

Prompt file with correct template

🐧 make a file called reflectionprompt.txt and just copy paste this in, change as needed

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{You are a world-class AI system, capable of complex reasoning and reflection. Reason through the query inside <thinking> tags, and then provide your final response inside <output> tags. If you detect that you made a mistake in your reasoning at any point, correct yourself inside <reflection> tags.<|eot_id|><|start_header_id|>user<|end_header_id|>
}<|eot_id|><|start_header_id|>user<|end_header_id|>
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

To run the model on commandline terminal with multiline input find the location of the first 00001 gguf file then do

./llama-cli -ngl 81 -m reflection-70b-precisequant-6bpw-00001-of-00002.gguf -f reflectionprompt.txt --prompt-cache random.cache --keep -1 -fa -cnv -c 32000 -co -e -mli --temp 0 -ngl 99

Perplexity benchmarks as you can see accuracy of the quant is 5.2416/5.2468 = 99.96% +-0.02%

Float16 -143GB - perplexity: calculating perplexity over 64 chunks, n_ctx=512, batch_size=2048, n_seq=4
16.92 seconds per pass - ETA 4.50 minutes
[1]4.0486,[2]4.6471,[3]3.9394,[4]3.4698,[5]3.2290,[6]3.0391,[7]3.1640,[8]3.1819,[9]3.2073,[10]3.3374,[11]3.5247,[12]3.7371,[13]3.9944,[14]4.0065,[15]4.1234,[16]4.1503,[17]4.2893,[18]4.4968,[19]4.4347,[20]4.4439,[21]4.5403,[22]4.4419,[23]4.2888,[24]4.2224,[25]4.1259,[26]4.0495,[27]4.0324,[28]4.0221,[29]4.0838,[30]4.1170,[31]4.1588,[32]4.1664,[33]4.2095,[34]4.2723,[35]4.3194,[36]4.4006,[37]4.4192,[38]4.4598,[39]4.4861,[40]4.5294,[41]4.5674,[42]4.5571,[43]4.6098,[44]4.6025,[45]4.7148,[46]4.7590,[47]4.7303,[48]4.6854,[49]4.6778,[50]4.7118,[51]4.7762,[52]4.7682,[53]4.8604,[54]4.8778,[55]4.9023,[56]4.9398,[57]4.9594,[58]4.9813,[59]4.9653,[60]5.0095,[61]5.0626,[62]5.1179,[63]5.1774,[64]5.2416,
Final estimate: PPL = 5.2416 +/- 0.09238

6bpw - 50GB - perplexity: calculating perplexity over 64 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 23.59 seconds per pass - ETA 6.28 minutes
[1]4.0767,[2]4.6657,[3]3.9513,[4]3.4823,[5]3.2487,[6]3.0724,[7]3.1902,[8]3.2125,[9]3.2384,[10]3.3744,[11]3.5567,[12]3.7686,[13]4.0223,[14]4.0309,[15]4.1456,[16]4.1740,[17]4.3123,[18]4.5194,[19]4.4535,[20]4.4623,[21]4.5580,[22]4.4580,[23]4.3051,[24]4.2390,[25]4.1393,[26]4.0586,[27]4.0414,[28]4.0307,[29]4.0909,[30]4.1243,[31]4.1653,[32]4.1725,[33]4.2153,[34]4.2791,[35]4.3258,[36]4.4072,[37]4.4263,[38]4.4676,[39]4.4944,[40]4.5377,[41]4.5755,[42]4.5648,[43]4.6176,[44]4.6105,[45]4.7227,[46]4.7669,[47]4.7393,[48]4.6918,[49]4.6836,[50]4.7175,[51]4.7818,[52]4.7738,[53]4.8659,[54]4.8834,[55]4.9086,[56]4.9452,[57]4.9649,[58]4.9874,[59]4.9718,[60]5.0159,[61]5.0686,[62]5.1238,[63]5.1833,[64]5.2468,
Final estimate: PPL = 5.2468 +/- 0.09258
Downloads last month
26
GGUF
Model size
70.6B params
Architecture
llama
Inference API
Unable to determine this model's library. Check the docs .

Model tree for nisten/Reflection-70b-PreciseQuant-6bpw-gguf