The difference between Q8 and F16 models

#2
by BigBoss0000 - opened

Is there any big difference between the Q8 and F16 models? Like will F16 model perform 2 times better than Q8?

Is there any big difference between the Q8 and F16 models? Like will F16 model perform 2 times better than Q8?

No the difference is so small it can is almost not be measured. I would say i1-Q5_K_M and larger have no meaningful difference to the unquantized model. Instead of Q8 I recommend using i1-Q6 from https://huggingface.co/mradermacher/Llama-3.2-3B-Instruct-uncensored-i1-GGUF. Here some plots I created a week ago for some other models. The small "i"-prefix on the plot means that wighted/imatrix quants are used.

dolphin-2.9.3-qwen2-0.5b - KL Divergence.png

dolphin-2.9.3-qwen2-0.5b - Perplexity.png

dolphin-2.9.3-qwen2-0.5b - Probability of quant generating the same token.png

dolphin-2.9.3-qwen2-0.5b - Correct token probability.png

dolphin-2.9.3-qwen2-0.5b - Eval (ARC, MMLU, Winogrande).png

dolphin-2.9.3-qwen2-1.5b - KL Divergence.png

dolphin-2.9.3-qwen2-1.5b - Perplexity.png

dolphin-2.9.3-qwen2-1.5b - Probability of quant generating the same token.png

dolphin-2.9.3-qwen2-1.5b - Correct token probability.png

dolphin-2.9.3-qwen2-1.5b - Eval (ARC, MMLU, Winogrande).png

Phi-3.5-mini-instruct_Uncensored - KL Divergence.png

Phi-3.5-mini-instruct_Uncensored - Perplexity.png

Phi-3.5-mini-instruct_Uncensored - Probability of quant generating the same token.png

Phi-3.5-mini-instruct_Uncensored - Correct token probability.png

Phi-3.5-mini-instruct_Uncensored - Eval (ARC, MMLU, Winogrande).png

dolphin-2.9.3-mistral-nemo-12b-llamacppfixed - KL Divergence.png

dolphin-2.9.3-mistral-nemo-12b-llamacppfixed - Perplexity.png

dolphin-2.9.3-mistral-nemo-12b-llamacppfixed - Probability of quant generating the same token.png

dolphin-2.9.3-mistral-nemo-12b-llamacppfixed - Correct token probability.png

dolphin-2.9.3-mistral-nemo-12b-llamacppfixed - Eval (ARC, MMLU, Winogrande).png

dolphin-2.9.1-qwen-110b - KL Divergence.png

dolphin-2.9.1-qwen-110b - Perplexity.png

dolphin-2.9.1-qwen-110b - Probability of quant generating the same token.png

dolphin-2.9.1-qwen-110b - Correct token probability.png

dolphin-2.9.1-qwen-110b - Eval (ARC, MMLU, Winogrande).png

Is there any big difference between the Q8 and F16 models? Like will F16 model perform 2 times better than Q8?

No the difference is so small it can is almost not be measured. I would say i1-Q5_K_M and larger have no meaningful difference to the unquantized model. Instead of Q8 I recommend using i1-Q6 from https://huggingface.co/mradermacher/Llama-3.2-3B-Instruct-uncensored-i1-GGUF. Here some plots I created a week ago for some other models. The small "i"-prefix on the plot means that wighted/imatrix quants are used.

dolphin-2.9.3-qwen2-0.5b - KL Divergence.png

dolphin-2.9.3-qwen2-0.5b - Perplexity.png

dolphin-2.9.3-qwen2-0.5b - Probability of quant generating the same token.png

dolphin-2.9.3-qwen2-0.5b - Correct token probability.png

dolphin-2.9.3-qwen2-0.5b - Eval (ARC, MMLU, Winogrande).png

dolphin-2.9.3-qwen2-1.5b - KL Divergence.png

dolphin-2.9.3-qwen2-1.5b - Perplexity.png

dolphin-2.9.3-qwen2-1.5b - Probability of quant generating the same token.png

dolphin-2.9.3-qwen2-1.5b - Correct token probability.png

dolphin-2.9.3-qwen2-1.5b - Eval (ARC, MMLU, Winogrande).png

Phi-3.5-mini-instruct_Uncensored - KL Divergence.png

Phi-3.5-mini-instruct_Uncensored - Perplexity.png

Phi-3.5-mini-instruct_Uncensored - Probability of quant generating the same token.png

Phi-3.5-mini-instruct_Uncensored - Correct token probability.png

Phi-3.5-mini-instruct_Uncensored - Eval (ARC, MMLU, Winogrande).png

dolphin-2.9.3-mistral-nemo-12b-llamacppfixed - KL Divergence.png

dolphin-2.9.3-mistral-nemo-12b-llamacppfixed - Perplexity.png

dolphin-2.9.3-mistral-nemo-12b-llamacppfixed - Probability of quant generating the same token.png

dolphin-2.9.3-mistral-nemo-12b-llamacppfixed - Correct token probability.png

dolphin-2.9.3-mistral-nemo-12b-llamacppfixed - Eval (ARC, MMLU, Winogrande).png

dolphin-2.9.1-qwen-110b - KL Divergence.png

dolphin-2.9.1-qwen-110b - Perplexity.png

dolphin-2.9.1-qwen-110b - Probability of quant generating the same token.png

dolphin-2.9.1-qwen-110b - Correct token probability.png

dolphin-2.9.1-qwen-110b - Eval (ARC, MMLU, Winogrande).png

Thanks for all those graphs, im not very familiar with LLMs but i think i understood

BigBoss0000 changed discussion status to closed

Sign up or log in to comment