Robert Sinclair

ZeroWw

AI & ML interests

LLMs optimization (model quantization and back-end optimizations) so that LLMs can run on computers of people with both kidneys. Discord: https://discord.com/channels/@robert_46007

Recent Activity

new activity 18 days ago

xai-org/grok-2:Possibility of Open Sourcing Grok-2 Mini?

new activity 18 days ago

xai-org/grok-2:Grok 3 release time?

updated a model 2 months ago

ZeroWw/Llama-3.3-8B-Instruct-GGUF

View all activity

Organizations

New activity in xai-org/grok-2 18 days ago

Possibility of Open Sourcing Grok-2 Mini?

➕ 33

#5 opened 6 months ago by

ConicCat

Grok 3 release time?

➕ 1

#44 opened 2 months ago by

ZeroWw

updated a model 2 months ago

ZeroWw/Llama-3.3-8B-Instruct-GGUF

Text Generation • 8B • Updated Dec 31, 2025 • 20

liked a model 2 months ago

shb777/Llama-3.3-8B-Instruct-128K

Text Generation • Updated Jan 3 • 970 • 47

published a model 2 months ago

ZeroWw/Llama-3.3-8B-Instruct-GGUF

Text Generation • 8B • Updated Dec 31, 2025 • 20

liked a model 2 months ago

unsloth/Qwen-Image-2512-GGUF

Text-to-Image • 20B • Updated Jan 6 • 46.3k • 309

New activity in zai-org/GLM-TTS 3 months ago

Voice cloning error.

#1 opened 3 months ago by

ZeroWw

liked a model 4 months ago

NexaAI/DeepSeek-OCR-GGUF

3B • Updated Nov 15, 2025 • 16.6k • 46

New activity in Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 4 months ago

I find qwen3 next exceptional, but too big.

#7 opened 4 months ago by

ZeroWw

updated a model 6 months ago

ZeroWw/Art-0-8B-GGUF

Text Generation • 8B • Updated Aug 30, 2025 • 23

New activity in xai-org/grok-2 6 months ago

In the face of google who didn't release old gemini! (Thanks)

👍 15

#21 opened 6 months ago by

ZeroWw

reacted to codelion's post with 👀❤️🔥 6 months ago

Post

5294

I wanted to share a technique that's been working really well for recovering performance after INT4 quantization.

Typically, quantizing the LLM to INT4 (unlike say INT8) for inference can incur some accuracy loss. Instead of accepting the quality loss, we used the FP16 model as a teacher to train a tiny LoRA adapter (rank=16) for the quantized model. The cool part: the model generates its own training data using the Magpie technique so no external datasets needed. This is critical because we want to remain as much as possible in the distribution of the model's natural responses.

Last year Apple's foundational models paper (https://arxiv.org/pdf/2407.21075) had proposed a similar technique and found "By using accuracy-recovery LoRA adapters with only rank 16, Alpaca win rate can be improved by 7-18%, GMS8K accuracy is boosted by 5-10%." (page 47).

We saw similar results on Qwen3-0.6B:

Perplexity: 2.40 → 2.09 (only 5.7% degradation from FP16 baseline)
Memory: Only 0.28GB vs 1.0GB for FP16 (75% reduction)
Speed: 3.0x faster inference than FP16
Quality: Generates correct, optimized code solutions

- Pre-trained adapter: codelion/Qwen3-0.6B-accuracy-recovery-lora
- GitHub repo: https://github.com/codelion/ellora

Happy to answer questions about the implementation or help anyone trying to replicate this. The key insight is that quantization errors are systematic and learnable - a small adapter can bridge the gap without negating the benefits of quantization.

Has anyone else experimented with self-distillation for quantization recovery? Would love to hear about different approaches!