gemma-4-31B-it-oQ8

An oQ8 mixed-precision quantization of google/gemma-4-31b-it using oMLX — a data-driven, sensitivity-aware quantization system for Apple Silicon.

Produces standard MLX safetensors compatible with oMLX, LM Studio, mlx-lm, and any MLX-compatible inference server.

Key Facts

Property	Value
Base Model	google/gemma-4-31b-it (31B dense, BF16)
Quantization	oQ8 — sensitivity-driven mixed-precision
Effective bpw	8.6
Model Size	~31.4 GB (vs. 58.3 GB BF16)
Vision	✅ Preserved (vision weights kept in fp16)
Format	Standard MLX safetensors
Quantized with	oMLX v0.3.4+
Hardware	Apple M2 Ultra 128 GB

Why oQ8?

oQ is not uniform quantization. Instead of applying the same bit depth to every layer, oQ measures per-layer quantization sensitivity through calibration inference and allocates bits where they matter most. Critical layers (embeddings, LM head, sensitive transformer layers) receive higher precision while the bulk of weights are quantized to the target bit depth.

At 8-bit, this is near-lossless quality at roughly half the size of BF16 — with significantly faster token generation due to reduced memory bandwidth requirements on Apple Silicon.

Benchmarks

Tested on Apple M2 Ultra (128 GB) with oMLX. Generation length: 128 tokens.

oQ8 (this model, 31.8 GB)

Test	TTFT (ms)	TPOT (ms)	pp TPS	tg TPS	E2E (s)	Throughput	Peak Mem
pp1024/tg128	5,777	57.4	177.3 tok/s	17.5 tok/s	13.1s	88.1 tok/s	31.80 GB
pp4096/tg128	22,680	63.6	180.6 tok/s	15.8 tok/s	30.8s	137.3 tok/s	33.65 GB
pp8192/tg128	45,656	73.2	179.4 tok/s	13.8 tok/s	55.0s	151.4 tok/s	33.96 GB

Continuous Batching (pp1024/tg128)

Batch	tg TPS	Speedup	pp TPS	pp TPS/req	Avg TTFT (ms)	E2E (s)
1x (baseline)	17.5 tok/s	1.00x	177.3 tok/s	177.3 tok/s	5,777	13.1
2x	28.9 tok/s	1.65x	176.6 tok/s	88.3 tok/s	11,402	20.5
4x	37.0 tok/s	2.11x	176.1 tok/s	44.0 tok/s	22,529	37.1

BF16 reference (58.5 GB)

Test	TTFT (ms)	TPOT (ms)	pp TPS	tg TPS	E2E (s)	Throughput	Peak Mem
pp1024/tg128	3,974	98.2	257.7 tok/s	10.3 tok/s	16.4s	70.1 tok/s	58.53 GB
pp4096/tg128	16,061	102.9	255.0 tok/s	9.8 tok/s	29.1s	145.0 tok/s	60.31 GB
pp8192/tg128	32,348	115.2	253.2 tok/s	8.8 tok/s	47.0s	177.1 tok/s	60.63 GB

Continuous Batching (pp1024/tg128)

Batch	tg TPS	Speedup	pp TPS	pp TPS/req	Avg TTFT (ms)	E2E (s)
1x (baseline)	10.3 tok/s	1.00x	257.7 tok/s	257.7 tok/s	3,974	16.4
2x	9.1 tok/s	0.88x	256.8 tok/s	128.4 tok/s	7,713	36.1
4x	16.7 tok/s	1.62x	253.1 tok/s	63.3 tok/s	15,215	46.8

Summary: oQ8 vs BF16

Metric	oQ8	BF16	Difference
Size	31.4 GB	58.3 GB	-46%
Token Generation	17.5 tok/s	10.3 tok/s	+70% faster
4x Batch Generation	37.0 tok/s	16.7 tok/s	+122% faster
Prefill	177 tok/s	258 tok/s	-31% (dequantization overhead)
Peak Memory	31.8 GB	58.5 GB	-46%

oQ8 is near-lossless at half the memory, with significantly faster token generation. The prefill speed is slightly slower due to dequantization, but for interactive use the generation speed is what matters.

Usage

oMLX

Drop the model folder into your oMLX models directory. Auto-detected on server start.

mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("mpe74/gemma-4-31B-it-oQ8")
messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)

LM Studio

Search for the model and download. Works with MLX backend on Apple Silicon.

Quantization Details

Parameter	Value
oQ Level	oQ8
Base bits	8
Mode	Affine quantization
Group size	64
Sensitivity model	Source model (google/gemma-4-31b-it BF16)
Calibration data	Built-in oMLX dataset (600 samples: code, multilingual, tool calling, reasoning)
Vision weights	Preserved in fp16

Bug Note

During quantization, a bug in oMLX was encountered: Object of type set is not JSON serializable caused by _oq_non_quantizable (a Python set) not being removed from the output config before JSON serialization. Fix: add "_oq_non_quantizable" to the cleanup list in omlx/oq.py line ~1300. Issue will be reported upstream.

Quantized by mpe74 using oMLX on Apple M2 Ultra (128 GB).

Downloads last month: 228

Safetensors

Model size

9B params

Tensor type

BF16

U32

MLX

Hardware compatibility

8-bit