Instructions to use mpe74/gemma-4-31B-it-oQ8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mpe74/gemma-4-31B-it-oQ8 with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("mpe74/gemma-4-31B-it-oQ8") config = load_config("mpe74/gemma-4-31B-it-oQ8") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use mpe74/gemma-4-31B-it-oQ8 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mpe74/gemma-4-31B-it-oQ8"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "mpe74/gemma-4-31B-it-oQ8" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use mpe74/gemma-4-31B-it-oQ8 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mpe74/gemma-4-31B-it-oQ8"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default mpe74/gemma-4-31B-it-oQ8
Run Hermes
hermes
gemma-4-31B-it-oQ8
An oQ8 mixed-precision quantization of google/gemma-4-31b-it using oMLX — a data-driven, sensitivity-aware quantization system for Apple Silicon.
Produces standard MLX safetensors compatible with oMLX, LM Studio, mlx-lm, and any MLX-compatible inference server.
Key Facts
| Property | Value |
|---|---|
| Base Model | google/gemma-4-31b-it (31B dense, BF16) |
| Quantization | oQ8 — sensitivity-driven mixed-precision |
| Effective bpw | 8.6 |
| Model Size | ~31.4 GB (vs. 58.3 GB BF16) |
| Vision | ✅ Preserved (vision weights kept in fp16) |
| Format | Standard MLX safetensors |
| Quantized with | oMLX v0.3.4+ |
| Hardware | Apple M2 Ultra 128 GB |
Why oQ8?
oQ is not uniform quantization. Instead of applying the same bit depth to every layer, oQ measures per-layer quantization sensitivity through calibration inference and allocates bits where they matter most. Critical layers (embeddings, LM head, sensitive transformer layers) receive higher precision while the bulk of weights are quantized to the target bit depth.
At 8-bit, this is near-lossless quality at roughly half the size of BF16 — with significantly faster token generation due to reduced memory bandwidth requirements on Apple Silicon.
Benchmarks
Tested on Apple M2 Ultra (128 GB) with oMLX. Generation length: 128 tokens.
oQ8 (this model, 31.8 GB)
| Test | TTFT (ms) | TPOT (ms) | pp TPS | tg TPS | E2E (s) | Throughput | Peak Mem |
|---|---|---|---|---|---|---|---|
| pp1024/tg128 | 5,777 | 57.4 | 177.3 tok/s | 17.5 tok/s | 13.1s | 88.1 tok/s | 31.80 GB |
| pp4096/tg128 | 22,680 | 63.6 | 180.6 tok/s | 15.8 tok/s | 30.8s | 137.3 tok/s | 33.65 GB |
| pp8192/tg128 | 45,656 | 73.2 | 179.4 tok/s | 13.8 tok/s | 55.0s | 151.4 tok/s | 33.96 GB |
Continuous Batching (pp1024/tg128)
| Batch | tg TPS | Speedup | pp TPS | pp TPS/req | Avg TTFT (ms) | E2E (s) |
|---|---|---|---|---|---|---|
| 1x (baseline) | 17.5 tok/s | 1.00x | 177.3 tok/s | 177.3 tok/s | 5,777 | 13.1 |
| 2x | 28.9 tok/s | 1.65x | 176.6 tok/s | 88.3 tok/s | 11,402 | 20.5 |
| 4x | 37.0 tok/s | 2.11x | 176.1 tok/s | 44.0 tok/s | 22,529 | 37.1 |
BF16 reference (58.5 GB)
| Test | TTFT (ms) | TPOT (ms) | pp TPS | tg TPS | E2E (s) | Throughput | Peak Mem |
|---|---|---|---|---|---|---|---|
| pp1024/tg128 | 3,974 | 98.2 | 257.7 tok/s | 10.3 tok/s | 16.4s | 70.1 tok/s | 58.53 GB |
| pp4096/tg128 | 16,061 | 102.9 | 255.0 tok/s | 9.8 tok/s | 29.1s | 145.0 tok/s | 60.31 GB |
| pp8192/tg128 | 32,348 | 115.2 | 253.2 tok/s | 8.8 tok/s | 47.0s | 177.1 tok/s | 60.63 GB |
Continuous Batching (pp1024/tg128)
| Batch | tg TPS | Speedup | pp TPS | pp TPS/req | Avg TTFT (ms) | E2E (s) |
|---|---|---|---|---|---|---|
| 1x (baseline) | 10.3 tok/s | 1.00x | 257.7 tok/s | 257.7 tok/s | 3,974 | 16.4 |
| 2x | 9.1 tok/s | 0.88x | 256.8 tok/s | 128.4 tok/s | 7,713 | 36.1 |
| 4x | 16.7 tok/s | 1.62x | 253.1 tok/s | 63.3 tok/s | 15,215 | 46.8 |
Summary: oQ8 vs BF16
| Metric | oQ8 | BF16 | Difference |
|---|---|---|---|
| Size | 31.4 GB | 58.3 GB | -46% |
| Token Generation | 17.5 tok/s | 10.3 tok/s | +70% faster |
| 4x Batch Generation | 37.0 tok/s | 16.7 tok/s | +122% faster |
| Prefill | 177 tok/s | 258 tok/s | -31% (dequantization overhead) |
| Peak Memory | 31.8 GB | 58.5 GB | -46% |
oQ8 is near-lossless at half the memory, with significantly faster token generation. The prefill speed is slightly slower due to dequantization, but for interactive use the generation speed is what matters.
Usage
oMLX
Drop the model folder into your oMLX models directory. Auto-detected on server start.
mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("mpe74/gemma-4-31B-it-oQ8")
messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)
LM Studio
Search for the model and download. Works with MLX backend on Apple Silicon.
Quantization Details
| Parameter | Value |
|---|---|
| oQ Level | oQ8 |
| Base bits | 8 |
| Mode | Affine quantization |
| Group size | 64 |
| Sensitivity model | Source model (google/gemma-4-31b-it BF16) |
| Calibration data | Built-in oMLX dataset (600 samples: code, multilingual, tool calling, reasoning) |
| Vision weights | Preserved in fp16 |
Bug Note
During quantization, a bug in oMLX was encountered: Object of type set is not JSON serializable caused by _oq_non_quantizable (a Python set) not being removed from the output config before JSON serialization. Fix: add "_oq_non_quantizable" to the cleanup list in omlx/oq.py line ~1300. Issue will be reported upstream.
Quantized by mpe74 using oMLX on Apple M2 Ultra (128 GB).
- Downloads last month
- 228
8-bit