Granite-4.0-H-Tiny — MLX 5-bit (Apple Silicon)

Maintainer / Publisher: Susant Achary

This repository provides an Apple-Silicon MLX build of IBM Granite-4.0-H-Tiny quantized to 5-bit.
If you need more faithfulness than 3/4-bit but want lower RAM than 6-bit, 5-bit is a strong middle ground—especially for document parsing, structured extraction, and long-context assistants on Mac.


🔎 About Granite 4.0 (context)

  • Architecture: Hybrid Mamba-2 + softmax attention; H tiers add Mixture-of-Experts (MoE) (sparse activation per token).
  • Tier: H-Tiny (~7B total params with ~1B active via MoE), designed for efficient long-context inference.
  • License: Apache-2.0 (permissive, enterprise-friendly).
  • Typical uses: Instruction following, long-context assistants, RAG pipelines, structured outputs.

This card documents the MLX 5-bit conversion. See the comparison table below for when to choose 3/4-bit (lower RAM) or 6-bit (highest fidelity).


📦 What’s in this repo (MLX format)

  • config.json (MLX), mlx_model*.safetensors (5-bit shards)
  • Tokenizer files: tokenizer.json, tokenizer_config.json
  • Model metadata (e.g., model_index.json)

Target platform: macOS on Apple Silicon (M-series) with Metal/MPS acceleration.


✅ Intended use

  • High-quality instruction following and summarization with long context
  • Document / form / table parsing and JSON extraction (schema-guided prompts)
  • On-device prototyping where accuracy matters but RAM is modest

⚠️ Limitations

  • Still quantized: some regressions vs FP16 can surface on intricate math/code.
  • KV cache / context length can dominate RAM at very long windows—monitor budgets.
  • Add your own guardrails and safety for production.

🔢 Choosing a quantization level (MLX on Apple Silicon)

Indicative ranges for a ~7B hybrid MoE LM (actual usage varies by context length and batch size).

Variant Typical Peak RAM Relative Speed Typical Behavior When to Choose
2-bit ~3–4 GB 🔥🔥🔥🔥 Smallest, most lossy Minimal RAM devices; smoke tests
3-bit ~5–6 GB 🔥🔥🔥🔥 Direct, concise Great default on M1/M2/M3/M4
4-bit ~6–7.5 GB 🔥🔥🔥 Better detail retention vs 3-bit If 3-bit misses small details
5-bit (this repo) ~8–9 GB 🔥🔥☆ Higher fidelity, fewer omissions When you want stronger document/JSON faithfulness without 6-bit RAM
6-bit ~9.5–11 GB 🔥🔥 Highest MLX fidelity If RAM permits and you need maximum quality

Rules of thumb

  • Start at 5-bit for document/structured tasks on 8–16 GB Macs.
  • Drop to 3/4-bit for tighter RAM / higher speed.
  • Move to 6-bit if you still see omissions or slight distortions in outputs.

🚀 Quickstart (CLI — MLX)

Deterministic generation

python -m mlx_lm.generate \
  --model <this-repo-id> \
  --prompt "Summarize the following in 5 bullet points:\n<your text>" \
  --max-tokens 256 \
  --temperature 0.0 \
  --device mps \
  --seed 0
Downloads last month
67
Safetensors
Model size
1B params
Tensor type
BF16
·
U32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/granite-4.0-h-tiny-5bit-MLX

Quantized
(30)
this model

Collection including mlx-community/granite-4.0-h-tiny-5bit-MLX