Instructions to use steampunque/gemma-4-31B-it-MP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use steampunque/gemma-4-31B-it-MP-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="steampunque/gemma-4-31B-it-MP-GGUF", filename="gemma-4-31B-it.Q4_E_H.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use steampunque/gemma-4-31B-it-MP-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf steampunque/gemma-4-31B-it-MP-GGUF # Run inference directly in the terminal: llama-cli -hf steampunque/gemma-4-31B-it-MP-GGUF
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf steampunque/gemma-4-31B-it-MP-GGUF # Run inference directly in the terminal: llama-cli -hf steampunque/gemma-4-31B-it-MP-GGUF
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf steampunque/gemma-4-31B-it-MP-GGUF # Run inference directly in the terminal: ./llama-cli -hf steampunque/gemma-4-31B-it-MP-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf steampunque/gemma-4-31B-it-MP-GGUF # Run inference directly in the terminal: ./build/bin/llama-cli -hf steampunque/gemma-4-31B-it-MP-GGUF
Use Docker
docker model run hf.co/steampunque/gemma-4-31B-it-MP-GGUF
- LM Studio
- Jan
- Ollama
How to use steampunque/gemma-4-31B-it-MP-GGUF with Ollama:
ollama run hf.co/steampunque/gemma-4-31B-it-MP-GGUF
- Unsloth Studio new
How to use steampunque/gemma-4-31B-it-MP-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for steampunque/gemma-4-31B-it-MP-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for steampunque/gemma-4-31B-it-MP-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for steampunque/gemma-4-31B-it-MP-GGUF to start chatting
- Pi new
How to use steampunque/gemma-4-31B-it-MP-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf steampunque/gemma-4-31B-it-MP-GGUF
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "steampunque/gemma-4-31B-it-MP-GGUF" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use steampunque/gemma-4-31B-it-MP-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf steampunque/gemma-4-31B-it-MP-GGUF
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default steampunque/gemma-4-31B-it-MP-GGUF
Run Hermes
hermes
- Docker Model Runner
How to use steampunque/gemma-4-31B-it-MP-GGUF with Docker Model Runner:
docker model run hf.co/steampunque/gemma-4-31B-it-MP-GGUF
- Lemonade
How to use steampunque/gemma-4-31B-it-MP-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull steampunque/gemma-4-31B-it-MP-GGUF
Run and chat with the model
lemonade run user.gemma-4-31B-it-MP-GGUF-{{QUANT_TAG}}List all available models
lemonade list
Mixed Precision GGUF layer quantization of gemma-4-31B-it by Google
Original model: https://huggingface.co/google/gemma-4-31B-it
The hybrid quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. The quants employed are all K to avoid slow CPU or older GPU processing of IQ quants. An extended K layer definitions quant is defined as follows :
Extended K (QN_E_H) mixed precision layer quant nomenclature:
QN_K_VOD, Q8_0_VOD
N = {2,3,4,5,6}
VOD = attnV:attnO:ffnD
V,O,D = {0,2,3,4,5,6,8,f,F}
VOD MAP:
2:Q2_K, 3:Q3_K, 4:Q4_K, 5:Q5_K, 6:Q6_K, 8:Q8_0, f:F:F16, 0:F32, default QN_K
LAYER_TYPES='[
[0 ,"Q5_K_656"],[1 ,"Q4_K_646"],[2 ,"Q4_K_554"],[3 ,"Q3_K_544"],[4 ,"Q4_K_544"],[5 ,"Q3_K_544"],[6 ,"Q4_K_544"],[7 ,"Q3_K_544"],
[8 ,"Q4_K_544"],[9 ,"Q3_K_544"],[10,"Q4_K_544"],[11,"Q3_K_544"],[12,"Q4_K_544"],[13,"Q3_K_544"],[14,"Q4_K_544"],[15,"Q3_K_544"],
[16,"Q4_K_544"],[17,"Q3_K_544"],[18,"Q4_K_544"],[19,"Q3_K_544"],[20,"Q4_K_544"],[21,"Q3_K_544"],[22,"Q4_K_544"],[23,"Q3_K_544"],
[24,"Q4_K_544"],[25,"Q3_K_544"],[26,"Q4_K_544"],[27,"Q3_K_544"],[28,"Q4_K_544"],[29,"Q3_K_544"],[30,"Q4_K_544"],[31,"Q3_K_544"],
[32,"Q4_K_544"],[33,"Q4_K_544"],[34,"Q4_K_544"],[35,"Q4_K_544"],[36,"Q4_K_544"],[37,"Q4_K_544"],[38,"Q4_K_544"],[39,"Q4_K_544"],
[40,"Q4_K_645"],[41,"Q4_K_544"],[42,"Q4_K_645"],[43,"Q4_K_544"],[44,"Q4_K_645"],[45,"Q4_K_544"],[46,"Q4_K_645"],[47,"Q4_K_544"],
[48,"Q4_K_645"],[49,"Q4_K_544"],[50,"Q4_K_645"],[51,"Q4_K_544"],[52,"Q4_K_645"],[53,"Q4_K_544"],[54,"Q4_K_645"],[55,"Q4_K_645"],
[56,"Q5_K_645"],[57,"Q5_K_656"],[58,"Q5_K_666"],[59,"Q6_K_886"]
]'
FLAGS="--token-embedding-type Q5_K --output-tensor-type Q6_K --layer-types-high"
The quant was tested for very strong performance over a small set of curated reasoning prompts and sized to 1.1G smaller than Q4_K_M.
Comparison:
| Quant | size | PPL | Comment |
|---|---|---|---|
| Q4_K_M | 18.7e9 | 15.7 | modified PPL, see discussion below. |
| Q4_E_H | 17.6e9 | 15.3 | modified PPL, 1.1G smaller than Q4_K_M |
Usage:
gemma 4 31B it is a vision capable dense RL model. It can be used together with its multimedia projector layers to process images and text inputs and generate text outputs. The mmproj file is made available in this repository.
Thinking:
By default the model will not create a RL reasoning block and just outputs
<|channel>thought
<channel|>
at the start of gen. To get it to fill in the think block use a system prompt with:
<|think|>
as the first token. This is a special token in the model vocab and must be tokenized as such to work. No other text in the system prompt besides the think token is needed to get it to fill in the RL block though other text can be added if desired.
Speculation:
Speculation can be used effectively with the model. A recommended low overhead speculator is gemma-3-270m-it-256k. To use this speculator the inference platform must support dynamic vocab translation between draft and target.
On a 2x 4070 setup (1 RPC) approx performance with fixed speculation block size of ND=2 samples on cuda backend using a custom speculator with a downstream server is:
| Q | QKV | NKV | ND | gen tps |
|---|---|---|---|---|
| Q4_E_H | F16 | 32k | 0 | 22 |
| " | " | " | 2 | 36 |
| " | " | " | 3 | 33 |
| " | Q8_0 | 64k | 0 | 17 |
| " | " | " | 2 | 27 |
| " | " | " | 3 | 27 |
The model was found to be highly capable on reasoning tasks when skipping think block. However on hard or trick questions the model can just wing a bogus response in non think mode, but turning on RL will boot it out of the pull-answer-out-of-latent-space-qed mode.
Vision:
The model was tested in vision mode on a couple pretty tough bird ID image and found to exhibit poor performance in both think and nonthink mode, not even considering the correct answer in its responses. As a comparision gemma3 27B went 1 for 2 and Qwen3 27B completely aces these tough (quite blurry images of a small bird) ID tests. The model did a great job on some text based image prompts though.
Code:
The model was tested across a small set of code gen prompts and found to be excellent in its ability to generate working code on all of the test prompts.
Llama.cpp inference/isssues:
Llama.cpp minimum version to run gemma-4-31B-it should be b8648 and above due to correction of the Gemma 4 tokenizer.
The model cannot compute valid perplexity due to the instruct tune forcing it to generate
<|channel>thought
as assitant gen independent of previous prompt contents. To work around this problem a modifed perplexity is computed by overwriting the beginning of the perplexity chunk contents with the forced assistent gen as follows:
# chunk is a string of text to eval perplexity on
injects='model\n<|channel>thought\n<channel|>'
chunk="${injects}${chunk:${#injects}}"
logprobs are skipped over the beginning part of the perplexity prompt using a modified llama.cpp downstream server to compute perplexity. Discussion at: https://github.com/ggml-org/llama.cpp/issues/21388
Benchmarks:
A full set of both math and vision benchmarks for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm
Download the file from below:
| Link | Type | Size/e9 B | Notes |
|---|---|---|---|
| gemma-4-31B-it.Q4_E_H.gguf | Q4_E_H | 17.6e9 B | 1.1B smaller than Q4_K_M |
| gemma-4-31B-it.mmproj.gguf | F16 | 1.2e9 B | multimedia projector |
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
- Downloads last month
- 456
We're not able to determine the quantization variants.