Instructions to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF",
	filename="MiniMax-M2.7-REAP-139B-IQ4_NL-MoE.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M

Ollama
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with Ollama:
```
ollama run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
```

Unsloth Studio new

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF to start chatting

Pi new

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with Docker Model Runner:
```
docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
```

Lemonade

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF-Q4_K_M

List all available models

lemonade list

m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF

GGUF quantizations of dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B, the first publicly available REAP-40% pruned variant of MiniMax-M2.7.

Available quantizations

Sizes are approximate; the model card will refresh as each quant is uploaded to this repo.

Variant	Approx. size	Target hardware	Notes
`Q4_K_M`	~84 GB	96 GB Apple Silicon (Mac Studio M4 Max)	Recommended sweet spot. Smoke-test verified 5/5.
`IQ4_XS`	~74 GB	96 GB Apple Silicon with extra headroom	Smaller than Q4_K_M, marginally lower quality.
`Q3_K_M`	~66 GB	64 GB Mac / 2×RTX 3090	Budget option; expect some reasoning loss.
`Q6_K`	~114 GB	128 GB Mac Ultra	High-quality.
`Q8_0`	~148 GB	192+ GB systems	Near-lossless.
`IQ4_NL-MoE`	~80 GB	96 GB Mac / 2×RTX 3090	MoE-aware: `attn=Q8_0`, `experts=IQ4_NL`, `embed/output=Q6_K`. Mirrors ubergarm's mainline-compatible recipe.

Which should you pick?

96 GB Apple Silicon (Mac Studio M4 Max): Q4_K_M — ~84 GB leaves ~12 GB for KV cache at ~16K context.
64 GB Mac: Q3_K_M is the only variant that fits. Expect some reasoning-quality degradation.
128 GB Mac Ultra / 2× A6000: Q6_K for near-baseline quality.
192+ GB system (dual H100 / RTX 6000 Ada): Q8_0 for minimal quality loss.
Alternative to Q4_K_M on 96 GB: IQ4_NL-MoE keeps attention at Q8_0 and quantizes only expert FFN tensors. Similar size, often better code/reasoning.

Evaluation

HumanEval pass@1 on Q4_K_M (on completed): 83.3 % (90 / 108)

For problems where the model completed its <think> reasoning within a 32 K-token generation budget, the Q4_K_M quant solved 90 of 108 correctly.

Strict pass@1 (all 164 problems, cap-outs counted as fails): 54.9 %

56 of 164 problems exhausted the 32 K reasoning budget mid-<think> and are counted as fails under strict academic scoring. Allocate ≥64 K tokens to approach the 83 % ceiling.

Methodology: 2 × H100 80 GB, llama.cpp /v1/chat/completions, native <think> enabled, temperature=0.2, top_p=0.95, max_tokens=32000.

Prior methodology note: an earlier evaluation using raw /v1/completions with chat-prose stripping (non-canonical for reasoning models) reported 65.2 %. The numbers above use the canonical chat-completion path.

Smoke test (5 diverse pre-publish prompts): 5 / 5 PASS — trivial arithmetic, Python Fibonacci, Norwegian response, MoE semantic explanation, JSON tool-call echo.

Memory & context sizing for consumer hardware

96 GB Apple Silicon (primary target)

Variant	File size	ctx 8K	ctx 32K	ctx 60K	ctx 131K
Q4_K_M	84 GB	✓	✓ w/ KV `q8_0`	✓ w/ KV `q4_0`	requires KV `q4_0`
IQ4_XS	74 GB	✓	✓	✓	✓ w/ KV `q8_0`
Q3_K_M	66 GB	✓	✓	✓	✓
IQ4_NL-MoE	80 GB	✓	✓ w/ KV `q8_0`	✓ w/ KV `q4_0`	requires KV `q4_0`
Q6_K / Q8_0	114 / 148 GB	too large for 96 GB system	—	—	—

The native FP16 KV cache costs ~0.25 GB per 1K tokens for this architecture (62 layers × 1024 KV dim × 2 bytes). That is non-trivial at long context: Q4_K_M at ctx=60K needs ~15 GB of KV cache alone.

KV cache quantization — essential for long context on 96 GB

llama.cpp supports quantizing the KV cache with near-zero quality loss:

./llama-server -m MiniMax-M2.7-REAP-139B-A10B-Q4_K_M.gguf   -c 65536 -ngl 99   --cache-type-k q8_0 --cache-type-v q8_0

KV type	Size @ ctx=60K	Quality impact
FP16 (default)	15 GB	baseline
`q8_0`	7.5 GB	essentially lossless (recommended)
`q4_0` / `q4_1`	3.8 GB	very small degradation, worth it for extreme context

Other systems

64 GB Mac / 2× RTX 3090: Q3_K_M with q8_0 KV fits at ctx=32K.
128 GB Mac Ultra: Q6_K comfortably at ctx=32K, tight at longer context.
Dual H100 (160 GB) / 192 GB+ systems: Q8_0 near-lossless, full context.

Known minor imperfection

During integrity audit, one layer (layer 0) had expert keep-indices that differed from the REAP-retained set in ~86 of 154 positions. The bias-value mismatch is bounded by the layer-0 bias natural variance (max |Δ|=0.75 on values ∈ [8.06, 8.88]), so router behavior is essentially unchanged — confirmed by the 5/5 smoke test above. All other 61 layers are bit-perfect. Details in the safetensors model card.

Citation

See the safetensors repo for full citation details. Core references:

Lasby et al., REAP the Experts (arXiv:2510.13999)
MiniMax-M2.7 base model (MiniMaxAI)

License

Inherits the Modified MIT License from MiniMaxAI/MiniMax-M2.7.

Published by m51Lab — open-source LLM contributions from the M51 AI OS group.

Downloads last month: 2,576

GGUF

Model size

139B params

Architecture

minimax-m2

Hardware compatibility

3-bit

4-bit

6-bit

8-bit

Model tree for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF

Base model

MiniMaxAI/MiniMax-M2.7

Finetuned

dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B

Quantized

(6)

this model

Paper for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 19