Instructions to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF", filename="MiniMax-M2.7-REAP-139B-IQ4_NL-MoE.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
Use Docker
docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
- Ollama
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with Ollama:
ollama run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
- Unsloth Studio new
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF to start chatting
- Pi new
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with Docker Model Runner:
docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
- Lemonade
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF-Q4_K_M
List all available models
lemonade list
m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF
GGUF quantizations of dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B, the first publicly available REAP-40% pruned variant of MiniMax-M2.7.
Available quantizations
Sizes are approximate; the model card will refresh as each quant is uploaded to this repo.
| Variant | Approx. size | Target hardware | Notes |
|---|---|---|---|
Q4_K_M |
~84 GB | 96 GB Apple Silicon (Mac Studio M4 Max) | Recommended sweet spot. Smoke-test verified 5/5. |
IQ4_XS |
~74 GB | 96 GB Apple Silicon with extra headroom | Smaller than Q4_K_M, marginally lower quality. |
Q3_K_M |
~66 GB | 64 GB Mac / 2ΓRTX 3090 | Budget option; expect some reasoning loss. |
Q6_K |
~114 GB | 128 GB Mac Ultra | High-quality. |
Q8_0 |
~148 GB | 192+ GB systems | Near-lossless. |
IQ4_NL-MoE |
~80 GB | 96 GB Mac / 2ΓRTX 3090 | MoE-aware: attn=Q8_0, experts=IQ4_NL, embed/output=Q6_K. Mirrors ubergarm's mainline-compatible recipe. |
Which should you pick?
- 96 GB Apple Silicon (Mac Studio M4 Max): Q4_K_M β ~84 GB leaves ~12 GB for KV cache at ~16K context.
- 64 GB Mac: Q3_K_M is the only variant that fits. Expect some reasoning-quality degradation.
- 128 GB Mac Ultra / 2Γ A6000: Q6_K for near-baseline quality.
- 192+ GB system (dual H100 / RTX 6000 Ada): Q8_0 for minimal quality loss.
- Alternative to Q4_K_M on 96 GB:
IQ4_NL-MoEkeeps attention at Q8_0 and quantizes only expert FFN tensors. Similar size, often better code/reasoning.
Evaluation
HumanEval pass@1 on Q4_K_M (on completed): 83.3 % (90 / 108)
For problems where the model completed its <think> reasoning within a 32 K-token generation budget, the Q4_K_M quant solved 90 of 108 correctly.
Strict pass@1 (all 164 problems, cap-outs counted as fails): 54.9 %
56 of 164 problems exhausted the 32 K reasoning budget mid-<think> and are counted as fails under strict academic scoring. Allocate β₯64 K tokens to approach the 83 % ceiling.
Methodology: 2 Γ H100 80 GB, llama.cpp /v1/chat/completions, native <think> enabled, temperature=0.2, top_p=0.95, max_tokens=32000.
Prior methodology note: an earlier evaluation using raw /v1/completions with chat-prose stripping (non-canonical for reasoning models) reported 65.2 %. The numbers above use the canonical chat-completion path.
Smoke test (5 diverse pre-publish prompts): 5 / 5 PASS β trivial arithmetic, Python Fibonacci, Norwegian response, MoE semantic explanation, JSON tool-call echo.
Memory & context sizing for consumer hardware
96 GB Apple Silicon (primary target)
| Variant | File size | ctx 8K | ctx 32K | ctx 60K | ctx 131K |
|---|---|---|---|---|---|
| Q4_K_M | 84 GB | β | β w/ KV q8_0 |
β w/ KV q4_0 |
requires KV q4_0 |
| IQ4_XS | 74 GB | β | β | β | β w/ KV q8_0 |
| Q3_K_M | 66 GB | β | β | β | β |
| IQ4_NL-MoE | 80 GB | β | β w/ KV q8_0 |
β w/ KV q4_0 |
requires KV q4_0 |
| Q6_K / Q8_0 | 114 / 148 GB | too large for 96 GB system | β | β | β |
The native FP16 KV cache costs ~0.25 GB per 1K tokens for this architecture (62 layers Γ 1024 KV dim Γ 2 bytes). That is non-trivial at long context: Q4_K_M at ctx=60K needs ~15 GB of KV cache alone.
KV cache quantization β essential for long context on 96 GB
llama.cpp supports quantizing the KV cache with near-zero quality loss:
./llama-server -m MiniMax-M2.7-REAP-139B-A10B-Q4_K_M.gguf -c 65536 -ngl 99 --cache-type-k q8_0 --cache-type-v q8_0
| KV type | Size @ ctx=60K | Quality impact |
|---|---|---|
| FP16 (default) | 15 GB | baseline |
q8_0 |
7.5 GB | essentially lossless (recommended) |
q4_0 / q4_1 |
3.8 GB | very small degradation, worth it for extreme context |
Other systems
- 64 GB Mac / 2Γ RTX 3090: Q3_K_M with
q8_0KV fits at ctx=32K. - 128 GB Mac Ultra: Q6_K comfortably at ctx=32K, tight at longer context.
- Dual H100 (160 GB) / 192 GB+ systems: Q8_0 near-lossless, full context.
Known minor imperfection
During integrity audit, one layer (layer 0) had expert keep-indices that differed from the REAP-retained set in ~86 of 154 positions. The bias-value mismatch is bounded by the layer-0 bias natural variance (max |Ξ|=0.75 on values β [8.06, 8.88]), so router behavior is essentially unchanged β confirmed by the 5/5 smoke test above. All other 61 layers are bit-perfect. Details in the safetensors model card.
Citation
See the safetensors repo for full citation details. Core references:
- Lasby et al., REAP the Experts (arXiv:2510.13999)
- MiniMax-M2.7 base model (MiniMaxAI)
License
Inherits the Modified MIT License from MiniMaxAI/MiniMax-M2.7.
Published by m51Lab β open-source LLM contributions from the M51 AI OS group.
- Downloads last month
- 2,576
3-bit
4-bit
6-bit
8-bit
Model tree for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF
Base model
MiniMaxAI/MiniMax-M2.7