Instructions to use AesSedai/MiniMax-M2.7-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use AesSedai/MiniMax-M2.7-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="AesSedai/MiniMax-M2.7-GGUF", filename="IQ3_S/MiniMax-M2.7-IQ3_S-00001-of-00003.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use AesSedai/MiniMax-M2.7-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M
Use Docker
docker model run hf.co/AesSedai/MiniMax-M2.7-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use AesSedai/MiniMax-M2.7-GGUF with Ollama:
ollama run hf.co/AesSedai/MiniMax-M2.7-GGUF:Q4_K_M
- Unsloth Studio new
How to use AesSedai/MiniMax-M2.7-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AesSedai/MiniMax-M2.7-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AesSedai/MiniMax-M2.7-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for AesSedai/MiniMax-M2.7-GGUF to start chatting
- Pi new
How to use AesSedai/MiniMax-M2.7-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "AesSedai/MiniMax-M2.7-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use AesSedai/MiniMax-M2.7-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default AesSedai/MiniMax-M2.7-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use AesSedai/MiniMax-M2.7-GGUF with Docker Model Runner:
docker model run hf.co/AesSedai/MiniMax-M2.7-GGUF:Q4_K_M
- Lemonade
How to use AesSedai/MiniMax-M2.7-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull AesSedai/MiniMax-M2.7-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.MiniMax-M2.7-GGUF-Q4_K_M
List all available models
lemonade list
nan kld
Hey, just wanted to let you know that I am getting the nan kld, too.
I did my baseline from a custom q8 that is basically the full precision model (q8 vs f8 for the experts).
Did you use the full bf16 precision as baseline?
I am also getting the nan for unsloths UD-Q4K_XL and am actually using a quant that gives nan in llama-perplexity.
So it might not be quant related but something else?
Curious what you found out so far.
I used the BF16 for the baseline, yes.
The same thing happened with the Q4-ish quants of the Mistral Small 120B model too, it got nan's when testing too.
I think there's something wrong with lcpp in that there's a numerical issue somewhere happening but I don't have any further info at the moment, my GPUs are tied up doing some other testing today so I can explore this more tomorrow.
Hi @AesSedai
I used the BF16 for the baseline, yes.
Can you link to where you obtained the BF16 weights? https://huggingface.co/MiniMaxAI/MiniMax-M2.7 only has about 230GB of tensors, which I assume must be fp8.
I found a (seemingly) functioning ablated version of M2.7 -- Youssofal/MiniMax-M2.7-abliterated-BF16 -- with its respective GGUFs at -- Youssofal/MiniMax-M2.7-Abliterated-Heretic-GGUF.
Q3K_M is working, apart from a few sus things, like rare errors in names (e.g., outputting Amano instead of Amane) - no idea if it's due to Q3 quantization itself or anything else.
It is not clear whether Q4K_M got those issues you guys were talking about - and if it does have them, I'm not sure if the author will attempt to address it.
Question is, if you have time on your schedule and if it's not too burdensome, could you please have a look at it later? I'm not demanding to make new quants, of course.
late edit: just minor stuff, nevermind it
I don't really quant finetunes if that's what you're asking, but I do plan on looking more into the Q4_K_M nan issue. I've got my rig crunching some lineage bench testing for someone else at the moment but should be able to look into it further tomorrow evening I hope.
If you wanted to test if the Youssofal/MiniMax-M2.7-Abliterated-Heretic-GGUF had the nan issue, you should be able to reproduce it by trying to run perplexity on the model, eg:
./build/bin/llama-perplexity \
--file /path/to/wiki.test.raw \
--model /path/to/MiniMax-M2.7-abliterated-Q4_K_M.gguf-00001-of-00004.gguf
Any file path for testing should work, but wiki.test.raw is a pretty common one to use for PPL. If the quant has the issue, it'll show as nan on some of the output rows.
I don't really quant finetunes if that's what you're asking, but I do plan on looking more into the Q4_K_M
nanissue. I've got my rig crunching some lineage bench testing for someone else at the moment but should be able to look into it further tomorrow evening I hope.If you wanted to test if the
Youssofal/MiniMax-M2.7-Abliterated-Heretic-GGUFhad thenanissue, you should be able to reproduce it by trying to run perplexity on the model, eg:./build/bin/llama-perplexity \ --file /path/to/wiki.test.raw \ --model /path/to/MiniMax-M2.7-abliterated-Q4_K_M.gguf-00001-of-00004.ggufAny file path for testing should work, but wiki.test.raw is a pretty common one to use for PPL. If the quant has the issue, it'll show as
nanon some of the output rows.
Got it! Oh, and no, not really - I'm just being too cautious rather than asking for new GGUFs, like I mentioned.
I'm currently doing some initial conversational tests with Q4K_M of that specific M2.7, it seems to be doing good so far.
Will attempt to check it for NaN issue later (if my last functioning brain cell won't give up on me, lol).
Interesting comment by bartowski about CUDA and NaN https://old.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/og8kk8x/
I bumped up the layer 61 ffn_down_exps to Q6_K and not getting nan anymore in KLD / PPL, so that will be uploading shortly. The issue seems to be Q4_K or Q5_K used for that layer's ffn_down_exps, maybe an activations overflow or something in lcpp. Not sure exactly, but swapping that one layer's quantization level did indeed resolve it.
@blankreg I tried Bart's trick but it doesn't work - they themselves edited the comment and it still NaNs - the only solution that seems to have worked was the Q6_K trick which Aes also employed in Q4_K_M I think today