Instructions to use AesSedai/MiniMax-M2.7-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AesSedai/MiniMax-M2.7-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="AesSedai/MiniMax-M2.7-GGUF",
	filename="IQ3_S/MiniMax-M2.7-IQ3_S-00001-of-00003.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use AesSedai/MiniMax-M2.7-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M

Use Docker

docker model run hf.co/AesSedai/MiniMax-M2.7-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use AesSedai/MiniMax-M2.7-GGUF with Ollama:
```
ollama run hf.co/AesSedai/MiniMax-M2.7-GGUF:Q4_K_M
```

Unsloth Studio new

How to use AesSedai/MiniMax-M2.7-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AesSedai/MiniMax-M2.7-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AesSedai/MiniMax-M2.7-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for AesSedai/MiniMax-M2.7-GGUF to start chatting

Pi new

How to use AesSedai/MiniMax-M2.7-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "AesSedai/MiniMax-M2.7-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use AesSedai/MiniMax-M2.7-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf AesSedai/MiniMax-M2.7-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default AesSedai/MiniMax-M2.7-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use AesSedai/MiniMax-M2.7-GGUF with Docker Model Runner:
```
docker model run hf.co/AesSedai/MiniMax-M2.7-GGUF:Q4_K_M
```

Lemonade

How to use AesSedai/MiniMax-M2.7-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull AesSedai/MiniMax-M2.7-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.MiniMax-M2.7-GGUF-Q4_K_M

List all available models

lemonade list

nan kld

by krampenschiesser - opened Apr 13

Discussion

krampenschiesser

Apr 13

•

edited Apr 13

Hey, just wanted to let you know that I am getting the nan kld, too.
I did my baseline from a custom q8 that is basically the full precision model (q8 vs f8 for the experts).
Did you use the full bf16 precision as baseline?
I am also getting the nan for unsloths UD-Q4K_XL and am actually using a quant that gives nan in llama-perplexity.
So it might not be quant related but something else?
Curious what you found out so far.

AesSedai

Owner Apr 13

I used the BF16 for the baseline, yes.

The same thing happened with the Q4-ish quants of the Mistral Small 120B model too, it got nan's when testing too.

I think there's something wrong with lcpp in that there's a numerical issue somewhere happening but I don't have any further info at the moment, my GPUs are tied up doing some other testing today so I can explore this more tomorrow.

tarruda

Apr 13

Hi @AesSedai

I used the BF16 for the baseline, yes.

Can you link to where you obtained the BF16 weights? https://huggingface.co/MiniMaxAI/MiniMax-M2.7 only has about 230GB of tensors, which I assume must be fp8.

AesSedai

Owner Apr 13

@tarruda I meant the BF16 gguf after convert_hf_to_gguf which I suppose means that it was upcast, I didn't see that the safetensors weights were in FP8. So I don't have the BF16 safetensors, sorry :(

AutisticPancake

Apr 14

•

edited Apr 14

@AesSedai

I found a (seemingly) functioning ablated version of M2.7 -- Youssofal/MiniMax-M2.7-abliterated-BF16 -- with its respective GGUFs at -- Youssofal/MiniMax-M2.7-Abliterated-Heretic-GGUF.
Q3K_M is working, apart from a few sus things, like rare errors in names (e.g., outputting Amano instead of Amane) - no idea if it's due to Q3 quantization itself or anything else.
It is not clear whether Q4K_M got those issues you guys were talking about - and if it does have them, I'm not sure if the author will attempt to address it.
Question is, if you have time on your schedule and if it's not too burdensome, could you please have a look at it later? I'm not demanding to make new quants, of course.

late edit: just minor stuff, nevermind it

AesSedai

Owner Apr 14

I don't really quant finetunes if that's what you're asking, but I do plan on looking more into the Q4_K_M nan issue. I've got my rig crunching some lineage bench testing for someone else at the moment but should be able to look into it further tomorrow evening I hope.

If you wanted to test if the Youssofal/MiniMax-M2.7-Abliterated-Heretic-GGUF had the nan issue, you should be able to reproduce it by trying to run perplexity on the model, eg:

./build/bin/llama-perplexity \
    --file /path/to/wiki.test.raw \
    --model /path/to/MiniMax-M2.7-abliterated-Q4_K_M.gguf-00001-of-00004.gguf

Any file path for testing should work, but wiki.test.raw is a pretty common one to use for PPL. If the quant has the issue, it'll show as nan on some of the output rows.

AutisticPancake

Apr 14

I don't really quant finetunes if that's what you're asking, but I do plan on looking more into the Q4_K_M nan issue. I've got my rig crunching some lineage bench testing for someone else at the moment but should be able to look into it further tomorrow evening I hope.

If you wanted to test if the Youssofal/MiniMax-M2.7-Abliterated-Heretic-GGUF had the nan issue, you should be able to reproduce it by trying to run perplexity on the model, eg:
./build/bin/llama-perplexity \
    --file /path/to/wiki.test.raw \
    --model /path/to/MiniMax-M2.7-abliterated-Q4_K_M.gguf-00001-of-00004.gguf
Any file path for testing should work, but wiki.test.raw is a pretty common one to use for PPL. If the quant has the issue, it'll show as nan on some of the output rows.

Got it! Oh, and no, not really - I'm just being too cautious rather than asking for new GGUFs, like I mentioned.
I'm currently doing some initial conversational tests with Q4K_M of that specific M2.7, it seems to be doing good so far.
Will attempt to check it for NaN issue later (if my last functioning brain cell won't give up on me, lol).

madferit421

Apr 14

This comment has been hidden (marked as Resolved)

blankreg

Apr 15

Interesting comment by bartowski about CUDA and NaN https://old.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/og8kk8x/

AesSedai

Owner Apr 15

I bumped up the layer 61 ffn_down_exps to Q6_K and not getting nan anymore in KLD / PPL, so that will be uploading shortly. The issue seems to be Q4_K or Q5_K used for that layer's ffn_down_exps, maybe an activations overflow or something in lcpp. Not sure exactly, but swapping that one layer's quantization level did indeed resolve it.

danielhanchen

Apr 16

@blankreg I tried Bart's trick but it doesn't work - they themselves edited the comment and it still NaNs - the only solution that seems to have worked was the Q6_K trick which Aes also employed in Q4_K_M I think today

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment