|
--- |
|
license: wtfpl |
|
--- |
|
**NOTE**: These weights require the very latest build of llama.cpp (i.e. commit `c2101a2`). More details [here](https://github.com/ggerganov/llama.cpp/pull/5328). |
|
|
|
From the author of the llama.cpp patch: |
|
> I started working on this as an experiment and because I wanted to try Mamba models with llama.cpp (also, there have been quite a few finetunes already). |
|
Turns out that implementing support for a novel model architecture is quite fun (well, at least when it finally works). |
|
The most powerful machine on which I try LLMs is a low-power laptop with 8GB of ram and an Intel CPU (no discrete GPU), so I can't try Mamba-3B in its full f32 glory (the full weights take 11GB), but at least now it's possible to use it quantized. |
|
|
|
> Constant memory usage is a big advantage of Mamba models, but this also means that previous states are not all kept in memory (at least in the current implementation, only the last one is kept), which means there might be more prompt re-processing than necessary in the server example, especially if your client trims the end of the output (it's also problematic that the stop token(s) are not included in the server's responses). The main example has no such problem. |
|
|
|
> Currently, the initial text generation speed for Mamba is a bit slower than for Transformer-based models (with empty context), but unlike them, Mamba's speed does not degrade with the amount of tokens processed. |
|
Also note that quantization may make the state unstable (making the output gibberish), but this needs more testing to figure out how much this happens (because I only saw it happen with very small models (130M), and not yet with bigger ones (3B)). |
|
|
|
> For testing, I recommend converting from https://huggingface.co/state-spaces/mamba-130m-hf since it's small, the config.json doesn't require modification, the tokenizer is already next to the model files, and the token_embd weight is shared with the output weight, so the download is smaller. |