Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
This repository contains the code and quantized models for Quamba2, a robust and scalable post-training quantization framework for Selective State Space Models (SSMs). Quamba2 is compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones, addressing the growing demand for SSM deployment on various platforms.
Quamba2 proposes an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for input $x$, combined with a per-state-group quantization for input-dependent parameters $B$ and $C$. It delivers 1.3$\times$ and 3$\times$ speed-ups in the pre-filling and generation stages, respectively, while offering 4$\times$ memory reduction with only a $1.6%$ average accuracy drop.
Read the paper: Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models Project Page: https://hychiang.info/projects/quamba2 Code Repository: https://github.com/enyac-group/Quamba
Key Features
- π§ Supports W4A8 / W4A16 / W4AX / W8A8 for Mamba1 and Mamba2
- π» 4Γ memory reduction
- π Achieves 13 tokens per second on Orin Nano 8G with Mamba2-8b
Usage
For detailed setup instructions and environment requirements, please refer to the GitHub repository.
To generate text with a quantized Quamba2 model, first download the model checkpoint. For example, to download the quamba2-2.7b-w4a8 model:
huggingface-cli download ut-enyac/quamba2-2.7b-w4a8 --local-dir pretrained_models/ut-enyac/quamba2-2.7b-w4a8
Then, you can use the generate.py script provided in the repository:
python generate.py ut-enyac/quamba2-2.7b-w4a8 --prompt "My cat wrote all this CUDA code for a new language model and" --topp 0.9 --temperature 0.7 --repetition_penalty 1.2 --quantize --cache_graph --pretrained_dir pretrained_models
Citation
If you find this work helpful or inspiring, please cite our paper:
@inproceedings{chiang2025quamba2,
title = {Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models},
author = {Chiang, Hung-Yueh and Chang, Chi-Chih and Frumkin, Natalia and Wu, Kai-Chiang, Abdelfattah, Mohamed S. and Marculescu, Diana},
booktitle = {Forty-Second International Conference on Machine Learning (ICML)},
year = {2025}
}
- Downloads last month
- 12
