Text Generation
PyTorch

Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

This repository contains the code and quantized models for Quamba2, a robust and scalable post-training quantization framework for Selective State Space Models (SSMs). Quamba2 is compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones, addressing the growing demand for SSM deployment on various platforms.

Quamba2 proposes an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for input $x$, combined with a per-state-group quantization for input-dependent parameters $B$ and $C$. It delivers 1.3$\times$ and 3$\times$ speed-ups in the pre-filling and generation stages, respectively, while offering 4$\times$ memory reduction with only a $1.6%$ average accuracy drop.

Read the paper: Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models Project Page: https://hychiang.info/projects/quamba2 Code Repository: https://github.com/enyac-group/Quamba

Quamba2 Overview

Key Features

  • πŸ”§ Supports W4A8 / W4A16 / W4AX / W8A8 for Mamba1 and Mamba2
  • πŸ”» 4Γ— memory reduction
  • πŸš€ Achieves 13 tokens per second on Orin Nano 8G with Mamba2-8b

Usage

For detailed setup instructions and environment requirements, please refer to the GitHub repository.

To generate text with a quantized Quamba2 model, first download the model checkpoint. For example, to download the quamba2-2.7b-w4a8 model:

huggingface-cli download ut-enyac/quamba2-2.7b-w4a8  --local-dir pretrained_models/ut-enyac/quamba2-2.7b-w4a8

Then, you can use the generate.py script provided in the repository:

python generate.py ut-enyac/quamba2-2.7b-w4a8 --prompt "My cat wrote all this CUDA code for a new language model and" --topp 0.9 --temperature 0.7 --repetition_penalty 1.2 --quantize --cache_graph --pretrained_dir pretrained_models

Citation

If you find this work helpful or inspiring, please cite our paper:

@inproceedings{chiang2025quamba2,
  title = {Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models},
  author = {Chiang, Hung-Yueh and Chang, Chi-Chih and Frumkin, Natalia and Wu, Kai-Chiang, Abdelfattah, Mohamed S.  and Marculescu, Diana},
  booktitle = {Forty-Second International Conference on Machine Learning (ICML)},
  year = {2025}
}
Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including ut-enyac/quamba2-8b-converted-w4a8