~8 tok/sec with ~5k context on vLLM with Flash Attention and `kv_cache_dtype="fp8"` on 3090TI 24GB VRAM
The AQLM format seems quite promising to fit that sweet sweet Q8_0
performance onto a home desktop rig. Still experimenting and setup a demo repo that might help folks get something up and running quickly (possibly on Windows with Docker [untested still]) https://github.com/ubergarm/vLLM-inference-AQLM/
Is there a .json
file available to feed vLLM
for the quantization_param_path="somefile.json"
? Not exactly sure if it would help, but experimenting with setting kv_cache_dtype="fp8"
seems to fit a little more context before OOMing...
Will be interesting to see the adoption of AQLM and what models with larger base contexts are quantized in the near future given the fairly hefty quantization demands.
Exciting stuff! Thanks for sharing!
The quantization params are in config.json
right here and both transformers
and vLLM
support it out of the box without any additional configs.
We have put out small demo for vLLM. The link is in the GitHub repo readme. From my own tests, context up to 3000 tokens (without fp8) works with this model on RTX3090.
llm = LLM(
model="ISTA-DASLab/Meta-Llama-3-70B-Instruct-AQLM-2Bit-1x16",
enforce_eager=True,
gpu_memory_utilization=0.99,
max_model_len=3000,
)
Ahh thanks for pointing that out. Yes, I got your small demo working locally with 5k context using "fp8" just for testing, but likely won't bother for actual use:
WARNING 05-06 22:48:50 model_runner.py:211] KV cache scaling factors provided, but the KV cache data type is not FP8. KV cache scaling factors will not be used.
Looking into the evaluation stuff now, thanks for releasing this models in this interesting AQLM quant!