arxiv:2409.17066

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Published on Sep 25

· Submitted by

yangwang92 on Sep 30

Upvote

Authors:

Yifei Liu ,

Jicheng Wen ,

Yang Wang ,

Shengyu Ye ,

Li Lyna Zhang ,

Ting Cao ,

Abstract

Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables. In this paper, we introduce Vector Post-Training Quantization (VPTQ) for extremely low-bit quantization of LLMs. We use Second-Order Optimization to formulate the LLM VQ problem and guide our quantization algorithm design by solving the optimization. We further refine the weights using Channel-Independent Second-Order Optimization for a granular VQ. In addition, by decomposing the optimization problem, we propose a brief and effective codebook initialization algorithm. We also extend VPTQ to support residual and outlier quantization, which enhances model accuracy and further compresses the model. Our experimental results show that VPTQ reduces model quantization perplexity by 0.01-0.34 on LLaMA-2, 0.38-0.68 on Mistral-7B, 4.41-7.34 on LLaMA-3 over SOTA at 2-bit, with an average accuracy improvement of 0.79-1.5% on LLaMA-2, 1% on Mistral-7B, 11-22% on LLaMA-3 on QA tasks on average. We only utilize 10.4-18.6% of the quantization algorithm execution time, resulting in a 1.6-1.8times increase in inference throughput compared to SOTA.

View arXiv page View PDF Add to collection

Community

yangwang92

Paper author Paper submitter Sep 30

•

edited about 1 month ago

VPTQ (Vector Post-Training Quantization) is an advanced compression technique that dramatically reduces the size of large language models such as the 70B and 405B Llama models. VPTQ efficiently compresses these models to 1-2 bits within just a few hours, enabling them to run effectively on GPUs with limited memory.

paper https://arxiv.org/abs/2409.17066
github https://github.com/microsoft/VPTQ
hugginface community https://huggingface.co/VPTQ-community
free huggingface onlie demo/space https://huggingface.co/spaces/VPTQ-community/VPTQ-Demo

Llama 3.1 70b chat on RTX4090 (24G @ 2bit)

Llama 3.1 70b prompt on RTX4090 (24G @ 2bit)

KT313

Sep 30

•

edited Sep 30

in the tables, for example table 2, you have highlighted the best values where VPTQ beats other quantization methods, but you did not highlight the highest values where other methods were better. It would be a lot better if you'd highlight the highest values everywhere instead of giving VPTQ preferential treatment by only highlighting the highest values if they are from your method :)

also just a small thing on the side for clarity, maybe changing unit descriptions from something like mem/GB, cost/h to mem (GB), cost (h) would help a bit with understandability. I was confused at first at mem/GB because i thought it meant "memory per gigabyte".

There are also some other text issues, like the duplicate sentence at the top of page 3: " Un-
der the guidance of the optimization problem, Under the guidance of the optimization problem".

content wise though, looks like super great work!

yangwang92

Paper author Sep 30

Thanks for your suggestion. our paper reviewer also points out the highlights and typos in the table. And we will fix this in our camera-ready version. : -)

The current tech report is an early version that introduces our methods and early results. Thanks for your kind suggestion!