This is an HQQ all 4-bit (group-size=64) quantized Llama3.1-8B-Instruct model. We provide two versions:

image/png

image/gif

Model Size

Models fp16 HQQ 4-bit/gs-64 AWQ 4-bit GPTQ 4-bit
Bitrate (Linear layers) 16 4.5 4.25 4.25
VRAM (GB) 15.7 6.1 6.3 5.7

Model Decoding Speed

Models fp16 HQQ 4-bit/gs-64 AWQ 4-bit GPTQ 4-bit
Decoding* - short seq (tokens/sec) 53 125 67 3.7
Decoding* - long seq (tokens/sec) 50 97 65 21

*: RTX 3090

Performance

Models fp16 HQQ 4-bit/gs-64 (no calib) HQQ 4-bit/gs-64 (calib) AWQ 4-bit GPTQ 4-bit
ARC (25-shot) 60.49 60.32 60.92 57.85 61.18
HellaSwag (10-shot) 80.16 79.21 79.52 79.28 77.82
MMLU (5-shot) 68.98 67.07 67.74 67.14 67.93
TruthfulQA-MC2 54.03 53.89 54.11 51.87 53.58
Winogrande (5-shot) 77.98 76.24 76.48 76.4 76.64
GSM8K (5-shot) 75.44 71.27 75.36 73.47 72.25
Average 69.51 68.00 69.02 67.67 68.23
Relative performance 100% 97.83% 99.3% 97.35% 98.16%

You can reproduce the results above via pip install lm-eval==0.4.3

Usage

First, install the dependecies:

pip install git+https://github.com/mobiusml/hqq.git #master branch fix
pip install bitblas #if you use the bitblas backend

Also, make sure you use at least torch 2.4.0 or the nightly build with at least CUDA 12.1.

Then you can use the sample code below:

import torch
from transformers import AutoTokenizer
from hqq.models.hf.base import AutoHQQHFModel
from hqq.utils.patching import *
from hqq.core.quantize import *
from hqq.utils.generation_hf import HFGenerator

#Settings
###################################################
backend       = "torchao_int4" #'torchao_int4' #"torchao_int4" (4-bit only) or "bitblas" (4-bit + 2-bit) or "gemlite" (8-bit, 4-bit, 2-bit, 1-bit)
compute_dtype = torch.bfloat16 if backend=="torchao_int4" else torch.float16
device        = 'cuda:0'
cache_dir     = '.'

#Load the model
###################################################
#model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq' #no calib version
model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib' #calibrated version

model     = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype, device=device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)

#Use optimized inference kernels
###################################################
prepare_for_inference(model, backend=backend) 

#Generate
###################################################
#For longer context, make sure to allocate enough cache via the cache_size= parameter 
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while

gen.generate("Write an essay about large language models", print_tokens=True)
gen.generate("Tell me a funny joke!", print_tokens=True)
gen.generate("How to make a yummy chocolate cake?", print_tokens=True)
Downloads last month
41
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API has been turned off for this model.

Collection including mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib