--- license: llama3.1 train: false inference: false pipeline_tag: text-generation --- This is an HQQ all 4-bit (group-size=64) quantized Llama3.1-8B-Instruct model. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png) ## Model Size | Models | fp16| HQQ 4-bit/gs-64 | AWQ 4-bit | |:-------------------:|:--------:|:----------------:|:----------------:| | Bitrate (Linear layers) | 16 | 4.5 | 4.25 | | VRAM | 15.7 (GB) | 6.1 (GB) | 6.3 (GB) | ## Model Decoding Speed | Models | fp16| HQQ 4-bit/gs-64| AWQ 4-bit | |:-------------------:|:--------:|:----------------:|:----------------:| | Decoding* - short seq (tokens/sec)| 53 | 125 | 67 | | Decoding* - long seq (tokens/sec)| 50 | 97 | 65 | *: RTX 3090 ## Performance | Models | fp16 | HQQ 4-bit/gs-64 | AWQ 4-bit | |:-------------------:|:--------:|:----------------:|:----------------:| | ARC (25-shot) | 60.49 | 60.32 | 57.85 | | HellaSwag (10-shot)| 80.16 | 79.21 | 79.28 | | MMLU (5-shot) | 68.98 | | 67.14 | | TruthfulQA-MC2 | 54.03 | 53.89 | 51.87 | | Winogrande (5-shot)| 77.98 | 76.24 | 76.4 | | GSM8K (5-shot) | 75.44 | | 73.47 | | Average | 69.51 | | 67.67 | ## Usage First, install the dependecies: ``` pip install git+https://github.com/mobiusml/hqq.git #master branch fix pip install bitblas ``` Also, make sure you use at least torch `2.4.0` or the nightly build. Then you can use the sample code below: ``` Python import torch from transformers import AutoTokenizer from hqq.models.hf.base import AutoHQQHFModel from hqq.utils.patching import * from hqq.core.quantize import * from hqq.utils.generation_hf import HFGenerator #Load the model ################################################### model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq' compute_dtype = torch.float16 #bfloat16 for torchao, float16 for bitblas cache_dir = '.' model = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype) tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir) quant_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, axis=1) patch_linearlayers(model, patch_add_quant_config, quant_config) #Use optimized inference kernels ################################################### HQQLinear.set_backend(HQQBackend.PYTORCH) #prepare_for_inference(model) #default backend #prepare_for_inference(model, backend="torchao_int4") prepare_for_inference(model, backend="bitblas") #takes a while to init... #Generate ################################################### gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while gen.generate("Write an essay about large language models", print_tokens=True) gen.generate("Tell me a funny joke!", print_tokens=True) gen.generate("How to make a yummy chocolate cake?", print_tokens=True) ```