--- base_model: OpenScholar/Llama-3.1_OpenScholar-8B license: apache-2.0 language: - en library_name: transformers pipeline_tag: text-generation tags: - llama-3.1 - autoawq --- # Llama-3.1_OpenScholar-8B with AWQ Quantization This is [Llama-3.1_OpenScholar-8B](https://huggingface.co/OpenScholar/Llama-3.1_OpenScholar-8B) with AWQ Quantization applied using the following code. ```python # Based on example: https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py import torch from awq import AutoAWQForCausalLM from transformers import AutoTokenizer # Input and output path path = "OpenScholar/Llama-3.1_OpenScholar-8B" output = "Llama-3.1_OpenScholar-8B-AWQ" # Quantization config config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } # Load model model = AutoAWQForCausalLM.from_pretrained( model_path=path, low_cpu_mem_usage=True, use_cache=False, safetensors=False, device_map="cuda", torch_dtype=torch.bfloat16 ) tokenizer = AutoTokenizer.from_pretrained(path) # Quantize model.quantize(tokenizer, quant_config=config) # Save quantized model model.save_quantized(output) # Save tokenizer # Note: Transformers >= 4.45.0 doubles size of tokenizer.json # See https://github.com/huggingface/transformers/issues/34744 tokenizer.save_pretrained(output) print(f'Model is quantized and saved to "{output}"') ```