metadata
base_model: OpenScholar/Llama-3.1_OpenScholar-8B
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- llama-3.1
- autoawq
Llama-3.1_OpenScholar-8B with AWQ Quantization
This is Llama-3.1_OpenScholar-8B with AWQ Quantization applied using the following code.
# Based on example: https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py
import torch
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
# Input and output path
path = "OpenScholar/Llama-3.1_OpenScholar-8B"
output = "Llama-3.1_OpenScholar-8B-AWQ"
# Quantization config
config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
# Load model
model = AutoAWQForCausalLM.from_pretrained(
model_path=path,
low_cpu_mem_usage=True,
use_cache=False,
safetensors=False,
device_map="cuda",
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(path)
# Quantize
model.quantize(tokenizer, quant_config=config)
# Save quantized model
model.save_quantized(output)
# Save tokenizer
# Note: Transformers >= 4.45.0 doubles size of tokenizer.json
# See https://github.com/huggingface/transformers/issues/34744
tokenizer.save_pretrained(output)
print(f'Model is quantized and saved to "{output}"')