|
--- |
|
base_model: OpenScholar/Llama-3.1_OpenScholar-8B |
|
license: apache-2.0 |
|
language: |
|
- en |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
tags: |
|
- llama-3.1 |
|
- autoawq |
|
--- |
|
|
|
# Llama-3.1_OpenScholar-8B with AWQ Quantization |
|
|
|
This is [Llama-3.1_OpenScholar-8B](https://huggingface.co/OpenScholar/Llama-3.1_OpenScholar-8B) with AWQ Quantization applied using the following code. |
|
|
|
```python |
|
# Based on example: https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py |
|
import torch |
|
|
|
from awq import AutoAWQForCausalLM |
|
from transformers import AutoTokenizer |
|
|
|
# Input and output path |
|
path = "OpenScholar/Llama-3.1_OpenScholar-8B" |
|
output = "Llama-3.1_OpenScholar-8B-AWQ" |
|
|
|
# Quantization config |
|
config = { |
|
"zero_point": True, |
|
"q_group_size": 128, |
|
"w_bit": 4, |
|
"version": "GEMM" |
|
} |
|
|
|
# Load model |
|
model = AutoAWQForCausalLM.from_pretrained( |
|
model_path=path, |
|
low_cpu_mem_usage=True, |
|
use_cache=False, |
|
safetensors=False, |
|
device_map="cuda", |
|
torch_dtype=torch.bfloat16 |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained(path) |
|
|
|
# Quantize |
|
model.quantize(tokenizer, quant_config=config) |
|
|
|
# Save quantized model |
|
model.save_quantized(output) |
|
|
|
# Save tokenizer |
|
# Note: Transformers >= 4.45.0 doubles size of tokenizer.json |
|
# See https://github.com/huggingface/transformers/issues/34744 |
|
tokenizer.save_pretrained(output) |
|
|
|
print(f'Model is quantized and saved to "{output}"') |
|
``` |
|
|