base_model: meta-llama/Meta-Llama-3-8B-Instruct
language:
- en
license: llama3.1
pipeline_tag: text-generation
tags:
- int8
- w8a8
- text-generation
Meta-Llama-3-8B-Instruct-quantized.w4a4
Model Overview
- Model Architecture: Meta-Llama-3
- Input: Text
- Output: Text
- Model Optimizations:
- Weight and Activation Quantization: INT8 (W8A8)
- Intended Use Cases: Intended for commercial and research use across multiple languages, designed to function as an assistant-like chat model.
- Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws).
- Release Date: 9/2024
- Version: 1.0
- License(s): Llama3.1
- Model Developers: Mahesh Yaddanapudi
Quantized version of Meta-Llama-3-8B-Instruct. This model is optimized using weight and activation quantization to INT8, drastically reducing memory usage and enabling deployment on extremely resource-constrained environments.
Model Optimizations
This model was obtained by quantizing the weights and activations of Meta-Llama-3-8B-Instruct to INT8 (W8A8) data type. This optimization reduces the number of bits per parameter and activation from 16 to 8, significantly reducing disk size and memory requirements.
The weights and activations of the linear operators within transformers blocks are quantized using the GPTQ algorithm, which applies symmetric per-channel quantization with a 1% damping factor and 256 sequences of 8,192 random tokens.
Deployment
This model can be deployed efficiently using various backends compatible with INT8 models, as shown in the example below.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "zzzmahesh/Meta-Llama-3-8B-Instruct-quantized.w8a8"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
low_cpu_mem_usage=True
)
prompt = "What are the benefits of model quantization in AI?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0]))
Creation
This model was created by using the GPTQ quantization method as implemented in the AutoGPTQ library, as demonstrated in the code snippet below.
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import random
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# Create random examples for quantization calibration
num_samples = 256
max_seq_len = 8192
tokenizer = AutoTokenizer.from_pretrained(model_id)
max_token_id = len(tokenizer.get_vocab()) - 1
examples = [{"input_ids": [random.randint(0, max_token_id) for _ in range(max_seq_len)], "attention_mask": max_seq_len * [1]} for _ in range(num_samples)]
# Define quantization configuration for W8A8
quantize_config = BaseQuantizeConfig(bits=8, group_size=-1, desc_act=True, model_file_base_name="model", damp_percent=0.01)
# Load and quantize the model
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config, device_map="auto")
model.quantize(examples)
model.save_pretrained("Meta-Llama-3-8B-Instruct-quantized.w8a8")
Future Work
Further evaluations are planned to compare this quantized model with its unquantized and higher-bit quantized counterparts, especially on benchmarks relevant to code generation and logical reasoning tasks.