Meta-Llama-3-8B-Instruct-quantized.w4a4

Model Overview

Model Architecture: Meta-Llama-3
- Input: Text
- Output: Text
Model Optimizations:
- Weight and Activation Quantization: INT4 (W4A4)
Intended Use Cases: Intended for commercial and research use across multiple languages, designed to function as an assistant-like chat model.
Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws).
Release Date: 9/2024
Version: 1.0
License(s): Llama3.1
Model Developers: Mahesh Yaddanapudi

Quantized version of Meta-Llama-3-8B-Instruct. This model is optimized using weight and activation quantization to INT4, drastically reducing memory usage and enabling deployment on extremely resource-constrained environments.

Model Optimizations

This model was obtained by quantizing the weights and activations of Meta-Llama-3-8B-Instruct to INT4 (W4A4) data type. This optimization reduces the number of bits per parameter and activation from 16 to 4, significantly reducing disk size and memory requirements.

The weights and activations of the linear operators within transformers blocks are quantized using the GPTQ algorithm, which applies symmetric per-channel quantization with a 1% damping factor and 256 sequences of 8,192 random tokens.

Deployment

This model can be deployed efficiently using various backends compatible with INT4 models, as shown in the example below.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "zzzmahesh/Meta-Llama-3-8B-Instruct-quantized.w4a4"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    low_cpu_mem_usage=True
)

prompt = "What are the benefits of model quantization in AI?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0]))

Creation

This model was created by using the GPTQ quantization method as implemented in the AutoGPTQ library, as demonstrated in the code snippet below.

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import random

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Create random examples for quantization calibration
num_samples = 256
max_seq_len = 8192
tokenizer = AutoTokenizer.from_pretrained(model_id)
max_token_id = len(tokenizer.get_vocab()) - 1
examples = [{"input_ids": [random.randint(0, max_token_id) for _ in range(max_seq_len)], "attention_mask": max_seq_len * [1]} for _ in range(num_samples)]

# Define quantization configuration for W4A4
quantize_config = BaseQuantizeConfig(bits=4, group_size=-1, desc_act=True, model_file_base_name="model", damp_percent=0.01)

# Load and quantize the model
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config, device_map="auto")
model.quantize(examples)
model.save_pretrained("Meta-Llama-3-8B-Instruct-quantized.w4a4")

Future Work

Further evaluations are planned to compare this quantized model with its unquantized and higher-bit quantized counterparts, especially on benchmarks relevant to code generation and logical reasoning tasks.