|
--- |
|
base_model: meta-llama/Meta-Llama-3-8B-Instruct |
|
language: |
|
- en |
|
license: llama3.1 |
|
pipeline_tag: text-generation |
|
tags: |
|
- int8 |
|
- w8a8 |
|
- text-generation |
|
--- |
|
|
|
# Meta-Llama-3-8B-Instruct-quantized.w4a4 |
|
|
|
## Model Overview |
|
- **Model Architecture:** Meta-Llama-3 |
|
- **Input:** Text |
|
- **Output:** Text |
|
- **Model Optimizations:** |
|
- **Weight and Activation Quantization:** INT8 (W8A8) |
|
- **Intended Use Cases:** Intended for commercial and research use across multiple languages, designed to function as an assistant-like chat model. |
|
- **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). |
|
- **Release Date:** 9/2024 |
|
- **Version:** 1.0 |
|
- **License(s):** Llama3.1 |
|
- **Model Developers:** Mahesh Yaddanapudi |
|
|
|
Quantized version of [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). This model is optimized using weight and activation quantization to INT8, drastically reducing memory usage and enabling deployment on extremely resource-constrained environments. |
|
|
|
### Model Optimizations |
|
|
|
This model was obtained by quantizing the weights and activations of [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) to INT8 (W8A8) data type. This optimization reduces the number of bits per parameter and activation from 16 to 8, significantly reducing disk size and memory requirements. |
|
|
|
The weights and activations of the linear operators within transformers blocks are quantized using the [GPTQ](https://arxiv.org/abs/2210.17323) algorithm, which applies symmetric per-channel quantization with a 1% damping factor and 256 sequences of 8,192 random tokens. |
|
|
|
## Deployment |
|
|
|
This model can be deployed efficiently using various backends compatible with INT8 models, as shown in the example below. |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_id = "zzzmahesh/Meta-Llama-3-8B-Instruct-quantized.w8a8" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_id, |
|
device_map="auto", |
|
low_cpu_mem_usage=True |
|
) |
|
|
|
prompt = "What are the benefits of model quantization in AI?" |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
outputs = model.generate(**inputs) |
|
|
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
## Creation |
|
|
|
This model was created by using the GPTQ quantization method as implemented in the AutoGPTQ library, as demonstrated in the code snippet below. |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig |
|
import random |
|
|
|
model_id = "meta-llama/Meta-Llama-3-8B-Instruct" |
|
|
|
# Create random examples for quantization calibration |
|
num_samples = 256 |
|
max_seq_len = 8192 |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
max_token_id = len(tokenizer.get_vocab()) - 1 |
|
examples = [{"input_ids": [random.randint(0, max_token_id) for _ in range(max_seq_len)], "attention_mask": max_seq_len * [1]} for _ in range(num_samples)] |
|
|
|
# Define quantization configuration for W8A8 |
|
quantize_config = BaseQuantizeConfig(bits=8, group_size=-1, desc_act=True, model_file_base_name="model", damp_percent=0.01) |
|
|
|
# Load and quantize the model |
|
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config, device_map="auto") |
|
model.quantize(examples) |
|
model.save_pretrained("Meta-Llama-3-8B-Instruct-quantized.w8a8") |
|
``` |
|
|
|
## Future Work |
|
|
|
Further evaluations are planned to compare this quantized model with its unquantized and higher-bit quantized counterparts, especially on benchmarks relevant to code generation and logical reasoning tasks. |