license: llama2
tags:
- llama
- text-generation
- causal-lm
- instruct
- quantization
- gptq
- 4-bit
- autoregressive
datasets:
- meta-llama/Llama-3.3-70B-Instruct
library_name: transformers
base_model:
- meta-llama/Llama-3.3-70B-Instruct
Llama 3.3 70B Instruct (AutoRound GPTQ 4-bit)
This repository provides a 4-bit quantized version of the Llama 3.3 70B Instruct model using the AutoRound method and GPTQ quantization. This process results in a significantly smaller model footprint with negligible degradation in performance (as measured by MMLU zero-shot evaluations).
Model Description
Base Model: meta-llama/Llama-3.3-70B-Instruct
Quantization: 4-bit GPTQ with AutoRound
Group Size: 128
Symmetry: Enabled (sym=True
)
This quantized model aims to preserve the capabilities and accuracy of the original Llama 3.3 70B Instruct model while drastically reducing the model size and computational overhead. By converting weights into a 4-bit representation with carefully selected quantization parameters, the model maintains near-original performance levels on challenging benchmarks.
Performance and Results
MMLU Zero-Shot Performance
- Original Model (FP16): ~81.82%
- 4-bit Quantized Model: ~81.93%
As shown above, the 4-bit quantized model achieved an MMLU zero-shot accuracy of 81.93%, which is effectively on par with the original FP16 model’s 81.82%. Thus, the quantization process did not cause performance degradation based on this evaluation metric.
Model Size Reduction
- Original FP16 Size: ~141.06 GB
- 4-bit Quantized Size: ~39.77 GB
The quantized model is approximately 3.5x smaller than the original. This reduction significantly lowers storage requirements and can enable faster inference on more modest hardware.
Intended Use
Primary Use Cases:
- Instruction following and content generation.
- Conversational AI interfaces, virtual assistants, and chatbots.
- Research and experimentation on large language models with reduced resource requirements.
Out-of-Scope Use Cases:
- High-stakes decision-making without human review.
- Scenarios requiring guaranteed factual correctness (e.g., medical or legal advice).
- Generation of malicious or harmful content.
Limitations and Biases
Like the original Llama models, this quantized variant may exhibit:
- Hallucinations: The model can produce factually incorrect or nonsensical outputs.
- Biases: The model may reflect cultural, social, or other biases present in its training data.
Users should ensure proper oversight and consider the model’s responses critically. It’s not suitable for authoritative or mission-critical applications without additional safeguards.
How to Use
You can load the model using transformers
:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "Satwik11/Llama-3.3-70B-Instruct-AutoRound-GPTQ-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16, # or torch.bfloat16 if supported
device_map="auto"
)
prompt = "Explain the concept of gravity to a 10-year-old."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))