File size: 3,532 Bytes
ca0d26a
 
 
 
 
 
 
 
 
 
 
 
 
 
c498661
 
ca0d26a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c498661
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
license: llama2
tags:
- llama
- text-generation
- causal-lm
- instruct
- quantization
- gptq
- 4-bit
- autoregressive
datasets:
- meta-llama/Llama-3.3-70B-Instruct
library_name: transformers
base_model:
- meta-llama/Llama-3.3-70B-Instruct
---

# Llama 3.3 70B Instruct (AutoRound GPTQ 4-bit)

This repository provides a 4-bit quantized version of the **Llama 3.3 70B Instruct** model using the [AutoRound](https://github.com/jllllll/auto-round) method and GPTQ quantization. This process results in a significantly smaller model footprint with negligible degradation in performance (as measured by MMLU zero-shot evaluations).

## Model Description

**Base Model:** [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)

**Quantization:** 4-bit GPTQ with AutoRound

**Group Size:** 128  
**Symmetry:** Enabled (`sym=True`)

This quantized model aims to preserve the capabilities and accuracy of the original Llama 3.3 70B Instruct model while drastically reducing the model size and computational overhead. By converting weights into a 4-bit representation with carefully selected quantization parameters, the model maintains near-original performance levels on challenging benchmarks.

## Performance and Results

### MMLU Zero-Shot Performance

- **Original Model (FP16):** ~81.82%  
- **4-bit Quantized Model:** ~81.93%

As shown above, the 4-bit quantized model achieved an MMLU zero-shot accuracy of **81.93%**, which is effectively on par with the original FP16 model’s **81.82%**. Thus, the quantization process did not cause performance degradation based on this evaluation metric.

### Model Size Reduction

- **Original FP16 Size:** ~141.06 GB  
- **4-bit Quantized Size:** ~39.77 GB

The quantized model is approximately **3.5x smaller** than the original. This reduction significantly lowers storage requirements and can enable faster inference on more modest hardware.

## Intended Use

**Primary Use Cases:**
- Instruction following and content generation.
- Conversational AI interfaces, virtual assistants, and chatbots.
- Research and experimentation on large language models with reduced resource requirements.

**Out-of-Scope Use Cases:**
- High-stakes decision-making without human review.
- Scenarios requiring guaranteed factual correctness (e.g., medical or legal advice).
- Generation of malicious or harmful content.

## Limitations and Biases

Like the original Llama models, this quantized variant may exhibit:
- Hallucinations: The model can produce factually incorrect or nonsensical outputs.
- Biases: The model may reflect cultural, social, or other biases present in its training data.

Users should ensure proper oversight and consider the model’s responses critically. It’s not suitable for authoritative or mission-critical applications without additional safeguards.

## How to Use

You can load the model using `transformers`:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "Satwik11/Llama-3.3-70B-Instruct-AutoRound-GPTQ-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # or torch.bfloat16 if supported
    device_map="auto"
)

prompt = "Explain the concept of gravity to a 10-year-old."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

```