IDEFICS3-8B-Llama3 AWQ 4-bit (Text-Only Quantization)
This is a 4-bit AWQ quantized version of HuggingFaceM4/Idefics3-8B-Llama3 using LLM Compressor.
Key Features
- ✅ Text model quantized (4-bit AWQ) - 62% size reduction
- ✅ Vision encoder preserved (FP16) - maintains image quality
- ✅ Smart quantization - Only LLM layers quantized, vision parts untouched
- ✅ vLLM compatible - Fast inference with vLLM
Model Details
- Base Model: HuggingFaceM4/Idefics3-8B-Llama3 (8B parameters)
- Architecture: IDEFICS3 (Llama3-based decoder + Vision encoder)
- Quantization Method: AWQ (Activation-aware Weight Quantization)
- Quantization Scheme: W4A16 (4-bit weights, 16-bit activations)
- Calibration Dataset: Flickr30k (128 samples)
Size Comparison
Metric | Value |
---|---|
Original (FP16) | ~16.0 GB |
Quantized (W4A16) | ~6.18 GB |
Reduction | ~61.4% |
Memory Saved | ~9.8 GB |
What Was Quantized
Quantized (4-bit):
- LlamaDecoderLayer (text/language model)
- Text processing linear layers
Preserved (FP16):
- Vision encoder (maintains image understanding quality)
- Vision-text connector
- Embeddings
- Language model head
This selective quantization ensures that vision understanding quality remains nearly identical to the original model while significantly reducing size.
Usage
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import requests
# Load model and processor
model = AutoModelForVision2Seq.from_pretrained(
"ronantakizawa/idefics3-8b-llama3-awq-w4a16",
trust_remote_code=True,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(
"ronantakizawa/idefics3-8b-llama3-awq-w4a16",
trust_remote_code=True
)
# Prepare inputs
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"
image = Image.open(requests.get(url, stream=True).raw)
messages = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Describe this image in detail."}
]
}]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=[image], return_tensors="pt").to("cuda")
# Generate
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0], skip_special_tokens=True))
vLLM Inference (Recommended for Production)
from vllm import LLM, SamplingParams
llm = LLM(
model="ronantakizawa/idefics3-8b-llama3-awq-w4a16",
trust_remote_code=True,
max_model_len=2048
)
# vLLM will automatically use AWQ quantization for faster inference
Performance
- Memory Usage: ~6-8 GB GPU VRAM (vs ~16 GB for FP16)
- Inference Speed: Similar to FP16 on compatible hardware
- Quality: Vision understanding ~100% preserved, text generation ~95-98% preserved
- Recommended GPU: 16GB+ VRAM for optimal performance
Quantization Details
- Method: AWQ (Activation-aware Weight Quantization)
- Sequential Pipeline: Used for layer-by-layer quantization
- Calibration: 128 Flickr30k image-text pairs
- Max Sequence Length: 2048 tokens
Limitations
- May have slight quality degradation in complex text generation compared to FP16
- Vision encoder is NOT quantized (intentional for quality)
- Requires vLLM or transformers with AWQ support
License
Apache 2.0 (same as base model)
Citation
@misc{idefics3-awq,
title={IDEFICS3-8B-Llama3 AWQ 4-bit},
author={Quantized by ronantakizawa},
year={2025},
url={https://huggingface.co/ronantakizawa/idefics3-8b-llama3-awq-w4a16}
}
Acknowledgements
- Base model by HuggingFace M4
- Quantization using LLM Compressor
- Meta tensor fix by @ronantakizawa
🤖 Generated with LLM Compressor
- Downloads last month
- 4
Model tree for ronantakizawa/idefics3-8b-llama3-awq
Base model
HuggingFaceM4/Idefics3-8B-Llama3