IDEFICS3-8B-Llama3 AWQ 4-bit (Text-Only Quantization)

This is a 4-bit AWQ quantized version of HuggingFaceM4/Idefics3-8B-Llama3 using LLM Compressor.

Key Features

  • Text model quantized (4-bit AWQ) - 62% size reduction
  • Vision encoder preserved (FP16) - maintains image quality
  • Smart quantization - Only LLM layers quantized, vision parts untouched
  • vLLM compatible - Fast inference with vLLM

Model Details

  • Base Model: HuggingFaceM4/Idefics3-8B-Llama3 (8B parameters)
  • Architecture: IDEFICS3 (Llama3-based decoder + Vision encoder)
  • Quantization Method: AWQ (Activation-aware Weight Quantization)
  • Quantization Scheme: W4A16 (4-bit weights, 16-bit activations)
  • Calibration Dataset: Flickr30k (128 samples)

Size Comparison

Metric Value
Original (FP16) ~16.0 GB
Quantized (W4A16) ~6.18 GB
Reduction ~61.4%
Memory Saved ~9.8 GB

What Was Quantized

Quantized (4-bit):

  • LlamaDecoderLayer (text/language model)
  • Text processing linear layers

Preserved (FP16):

  • Vision encoder (maintains image understanding quality)
  • Vision-text connector
  • Embeddings
  • Language model head

This selective quantization ensures that vision understanding quality remains nearly identical to the original model while significantly reducing size.

Usage

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import requests

# Load model and processor
model = AutoModelForVision2Seq.from_pretrained(
    "ronantakizawa/idefics3-8b-llama3-awq-w4a16",
    trust_remote_code=True,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(
    "ronantakizawa/idefics3-8b-llama3-awq-w4a16",
    trust_remote_code=True
)

# Prepare inputs
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"
image = Image.open(requests.get(url, stream=True).raw)

messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe this image in detail."}
    ]
}]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=[image], return_tensors="pt").to("cuda")

# Generate
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0], skip_special_tokens=True))

vLLM Inference (Recommended for Production)

from vllm import LLM, SamplingParams

llm = LLM(
    model="ronantakizawa/idefics3-8b-llama3-awq-w4a16",
    trust_remote_code=True,
    max_model_len=2048
)

# vLLM will automatically use AWQ quantization for faster inference

Performance

  • Memory Usage: ~6-8 GB GPU VRAM (vs ~16 GB for FP16)
  • Inference Speed: Similar to FP16 on compatible hardware
  • Quality: Vision understanding ~100% preserved, text generation ~95-98% preserved
  • Recommended GPU: 16GB+ VRAM for optimal performance

Quantization Details

  • Method: AWQ (Activation-aware Weight Quantization)
  • Sequential Pipeline: Used for layer-by-layer quantization
  • Calibration: 128 Flickr30k image-text pairs
  • Max Sequence Length: 2048 tokens

Limitations

  • May have slight quality degradation in complex text generation compared to FP16
  • Vision encoder is NOT quantized (intentional for quality)
  • Requires vLLM or transformers with AWQ support

License

Apache 2.0 (same as base model)

Citation

@misc{idefics3-awq,
  title={IDEFICS3-8B-Llama3 AWQ 4-bit},
  author={Quantized by ronantakizawa},
  year={2025},
  url={https://huggingface.co/ronantakizawa/idefics3-8b-llama3-awq-w4a16}
}

Acknowledgements


🤖 Generated with LLM Compressor

Downloads last month
4
Safetensors
Model size
2B params
Tensor type
I64
·
I32
·
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ronantakizawa/idefics3-8b-llama3-awq

Quantized
(5)
this model