HuggingFaceTB
/

SmolLM2-135M-Instruct

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

Why are model_q4.onnx and model_q4f16.onnx not 4 times smaller than model.onnx?

#3

by Franck-Dernoncourt - opened 17 days ago

Franck-Dernoncourt

17 days ago

I see on https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct/tree/main/onnx:

File Name	Size
model.onnx	654 MB
model_fp16.onnx	327 MB
model_q4.onnx	200 MB
model_q4f16.onnx	134 MB

I understand that:

model.onnx is the fp32 model,
model_fp16.onnx is the model whose weights are quantized to fp16

I don't understand the size of model_q4.onnx and model_q4f16.onnx

Why is model_q4.onnx 200 MB instead of 654 MB / 4 = 163.5 MB? I thought model_q4.onnx meant that the weights are quantized to 4 bits.
Why is model_q4f16.onnx 134 MB instead of 654 MB / 4 = 163.5 MB? I thought model_q4f16.onnx meant that the weights are quantized to 4 bits and activations are fp16, since https://llm.mlc.ai/docs/compilation/configure_quantization.html states:

qAfB(_id), where A represents the number of bits for storing weights and B represents the number of bits for storing activations.

and Why do activations need more bits (16bit) than weights (8bit) in tensor flow's neural network quantization framework? indicates that activations don't count toward the model size (understandably).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment