This model is quantized from Meta-Llama/Llama-3.2-1B.

Quantized Llama 3.2-1B

This repository contains a quantized version of the Llama 3.2-1B model, optimized for reduced memory footprint and faster inference.

Quantization Details

The model has been quantized using GPTQ (Generative Pretrained Transformer Quantization) with the following parameters:

Quantization method: GPTQ
Number of bits: 4
Dataset used for calibration: c4

Usage

To use the quantized model, you can load it using the load_quantized_model function from the optimum.gptq library: Make sure to replace save_folder with the path to the directory where the quantized model is saved.

Requirements

Python 3.8 or higher
PyTorch 2.0 or higher
Transformers
Optimum
Accelerate
Bitsandbytes
Auto-GPTQ

You can install these dependencies using pip.

Disclaimer

This quantized model is provided for research and experimentation purposes. While quantization can significantly reduce model size and improve inference speed, it may also result in a slight decrease in accuracy compared to the original model.

Acknowledgements

Meta AI for releasing the Llama 3.2-1B model.
The authors of the GPTQ quantization method.
The Hugging Face team for providing the tools and resources for model sharing and deployment.

spedrox-sac
/

Llama-3.2-1B_quantized