Sparse-Llama-3.1-8B-ultrachat_200k-2of4-quantized.w4a16
Model Overview
- Model Architecture: Llama-3.1-8B
- Input: Text
- Output: Text
- Model Optimizations:
- Sparsity: 2:4
- Release Date: 11/21/2024
- Version: 1.0
- License(s): llama3.1
- Model Developers: Neural Magic
This is a multi-turn conversational AI model obtained by fine-tuning the 2:4 sparse Sparse-Llama-3.1-8B-2of4 on the ultrachat_200k dataset, followed by quantization. On the AlpacaEval benchmark (version 1), it achieves a score of 61.6, compared to 62.0 for the fine-tuned dense model Llama-3.1-8B-ultrachat_200k — demonstrating a 99.4% accuracy recovery.
Model Optimizations
This model was obtained by quantizing the weights of Sparse-Llama-3.1-8B-ultrachat_200k-2of4 to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. That is on top of the reduction of 50% of weights via 2:4 pruning employed on Sparse-Llama-3.1-8B-ultrachat_200k-2of4.
Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT4 and floating point representations of the quantized weights. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library.
Deployment with vLLM
This model can be deployed efficiently using the vLLM backend. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details.
Evaluation
This model was evaluated on Neural Magic's fork of AlpacaEval benchmark. We adopt the same setup as in Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, using version 1 of the benchmark and Llama-2-70b-chat as the annotator.
Accuracy
AlpacaEval Benchmark
Metric | Llama-3.1-8B-ultrachat_200k | Sparse-Llama-3.1-8B-ultrachat_200k-2of4 | Sparse-Llama-3.1-8B-ultrachat_200k-2of4-quantized.w4a16 |
Win rate | 62.0 | 61.1 | 61.6 |
- Downloads last month
- 32
Model tree for neuralmagic/Sparse-Llama-3.1-8B-ultrachat_200k-2of4-quantized.w4a16
Base model
meta-llama/Llama-3.1-8B