Edit model card

SparseLlama-3-8B-pruned_50.2of4-FP8

This repo contains model files for a 2:4 (N:M) sparse Meta-Llama-3-8B model pruned in one-shot with SparseGPT, and then additionally retrained with the SquareHead knowledge distillation while maintaining the 2:4 sparsity mask. It was then quantized using AutoFP8 to FP8 weights and activations with per-tensor scales, calibrated on UltraChat2k.

Note: The unquantized SparseLlama-3-8B-pruned_50.2of4-FP8 is still a work in progress and subject to change. This FP8 model will be updated once the unquantized model is updated too.

Evaluation Benchmark Results

Model evaluation results obtained via lm-evaluation-harness following the configuration of Open LLM Leaderboard.

Benchmark Meta-Llama-3-8B SparseLlama-3-8B-pruned_50.2of4 SparseLlama-3-8B-pruned_50.2of4-FP8
(this model)
ARC-c
25-shot
59.47% 57.76% 58.02%
MMLU
5-shot
65.29% 60.44% 60.71%
HellaSwag
10-shot
82.14% 79.97% 79.61%
WinoGrande
5-shot
77.27% 77.19% 76.32%
GSM8K
5-shot
44.81% 47.92% 49.36%
TruthfulQA
0-shot
43.96% 41.02% 40.82%
Average
Accuracy
62.16% 60.72% 60.81%
Recovery 100% 97.68% 97.83%

Help

For further support, and discussions on these models and AI in general, join Neural Magic's Slack Community

Downloads last month
16
Safetensors
Model size
8.03B params
Tensor type
BF16
·
F8_E4M3
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.