SparseLlama-3-8B-pruned_50.2of4-FP8

This repo contains model files for a 2:4 (N:M) sparse Meta-Llama-3-8B model pruned in one-shot with SparseGPT, and then additionally retrained with the SquareHead knowledge distillation while maintaining the 2:4 sparsity mask. It was then quantized using AutoFP8 to FP8 weights and activations with per-tensor scales, calibrated on UltraChat2k.

Note: The unquantized SparseLlama-3-8B-pruned_50.2of4-FP8 is still a work in progress and subject to change. This FP8 model will be updated once the unquantized model is updated too.

Evaluation Benchmark Results

Model evaluation results obtained via lm-evaluation-harness following the configuration of Open LLM Leaderboard.

Benchmark	Meta-Llama-3-8B	SparseLlama-3-8B-pruned_50.2of4	SparseLlama-3-8B-pruned_50.2of4-FP8 (this model)
ARC-c 25-shot	59.47%	57.76%	58.02%
MMLU 5-shot	65.29%	60.44%	60.71%
HellaSwag 10-shot	82.14%	79.97%	79.61%
WinoGrande 5-shot	77.27%	77.19%	76.32%
GSM8K 5-shot	44.81%	47.92%	49.36%
TruthfulQA 0-shot	43.96%	41.02%	40.82%
Average Accuracy	62.16%	60.72%	60.81%
Recovery	100%	97.68%	97.83%

Help

For further support, and discussions on these models and AI in general, join Neural Magic's Slack Community