metadata

base_model: meta-llama/Llama-2-7b-hf
inference: true
model_type: llama
datasets:
  - cerebras/SlimPajama-627B
tags:
  - sparse

Llama-2-7b-pruned50-retrained

This repo contains model files for a Llama 2 7B model that has had 50% of the parameters pruned in one-shot with SparseGPT, then retrained by Cerebras with 45B tokens from SlimPajama while maintaining sparsity.

Authors: Neural Magic, Cerebras

Usage

Below we share some code snippets on how to get quickly started with running the model.

Fine-tuning examples

Coming soon.

Running the model

# pip install transformers accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("neuralmagic/Llama-2-7b-pruned50-retrained")
model = AutoModelForCausalLM.from_pretrained("neuralmagic/Llama-2-7b-pruned50-retrained", device_map="auto")

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

Evaluation Benchmark Results

Model evaluation metrics and results.

Benchmark	Metric	Llama-2-7b	Llama-2-7b-pruned50-retrained
MMLU	5-shot, top-1	xxxx	xxxx
HellaSwag	0-shot	xxxx	xxxx
WinoGrande	partial score	xxxx	xxxx
ARC-c		xxxx	xxxx
TruthfulQA	5-shot	xxxx	xxxx
HumanEval	pass@1	xxxx	xxxx
GSM8K	maj@1	xxxx	xxxx
------------------------------	-------------	-----------	---------
Average		xxxx	xxxx

Model Training Data

Coming soon.

Sparsification

This model was pruned with SparseGPT, using SparseML.