JunxiongWang/Llama3.1-Mamba2-8B-distill

Zero-shot results when using the Llama-3.1-70B-Instruct as the teacher model, and the Llama-3.1-8B-Instruct as the initialized model

Task	Llama-3.1-8B-Instruct	Llama3.1-Mamba-8B-distill	Llama3.1-Mamba-8B-dpo	Llama3.1-Mamba2-8B-distill	Llama3.1-Mamba2-8B-dpo
arc_challenge	0.552	0.5384	0.5657	0.5265	0.5973
arc_easy	0.8178	0.8224	0.8401	0.822	0.8481
hellaswag	0.7921	0.7591	0.7736	0.7536	0.7969
mmlu (0 shot)	0.6812	0.6213	0.636	0.6101	0.5974
openbookqa	0.432	0.428	0.442	0.416	0.44
piqa	0.8079	0.7933	0.8041	0.7889	0.8003
pubmedqa	0.752	0.72	0.744	0.726	0.746
race	0.4478	0.4211	0.4344	0.4211	0.4612
winogrande	0.7388	0.7277	0.738	0.7174	0.7411
truthful	0.4267	0.4002	0.4607	0.4031	0.5022

@article{junxiongdaniele2024mambainllama,
  title   = {The Mamba in the Llama: Distilling and Accelerating Hybrid Models},
  author  = {Junxiong Wang and Daniele Paliotta and Avner May and Alexander M. Rush and Tri Dao},
  journal = {arXiv preprint arXiv:2408.15237},
  year    = {2024}
}

JunxiongWang
/

Llama3.1-Mamba2-8B-distill

Collection including JunxiongWang/Llama3.1-Mamba2-8B-distill

MambaInLlama-distill