Runming Yang
Upload model files
c347fbc
metadata
license: llama3.2
datasets:
  - BAAI/Infinity-Instruct
base_model:
  - meta-llama/Llama-3.2-1B-Instruct

Model Overview

This weight is a fine-tuned version of Llama-3.2-1B-Instruct using the LLM-Neo method. Usage is identical to the original Llama-3.2-1B-Instruct model.

Training Details

The training process employs the LLM-Neo method. The dataset is derived from a mixed sample of BAAI/Infinity-Instruct, specifically the 0625 and 7M subsets, with a total of 10k instruction samples. The KD (knowledge distillation) model used is Llama-3.1-8B-Instruct, with the following hyperparameters:

  • Learning Rate: 1e-4
  • Epochs: 1
  • KD Ratio: 0.9
  • Rank: 128

Model Performance Evaluation

Neo_radar

The evaluation of this model is divided into two parts: results from lm-evaluation-harness and math-evaluation-harness frameworks.

Note: The results are influenced by the specific benchmark versions and testing hardware/software configurations. Therefore, the reported metrics should be interpreted as relative performance within a given setup.

Part 1: lm-evaluation-harness results

In this part, the model was evaluated on several widely-used benchmark datasets, covering reasoning, commonsense, mathematics, and language understanding tasks. Below is a detailed comparison of the performance metrics between Llama-3.2-1B-Instruct and the current model:

Dataset Llama-3.2-1B-Instruct Llama-3.2-1B-Instruct-Neo
ARC Challenge 36.09 36.43
ARC Easy 68.52 67.51
CEval 39.45 39.67
CMMLU 35.62 36.48
MMLU 45.91 46.27
HellaSwag 45.07 45.84
OpenBookQA 24.40 25.40
PIQA 73.88 74.32
Winogrande 59.27 61.17

The results demonstrate that the current model outperforms Llama-3.2-1B-Instruct in several tasks, especially in reasoning tasks (e.g., Winogrande) and commonsense tasks (e.g., PIQA).


Part 2: math-evaluation-harness results

In this part, the model was evaluated specifically on mathematical reasoning and related tasks, focusing on its ability to handle complex mathematical problems.

Dataset Llama-3.2-1B-Instruct Llama-3.2-1B-Instruct-Neo
GSM8K 35.00 39.30
Minerva Math 14.80 22.80
SVAMP 50.40 54.50
ASDiv 67.40 71.20
MAWPS 83.50 85.60
TabMWP 41.90 35.40
MathQ 44.20 48.30
MMLU-STEM 37.90 38.90

The mathematical evaluation highlights significant improvements of the current model in handling complex problems, with notable progress on datasets such as Minerva Math and GSM8K.


Summary

  • Strengths: The current model demonstrates notable improvements over Llama-3.2-1B-Instruct across multiple benchmark tasks, particularly in reasoning and mathematical problem-solving.
  • Future Directions: Further optimization in logical reasoning tasks (e.g., TabMWP) and continued enhancements in general language and mathematical adaptability.