--- license: llama3.2 datasets: - BAAI/Infinity-Instruct base_model: - meta-llama/Llama-3.2-1B-Instruct --- ## Model Overview This weight is a fine-tuned version of **[Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)** using the **[LLM-Neo](https://arxiv.org/abs/2411.06839)** method. Usage is identical to the original Llama-3.2-1B-Instruct model. ## Training Details The training process employs the **LLM-Neo** method. The dataset is derived from a mixed sample of **[BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)**, specifically the `0625` and `7M` subsets, with a total of 10k instruction samples. The KD (knowledge distillation) model used is **[Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)**, with the following hyperparameters: - **Learning Rate**: 1e-4 - **Epochs**: 1 - **KD Ratio**: 0.9 - **Rank**: 128 ## Model Performance Evaluation Neo_radar The evaluation of this model is divided into two parts: results from **[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)** and **[math-evaluation-harness](https://github.com/ZubinGou/math-evaluation-harness)** frameworks. > **Note**: The results are influenced by the specific benchmark versions and testing hardware/software configurations. > Therefore, the reported metrics should be interpreted as relative performance within a given setup. ### Part 1: lm-evaluation-harness results In this part, the model was evaluated on several widely-used benchmark datasets, covering reasoning, commonsense, mathematics, and language understanding tasks. Below is a detailed comparison of the performance metrics between **Llama-3.2-1B-Instruct** and the current model: | Dataset | Llama-3.2-1B-Instruct | Llama-3.2-1B-Instruct-Neo | |---------------------|------------------------|---------------| | ARC Challenge | 36.09 | 36.43 | | ARC Easy | 68.52 | 67.51 | | CEval | 39.45 | 39.67 | | CMMLU | 35.62 | 36.48 | | MMLU | 45.91 | 46.27 | | HellaSwag | 45.07 | 45.84 | | OpenBookQA | 24.40 | 25.40 | | PIQA | 73.88 | 74.32 | | Winogrande | 59.27 | 61.17 | The results demonstrate that the current model outperforms **Llama-3.2-1B-Instruct** in several tasks, especially in reasoning tasks (e.g., **Winogrande**) and commonsense tasks (e.g., **PIQA**). --- ### Part 2: math-evaluation-harness results In this part, the model was evaluated specifically on mathematical reasoning and related tasks, focusing on its ability to handle complex mathematical problems. | Dataset | Llama-3.2-1B-Instruct | Llama-3.2-1B-Instruct-Neo | |---------------------|------------------------|---------------| | GSM8K | 35.00 | 39.30 | | Minerva Math | 14.80 | 22.80 | | SVAMP | 50.40 | 54.50 | | ASDiv | 67.40 | 71.20 | | MAWPS | 83.50 | 85.60 | | TabMWP | 41.90 | 35.40 | | MathQ | 44.20 | 48.30 | | MMLU-STEM | 37.90 | 38.90 | The mathematical evaluation highlights significant improvements of the current model in handling complex problems, with notable progress on datasets such as **Minerva Math** and **GSM8K**. --- ### Summary - **Strengths**: The current model demonstrates notable improvements over **Llama-3.2-1B-Instruct** across multiple benchmark tasks, particularly in reasoning and mathematical problem-solving. - **Future Directions**: Further optimization in logical reasoning tasks (e.g., **TabMWP**) and continued enhancements in general language and mathematical adaptability.