File size: 2,629 Bytes
fde5a04
 
 
08efcc5
af15a69
3cf5cfe
91aaf92
e2619a8
4971892
22e1050
4989ee9
b3e72c7
e9507b0
08efcc5
d01d003
 
2516255
 
 
 
 
 
 
 
 
 
 
 
 
e4d7bfa
2516255
 
0307054
 
 
59add6c
0307054
59add6c
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
license: apache-2.0
---
# LLM360 Research Suite: K2 Loss Spike 1
We encountered two major loss spikes while [training K2](https://huggingface.co/LLM360/K2). 
* The first loss spike occurred after 160 checkpoints and lasted over ~34 checkpoints. We restarted training at checkpoint 160 and training returned to normal.
* The [second loss spike](https://huggingface.co/LLM360/K2-Spike-2/) occurred after restarting training to fix the first loss spike at checkpoint 186 and lasted from ~8 checkpoints.
* For every spike checkpoint, we also uploaded the corresponding normal checkpoint for easy comparison. You could find different checkpoints in different branches.

We are releasing these checkpoints so others can study this interesting phenomena in large model training.

<img src="loss_spike.png" alt="k2 loss spikes"/>

# Purpose
Loss spikes are still a relatively unknown phenomena. By making these spikes and associated training details available, we hope others use these artifacts to further the worlds knowledge on this topic.

## First 10 Checkpoints
| Checkpoints      |  |
| ----------- | ----------- |
| [Checkpoint 160](https://huggingface.co/LLM360/K2-Spike-1/tree/spike_ckpt_160)     | [Checkpoint 170](https://huggingface.co/LLM360/K2-Spike-1/tree/spike_ckpt_170)       |
| [Checkpoint 162](https://huggingface.co/LLM360/K2-Spike-1/tree/spike_ckpt_162)   | [Checkpoint 172](https://huggingface.co/LLM360/K2-Spike-1/tree/spike_ckpt_172)        |
| [Checkpoint 164](https://huggingface.co/LLM360/K2-Spike-1/tree/spike_ckpt_164)   | [Checkpoint 174](https://huggingface.co/LLM360/K2-Spike-1/tree/spike_ckpt_174)        |
| [Checkpoint 166](https://huggingface.co/LLM360/K2-Spike-1/tree/spike_ckpt_166)   | [Checkpoint 176](https://huggingface.co/LLM360/K2-Spike-1/tree/spike_ckpt_176)        |
| [Checkpoint 168](https://huggingface.co/LLM360/K2-Spike-1/tree/spike_ckpt_168)   | [Checkpoint 178](https://huggingface.co/LLM360/K2-Spike-1/tree/spike_ckpt_178)        |

[to find all branches: git branch -a]

## Loss Spike's on the LLM360 Evaluation Suite

View all the evaluations on our [Weights & Biases here](https://wandb.ai/llm360/K2?nw=7bxe4sz0vv)


## About the LLM360 Research Suite
The LLM360 Research Suite is a comprehensive set of large language model (LLM) artifacts from Amber, CrystalCoder, and K2 for academic and industry researchers to explore LLM training dynamics. Additional resources can be found at llm360.ai.

## Citation

**BibTeX:**

```bibtex
@misc{
      title={LLM360-K2-65B: Scaling Up Open and Transparent Language Models}, 
      author={The LLM360 Team},
      year={2024},
}
```