Update README.md
Browse files
README.md
CHANGED
@@ -46,7 +46,10 @@ In addition to this, we noticed that Mistral Large models seemed much more sensi
|
|
46 |
|
47 |
We hypothesize this is primarily due to the particularly narrow and low variance weight distributions typical of Mistral derived models regardless of their scale.
|
48 |
|
49 |
-
In the end, due to the costs that would be involved in training another full 2 epochs run ($600), we settled on our third attempt: 2e-6 with an effective batch size of 64, stopped earlier than the target 2 epochs.
|
|
|
|
|
|
|
50 |
|
51 |
[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
|
52 |
|
|
|
46 |
|
47 |
We hypothesize this is primarily due to the particularly narrow and low variance weight distributions typical of Mistral derived models regardless of their scale.
|
48 |
|
49 |
+
In the end, due to the costs that would be involved in training another full 2 epochs run ($600) on an even lower rate, we settled on our third attempt: 2e-6 with an effective batch size of 64, stopped earlier than the target 2 epochs.
|
50 |
+
|
51 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/d9_cBy-DuWrdnoVBbAvRV.png)
|
52 |
+
We notice a correlation between the significance of the 2nd epoch loss drop and the strength of the learning rate, implying 4e-6 leads to more catastrophic forgetting.
|
53 |
|
54 |
[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
|
55 |
|