anthracite-org
/

magnum-v2-123b

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

kalomaze commited on Aug 18

Commit

8cc08c5

•

1 Parent(s): bae73c3

Update README.md

Files changed (1) hide show

README.md +4 -1

README.md CHANGED Viewed

@@ -46,7 +46,10 @@ In addition to this, we noticed that Mistral Large models seemed much more sensi
 We hypothesize this is primarily due to the particularly narrow and low variance weight distributions typical of Mistral derived models regardless of their scale.
-In the end, due to the costs that would be involved in training another full 2 epochs run ($600), we settled on our third attempt: 2e-6 with an effective batch size of 64, stopped earlier than the target 2 epochs.
 [<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)

 We hypothesize this is primarily due to the particularly narrow and low variance weight distributions typical of Mistral derived models regardless of their scale.
+In the end, due to the costs that would be involved in training another full 2 epochs run ($600) on an even lower rate, we settled on our third attempt: 2e-6 with an effective batch size of 64, stopped earlier than the target 2 epochs.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/d9_cBy-DuWrdnoVBbAvRV.png)
+We notice a correlation between the significance of the 2nd epoch loss drop and the strength of the learning rate, implying 4e-6 leads to more catastrophic forgetting.
 [<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)