Text Generation
Transformers
Safetensors
mistral
chat
conversational
text-generation-inference
Inference Endpoints
kalomaze commited on
Commit
8cc08c5
1 Parent(s): bae73c3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -1
README.md CHANGED
@@ -46,7 +46,10 @@ In addition to this, we noticed that Mistral Large models seemed much more sensi
46
 
47
  We hypothesize this is primarily due to the particularly narrow and low variance weight distributions typical of Mistral derived models regardless of their scale.
48
 
49
- In the end, due to the costs that would be involved in training another full 2 epochs run ($600), we settled on our third attempt: 2e-6 with an effective batch size of 64, stopped earlier than the target 2 epochs.
 
 
 
50
 
51
  [<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
52
 
 
46
 
47
  We hypothesize this is primarily due to the particularly narrow and low variance weight distributions typical of Mistral derived models regardless of their scale.
48
 
49
+ In the end, due to the costs that would be involved in training another full 2 epochs run ($600) on an even lower rate, we settled on our third attempt: 2e-6 with an effective batch size of 64, stopped earlier than the target 2 epochs.
50
+
51
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/d9_cBy-DuWrdnoVBbAvRV.png)
52
+ We notice a correlation between the significance of the 2nd epoch loss drop and the strength of the learning rate, implying 4e-6 leads to more catastrophic forgetting.
53
 
54
  [<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
55