|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- fr |
|
- de |
|
- es |
|
- it |
|
- pt |
|
- ru |
|
- zh |
|
- ja |
|
pipeline_tag: text-generation |
|
tags: |
|
- chat |
|
--- |
|
|
|
|
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/hkPzhL-xYPeGGKCyAf3Qd.png) |
|
This is the sixth in a series of models designed to replicate the prose quality of the Claude 3 models, specifically Sonnet and Opus. This model is fine-tuned on top of [Mistral-Large-Instruct-2407](https://huggingface.co/mistralai/Mistral-Large-Instruct-2407). |
|
|
|
## Prompting |
|
Model has been Instruct tuned with the Mistral formatting. A typical input would look like this: |
|
|
|
```py |
|
<s>[INST] SYSTEM MESSAGE\nUSER MESSAGE[/INST] ASSISTANT MESSAGE</s>[INST] USER MESSAGE[/INST] |
|
``` |
|
|
|
We also provide SillyTavern presets for [Context](https://huggingface.co/anthracite-org/Magnum-123b-v1/resolve/main/Magnum-Mistral-Context.json) and [Instruct](https://huggingface.co/anthracite-org/Magnum-123b-v1/raw/main/Magnum-Mistral-Instruct.json) respectively. |
|
|
|
The Mistral preset included in SillyTavern seems to be misconfigured by default, so we recommend using these as a replacement. |
|
|
|
## Credits |
|
- [anthracite-org/Stheno-Data-Filtered](https://huggingface.co/datasets/anthracite-org/Stheno-Data-Filtered) |
|
- [anthracite-org/kalo-opus-instruct-22k-no-refusal](https://huggingface.co/datasets/anthracite-org/kalo-opus-instruct-22k-no-refusal) |
|
- [anthracite-org/nopm_claude_writing_fixed](https://huggingface.co/datasets/anthracite-org/nopm_claude_writing_fixed) |
|
|
|
This model has been a team effort, and the credits goes to all members of Anthracite. |
|
|
|
## Training |
|
The training was done for 1.5 epochs. We used 8x [AMD Instinct™ MI300X Accelerators](https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html) for the full-parameter fine-tuning of the model. |
|
|
|
In addition to this, we noticed that Mistral Large models seemed much more sensitive to learning rate adjustments than other models: |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/xCK3ISKF6pWcMyO7MEzTA.png) |
|
|
|
We hypothesize this is primarily due to the particularly narrow and low variance weight distributions typical of Mistral derived models regardless of their scale. |
|
|
|
In the end, due to the costs that would be involved in training another full 2 epochs run ($600) on an even lower rate, we settled on our third attempt: 2e-6 with an effective batch size of 64, stopped earlier than the target 2 epochs. |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/d9_cBy-DuWrdnoVBbAvRV.png) |
|
We notice a correlation between the significance of the 2nd epoch loss drop and the strength of the learning rate, implying 4e-6 leads to more catastrophic forgetting. |
|
|
|
[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl) |
|
|
|
## Safety |
|
... |