File size: 2,459 Bytes
ab54f6e cea56e4 3767886 cea56e4 3767886 53ba6af 3767886 ab54f6e 97b25ee ab54f6e cea56e4 3767886 ab54f6e cea56e4 ab54f6e 3767886 cea56e4 3767886 cea56e4 3767886 ab54f6e 853042c cea56e4 06607c1 853042c cea56e4 853042c cea56e4 853042c cea56e4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
---
license: llama3
language:
- gsw
datasets:
- cis-lmu/Glot500
- cis-lmu/GlotCC-V1
pipeline_tag: text-generation
base_model: NousResearch/Hermes-2-Pro-Llama-3-8B
model_type: LlamaForCausalLM
tags:
- Llama-3
- instruct
- finetune
- qlora
- chatml
- synthetic data
- axolotl
---
# Alpesteibock-Llama-3-8B-Alpha
**Alpesteibock-Llama-3-8B-Alpha** is an experimental QLoRA fine-tune of [NousResearch/Hermes-2-Pro-Llama-3-8B](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B) on a dataset of 34.7 million tokens of Swiss German text from multiple sources for two epochs.
## License
This model is released under the [Llama 3 Community License](https://llama.meta.com/llama3/license/).
## Usage
The model uses ChatML as an instruction template and was trained using "You are Alpesteibock, a helpful assistant who speaks Swiss German." as a system message:
```
<|im_start|>system
You are Alpesteibock, a helpful assistant who speaks Swiss German.<|im_end|>
<|im_start|>user
Hoi. Wie heissisch du?<|im_end|>
<|im_start|>assistant
Ich bi de Alpesteibock und ich freu mi uf di.<|im_end|>
```
## Dataset
The dataset used for training consists of the following sources:
| Dataset | File Size | Description | Phase |
|---------|-----------|-------------|-------|
| [Glot500 Corpus](https://huggingface.co/datasets/cis-lmu/Glot500) (gsw_Latn, Leipzig_web) | 21.7 MB | Text, usually sentences, crawled from the web | 1 |
| [Alemannic Wikipedia](https://dumps.wikimedia.org/alswiki/) (Subset) | 50.5 MB | Articles in the Alemannic Wikipedia with most of those written in Alsatian filtered out | 2 |
| [Schweizerdeutscher Mundartkorpus](https://chmk.ch/) (Copyright Free Subset) | 28.4 MB | Copyright free books written in Swiss German | 2 |
| [GlotCC-V1.0](https://huggingface.co/datasets/cis-lmu/GlotCC-V1) (gsw-Latn) | 7.5 MB | Document-level general domain monolingual dataset derived from CommonCrawl | 2 |
| Synthetic Instruction Data | 1.7 MB | Different datasets of synthetically generated Swiss German text | 2 |
## Training Details
Hardware: 1x RTX 4090
Duration: 40 hours in total (2 hours for first phase and 38 hours for second phase)
### Hyperparameters
Adapter: QLoRA
Precision: 4-bit
Optimizer: adamw_bnb_8bit
LoRA Rank: 256
LoRA Alpha: 256
Learning Rate: 1e-5
Scheduler: Cosine
Context Length: 4096
Batch Size: 1
Gradient Accumulation Steps: 1
Sample Packing: On for first phase, Off for second phase
Epochs: 2 |