aapot commited on
Commit
e7fb7a8
1 Parent(s): 4ffb7f2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +207 -146
README.md CHANGED
@@ -1,207 +1,268 @@
1
  ---
2
- library_name: transformers
3
- tags: []
4
- ---
5
-
6
- # Model Card for Model ID
7
-
8
- Prelim scores \
9
- Testi: mt_bench, score 4.642857142857143 \
10
- Testi: reasoning, score 3.8 \
11
- Testi: humanities, score 5.5 \
12
- Testi: roleplay, score 5.3 \
13
- Testi: math, score 3.4 \
14
- Testi: extraction, score 2.4 \
15
- Testi: stem, score 5.4 \
16
- Testi: writing, score 6.7 \
17
- Testi: wibe, score 5.621621621621622
18
-
19
-
20
- ## Model Details
21
-
22
- ### Model Description
23
-
24
- <!-- Provide a longer summary of what this model is. -->
25
-
26
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
27
-
28
- - **Developed by:** [More Information Needed]
29
- - **Funded by [optional]:** [More Information Needed]
30
- - **Shared by [optional]:** [More Information Needed]
31
- - **Model type:** [More Information Needed]
32
- - **Language(s) (NLP):** [More Information Needed]
33
- - **License:** [More Information Needed]
34
- - **Finetuned from model [optional]:** [More Information Needed]
35
-
36
- ### Model Sources [optional]
37
-
38
- <!-- Provide the basic links for the model. -->
39
-
40
- - **Repository:** [More Information Needed]
41
- - **Paper [optional]:** [More Information Needed]
42
- - **Demo [optional]:** [More Information Needed]
43
-
44
- ## Uses
45
-
46
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
47
-
48
- ### Direct Use
49
-
50
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
51
-
52
- [More Information Needed]
53
-
54
- ### Downstream Use [optional]
55
 
56
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
57
-
58
- [More Information Needed]
59
-
60
- ### Out-of-Scope Use
61
-
62
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
63
-
64
- [More Information Needed]
65
-
66
- ## Bias, Risks, and Limitations
67
-
68
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
69
-
70
- [More Information Needed]
71
-
72
- ### Recommendations
73
-
74
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
75
-
76
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
77
-
78
- ## How to Get Started with the Model
79
-
80
- Use the code below to get started with the model.
81
-
82
- [More Information Needed]
83
-
84
- ## Training Details
85
-
86
- ### Training Data
87
-
88
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
89
 
90
- [More Information Needed]
91
 
92
- ### Training Procedure
 
 
93
 
94
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
95
 
96
- #### Preprocessing [optional]
97
 
98
- [More Information Needed]
 
 
 
99
 
 
100
 
101
- #### Training Hyperparameters
 
 
 
102
 
103
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
104
 
105
- #### Speeds, Sizes, Times [optional]
106
 
107
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
108
 
109
- [More Information Needed]
110
 
111
- ## Evaluation
 
112
 
113
- <!-- This section describes the evaluation protocols and provides the results. -->
114
 
115
- ### Testing Data, Factors & Metrics
116
 
117
- #### Testing Data
 
 
118
 
119
- <!-- This should link to a Dataset Card if possible. -->
120
 
121
- [More Information Needed]
 
 
122
 
123
- #### Factors
124
 
125
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
126
 
127
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
128
 
129
- #### Metrics
 
 
 
 
 
 
 
 
 
 
 
 
130
 
131
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
132
 
133
- [More Information Needed]
134
 
135
- ### Results
 
 
136
 
137
- [More Information Needed]
138
 
139
- #### Summary
140
 
 
141
 
 
142
 
143
- ## Model Examination [optional]
144
 
145
- <!-- Relevant interpretability work for the model goes here -->
146
 
147
- [More Information Needed]
148
 
149
- ## Environmental Impact
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
 
151
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
152
 
153
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
154
 
155
- - **Hardware Type:** [More Information Needed]
156
- - **Hours used:** [More Information Needed]
157
- - **Cloud Provider:** [More Information Needed]
158
- - **Compute Region:** [More Information Needed]
159
- - **Carbon Emitted:** [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
160
 
161
- ## Technical Specifications [optional]
162
 
163
- ### Model Architecture and Objective
164
 
165
- [More Information Needed]
166
 
167
- ### Compute Infrastructure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
168
 
169
- [More Information Needed]
170
 
171
- #### Hardware
172
 
173
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
174
 
175
- #### Software
176
 
177
- [More Information Needed]
178
 
179
- ## Citation [optional]
180
 
181
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
182
 
183
- **BibTeX:**
 
 
 
 
 
 
 
 
 
 
184
 
185
- [More Information Needed]
186
 
187
- **APA:**
 
 
 
 
 
 
 
 
 
 
188
 
189
- [More Information Needed]
190
 
191
- ## Glossary [optional]
192
 
193
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
194
 
195
- [More Information Needed]
196
 
197
- ## More Information [optional]
 
198
 
199
- [More Information Needed]
200
 
201
- ## Model Card Authors [optional]
 
202
 
203
- [More Information Needed]
204
 
205
- ## Model Card Contact
206
 
207
- [More Information Needed]
 
1
  ---
2
+ language:
3
+ - fi
4
+ license: apache-2.0
5
+ tags:
6
+ - finnish
7
+ - llama
8
+ inference: false
9
+ pipeline_tag: text-generation
10
+ base_model: Finnish-NLP/Ahma-3B
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
+ # Ahma-3B-Instruct for Finnish
15
 
16
+ Ahma-3B-Instruct is a instruct/chat-tuned version of [Ahma-3B](https://huggingface.co/Finnish-NLP/Ahma-3B) trained to follow instructions in Finnish. The base Ahma 3B parameter model is decoder-only transformer model based on Meta's Llama (v1) architecture pretrained from scratch on Finnish language. Original Llama model architecture was introduced in
17
+ [this paper](https://arxiv.org/abs/2302.13971)
18
+ and first released at [this page](https://github.com/facebookresearch/llama).
19
 
20
+ What does Ahma mean? Ahma is the Finnish word for wolverine! In the Finnish Lapland, wolverines are the biggest cause of reindeer damage.
21
 
22
+ There are two different sized base Ahma models, all pretrained from scratch for 139B tokens:
23
 
24
+ | Model | Context length | Layers | Dim | Heads | Params |
25
+ |:--------------------------------------------------------------------------------|:---------------|:-------|:-----|:------|:-------|
26
+ | [Ahma-3B](https://huggingface.co/Finnish-NLP/Ahma-3B) | 2048 | 26 | 3200 | 32 | 3.6B |
27
+ | [Ahma-7B](https://huggingface.co/Finnish-NLP/Ahma-7B) | 2048 | 32 | 4096 | 32 | 7.0B |
28
 
29
+ And two instruct-tuned versions:
30
 
31
+ | Model | Context length | Layers | Dim | Heads | Params |
32
+ |:--------------------------------------------------------------------------------|:---------------|:-------|:-----|:------|:-------|
33
+ | [Ahma-3B-Instruct](https://huggingface.co/Finnish-NLP/Ahma-3B-Instruct) | 2048 | 26 | 3200 | 32 | 3.6B |
34
+ | [Ahma-7B-Instruct](https://huggingface.co/Finnish-NLP/Ahma-7B-Instruct) | 2048 | 32 | 4096 | 32 | 7.0B |
35
 
36
+ ## Intended uses & limitations
37
 
38
+ This model was fine-tuned for instruction following. Instruction-tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks.
39
 
40
+ ### How to use
41
 
42
+ If you want to use this model for instruction-following, you need to use the same prompt format we used in the fine-tuning process (basically the same format what Meta used in their Llama2 models). **Note: do not use "LlamaTokenizer" from transformers library but always use the AutoTokenizer instead, or use the plain sentencepiece tokenizer.** Here is an example using the instruction-following prompt format, with some generation arguments you can modify for your use:
43
 
44
+ ```python
45
+ from transformers import AutoTokenizer, AutoModelForCausalLM
46
 
47
+ system_prompt = "Olet tekoälyavustaja. Vastaat aina mahdollisimman avuliaasti. Vastauksesi eivät saa sisältää mitään haitallista, epäeettistä, rasistista, seksististä, vaarallista tai laitonta sisältöä. Jos kysymyksessä ei ole mitään järkeä tai se ei ole asiasisällöltään johdonmukainen, selitä miksi sen sijaan, että vastaisit jotain väärin. Jos et tiedä vastausta kysymykseen, älä kerro väärää tietoa."
48
 
 
49
 
50
+ def format_prompt(prompt: str) -> str:
51
+ prompt = f" [INST] <<SYS>>\n{system_prompt.strip()}\n<</SYS>>\n\n{prompt.strip()} [/INST] "
52
+ return prompt
53
 
 
54
 
55
+ tokenizer = AutoTokenizer.from_pretrained("Finnish-NLP/Ahma-3B-Instruct")
56
+ model = AutoModelForCausalLM.from_pretrained("Finnish-NLP/Ahma-3B-Instruct")
57
+ model = model.to("cuda")
58
 
59
+ # use the custom prompt format function or the chat template feature in the tokenizer to format your inputs
60
 
61
+ # prompt = format_prompt("Kerro kolme hyötyä, joita pienet avoimen lähdekoodin kielimallit tuovat?")
62
+ # inputs = tokenizer(prompt, return_tensors="pt")
63
 
64
+ messages = [
65
+ {
66
+ "role": "system",
67
+ "content": system_prompt,
68
+ },
69
+ {"role": "user", "content": "Kerro kolme hyötyä, joita pienet avoimen lähdekoodin kielimallit tuovat?"},
70
+ ]
71
+ inputs = tokenizer.apply_chat_template(
72
+ messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
73
+ )
74
+ inputs = inputs.to("cuda")
75
 
76
+ generated_ids = model.generate(
77
+ inputs,
78
+ temperature=0.6,
79
+ penalty_alpha=0.6,
80
+ top_k=4,
81
+ do_sample=True,
82
+ repetition_penalty=1.2,
83
+ min_length=5,
84
+ max_length=2048,
85
+ )
86
+ generated_text = tokenizer.batch_decode(
87
+ generated_ids, skip_special_tokens=False
88
+ )[0]
89
 
90
+ '''
91
+ 1) Parantuneet keskustelutaidot: Pienet, hyvin koulutetut kielimallit voidaan kouluttaa ymmärtämään ja tuottamaan ihmisen kaltaista kieltä, mikä johtaa luonnollisempaan keskusteluun. Tämä voi olla erityisen hyödyllistä sovelluksissa, kuten chat-roboteissa, virtuaaliavustajissa ja kielenkääntämisessä.
92
 
93
+ 2) Lisääntynyt luovuus kirjoittamisessa: Kielimallit voivat auttaa kirjoittajia tuottamalla ideoita, lauseita ja virkkeitä, jotka ovat hiottuja ja merkityksellisiä. Tämä voi johtaa parempaan kirjoituslaatuun, parempaan organisointiin ja tehokkaampaan viestintään.
94
 
95
+ 3) Parempi tietojenkäsittely ja -tallennus: Pienemmät ja edullisemmat kielimallit voivat mullistaa tietojenkäsittelyn ja tallennuksen. Ne voivat säästää tilaa ja resursseja, koska ne pystyvät suorittamaan tiettyjä tehtäviä tehokkaammin kuin perinteiset koneoppimisalgoritmit. Lisäksi kielimallien avoimen lähdekoodin luonne mahdollistaa sen, että tutkijat, kehittäjät ja yritykset voivat tehdä niihin parannuksia ja lisäyksiä, mikä voi johtaa entistä kehittyneempiin ja monipuolisempiin ratkaisuihin.
96
+ '''
97
+ ```
98
 
99
+ You may experiment with different system prompt instructions too if you like.
100
 
101
+ ### Limitations and bias
102
 
103
+ The training data used for this model contains a lot of content from the internet, which is far from neutral. Therefore, the model can have biased predictions. This bias will also affect all fine-tuned versions of this model.
104
 
105
+ ## Training data
106
 
107
+ To better reflect the data distribution of the training set and balance the common samples and rare samples during training, we implemented the "ClusterClip Sampling" method by [Shao et al. (2024)](https://arxiv.org/abs/2402.14526) using [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) embeddings and KMeans clustering of 30 clusters. The training datasets mentioned below were created using this sampling method.
108
 
109
+ There has also been some indication that gradually increasing the training example lengths during the training could be beneficial. Thus, the training dataset was splitted to 4 bins based on example lengths, and then examples were sampled from the bins so that the example lengths are gradually increasing towards the end of the training while a little amount of the shorter examples are still present too.
110
 
111
+ This model was first supervised fine-tuned (SFT) on the combination of the following datasets:
112
 
113
+ | Dataset | Dataset type | Upsampling | Words | Ratio | Average words per example |
114
+ |:-------------------------------------------------|:-----------------------|:-----------|:-----------|:---------|:--------------------------|
115
+ | Aya Finnish | Finnish single-turn | 2.9X | 55K | 0.54% | 83 |
116
+ | OASST | Translated single-turn | 2.9X | 507K | 5.01% | 139 |
117
+ | ai2_arc | Translated single-turn | 2.9X | 12K | 0.12% | 39 |
118
+ | chatbot_arena | Translated single-turn | 2.8X | 554K | 5.48% | 147 |
119
+ | dibt10k | Translated single-turn | 2.9X | 363K | 3.58% | 262 |
120
+ | dolly | Translated single-turn | 2.9X | 221K | 2.19% | 71 |
121
+ | Aya Dutch | Translated single-turn | 2.9X | 13K | 0.12% | 36 |
122
+ | Aya English | Translated single-turn | 2.9X | 97K | 0.96% | 61 |
123
+ | Aya French | Translated single-turn | 3.7X | 75K | 0.74% | 58 |
124
+ | intel_dpo | Translated single-turn | 2.9X | 539K | 5.33% | 163 |
125
+ | lmsys_1m | Translated single-turn | 2.8X | 2187K | 21.61% | 246 |
126
+ | news_qa | Translated single-turn | 2.9X | 297K | 2.94% | 152 |
127
+ | orca_math | Translated single-turn | 2.9X | 1165K | 11.51% | 196 |
128
+ | Aya Portuguese | Translated single-turn | 2.9X | 97K | 0.96% | 27 |
129
+ | Aya Spanish | Translated single-turn | 2.8X | 52K | 0.51% | 54 |
130
+ | Aya Swedish | Translated single-turn | 2.9X | 5K | 0.05% | 41 |
131
+ | ultrachat | Translated single-turn | 2.8X | 2199K | 21.73% | 221 |
132
+ | lmsys_multiturn | Translated multi-turn | 2.9X | 490K | 4.84% | 379 |
133
+ | oaast2_multiturn | Translated multi-turn | 2.8X | 593K | 5.86% | 307 |
134
+ | suomitrivia_synthetic | Synthetic single-turn | 1.0X | 4K | 0.04% | 16 |
135
+ | wikipedia_multitask_synthetic_qa | Synthetic single-turn | 1.0X | 206K | 2.03% | 499 |
136
+ | wikipedia_synthetic_qa_reasoning | Synthetic single-turn | 1.0X | 201K | 1.98% | 477 |
137
+ | wikipedia_synthetic_person_discussions_multiturn | Synthetic multi-turn | 1.0X | 188K | 1.85% | 194 |
138
+ | **TOTAL** | | | **10121K** | **100%** | **168** |
139
 
140
+ After tokenization, the SFT training dataset had 23 million tokens and 5% of the dataset was splitted for evaluation during the training.
141
 
 
142
 
143
+ The SFT model was then further fine-tuned with Direct Preference Optimization (DPO) on the combination of the following datasets:
144
+
145
+ | Dataset | Dataset type | Upsampling | Words | Ratio | Average words per example |
146
+ |:----------------|:-----------------------|:-----------|:----------|:---------|:--------------------------|
147
+ | intel_dpo | Translated single-turn | 1.3X | 467K | 39.75% | 153 |
148
+ | ultrachat | Translated single-turn | 1.2X | 1017K | 57.24% | 220 |
149
+ | suomitrivia_dpo | Synthetic single-turn | 1.0X | 5K | 3.01% | 16 |
150
+ | **TOTAL** | | | **1489K** | **100%** | **130** |
151
+
152
+ After tokenization, the DPO training dataset had 3 million tokens and 5% of the dataset was splitted for evaluation during the training.
153
+
154
+ ## Training procedure
155
+
156
+ ### Preprocessing
157
+
158
+ Texts are tokenized using Byte Pair Encoding (BPE) using the implementation from SentencePiece splitting all numbers into individual digits and using bytes to decompose unknown UTF-8 characters. The total
159
+ vocabulary size is 64k tokens. Inputs are sequences of 2048 consecutive tokens. Texts are not lower cased so this model is case-sensitive: it makes a difference between finnish and Finnish. Both BOS and EOS tokens were used in the fine-tuning.
160
+
161
+ ### Supervised fine-tuning (SFT)
162
+
163
+ This model was first supervised fine-tuned (SFT) using the [unsloth](https://github.com/unslothai/unsloth) framework with a single NVIDIA GeForce RTX 4080 GPU. The model was fine-tuned for 1 epoch with a learning rate of 5e-05, weight decay of 5e-03, learning rate warmup ratio of 0.1 with cosine decay, batch size of 4 and gradient accumulation of 8 totalling the batch size to 32, max sequence lenght of 2048, and with NEFTune noise alpha of 5. The used optimizer was "paged_adamw_8bit" and the model was loaded with 4bit quantization. Training was done using the Rank-Stabilized LoRA (RSLora) with a rank of 256 and alpha of 128, LoRA dropout of 0.02, and target modules of "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj".
164
+
165
+ ### Direct Preference Optimization (DPO) fine-tuning
166
+
167
+ The SFT model was then further fine-tuned with Direct Preference Optimization (DPO) using the [unsloth](https://github.com/unslothai/unsloth) framework with a single NVIDIA GeForce RTX 4080 GPU. The model was fine-tuned for 1 epoch with a learning rate of 2e-05, weight decay of 0.0, learning rate warmup ratio of 0.1 with cosine decay, batch size of 2 and gradient accumulation of 8 totalling the batch size to 16, and with max sequence lenght of 2048. The used optimizer was "paged_adamw_8bit". Training was done using the Rank-Stabilized LoRA (RSLora) with a rank of 64 and alpha of 32, LoRA dropout of 0.05, and target modules of "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj".
168
+
169
+ ## Evaluation results
170
 
171
+ ### FIN-bench
172
 
173
+ This Ahma-3B-Instruct model was evaluated using [FIN-bench by TurkuNLP](https://github.com/TurkuNLP/FIN-bench), and the same evaluation was carried out for other relevant Finnish models for comparison: [FinGPT 8B by TurkuNLP](https://huggingface.co/TurkuNLP/gpt3-finnish-8B), [Viking 7B by TurkuNLP, SiloGen and HPLT](https://huggingface.co/LumiOpen/Viking-7B), and [Poro 34B by SiloGen, TurkuNLP and HPLT](https://huggingface.co/LumiOpen/Poro-34B). Below are the results with 0-shot and 3-shot settings in FIN-bench.
174
 
175
+ 0-shot results:
176
 
177
+ | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
178
+ |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
179
+ | Analogies | 50.77 | 48.46 | TBA | TBA | 49.23 | 40.00 | 54.62 |
180
+ | Arithmetic | 27.64 | 22.14 | TBA | TBA | 33.15 | 30.16 | 30.34 |
181
+ | Cause and Effect | 59.48 | 58.82 | TBA | TBA | 66.01 | 58.82 | 62.74 |
182
+ | Emotions | 36.25 | 28.12 | TBA | TBA | 22.50 | 26.25 | 35.63 |
183
+ | Empirical Judgements | 33.33 | 35.35 | TBA | TBA | 27.27 | 33.33 | 49.49 |
184
+ | General Knowledge | 44.29 | 48.57 | TBA | TBA | 40.00 | 24.29 | 51.43 |
185
+ | HHH Alignment | 42.09 | 41.66 | TBA | TBA | 41.81 | 42.51 | 42.92 |
186
+ | Intent Recognition | 24.42 | 26.16 | TBA | TBA | 17.49 | 22.40 | 68.35 |
187
+ | Misconceptions | 46.27 | 47.01 | TBA | TBA | 53.73 | 53.73 | 52.24 |
188
+ | Paraphrase | 59.50 | 73.00 | TBA | TBA | 51.00 | 50.00 | 51.00 |
189
+ | Sentence Ambiguity | 53.33 | 65.00 | TBA | TBA | 51.67 | 48.33 | 50.00 |
190
+ | Similarities Abstraction | 65.79 | 68.42 | TBA | TBA | 60.53 | 65.79 | 60.53 |
191
+ | **Non-Arithmetic Average** | **47.55** | **48.95** | TBA | TBA | **46.17** | **44.42** | **52.08** |
192
+ | **Overall Average** | **36.49** | **34.06** | TBA | TBA | **38.93** | **36.50** | **40.00** |
193
 
 
194
 
195
+ 3-shot results:
196
 
197
+ | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
198
+ |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
199
+ | Analogies | 50.77 | 49.23 | TBA | TBA | 40.77 | 54.62 | 76.92 |
200
+ | Arithmetic | 38.38 | 43.89 | TBA | TBA | 43.63 | 45.78 | 53.68 |
201
+ | Cause and Effect | 60.78 | 64.71 | TBA | TBA | 64.05 | 58.17 | 67.32 |
202
+ | Emotions | 30.00 | 41.25 | TBA | TBA | 44.37 | 48.13 | 56.87 |
203
+ | Empirical Judgements | 46.46 | 44.44 | TBA | TBA | 32.32 | 43.43 | 63.64 |
204
+ | General Knowledge | 47.14 | 40.00 | TBA | TBA | 54.29 | 28.57 | 74.29 |
205
+ | HHH Alignment | 43.53 | 44.80 | TBA | TBA | 45.39 | 44.80 | 46.07 |
206
+ | Intent Recognition | 20.52 | 44.22 | TBA | TBA | 51.45 | 58.82 | 83.67 |
207
+ | Misconceptions | 50.75 | 52.24 | TBA | TBA | 52.99 | 46.27 | 52.99 |
208
+ | Paraphrase | 50.50 | 58.50 | TBA | TBA | 53.00 | 54.50 | 55.00 |
209
+ | Sentence Ambiguity | 53.33 | 48.33 | TBA | TBA | 51.67 | 53.33 | 66.67 |
210
+ | Similarities Abstraction | 69.74 | 72.37 | TBA | TBA | 64.47 | 73.68 | 75.00 |
211
+ | **Non-Arithmetic Average** | **48.48** | **51.49** | TBA | TBA | **51.19** | **50.94** | **61.96** |
212
+ | **Overall Average** | **42.87** | **47.27** | TBA | TBA | **46.99** | **48.07** | **57.36** |
213
+
214
+ As we can see, Ahma-3B-Instruct model outperforms 2X larger models like the FinGPT 8B and Viking 7B, especially in non-arithmetic tasks in 0-shot usage. Even the 10X larger Poro 34B model, which is generally better, doesn't show a huge performance difference considering its size, and Ahma-3B-Instruct actually surpasses it in some tasks.
215
 
216
+ In a 3-shot setting, we can see that the Ahma-3B-Instruct model has better few-shot example following performance compared to the base Ahma 3B model. This could be due to the inclusion of multi-turn examples in the fine-tuning dataset.
217
 
218
+ ### MTBench Finnish
219
 
220
+ This Ahma-3B-Instruct model was primarily evaluated using [MTBench Finnish by LumiOpen](https://github.com/LumiOpen/FastChat/tree/main/fastchat/llm_judge) since this model is fine-tuned for chat and instruction following. Since the MTBench evaluates also multi-turn chats while Ahma base models were only pretrained with single-turn instruction following examples, we have reported MTBench Finnish results separately for their single-turn and multi-turn evaluation examples. This enables us to evaluate how well this Ahma-3B-Instruct model improves on multi-turn chats since its fine-tuning dataset included some multi-turn examples too. [Poro 34B Chat by SiloGen, TurkuNLP and HPLT](https://huggingface.co/LumiOpen/Poro-34B-chat) model's presumably multi-turn results are copied from their model card for the comparison.
221
 
222
+ Single-turn results:
223
 
224
+ | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct |
225
+ |:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|
226
+ | Coding | 1.00 | 1.00 | TBA | TBA |
227
+ | Extraction | 2.00 | 1.30 | TBA | TBA |
228
+ | Humanities | 4.05 | 6.20 | TBA | TBA |
229
+ | Math | 3.00 | 3.20 | TBA | TBA |
230
+ | Reasoning | 2.90 | 4.60 | TBA | TBA |
231
+ | Roleplay | 4.80 | 6.50 | TBA | TBA |
232
+ | STEM | 5.10 | 5.95 | TBA | TBA |
233
+ | Writing | 6.60 | 9.00 | TBA | TBA |
234
+ | **Overall Average** | **3.68** | **4.72** | TBA | TBA |
235
 
236
+ Multi-turn results:
237
 
238
+ | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct | Poro 34B Chat |
239
+ |:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|:--------------|
240
+ | Coding | 1.00 | 1.00 | TBA | TBA | 3.70 |
241
+ | Extraction | 1.55 | 1.15 | TBA | TBA | 6.37 |
242
+ | Humanities | 3.25 | 6.20 | TBA | TBA | 9.25 |
243
+ | Math | 2.20 | 2.70 | TBA | TBA | 1.20 |
244
+ | Reasoning | 2.45 | 3.50 | TBA | TBA | 4.35 |
245
+ | Roleplay | 4.90 | 6.40 | TBA | TBA | 7.35 |
246
+ | STEM | 4.20 | 4.78 | TBA | TBA | 7.80 |
247
+ | Writing | 3.80 | 6.65 | TBA | TBA | 8.50 |
248
+ | **Overall Average** | **2.92** | **4.05** | TBA | TBA | **6.06** |
249
 
 
250
 
251
+ As we can see, the Ahma-3B-Instruct model significantly improves upon the base Ahma-3B model, especially in tasks like writing. It's also worth noting that the Ahma-3B-Instruct model shows enhanced performance in multi-turn tasks compared to the base model, which highlights the value of the multi-turn training examples used in the fine-tuning process. The Ahma-3B-Instruct model lost 14% of its single-turn overall score in a multi-turn setting, while the base Ahma-3B model lost 21%. Therefore, this instruct model might be better suited for chat use cases as well. As expected, coding performance was poor since the Ahma models aren't trained on code data.
252
 
253
+ Ahma models also seemed to have problems with the fact that they started to constantly repeat the generated text in some evaluation examples, which affected the scoring. With the addition of a repetition penalty setting to the evaluation script generation method, the scores already improved significantly, so Ahma models should be used with better generation settings in real-world use compared to the settings used in this benchmark.
254
 
255
+ ## Acknowledgements
256
 
257
+ This project would not have been possible without compute generously provided by Google through the
258
+ [TPU Research Cloud](https://sites.research.google/trc/).
259
 
260
+ ## Team Members
261
 
262
+ - Aapo Tanskanen, [Hugging Face profile](https://huggingface.co/aapot), [LinkedIn profile](https://www.linkedin.com/in/aapotanskanen/)
263
+ - Rasmus Toivanen, [Hugging Face profile](https://huggingface.co/RASMUS), [LinkedIn profile](https://www.linkedin.com/in/rasmustoivanen/)
264
 
265
+ Feel free to contact us for more details 🤗
266
 
 
267
 
268
+ ![Ahma](ahma.jpg)