File size: 7,278 Bytes
b04ff49
7892031
 
 
 
 
 
b04ff49
7892031
b04ff49
 
7892031
 
 
54f7bba
 
 
 
 
 
 
 
 
 
 
 
7892031
 
 
 
 
54f7bba
 
 
 
7892031
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9fca79b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54f7bba
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
---
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE
language:
- en
pipeline_tag: text-generation
base_model: Qwen/Qwen2.5-1.5B-Instruct
tags:
- chat
- trl
- sft
- math
library_name: transformers
model-index:
- name: Qwen2.5-1.5B-Instruct-QwQ
  results:
  - task:
      type: text-generation
    dataset:
      name: GSM8k
      type: gsm8k
    metrics:
    - name: pass@4
      type: pass@4
      value: 89.6
      verified: false
---
# Qwen2.5-1.5B-Instruct-QwQ

## Introduction

Qwen2.5-1.5B-Instruct-QwQ is a fine-tuned model based on Qwen2.5-1.5B-Instruct. It was fine-tuned on roughly 20k samples from QwQ-32B-Preview. Compared to Qwen2.5-1.5B-Instruct, this fine-tuned model seems more performant in mathematics contexts and general reasoning. Also it shows some capabilities of self-correction, altough it seems a bit limited (bigger models seem to learn self-correction better, e.g. the 3B & 7B version show much better self-correction abilities in my experiments).

For data generation, math problems from the train sets of the GSM8k and MATH datasets were used.


**This repo contains the instruction-tuned 1.5B Qwen2.5 model fine-tuned on QwQ reasoning chains**, which has the following features:
- Type: Causal Language Models
- Training Stage: Pretraining & Post-training
- Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings
- Number of Parameters: 1.54B
- Number of Paramaters (Non-Embedding): 1.31B
- Number of Layers: 28
- Number of Attention Heads (GQA): 12 for Q and 2 for KV
- Context Length: Full 32,768 tokens and generation 8192 tokens
  

## Quickstart

Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "micaebe/Qwen2.5-1.5B-Instruct-QwQ"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
```

Here is an example of the question "Is 9.9 or 9.10 larger", sampled greedily. It gets the question correct, but the reasoning seems a bit weird. At the beginning its pretty strong but when it tries to solve the problem from another view it's starting to deviate. 
```
<|im_start|>system
You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.<|im_end|>
<|im_start|>user
Is 9.9 or 9.10 larger?<|im_end|>
<|im_start|>assistant
So I have this problem here: "Is 9.9 or 9.10 larger?" Okay, let's see what it's asking.

First off, both numbers are decimals with the same whole number part, which is 9 in this case. So that means they're both nine units long, right?

Now, when comparing two decimal numbers where the whole parts are equal, we just need to compare the digits after the decimal point. That makes sense because if the whole parts are the same, then the difference must be in the tenths place.

Let me write down the numbers again:

- 9.9

- 9.10

Okay, so the whole part is 9 for both, but the tenths places differ. In 9.9, the tenth is 9, and in 9.10, the tenth is 1.

Since 9 is greater than 1, it follows that 9.9 is greater than 9.10.

Wait a minute, does that make sense? Let's think about it another way.

Imagine you have 9 apples, and someone gives you an additional 0.9 of an apple. Now, do you have more apples or fewer apples compared to having 9.10 apples?

Well, 0.9 is less than 1, so adding it to 9 would give you 9.9, which is still less than 9.10.

But wait, maybe there's something else to consider. What if I were to add these numbers together?

Let's try adding them up:

9.9 + 9.10 = ?

Hmm, that might help clarify things.

Adding 9.9 and 9.10:

First, align the decimal points:

  9.90
+ 9.10
------

Now, add the hundredths first:

0 + 0 = 0

Then, tens:

9 + 1 = 10

Bring down the zero from the hundredths place:

10.0

Finally, ones:

9 + 9 = 18

So, 9.9 + 9.10 = 18.00

That seems straightforward. But earlier, I thought that since the whole parts are the same, the comparison should be based on the tenths place only. But now, adding them shows that 9.9 is indeed smaller than 9.10.

I'm getting confused here. Maybe I need to rethink my approach.

Alternatively, perhaps I can look at the numbers as fractions to get a better understanding.

Both 9.9 and 9.10 can be written as fractions:

9.9 = 99/10

9.10 = 91/10

Now, comparing 99/10 and 91/10.

Clearly, 99 is greater than 91, so 99/10 > 91/10.

Therefore, 9.9 > 9.10.

But wait, that doesn't seem right either. Earlier, when I added them, 9.9 + 9.10 = 18.00, which suggests that 9.9 is less than 9.10.

Maybe I'm missing something fundamental here.

Let me think differently.

Consider the numbers as lengths on a number line.

The number line has negative numbers to the left and positive numbers to the right.

If I plot -9.9 and -9.10 on the number line, which one is farther to the left?

Well, -9.9 is further to the left than -9.10, because -9.9 is closer to zero than -9.10.

Similarly, if I plot 9.9 and 9.10 on the number line, which one is farther to the right?

Again, 9.9 is farther to the right than 9.10, because 9.9 is closer to ten than 9.10.

This reinforces that 9.9 is greater than 9.10.

Wait, but earlier, when I was trying to add them, I got 18.00, which is greater than any other possible sum involving 9.9 and 9.10.

So, perhaps the confusion lies in how I interpreted the addition.

Let me re-examine the addition:

9.9 + 9.10 = 18.00

Yes, that's correct.

And since 18.00 is greater than any other possible sum involving 9.9 and 9.10, it confirms that 9.9 is greater than 9.10.

But wait, maybe I should consider that 9.9 is actually 9.90, not 9.9.

Because sometimes people write decimals without trailing zeros, like 9.9 instead of 9.90.

In that case, 9.90 is exactly the same as 9.9, so the comparison remains the same.

Therefore, 9.90 is greater than 9.10.

But according to standard notation, 9.9 is typically used, even though 9.90 is mathematically equivalent.

So, sticking with 9.9, which is 9.90, it's clear that 9.9 is larger than 9.10.

Hence, the answer is 9.9 is larger.

**Final Answer**

\[ \boxed{9.9} \]<|im_end|>
```




For GSM8k performance comparison: the base-instruct model scores 73.2% on the test set.

Disclaimer: GSM8k scores are currently only from the first 27% of the test set. Scores will be updated in the future, tested on the full dataset.