add docs for `input_output` format (#1367) [skip ci]
Browse files- README.md +9 -0
- docs/input_output.md +260 -0
README.md
CHANGED
@@ -385,6 +385,15 @@ pretraining_dataset: # hf path only
|
|
385 |
|
386 |
</details>
|
387 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
388 |
##### Conversation
|
389 |
|
390 |
- `sharegpt`: conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)
|
|
|
385 |
|
386 |
</details>
|
387 |
|
388 |
+
##### Template-Free
|
389 |
+
|
390 |
+
- `input_output`: template-free prompt construction
|
391 |
+
```json
|
392 |
+
{"segments": [{"label": true|false, "text": "..."}]}
|
393 |
+
```
|
394 |
+
|
395 |
+
This is a special format that allows you to construct prompts without using templates. This is for advanced users who want more freedom with prompt construction. See [these docs](docs/input_output.md) for more details.
|
396 |
+
|
397 |
##### Conversation
|
398 |
|
399 |
- `sharegpt`: conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)
|
docs/input_output.md
ADDED
@@ -0,0 +1,260 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Template-free prompt construction with the `input_output` format
|
2 |
+
|
3 |
+
<!-- TOC -->
|
4 |
+
|
5 |
+
- [Background](#background)
|
6 |
+
- [Masking Inputs](#masking-inputs)
|
7 |
+
- [You may not want prompt templates](#you-may-not-want-prompt-templates)
|
8 |
+
- [The `input_output` format](#the-input_output-format)
|
9 |
+
- [Usage](#usage)
|
10 |
+
- [1. Prepare Data](#1-prepare-data)
|
11 |
+
- [2. Use `type: input_output`](#2-use-type-input_output)
|
12 |
+
- [3. Check the prompts](#3-check-the-prompts)
|
13 |
+
|
14 |
+
<!-- /TOC -->
|
15 |
+
|
16 |
+
<a id="markdown-background" name="background"></a>
|
17 |
+
|
18 |
+
## Background
|
19 |
+
|
20 |
+
<a id="markdown-masking-inputs" name="masking-inputs"></a>
|
21 |
+
|
22 |
+
### Masking Inputs
|
23 |
+
|
24 |
+
One of the most popular features of
|
25 |
+
[axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) is
|
26 |
+
setting the following configuration value:
|
27 |
+
|
28 |
+
|
29 |
+
```yaml
|
30 |
+
train_on_inputs: false
|
31 |
+
```
|
32 |
+
|
33 |
+
If you declare a [dataset formats](https://github.com/OpenAccess-AI-Collective/axolotl?tab=readme-ov-file#dataset)
|
34 |
+
such as `alpaca` or `chatml`, axolotl knows what is an input
|
35 |
+
(i.e. human) vs. an output (i.e. the assistant) and masks the input
|
36 |
+
labels so that your model can focus on predicting the outputs only.
|
37 |
+
|
38 |
+
<a id="markdown-you-may-not-want-prompt-templates" name="you-may-not-want-prompt-templates"></a>
|
39 |
+
|
40 |
+
### You may not want prompt templates
|
41 |
+
|
42 |
+
However, there are many situations where you don't want to use one of
|
43 |
+
these formats or templates (I usually don't!). This is because they can:
|
44 |
+
|
45 |
+
- Add unnecessary boilerplate to your prompts.
|
46 |
+
- Create artifacts like special delimiters `<|im_start|>` that can
|
47 |
+
quickly become footguns if you don't include them correctly at
|
48 |
+
inference time.
|
49 |
+
- Enforce a *chat* interface when you do not want one. Sometimes you
|
50 |
+
just want to fine-tune a model to a very specific task and do NOT
|
51 |
+
want multi-turn conversations, roles, etc.
|
52 |
+
- Limit you to only certain roles that the template allows.
|
53 |
+
|
54 |
+
<a id="markdown-the-inputoutput-format" name="the-inputoutput-format"></a>
|
55 |
+
|
56 |
+
### The `input_output` format
|
57 |
+
|
58 |
+
You can construct your prompts without a template by using the
|
59 |
+
`input_output` format, by setting `type: input_output` in your
|
60 |
+
configuration file like this:
|
61 |
+
|
62 |
+
**config.yml**
|
63 |
+
|
64 |
+
```yaml
|
65 |
+
train_on_inputs: false # Mask segments of your data
|
66 |
+
datasets:
|
67 |
+
- path: output.jsonl
|
68 |
+
type: input_output # use template free prompt construction
|
69 |
+
```
|
70 |
+
|
71 |
+
Unlike `type: completion`, which is also template-free,
|
72 |
+
`type: input_output` allows you to mask segments of your text. More
|
73 |
+
details on how this works are described below.
|
74 |
+
|
75 |
+
<a id="markdown-usage" name="usage"></a>
|
76 |
+
|
77 |
+
## Usage
|
78 |
+
|
79 |
+
This is how you can use the `input_output` format:
|
80 |
+
|
81 |
+
<a id="markdown-1-prepare-data" name="1-prepare-data"></a>
|
82 |
+
|
83 |
+
### 1. Prepare Data
|
84 |
+
|
85 |
+
To use the `input_output` format, collect your data in the following
|
86 |
+
format into a jsonl file (below is the first row from the file
|
87 |
+
`output`.jsonl` pretty printed):
|
88 |
+
|
89 |
+
```bash
|
90 |
+
$ head -n1 output.jsonl | python -m json.tool
|
91 |
+
|
92 |
+
{.cell-output .cell-output-stdout}
|
93 |
+
{
|
94 |
+
"segments": [
|
95 |
+
{
|
96 |
+
"label": true,
|
97 |
+
"text": "<s>Hello\n"
|
98 |
+
},
|
99 |
+
{
|
100 |
+
"label": true,
|
101 |
+
"text": "hi there!. "
|
102 |
+
},
|
103 |
+
{
|
104 |
+
"label": false,
|
105 |
+
"text": "goodbye "
|
106 |
+
},
|
107 |
+
{
|
108 |
+
"label": true,
|
109 |
+
"text": "farewell</s>"
|
110 |
+
}
|
111 |
+
]
|
112 |
+
}
|
113 |
+
```
|
114 |
+
|
115 |
+
Set `label:false` when you want to mask a segment of text so that the
|
116 |
+
model isn't trained on it. Some things to keep in mind:
|
117 |
+
|
118 |
+
> [!IMPORTANT]
|
119 |
+
> 1. **EOS, BOS, spaces, newlines etc. are entirely up to you. Axolotl
|
120 |
+
concatenates all the segments as-is.** The tokenizer doesn't add
|
121 |
+
anything additional. Notice how I added spaces, newlines, `<s>`
|
122 |
+
(BOS), and `</s>` (EOS) myself.
|
123 |
+
> 2. Make sure you check the materialized output to validate that the
|
124 |
+
prompt is getting assembled how you like.
|
125 |
+
|
126 |
+
<a id="markdown-2-use-type-inputoutput" name="2-use-type-inputoutput"></a>
|
127 |
+
|
128 |
+
### 2. Use `type: input_output`
|
129 |
+
|
130 |
+
Let's materialize data with our `output.jsonl` file by setting
|
131 |
+
`type: input_output` in our axolotl config:
|
132 |
+
|
133 |
+
```yaml
|
134 |
+
# training_config.yaml
|
135 |
+
base_model: mistralai/Mistral-7B-v0.1
|
136 |
+
data_seed: 49
|
137 |
+
seed: 49
|
138 |
+
|
139 |
+
datasets:
|
140 |
+
- path: output.jsonl
|
141 |
+
type: input_output
|
142 |
+
val_set_size: 0.1
|
143 |
+
|
144 |
+
sequence_len: 896
|
145 |
+
sample_packing: false
|
146 |
+
|
147 |
+
micro_batch_size: 2
|
148 |
+
gradient_accumulation_steps: 3
|
149 |
+
eval_batch_size: 2
|
150 |
+
num_epochs: 1
|
151 |
+
learning_rate: 0.0002
|
152 |
+
|
153 |
+
train_on_inputs: false
|
154 |
+
special_tokens:
|
155 |
+
bos_token: "<s>"
|
156 |
+
eos_token: "</s>"
|
157 |
+
unk_token: "<unk>"
|
158 |
+
```
|
159 |
+
|
160 |
+
You can use the following command to materialize your data. The
|
161 |
+
`--debug` flag will print the tokens, along with the labels so you can
|
162 |
+
verify that the correct items are being ignored:
|
163 |
+
|
164 |
+
```bash
|
165 |
+
$ python -m axolotl.cli.preprocess training_config.yaml --debug
|
166 |
+
|
167 |
+
...
|
168 |
+
[2024-03-05 23:36:46,969] [INFO] [axolotl.check_example_labels:35] [PID:607731] [RANK:0] <s>(1, 1) Hello(22557, 22557)
|
169 |
+
(13, 13) hi(12014, 12014) there(736, 736) !(28808, 28808) .(28723, 28723) (28705, 28705) good(-100, 1179) bye(-100, 17664) (-100, 28705) fare(19111, 19111) well(5458, 5458) </s>(2, 2)
|
170 |
+
|
171 |
+
```
|
172 |
+
|
173 |
+
The format is `decoded_token`(`label`, `token_id`), for example,
|
174 |
+
`<s>(1, 1)` means that the token is `<s>`, the label is `1` and the
|
175 |
+
token_id is `1`. When the label is `-100` then that token is ignored for
|
176 |
+
training.
|
177 |
+
|
178 |
+
<a id="markdown-3-check-the-prompts" name="3-check-the-prompts"></a>
|
179 |
+
|
180 |
+
### 3. Check the prompts
|
181 |
+
|
182 |
+
Here is another way to check the materialized output:
|
183 |
+
|
184 |
+
```python
|
185 |
+
from transformers import AutoTokenizer
|
186 |
+
from datasets import load_from_disk
|
187 |
+
import yaml
|
188 |
+
|
189 |
+
directory = !ls last_run_prepared/
|
190 |
+
with open('training_config.yaml', 'r') as f:
|
191 |
+
cfg = yaml.safe_load(f)
|
192 |
+
model_id = cfg['base_model']
|
193 |
+
tok = AutoTokenizer.from_pretrained(model_id)
|
194 |
+
ds = load_from_disk(f'last_run_prepared/{directory[0]}/')
|
195 |
+
```
|
196 |
+
|
197 |
+
```python
|
198 |
+
>>> row = ds[0]
|
199 |
+
>>> print(tok.decode(row['input_ids']))
|
200 |
+
<s> Hello
|
201 |
+
hi there!. goodbye farewell</s>
|
202 |
+
```
|
203 |
+
|
204 |
+
We can check that the right tokens are ingored by comparing the labels
|
205 |
+
to each token:
|
206 |
+
|
207 |
+
```python
|
208 |
+
import pandas as pd
|
209 |
+
pd.DataFrame([{'token': tok.decode(i), 'label': l, 'id':i} for i,l in
|
210 |
+
zip(row['input_ids'], row['labels'])])
|
211 |
+
```
|
212 |
+
|
213 |
+
| token | label | id |
|
214 |
+
|-------|-------|-------|
|
215 |
+
| 0 | \<s\> | 1 |
|
216 |
+
| 1 | Hello | 22557 |
|
217 |
+
| 2 | \\n | 13 |
|
218 |
+
| 3 | hi | 12014 |
|
219 |
+
| 4 | there | 736 |
|
220 |
+
| 5 | ! | 28808 |
|
221 |
+
| 6 | . | 28723 |
|
222 |
+
| 7 | | 28705 |
|
223 |
+
| 8 | good | -100 |
|
224 |
+
| 9 | bye | -100 |
|
225 |
+
| 10 | | -100 |
|
226 |
+
| 11 | fare | 19111 |
|
227 |
+
| 12 | well | 5458 |
|
228 |
+
| 13 | \</s\>| 2 |
|
229 |
+
|
230 |
+
|
231 |
+
|
232 |
+
If we look at the input data, the above table seems correct! (The jsonl
|
233 |
+
version is repeated below for reference):
|
234 |
+
|
235 |
+
|
236 |
+
```bash
|
237 |
+
$ head -n1 output.jsonl | python -m json.tool
|
238 |
+
|
239 |
+
{.cell-output .cell-output-stdout}
|
240 |
+
{
|
241 |
+
"segments": [
|
242 |
+
{
|
243 |
+
"label": true,
|
244 |
+
"text": "<s>Hello\n"
|
245 |
+
},
|
246 |
+
{
|
247 |
+
"label": true,
|
248 |
+
"text": "hi there!. "
|
249 |
+
},
|
250 |
+
{
|
251 |
+
"label": false,
|
252 |
+
"text": "goodbye "
|
253 |
+
},
|
254 |
+
{
|
255 |
+
"label": true,
|
256 |
+
"text": "farewell</s>"
|
257 |
+
}
|
258 |
+
]
|
259 |
+
}
|
260 |
+
```
|