|
# Template-free prompt construction with the `input_output` format |
|
|
|
<!-- TOC --> |
|
|
|
- [Background](#background) |
|
- [Masking Inputs](#masking-inputs) |
|
- [You may not want prompt templates](#you-may-not-want-prompt-templates) |
|
- [The `input_output` format](#the-input_output-format) |
|
- [Usage](#usage) |
|
- [1. Prepare Data](#1-prepare-data) |
|
- [2. Use `type: input_output`](#2-use-type-input_output) |
|
- [3. Check the prompts](#3-check-the-prompts) |
|
|
|
<!-- /TOC --> |
|
|
|
<a id="markdown-background" name="background"></a> |
|
|
|
## Background |
|
|
|
<a id="markdown-masking-inputs" name="masking-inputs"></a> |
|
|
|
### Masking Inputs |
|
|
|
One of the most popular features of |
|
[axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) is |
|
setting the following configuration value: |
|
|
|
|
|
```yaml |
|
train_on_inputs: false |
|
``` |
|
|
|
If you declare a [dataset formats](https://github.com/OpenAccess-AI-Collective/axolotl?tab=readme-ov-file#dataset) |
|
such as `alpaca` or `chatml`, axolotl knows what is an input |
|
(i.e. human) vs. an output (i.e. the assistant) and masks the input |
|
labels so that your model can focus on predicting the outputs only. |
|
|
|
<a id="markdown-you-may-not-want-prompt-templates" name="you-may-not-want-prompt-templates"></a> |
|
|
|
### You may not want prompt templates |
|
|
|
However, there are many situations where you don't want to use one of |
|
these formats or templates (I usually don't!). This is because they can: |
|
|
|
- Add unnecessary boilerplate to your prompts. |
|
- Create artifacts like special delimiters `<|im_start|>` that can |
|
quickly become footguns if you don't include them correctly at |
|
inference time. |
|
- Enforce a *chat* interface when you do not want one. Sometimes you |
|
just want to fine-tune a model to a very specific task and do NOT |
|
want multi-turn conversations, roles, etc. |
|
- Limit you to only certain roles that the template allows. |
|
|
|
<a id="markdown-the-inputoutput-format" name="the-inputoutput-format"></a> |
|
|
|
### The `input_output` format |
|
|
|
You can construct your prompts without a template by using the |
|
`input_output` format, by setting `type: input_output` in your |
|
configuration file like this: |
|
|
|
**config.yml** |
|
|
|
```yaml |
|
train_on_inputs: false # Mask segments of your data |
|
datasets: |
|
- path: output.jsonl |
|
type: input_output # use template free prompt construction |
|
``` |
|
|
|
Unlike `type: completion`, which is also template-free, |
|
`type: input_output` allows you to mask segments of your text. More |
|
details on how this works are described below. |
|
|
|
<a id="markdown-usage" name="usage"></a> |
|
|
|
## Usage |
|
|
|
This is how you can use the `input_output` format: |
|
|
|
<a id="markdown-1-prepare-data" name="1-prepare-data"></a> |
|
|
|
### 1. Prepare Data |
|
|
|
To use the `input_output` format, collect your data in the following |
|
format into a jsonl file (below is the first row from the file |
|
`output`.jsonl` pretty printed): |
|
|
|
```bash |
|
$ head -n1 output.jsonl | python -m json.tool |
|
|
|
{.cell-output .cell-output-stdout} |
|
{ |
|
"segments": [ |
|
{ |
|
"label": true, |
|
"text": "<s>Hello\n" |
|
}, |
|
{ |
|
"label": true, |
|
"text": "hi there!. " |
|
}, |
|
{ |
|
"label": false, |
|
"text": "goodbye " |
|
}, |
|
{ |
|
"label": true, |
|
"text": "farewell</s>" |
|
} |
|
] |
|
} |
|
``` |
|
|
|
Set `label:false` when you want to mask a segment of text so that the |
|
model isn't trained on it. Some things to keep in mind: |
|
|
|
> [!IMPORTANT] |
|
> 1. **EOS, BOS, spaces, newlines etc. are entirely up to you. Axolotl |
|
concatenates all the segments as-is.** The tokenizer doesn't add |
|
anything additional. Notice how I added spaces, newlines, `<s>` |
|
(BOS), and `</s>` (EOS) myself. |
|
> 2. Make sure you check the materialized output to validate that the |
|
prompt is getting assembled how you like. |
|
|
|
<a id="markdown-2-use-type-inputoutput" name="2-use-type-inputoutput"></a> |
|
|
|
### 2. Use `type: input_output` |
|
|
|
Let's materialize data with our `output.jsonl` file by setting |
|
`type: input_output` in our axolotl config: |
|
|
|
```yaml |
|
# training_config.yaml |
|
base_model: mistralai/Mistral-7B-v0.1 |
|
data_seed: 49 |
|
seed: 49 |
|
|
|
datasets: |
|
- path: output.jsonl |
|
type: input_output |
|
val_set_size: 0.1 |
|
|
|
sequence_len: 896 |
|
sample_packing: false |
|
|
|
micro_batch_size: 2 |
|
gradient_accumulation_steps: 3 |
|
eval_batch_size: 2 |
|
num_epochs: 1 |
|
learning_rate: 0.0002 |
|
|
|
train_on_inputs: false |
|
special_tokens: |
|
bos_token: "<s>" |
|
eos_token: "</s>" |
|
unk_token: "<unk>" |
|
``` |
|
|
|
You can use the following command to materialize your data. The |
|
`--debug` flag will print the tokens, along with the labels so you can |
|
verify that the correct items are being ignored: |
|
|
|
```bash |
|
$ python -m axolotl.cli.preprocess training_config.yaml --debug |
|
|
|
... |
|
[2024-03-05 23:36:46,969] [INFO] [axolotl.check_example_labels:35] [PID:607731] [RANK:0] <s>(1, 1) Hello(22557, 22557) |
|
(13, 13) hi(12014, 12014) there(736, 736) !(28808, 28808) .(28723, 28723) (28705, 28705) good(-100, 1179) bye(-100, 17664) (-100, 28705) fare(19111, 19111) well(5458, 5458) </s>(2, 2) |
|
|
|
``` |
|
|
|
The format is `decoded_token`(`label`, `token_id`), for example, |
|
`<s>(1, 1)` means that the token is `<s>`, the label is `1` and the |
|
token_id is `1`. When the label is `-100` then that token is ignored for |
|
training. |
|
|
|
<a id="markdown-3-check-the-prompts" name="3-check-the-prompts"></a> |
|
|
|
### 3. Check the prompts |
|
|
|
Here is another way to check the materialized output: |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
from datasets import load_from_disk |
|
import yaml |
|
|
|
directory = !ls last_run_prepared/ |
|
with open('training_config.yaml', 'r') as f: |
|
cfg = yaml.safe_load(f) |
|
model_id = cfg['base_model'] |
|
tok = AutoTokenizer.from_pretrained(model_id) |
|
ds = load_from_disk(f'last_run_prepared/{directory[0]}/') |
|
``` |
|
|
|
```python |
|
>>> row = ds[0] |
|
>>> print(tok.decode(row['input_ids'])) |
|
<s> Hello |
|
hi there!. goodbye farewell</s> |
|
``` |
|
|
|
We can check that the right tokens are ingored by comparing the labels |
|
to each token: |
|
|
|
```python |
|
import pandas as pd |
|
pd.DataFrame([{'token': tok.decode(i), 'label': l, 'id':i} for i,l in |
|
zip(row['input_ids'], row['labels'])]) |
|
``` |
|
|
|
| token | label | id | |
|
|-------|-------|-------| |
|
| 0 | \<s\> | 1 | |
|
| 1 | Hello | 22557 | |
|
| 2 | \\n | 13 | |
|
| 3 | hi | 12014 | |
|
| 4 | there | 736 | |
|
| 5 | ! | 28808 | |
|
| 6 | . | 28723 | |
|
| 7 | | 28705 | |
|
| 8 | good | -100 | |
|
| 9 | bye | -100 | |
|
| 10 | | -100 | |
|
| 11 | fare | 19111 | |
|
| 12 | well | 5458 | |
|
| 13 | \</s\>| 2 | |
|
|
|
|
|
|
|
If we look at the input data, the above table seems correct! (The jsonl |
|
version is repeated below for reference): |
|
|
|
|
|
```bash |
|
$ head -n1 output.jsonl | python -m json.tool |
|
|
|
{.cell-output .cell-output-stdout} |
|
{ |
|
"segments": [ |
|
{ |
|
"label": true, |
|
"text": "<s>Hello\n" |
|
}, |
|
{ |
|
"label": true, |
|
"text": "hi there!. " |
|
}, |
|
{ |
|
"label": false, |
|
"text": "goodbye " |
|
}, |
|
{ |
|
"label": true, |
|
"text": "farewell</s>" |
|
} |
|
] |
|
} |
|
``` |
|
|