Spaces:

Dovakiins
/

qwerrwe

Build error

App Files Files Community

qwerrwe / docs /input_output.qmd

Nanobit

Feat: update doc (#1475) [skip ci]

c2b64e4 unverified 8 months ago

raw

history blame

7.26 kB

	---
	title: Template-free prompt construction
	description: "Template-free prompt construction with the `input_output` format"
	---

	<!-- TOC -->

	- [Background](#background)
	- [Masking Inputs](#masking-inputs)
	- [You may not want prompt templates](#you-may-not-want-prompt-templates)
	- [The `input_output` format](#the-input_output-format)
	- [Usage](#usage)
	- [1. Prepare Data](#1-prepare-data)
	- [2. Use `type: input_output`](#2-use-type-input_output)
	- [3. Check the prompts](#3-check-the-prompts)

	<!-- /TOC -->

	<a id="markdown-background" name="background"></a>

	## Background

	<a id="markdown-masking-inputs" name="masking-inputs"></a>

	### Masking Inputs

	One of the most popular features of
	[axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) is
	setting the following configuration value:


	```yaml
	train_on_inputs: false
	```

	If you declare a [dataset formats](https://github.com/OpenAccess-AI-Collective/axolotl?tab=readme-ov-file#dataset)
	such as `alpaca` or `chatml`, axolotl knows what is an input
	(i.e. human) vs. an output (i.e. the assistant) and masks the input
	labels so that your model can focus on predicting the outputs only.

	<a id="markdown-you-may-not-want-prompt-templates" name="you-may-not-want-prompt-templates"></a>

	### You may not want prompt templates

	However, there are many situations where you don't want to use one of
	these formats or templates. This is because they can:

	- Add unnecessary boilerplate to your prompts.
	- Create artifacts like special delimiters `<\|im_start\|>` that can
	quickly become footguns if you don't include them correctly at
	inference time.
	- Enforce a chat interface when you do not want one. Sometimes you
	just want to fine-tune a model to a very specific task and do NOT
	want multi-turn conversations, roles, etc.
	- Limit you to only certain roles that the template allows.

	<a id="markdown-the-inputoutput-format" name="the-inputoutput-format"></a>

	### The `input_output` format

	You can construct your prompts without a template by using the
	`input_output` format, by setting `type: input_output` in your
	configuration file like this:

	config.yml

	```yaml
	train_on_inputs: false # Mask segments of your data
	datasets:
	- path: output.jsonl
	type: input_output # use template free prompt construction
	```

	Unlike `type: completion`, which is also template-free,
	`type: input_output` allows you to mask segments of your text. More
	details on how this works are described below.

	<a id="markdown-usage" name="usage"></a>

	## Usage

	This is how you can use the `input_output` format:

	<a id="markdown-1-prepare-data" name="1-prepare-data"></a>

	### 1. Prepare Data

	To use the `input_output` format, collect your data in the following
	format into a jsonl file (below is the first row from the file
	`output`.jsonl` pretty printed):

	```bash
	$ head -n1 output.jsonl \| python -m json.tool
	```

	:::{.cell-output .cell-output-stdout}
	{
	"segments": [
	{
	"label": true,
	"text": "<s>Hello\n"
	},
	{
	"label": true,
	"text": "hi there!. "
	},
	{
	"label": false,
	"text": "goodbye "
	},
	{
	"label": true,
	"text": "farewell</s>"
	}
	]
	}
	:::

	Set `label:false` when you want to mask a segment of text so that the
	model isn't trained on it. Some things to keep in mind:

	> [!IMPORTANT]
	> 1. **EOS, BOS, spaces, newlines etc. are entirely up to you. Axolotl
	concatenates all the segments as-is.** The tokenizer doesn't add
	anything additional. Notice how I added spaces, newlines, `<s>`
	(BOS), and `</s>` (EOS) myself.
	> 2. Make sure you check the materialized output to validate that the
	prompt is getting assembled how you like.

	<a id="markdown-2-use-type-inputoutput" name="2-use-type-inputoutput"></a>

	### 2. Use `type: input_output`

	Let's materialize data with our `output.jsonl` file by setting
	`type: input_output` in our axolotl config:

	```yaml
	# training_config.yaml
	base_model: mistralai/Mistral-7B-v0.1
	data_seed: 49
	seed: 49

	datasets:
	- path: output.jsonl
	type: input_output
	val_set_size: 0.1

	sequence_len: 896
	sample_packing: false

	micro_batch_size: 2
	gradient_accumulation_steps: 3
	eval_batch_size: 2
	num_epochs: 1
	learning_rate: 0.0002

	train_on_inputs: false
	special_tokens:
	bos_token: "<s>"
	eos_token: "</s>"
	unk_token: "<unk>"
	```

	You can use the following command to materialize your data. The
	`--debug` flag will print the tokens, along with the labels so you can
	verify that the correct items are being ignored:

	```bash
	$ python -m axolotl.cli.preprocess training_config.yaml --debug

	...
	[2024-03-05 23:36:46,969] [INFO] [axolotl.check_example_labels:35] [PID:607731] [RANK:0] <s>(1, 1) Hello(22557, 22557)
	(13, 13) hi(12014, 12014) there(736, 736) !(28808, 28808) .(28723, 28723) (28705, 28705) good(-100, 1179) bye(-100, 17664) (-100, 28705) fare(19111, 19111) well(5458, 5458) </s>(2, 2)

	```

	The format is `decoded_token`(`label`, `token_id`), for example,
	`<s>(1, 1)` means that the token is `<s>`, the label is `1` and the
	token_id is `1`. When the label is `-100` then that token is ignored for
	training.

	<a id="markdown-3-check-the-prompts" name="3-check-the-prompts"></a>

	### 3. Check the prompts

	Here is another way to check the materialized output:

	```python
	from transformers import AutoTokenizer
	from datasets import load_from_disk
	import yaml

	directory = !ls last_run_prepared/
	with open('training_config.yaml', 'r') as f:
	cfg = yaml.safe_load(f)
	model_id = cfg['base_model']
	tok = AutoTokenizer.from_pretrained(model_id)
	ds = load_from_disk(f'last_run_prepared/{directory[0]}/')
	```

	```python
	>>> row = ds[0]
	>>> print(tok.decode(row['input_ids']))
	<s> Hello
	hi there!. goodbye farewell</s>
	```

	We can check that the right tokens are ingored by comparing the labels
	to each token:

	```python
	import pandas as pd
	pd.DataFrame([{'token': tok.decode(i), 'label': l, 'id':i} for i,l in
	zip(row['input_ids'], row['labels'])])
	```

	\| token \| label \| id \|
	\|-------\|-------\|-------\|
	\| 0 \| \<s\> \| 1 \|
	\| 1 \| Hello \| 22557 \|
	\| 2 \| \\n \| 13 \|
	\| 3 \| hi \| 12014 \|
	\| 4 \| there \| 736 \|
	\| 5 \| ! \| 28808 \|
	\| 6 \| . \| 28723 \|
	\| 7 \| \| 28705 \|
	\| 8 \| good \| -100 \|
	\| 9 \| bye \| -100 \|
	\| 10 \| \| -100 \|
	\| 11 \| fare \| 19111 \|
	\| 12 \| well \| 5458 \|
	\| 13 \| \</s\>\| 2 \|



	If we look at the input data, the above table seems correct! (The jsonl
	version is repeated below for reference):


	```bash
	$ head -n1 output.jsonl \| python -m json.tool
	```

	:::{.cell-output .cell-output-stdout}
	{
	"segments": [
	{
	"label": true,
	"text": "<s>Hello\n"
	},
	{
	"label": true,
	"text": "hi there!. "
	},
	{
	"label": false,
	"text": "goodbye "
	},
	{
	"label": true,
	"text": "farewell</s>"
	}
	]
	}
	:::