add docs around pre-processing (#1529)
Browse files- README.md +1 -0
- docs/dataset_preprocessing.qmd +35 -0
README.md
CHANGED
@@ -44,6 +44,7 @@ Features:
|
|
44 |
- Advanced Topics
|
45 |
- [Multipack](./docs/multipack.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
|
46 |
- [RLHF & DPO](./docs/rlhf.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
|
|
|
47 |
- [Common Errors](#common-errors-)
|
48 |
- [Tokenization Mismatch b/w Training & Inference](#tokenization-mismatch-bw-inference--training)
|
49 |
- [Debugging Axolotl](#debugging-axolotl)
|
|
|
44 |
- Advanced Topics
|
45 |
- [Multipack](./docs/multipack.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
|
46 |
- [RLHF & DPO](./docs/rlhf.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
|
47 |
+
- [Dataset Pre-Processing](./docs/dataset_preprocessing.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
|
48 |
- [Common Errors](#common-errors-)
|
49 |
- [Tokenization Mismatch b/w Training & Inference](#tokenization-mismatch-bw-inference--training)
|
50 |
- [Debugging Axolotl](#debugging-axolotl)
|
docs/dataset_preprocessing.qmd
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: Dataset Preprocessing
|
3 |
+
description: How datasets are processed
|
4 |
+
---
|
5 |
+
|
6 |
+
Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside
|
7 |
+
the (dataset format)[../dataset-formats/] and prompt strategies to:
|
8 |
+
- parse the dataset based on the *dataset format*
|
9 |
+
- transform the dataset to how you would interact with the model based on the *prompt strategy*
|
10 |
+
- tokenize the dataset based on the configured model & tokenizer
|
11 |
+
- shuffle and merge multiple datasets together if using more than one
|
12 |
+
|
13 |
+
The processing of the datasets can happen one of two ways:
|
14 |
+
|
15 |
+
1. Before kicking off training by calling `python -m axolotl.cli.preprocess /path/to/your.yaml --debug`
|
16 |
+
2. When training is started
|
17 |
+
|
18 |
+
What are the benefits of pre-processing? When training interactively or for sweeps
|
19 |
+
(e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly
|
20 |
+
slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent
|
21 |
+
training parameters so that it will intelligently pull from its cache when possible.
|
22 |
+
|
23 |
+
The path of the cache is controlled by `dataset_prepared_path:` and is often left blank in example
|
24 |
+
YAMLs as this leads to a more robust solution that prevents unexpectedly reusing cached data.
|
25 |
+
|
26 |
+
If `dataset_prepared_path:` is left empty, when training, the processed dataset will be cached in a
|
27 |
+
default path of `./last_run_prepared/`, but will ignore anything already cached there. By explicitly
|
28 |
+
setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed
|
29 |
+
data is in the cache.
|
30 |
+
|
31 |
+
What are the edge cases? Let's say you are writing a custom prompt strategy or using a user-defined
|
32 |
+
prompt template. Because the trainer cannot readily detect these changes, we cannot change the
|
33 |
+
calculated hash value for the pre-processed dataset. If you have `dataset_prepared_path: ...` set
|
34 |
+
and change your prompt templating logic, it may not pick up the changes you made and you will be
|
35 |
+
training over the old prompt.
|