winglian commited on
Commit
e07dcb2
1 Parent(s): 6319da1

add docs around pre-processing (#1529)

Browse files
Files changed (2) hide show
  1. README.md +1 -0
  2. docs/dataset_preprocessing.qmd +35 -0
README.md CHANGED
@@ -44,6 +44,7 @@ Features:
44
  - Advanced Topics
45
  - [Multipack](./docs/multipack.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
46
  - [RLHF & DPO](./docs/rlhf.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
 
47
  - [Common Errors](#common-errors-)
48
  - [Tokenization Mismatch b/w Training & Inference](#tokenization-mismatch-bw-inference--training)
49
  - [Debugging Axolotl](#debugging-axolotl)
 
44
  - Advanced Topics
45
  - [Multipack](./docs/multipack.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
46
  - [RLHF & DPO](./docs/rlhf.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
47
+ - [Dataset Pre-Processing](./docs/dataset_preprocessing.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
48
  - [Common Errors](#common-errors-)
49
  - [Tokenization Mismatch b/w Training & Inference](#tokenization-mismatch-bw-inference--training)
50
  - [Debugging Axolotl](#debugging-axolotl)
docs/dataset_preprocessing.qmd ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Dataset Preprocessing
3
+ description: How datasets are processed
4
+ ---
5
+
6
+ Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside
7
+ the (dataset format)[../dataset-formats/] and prompt strategies to:
8
+ - parse the dataset based on the *dataset format*
9
+ - transform the dataset to how you would interact with the model based on the *prompt strategy*
10
+ - tokenize the dataset based on the configured model & tokenizer
11
+ - shuffle and merge multiple datasets together if using more than one
12
+
13
+ The processing of the datasets can happen one of two ways:
14
+
15
+ 1. Before kicking off training by calling `python -m axolotl.cli.preprocess /path/to/your.yaml --debug`
16
+ 2. When training is started
17
+
18
+ What are the benefits of pre-processing? When training interactively or for sweeps
19
+ (e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly
20
+ slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent
21
+ training parameters so that it will intelligently pull from its cache when possible.
22
+
23
+ The path of the cache is controlled by `dataset_prepared_path:` and is often left blank in example
24
+ YAMLs as this leads to a more robust solution that prevents unexpectedly reusing cached data.
25
+
26
+ If `dataset_prepared_path:` is left empty, when training, the processed dataset will be cached in a
27
+ default path of `./last_run_prepared/`, but will ignore anything already cached there. By explicitly
28
+ setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed
29
+ data is in the cache.
30
+
31
+ What are the edge cases? Let's say you are writing a custom prompt strategy or using a user-defined
32
+ prompt template. Because the trainer cannot readily detect these changes, we cannot change the
33
+ calculated hash value for the pre-processed dataset. If you have `dataset_prepared_path: ...` set
34
+ and change your prompt templating logic, it may not pick up the changes you made and you will be
35
+ training over the old prompt.