ncoop57 commited on
Commit
f30d062
·
unverified ·
1 Parent(s): 269c543

Add StableLM 2 Example Scripts (#1327) [skip ci]

Browse files

* Add StableLM examples and configurations

* Add FFT and LORA configuration files and modify readme with usage

examples/stablelm-2/1.6b/fft.yml ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ base_model: stabilityai/stablelm-2-1_6b
2
+ model_type: AutoModelForCausalLM
3
+ tokenizer_type: AutoTokenizer
4
+ trust_remote_code: true
5
+
6
+ load_in_8bit: false
7
+ load_in_4bit: false
8
+ strict: false
9
+
10
+ datasets:
11
+ - path: mhenrichsen/alpaca_2k_test
12
+ type: alpaca
13
+ dataset_prepared_path: last_run_prepared
14
+ val_set_size: 0.05
15
+ output_dir: ./out
16
+
17
+ sequence_len: 4096
18
+ sample_packing: true
19
+ pad_to_sequence_len: true
20
+
21
+ adapter:
22
+ lora_model_dir:
23
+ lora_r:
24
+ lora_alpha:
25
+ lora_dropout:
26
+ lora_target_linear:
27
+ lora_fan_in_fan_out:
28
+
29
+ wandb_project:
30
+ wandb_entity:
31
+ wandb_watch:
32
+ wandb_name:
33
+ wandb_log_model:
34
+
35
+ gradient_accumulation_steps: 1
36
+ micro_batch_size: 1
37
+ num_epochs: 1
38
+ optimizer: adamw_bnb_8bit
39
+ lr_scheduler: cosine
40
+ learning_rate: 0.0002
41
+
42
+ train_on_inputs: false
43
+ group_by_length: false
44
+ bf16: auto
45
+ fp16:
46
+ tf32: false
47
+
48
+ gradient_checkpointing: true
49
+ early_stopping_patience:
50
+ resume_from_checkpoint:
51
+ local_rank:
52
+ logging_steps: 1
53
+ xformers_attention:
54
+ flash_attention: true
55
+ flash_attn_cross_entropy: false
56
+ flash_attn_rms_norm: true
57
+ flash_attn_fuse_qkv: false
58
+ flash_attn_fuse_mlp: true
59
+
60
+ warmup_steps: 100
61
+ evals_per_epoch: 4
62
+ eval_table_size:
63
+ saves_per_epoch: 1
64
+ debug:
65
+ deepspeed: #deepspeed_configs/zero2.json # multi-gpu only
66
+ weight_decay: 0.1
67
+ fsdp:
68
+ fsdp_config:
69
+ special_tokens:
examples/stablelm-2/1.6b/lora.yml ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ base_model: stabilityai/stablelm-2-1_6b
2
+ model_type: AutoModelForCausalLM
3
+ tokenizer_type: AutoTokenizer
4
+ trust_remote_code: true
5
+
6
+ load_in_8bit: true
7
+ load_in_4bit: false
8
+ strict: false
9
+
10
+ datasets:
11
+ - path: mhenrichsen/alpaca_2k_test
12
+ type: alpaca
13
+ dataset_prepared_path:
14
+ val_set_size: 0.05
15
+ output_dir: ./lora-out
16
+
17
+ sequence_len: 4096
18
+ sample_packing: true
19
+ pad_to_sequence_len: true
20
+
21
+ adapter: lora
22
+ lora_model_dir:
23
+ lora_r: 32
24
+ lora_alpha: 16
25
+ lora_dropout: 0.05
26
+ lora_target_linear: true
27
+ lora_fan_in_fan_out:
28
+
29
+ wandb_project:
30
+ wandb_entity:
31
+ wandb_watch:
32
+ wandb_name:
33
+ wandb_log_model:
34
+
35
+ gradient_accumulation_steps: 1
36
+ micro_batch_size: 1
37
+ num_epochs: 1
38
+ optimizer: adamw_bnb_8bit
39
+ lr_scheduler: cosine
40
+ learning_rate: 0.0002
41
+
42
+ train_on_inputs: false
43
+ group_by_length: false
44
+ bf16: auto
45
+ fp16:
46
+ tf32: false
47
+
48
+ gradient_checkpointing: true
49
+ early_stopping_patience:
50
+ resume_from_checkpoint:
51
+ local_rank:
52
+ logging_steps: 1
53
+ xformers_attention:
54
+ flash_attention: true
55
+ flash_attn_cross_entropy: false
56
+ flash_attn_rms_norm: true
57
+
58
+ warmup_steps: 10
59
+ evals_per_epoch: 4
60
+ saves_per_epoch: 1
61
+ debug:
62
+ deepspeed:
63
+ weight_decay: 0.0
64
+ fsdp:
65
+ fsdp_config:
66
+ special_tokens:
examples/stablelm-2/README.md ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # StableLM 2
2
+
3
+ This repository contains examples for training and processing using StableLM-2. It also includes a section to help you estimate the GPU requirements for your specific use case.
4
+
5
+ ## Estimating GPU Requirements
6
+
7
+ | type | deepspeed | batch size | context length | vRAM GPU (GBs) |
8
+ |---------------|-----------|------------|----------------|----------------|
9
+ | full finetune | N/A | 1 | 4096 | ~21.5GBs |
10
+ | full finetune | zero2 | 1 | 4096 | ~20GBs |
11
+ | lora | N/A | 1 | 4096 | ~16.6GBs |
12
+
13
+ The above are estimates and might differ slight depending on the setup for example whether you pack your sequence lengths or not (the above assumes you do to length 4096).
14
+
15
+ This blog post from Hamel Husain was a great resource for estimating these numbers: https://hamel.dev/notes/llm/03_estimating_vram.html
16
+
17
+ ## Training
18
+ We have example scripts here for both full finetuning and lora using the popular alpaca dataset:
19
+
20
+ ```shell
21
+ # preprocess the dataset
22
+ CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess examples/stablelm-2/1.6b/lora.yml
23
+ ```
24
+
25
+ Single GPU Training:
26
+ ```shell
27
+ python -m axolotl.cli.train examples/stablelm-2/fft.yml --deepspeed deepspeed_configs/zero2.json
28
+ # OR
29
+ python -m axolotl.cli.train examples/stablelm-2/1.6b/lora.yml
30
+ ```
31
+
32
+ Multinode GPU Training with `accelerate`:
33
+ ```shell
34
+ # make sure you've configured accelerate properly
35
+ accelerate launch -m axolotl.cli.train examples/stablelm-2/1.6b/fft.yml --deepspeed deepspeed_configs/zero2.json
36
+ ```