seungduk commited on
Commit
77c84e0
Β·
unverified Β·
1 Parent(s): f91db19

Update README with some explanations (#700)

Browse files

* Update README with some explanations

* revert commit-hook change

* add more explanation about batch size and gradient accum

* not use latex foromat

* decorate

* git hook again

* Attach a link that explains about LoRA hyperparameters

* update table of content

* Explanation about lora_modules_to_save

Files changed (1) hide show
  1. README.md +167 -83
README.md CHANGED
@@ -23,9 +23,10 @@ Features:
23
  - [Supported Features](#axolotl-supports)
24
  - [Quickstart](#quickstart-)
25
  - [Installation](#installation)
26
- - [Docker Installation](#environment)
27
- - [Conda/Pip venv Installation](#condapip-venv)
28
- - [LambdaLabs Installation](#lambdalabs)
 
29
  - [Dataset](#dataset)
30
  - [How to Add Custom Prompts](#how-to-add-custom-prompts)
31
  - [How to Use Custom Pretokenized Dataset](#how-to-use-your-custom-pretokenized-dataset)
@@ -50,7 +51,7 @@ Features:
50
  <b>Axolotl provides a unified repository for fine-tuning <br />a variety of AI models with ease</b>
51
  </p>
52
  <p>
53
- Go ahead and axolotl questions!!
54
  </p>
55
  <img src="https://github.com/OpenAccess-AI-Collective/axolotl/actions/workflows/pre-commit.yml/badge.svg?branch=main" alt="pre-commit">
56
  <img alt="PyTest Status" src="https://github.com/OpenAccess-AI-Collective/axolotl/actions/workflows/tests.yml/badge.svg?branch=main">
@@ -102,7 +103,7 @@ accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \
102
 
103
  ### Environment
104
 
105
- - Docker
106
  ```bash
107
  docker run --gpus '"all"' --rm -it winglian/axolotl:main-py3.10-cu118-2.0.1
108
  ```
@@ -114,12 +115,12 @@ accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \
114
  docker compose up -d
115
  ```
116
 
117
- - Conda/Pip venv
118
  1. Install python >=**3.9**
119
 
120
  2. Install pytorch stable https://pytorch.org/get-started/locally/
121
 
122
- 3. Install axolotl along with python dependencies
123
  ```bash
124
  pip3 install packaging
125
  pip3 install -e '.[flash-attn,deepspeed]'
@@ -130,7 +131,7 @@ accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \
130
  ```
131
  Get the token at huggingface.co/settings/tokens
132
 
133
- - LambdaLabs
134
  <details>
135
 
136
  <summary>Click to Expand</summary>
@@ -174,7 +175,8 @@ accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \
174
  ```
175
  </details>
176
 
177
- - Windows: Please use WSL or Docker!
 
178
 
179
  ### Dataset
180
 
@@ -396,15 +398,15 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
396
  <summary>All yaml options</summary>
397
 
398
  ```yaml
399
- # this is the huggingface model that contains *.pt, *.safetensors, or *.bin files
400
- # this can also be a relative path to a model on disk
401
  base_model: ./llama-7b-hf
402
- # you can specify an ignore pattern if the model repo contains more than 1 model type (*.pt, etc)
403
  base_model_ignore_patterns:
404
- # if the base_model repo on hf hub doesn't include configuration .json files,
405
- # you can set that here, or leave this empty to default to base_model
406
  base_model_config: ./llama-7b-hf
407
- # you can specify to choose a specific model revision from huggingface hub
408
  model_revision:
409
  # Optional tokenizer configuration override in case you want to use a different tokenizer
410
  # than the one defined in the base model
@@ -419,23 +421,24 @@ trust_remote_code:
419
  tokenizer_use_fast:
420
  # Whether to use the legacy tokenizer setting, defaults to True
421
  tokenizer_legacy:
422
- # resize the model embeddings when new tokens are added to multiples of 32
423
- # this is reported to improve training speed on some models
424
  resize_token_embeddings_to_32x:
425
 
426
- # used to identify which the model is based on
427
  is_falcon_derived_model:
428
  is_llama_derived_model:
 
429
  is_mistral_derived_model:
430
 
431
- # whether you are training a 4-bit GPTQ quantized model
432
  gptq: true
433
  gptq_groupsize: 128 # group size
434
  gptq_model_v1: false # v1 or v2
435
 
436
- # this will attempt to quantize the model down to 8 bits and use adam 8 bit optimizer
437
  load_in_8bit: true
438
- # use bitsandbytes 4 bit
439
  load_in_4bit:
440
 
441
  # Use CUDA bf16
@@ -449,9 +452,9 @@ tf32: true # require >=ampere
449
  bfloat16: true # require >=ampere
450
  float16: true
451
 
452
- # a list of one or more datasets to finetune the model with
453
  datasets:
454
- # hf dataset repo | "json" for local dataset, make sure to fill data_files
455
  - path: vicgalle/alpaca-gpt4
456
  # The type of prompt to use for training. [alpaca, sharegpt, gpteacher, oasst, reflection]
457
  type: alpaca # format | format:<prompt_style> (chat/instruct) | <prompt_strategies>.load_<load_fn>
@@ -461,16 +464,16 @@ datasets:
461
  name: # Optional[str] name of dataset configuration to load
462
  conversation: # Optional[str] fastchat conversation type, only used with type: sharegpt
463
 
464
- # custom user prompt
465
  - path: repo
466
  type:
467
- # the below are defaults. only set what's needed.
468
  system_prompt: ""
469
  field_system: system
470
  field_instruction: instruction
471
  field_output: input
472
 
473
- # customizable to be single line or multi-line
474
  system_format: "{system}"
475
  # 'format' can include {input}
476
  format: |-
@@ -479,13 +482,13 @@ datasets:
479
  # 'no_input_format' cannot include {input}
480
  no_input_format: "{instruction} "
481
 
482
- # for completions datsets, uses the provided field if not `text`
483
  field:
484
 
485
- # axolotl attempts to save the dataset as an arrow after packing the data together so
486
  # subsequent training attempts load faster, relative path
487
  dataset_prepared_path: data/last_run_prepared
488
- # push prepared dataset to hub
489
  push_dataset_to_hub: # repo path
490
  # The maximum number of processes to use while preprocessing your input dataset. This defaults to `os.cpu_count()`
491
  # if not set.
@@ -495,8 +498,8 @@ hub_model_id: # repo path to push finetuned model
495
  # how to push checkpoints to hub
496
  # https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.hub_strategy
497
  hub_strategy:
498
- # whether to use hf `use_auth_token` for loading datasets. Useful for fetching private datasets
499
- # required to be true when used in combination with `push_dataset_to_hub`
500
  hf_use_auth_token: # boolean
501
  # How much of the dataset to set aside as evaluation. 1 = 100%, 0.50 = 50%, etc. 0 for no eval.
502
  val_set_size: 0.04
@@ -505,30 +508,34 @@ dataset_shard_num:
505
  # Index of shard to use for whole dataset
506
  dataset_shard_idx:
507
 
508
- # the maximum length of an input to train with, this should typically be less than 2048
509
  # as most models have a token/context limit of 2048
510
  sequence_len: 2048
511
- # pad inputs so each step uses constant sized buffers
512
- # this will reduce memory fragmentation and may prevent OOMs, by re-using memory more efficiently
513
  pad_to_sequence_len:
514
- # max sequence length to concatenate training samples together up to
515
- # inspired by StackLLaMA. see https://huggingface.co/blog/stackllama#supervised-fine-tuning
516
  # FutureWarning: This will soon be DEPRECATED
517
  max_packed_sequence_len: 1024
518
- # use efficient multi-packing with block diagonal attention and per sequence position_ids. Recommend set to 'true'
519
  sample_packing:
520
- # set to 'false' if getting errors during eval with sample_packing on.
521
  eval_sample_packing:
522
- # you can set these packing optimizations AFTER starting a training at least once.
523
  # The trainer will provide recommended values for these values.
524
  sample_packing_eff_est:
525
  total_num_tokens:
526
 
527
- # if you want to use 'lora' or 'qlora' or leave blank to train all parameters in original model
528
  adapter: lora
529
- # if you already have a lora model trained that you want to load, put that here
530
- # lora hyperparameters
531
  lora_model_dir:
 
 
 
 
532
  lora_r: 8
533
  lora_alpha: 16
534
  lora_dropout: 0.05
@@ -540,36 +547,48 @@ lora_target_modules:
540
  # - gate_proj
541
  # - down_proj
542
  # - up_proj
543
- lora_target_linear: # if true, will target all linear layers
 
 
 
 
 
544
  lora_modules_to_save:
545
  # - embed_tokens
546
  # - lm_head
 
 
 
 
547
  lora_out_dir:
548
  lora_fan_in_fan_out: false
549
 
550
  # ReLoRA configuration
551
- # must use either 'lora' or 'qlora' adapter, and does not support fsdp or deepspeed
552
- relora_steps: # number of steps per ReLoRA restart
553
- relora_warmup_steps: # number of per-restart warmup steps
554
- relora_cpu_offload: # true to perform lora weight merges on cpu during restarts, for modest gpu memory savings
555
 
556
  # wandb configuration if you're using it
557
  wandb_mode: # "offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb
558
- wandb_project: # your wandb project name
559
- wandb_entity: # a wandb Team name if using a Team
560
  wandb_watch:
561
- wandb_run_id: # set the name of your wandb run
562
  wandb_log_model: # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training
563
 
564
- # where to save the finished model to
565
  output_dir: ./completed-model
566
 
567
- # whether to use torch.compile and which backend to use
568
  torch_compile: # bool
569
  torch_compile_backend: # Optional[str]
570
 
571
- # training hyperparameters
 
 
572
  gradient_accumulation_steps: 1
 
573
  micro_batch_size: 2
574
  eval_batch_size:
575
  num_epochs: 3
@@ -577,44 +596,47 @@ warmup_steps: 100
577
  learning_rate: 0.00003
578
  lr_quadratic_warmup:
579
  logging_steps:
580
- save_strategy: # set to `no` to skip checkpoint saves
581
- save_steps: # leave empty to save at each epoch
582
- eval_steps: # leave empty to eval at each epoch
583
- save_total_limit: # checkpoints saved at a time
 
 
 
584
  max_steps:
585
 
586
- eval_table_size: # approximate number of predictions sent to wandb depending on batch size. Enabled above 0. Default is 0
587
- eval_table_max_new_tokens: # total number of tokens generated for predictions sent to wandb. Default is 128
588
 
589
- # save model as safetensors (require safetensors package)
590
  save_safetensors:
591
 
592
- # whether to mask out or include the human's prompt from the training labels
593
  train_on_inputs: false
594
- # group similarly sized data to minimize padding
595
- # may be slower to start, as it must download and sort the entire dataset
596
- # note that training loss may have an oscillating pattern with this enabled
597
  group_by_length: false
598
 
599
  # Whether to use gradient checkpointing https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing
600
  gradient_checkpointing: false
601
 
602
- # stop training after this many evaluation losses have increased in a row
603
  # https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
604
  early_stopping_patience: 3
605
 
606
- # specify a scheduler and kwargs to use with the optimizer
607
  lr_scheduler: # 'one_cycle' | 'log_sweep' | empty for cosine
608
  lr_scheduler_kwargs:
609
 
610
- # for one_cycle optim
611
- lr_div_factor: # learning rate div factor
612
 
613
- # for log_sweep optim
614
  log_sweep_min_lr:
615
  log_sweep_max_lr:
616
 
617
- # specify optimizer
618
  # Valid values are driven by the Transformers OptimizerNames class, see:
619
  # https://github.com/huggingface/transformers/blob/95b374952dc27d8511541d6f5a4e22c9ec11fb24/src/transformers/training_args.py#L134
620
  #
@@ -640,7 +662,7 @@ log_sweep_max_lr:
640
  # - paged_lion_32bit
641
  # - paged_lion_8bit
642
  optimizer:
643
- # specify weight decay
644
  weight_decay:
645
  # adamw hyperparams
646
  adam_beta1:
@@ -649,49 +671,51 @@ adam_epsilon:
649
  # Gradient clipping max norm
650
  max_grad_norm:
651
 
652
- # whether to bettertransformers
653
  flash_optimum:
654
- # whether to use xformers attention patch https://github.com/facebookresearch/xformers:
655
  xformers_attention:
656
- # whether to use flash attention patch https://github.com/Dao-AILab/flash-attention:
657
  flash_attention:
658
  flash_attn_cross_entropy: # Whether to use flash-attention cross entropy implementation - advanced use only
659
  flash_attn_rms_norm: # Whether to use flash-attention rms norm implementation - advanced use only
660
- # whether to use scaled-dot-product attention
661
  # https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
662
  sdp_attention:
663
  # Landmark attention (only llama)
664
  landmark_attention:
665
  # xpos RoPE see https://github.com/kaiokendev/cutoff-len-is-context-len/blob/main/util/xpos_rope_llama_monkey_patch.py
666
- # llama only
667
  xpos_rope:
668
  # RoPE Scaling https://github.com/huggingface/transformers/pull/24653
669
  rope_scaling:
670
  type: # linear | dynamic
671
  factor: # float
672
 
673
- # resume from a specific checkpoint dir
674
  resume_from_checkpoint:
675
- # if resume_from_checkpoint isn't set and you simply want it to start where it left off
676
- # be careful with this being turned on between different models
677
  auto_resume_from_checkpoints: false
678
 
679
- # don't mess with this, it's here for accelerate and torchrun
680
  local_rank:
681
 
682
- # add or change special tokens
 
683
  special_tokens:
684
  # bos_token: "<s>"
685
  # eos_token: "</s>"
686
  # unk_token: "<unk>"
687
- # add extra tokens
 
688
  tokens:
689
 
690
  # FSDP
691
  fsdp:
692
  fsdp_config:
693
 
694
- # Deepspeed config path
695
  deepspeed:
696
 
697
  # Advanced DDP Arguments
@@ -717,6 +741,66 @@ strict:
717
 
718
  </details>
719
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
720
  ### Train
721
 
722
  Run
 
23
  - [Supported Features](#axolotl-supports)
24
  - [Quickstart](#quickstart-)
25
  - [Installation](#installation)
26
+ - [Docker](#docker)
27
+ - [Conda/Pip venv](#condapip-venv)
28
+ - [LambdaLabs](#lambdalabs)
29
+ - [Windows](#windows)
30
  - [Dataset](#dataset)
31
  - [How to Add Custom Prompts](#how-to-add-custom-prompts)
32
  - [How to Use Custom Pretokenized Dataset](#how-to-use-your-custom-pretokenized-dataset)
 
51
  <b>Axolotl provides a unified repository for fine-tuning <br />a variety of AI models with ease</b>
52
  </p>
53
  <p>
54
+ Go ahead and Axolotl questions!!
55
  </p>
56
  <img src="https://github.com/OpenAccess-AI-Collective/axolotl/actions/workflows/pre-commit.yml/badge.svg?branch=main" alt="pre-commit">
57
  <img alt="PyTest Status" src="https://github.com/OpenAccess-AI-Collective/axolotl/actions/workflows/tests.yml/badge.svg?branch=main">
 
103
 
104
  ### Environment
105
 
106
+ #### Docker
107
  ```bash
108
  docker run --gpus '"all"' --rm -it winglian/axolotl:main-py3.10-cu118-2.0.1
109
  ```
 
115
  docker compose up -d
116
  ```
117
 
118
+ #### Conda/Pip venv
119
  1. Install python >=**3.9**
120
 
121
  2. Install pytorch stable https://pytorch.org/get-started/locally/
122
 
123
+ 3. Install Axolotl along with python dependencies
124
  ```bash
125
  pip3 install packaging
126
  pip3 install -e '.[flash-attn,deepspeed]'
 
131
  ```
132
  Get the token at huggingface.co/settings/tokens
133
 
134
+ #### LambdaLabs
135
  <details>
136
 
137
  <summary>Click to Expand</summary>
 
175
  ```
176
  </details>
177
 
178
+ #### Windows
179
+ Please use WSL or Docker!
180
 
181
  ### Dataset
182
 
 
398
  <summary>All yaml options</summary>
399
 
400
  ```yaml
401
+ # This is the huggingface model that contains *.pt, *.safetensors, or *.bin files
402
+ # This can also be a relative path to a model on disk
403
  base_model: ./llama-7b-hf
404
+ # You can specify an ignore pattern if the model repo contains more than 1 model type (*.pt, etc)
405
  base_model_ignore_patterns:
406
+ # If the base_model repo on hf hub doesn't include configuration .json files,
407
+ # You can set that here, or leave this empty to default to base_model
408
  base_model_config: ./llama-7b-hf
409
+ # You can specify to choose a specific model revision from huggingface hub
410
  model_revision:
411
  # Optional tokenizer configuration override in case you want to use a different tokenizer
412
  # than the one defined in the base model
 
421
  tokenizer_use_fast:
422
  # Whether to use the legacy tokenizer setting, defaults to True
423
  tokenizer_legacy:
424
+ # Resize the model embeddings when new tokens are added to multiples of 32
425
+ # This is reported to improve training speed on some models
426
  resize_token_embeddings_to_32x:
427
 
428
+ # Used to identify which the model is based on
429
  is_falcon_derived_model:
430
  is_llama_derived_model:
431
+ # Please note that if you set this to true, `padding_side` will be set to "left" by default
432
  is_mistral_derived_model:
433
 
434
+ # Whether you are training a 4-bit GPTQ quantized model
435
  gptq: true
436
  gptq_groupsize: 128 # group size
437
  gptq_model_v1: false # v1 or v2
438
 
439
+ # This will attempt to quantize the model down to 8 bits and use adam 8 bit optimizer
440
  load_in_8bit: true
441
+ # Use bitsandbytes 4 bit
442
  load_in_4bit:
443
 
444
  # Use CUDA bf16
 
452
  bfloat16: true # require >=ampere
453
  float16: true
454
 
455
+ # A list of one or more datasets to finetune the model with
456
  datasets:
457
+ # HuggingFace dataset repo | "json" for local dataset, make sure to fill data_files
458
  - path: vicgalle/alpaca-gpt4
459
  # The type of prompt to use for training. [alpaca, sharegpt, gpteacher, oasst, reflection]
460
  type: alpaca # format | format:<prompt_style> (chat/instruct) | <prompt_strategies>.load_<load_fn>
 
464
  name: # Optional[str] name of dataset configuration to load
465
  conversation: # Optional[str] fastchat conversation type, only used with type: sharegpt
466
 
467
+ # Custom user prompt
468
  - path: repo
469
  type:
470
+ # The below are defaults. only set what's needed.
471
  system_prompt: ""
472
  field_system: system
473
  field_instruction: instruction
474
  field_output: input
475
 
476
+ # Customizable to be single line or multi-line
477
  system_format: "{system}"
478
  # 'format' can include {input}
479
  format: |-
 
482
  # 'no_input_format' cannot include {input}
483
  no_input_format: "{instruction} "
484
 
485
+ # For completions datsets, uses the provided field if not `text`
486
  field:
487
 
488
+ # Axolotl attempts to save the dataset as an arrow after packing the data together so
489
  # subsequent training attempts load faster, relative path
490
  dataset_prepared_path: data/last_run_prepared
491
+ # Push prepared dataset to hub
492
  push_dataset_to_hub: # repo path
493
  # The maximum number of processes to use while preprocessing your input dataset. This defaults to `os.cpu_count()`
494
  # if not set.
 
498
  # how to push checkpoints to hub
499
  # https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.hub_strategy
500
  hub_strategy:
501
+ # Whether to use hf `use_auth_token` for loading datasets. Useful for fetching private datasets
502
+ # Required to be true when used in combination with `push_dataset_to_hub`
503
  hf_use_auth_token: # boolean
504
  # How much of the dataset to set aside as evaluation. 1 = 100%, 0.50 = 50%, etc. 0 for no eval.
505
  val_set_size: 0.04
 
508
  # Index of shard to use for whole dataset
509
  dataset_shard_idx:
510
 
511
+ # The maximum length of an input to train with, this should typically be less than 2048
512
  # as most models have a token/context limit of 2048
513
  sequence_len: 2048
514
+ # Pad inputs so each step uses constant sized buffers
515
+ # This will reduce memory fragmentation and may prevent OOMs, by re-using memory more efficiently
516
  pad_to_sequence_len:
517
+ # Max sequence length to concatenate training samples together up to
518
+ # Inspired by StackLLaMA. see https://huggingface.co/blog/stackllama#supervised-fine-tuning
519
  # FutureWarning: This will soon be DEPRECATED
520
  max_packed_sequence_len: 1024
521
+ # Use efficient multi-packing with block diagonal attention and per sequence position_ids. Recommend set to 'true'
522
  sample_packing:
523
+ # Set to 'false' if getting errors during eval with sample_packing on.
524
  eval_sample_packing:
525
+ # You can set these packing optimizations AFTER starting a training at least once.
526
  # The trainer will provide recommended values for these values.
527
  sample_packing_eff_est:
528
  total_num_tokens:
529
 
530
+ # If you want to use 'lora' or 'qlora' or leave blank to train all parameters in original model
531
  adapter: lora
532
+ # If you already have a lora model trained that you want to load, put that here.
533
+ # This means after training, if you want to test the model, you should set this to the value of `lora_out_dir`.
534
  lora_model_dir:
535
+
536
+ # LoRA hyperparameters
537
+ # For more details about the following options, see:
538
+ # https://www.anyscale.com/blog/fine-tuning-llms-lora-or-full-parameter-an-in-depth-analysis-with-llama-2
539
  lora_r: 8
540
  lora_alpha: 16
541
  lora_dropout: 0.05
 
547
  # - gate_proj
548
  # - down_proj
549
  # - up_proj
550
+ lora_target_linear: # If true, will target all linear layers
551
+
552
+ # If you added new tokens to the tokenizer, you may need to save some LoRA modules because they need to know the new tokens.
553
+ # For LLaMA and Mistral, you need to save `embed_tokens` and `lm_head`. It may vary for other models.
554
+ # `embed_tokens` converts tokens to embeddings, and `lm_head` converts embeddings to token probabilities.
555
+ # https://github.com/huggingface/peft/issues/334#issuecomment-1561727994
556
  lora_modules_to_save:
557
  # - embed_tokens
558
  # - lm_head
559
+
560
+ # Once you complete training, the model will be saved to the following directory.
561
+ # If you merge the adapter to the base model, a subdirectory `merged` will be created under this directory.
562
+ # Make sure `lora_model_dir` points to this directory if you want to use the trained model.
563
  lora_out_dir:
564
  lora_fan_in_fan_out: false
565
 
566
  # ReLoRA configuration
567
+ # Must use either 'lora' or 'qlora' adapter, and does not support fsdp or deepspeed
568
+ relora_steps: # Number of steps per ReLoRA restart
569
+ relora_warmup_steps: # Number of per-restart warmup steps
570
+ relora_cpu_offload: # True to perform lora weight merges on cpu during restarts, for modest gpu memory savings
571
 
572
  # wandb configuration if you're using it
573
  wandb_mode: # "offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb
574
+ wandb_project: # Your wandb project name
575
+ wandb_entity: # A wandb Team name if using a Team
576
  wandb_watch:
577
+ wandb_run_id: # Set the name of your wandb run
578
  wandb_log_model: # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training
579
 
580
+ # Where to save the full-finetuned model to
581
  output_dir: ./completed-model
582
 
583
+ # Whether to use torch.compile and which backend to use
584
  torch_compile: # bool
585
  torch_compile_backend: # Optional[str]
586
 
587
+ # Training hyperparameters
588
+
589
+ # If greater than 1, backpropagation will be skipped and the gradients will be accumulated for the given number of steps.
590
  gradient_accumulation_steps: 1
591
+ # The number of samples to include in each batch. This is the number of samples sent to each GPU.
592
  micro_batch_size: 2
593
  eval_batch_size:
594
  num_epochs: 3
 
596
  learning_rate: 0.00003
597
  lr_quadratic_warmup:
598
  logging_steps:
599
+ save_strategy: # Set to `no` to skip checkpoint saves
600
+ save_steps: # Leave empty to save at each epoch
601
+ eval_steps: # Leave empty to eval at each epoch
602
+ save_total_limit: # Checkpoints saved at a time
603
+ # Maximum number of iterations to train for. It precedes num_epochs which means that
604
+ # if both are set, num_epochs will not be guaranteed.
605
+ # e.g., when 1 epoch is 1000 steps => `num_epochs: 2` and `max_steps: 100` will train for 100 steps
606
  max_steps:
607
 
608
+ eval_table_size: # Approximate number of predictions sent to wandb depending on batch size. Enabled above 0. Default is 0
609
+ eval_table_max_new_tokens: # Total number of tokens generated for predictions sent to wandb. Default is 128
610
 
611
+ # Save model as safetensors (require safetensors package)
612
  save_safetensors:
613
 
614
+ # Whether to mask out or include the human's prompt from the training labels
615
  train_on_inputs: false
616
+ # Group similarly sized data to minimize padding.
617
+ # May be slower to start, as it must download and sort the entire dataset.
618
+ # Note that training loss may have an oscillating pattern with this enabled.
619
  group_by_length: false
620
 
621
  # Whether to use gradient checkpointing https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing
622
  gradient_checkpointing: false
623
 
624
+ # Stop training after this many evaluation losses have increased in a row
625
  # https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
626
  early_stopping_patience: 3
627
 
628
+ # Specify a scheduler and kwargs to use with the optimizer
629
  lr_scheduler: # 'one_cycle' | 'log_sweep' | empty for cosine
630
  lr_scheduler_kwargs:
631
 
632
+ # For one_cycle optim
633
+ lr_div_factor: # Learning rate div factor
634
 
635
+ # For log_sweep optim
636
  log_sweep_min_lr:
637
  log_sweep_max_lr:
638
 
639
+ # Specify optimizer
640
  # Valid values are driven by the Transformers OptimizerNames class, see:
641
  # https://github.com/huggingface/transformers/blob/95b374952dc27d8511541d6f5a4e22c9ec11fb24/src/transformers/training_args.py#L134
642
  #
 
662
  # - paged_lion_32bit
663
  # - paged_lion_8bit
664
  optimizer:
665
+ # Specify weight decay
666
  weight_decay:
667
  # adamw hyperparams
668
  adam_beta1:
 
671
  # Gradient clipping max norm
672
  max_grad_norm:
673
 
674
+ # Whether to bettertransformers
675
  flash_optimum:
676
+ # Whether to use xformers attention patch https://github.com/facebookresearch/xformers:
677
  xformers_attention:
678
+ # Whether to use flash attention patch https://github.com/Dao-AILab/flash-attention:
679
  flash_attention:
680
  flash_attn_cross_entropy: # Whether to use flash-attention cross entropy implementation - advanced use only
681
  flash_attn_rms_norm: # Whether to use flash-attention rms norm implementation - advanced use only
682
+ # Whether to use scaled-dot-product attention
683
  # https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
684
  sdp_attention:
685
  # Landmark attention (only llama)
686
  landmark_attention:
687
  # xpos RoPE see https://github.com/kaiokendev/cutoff-len-is-context-len/blob/main/util/xpos_rope_llama_monkey_patch.py
688
+ # LLaMA only
689
  xpos_rope:
690
  # RoPE Scaling https://github.com/huggingface/transformers/pull/24653
691
  rope_scaling:
692
  type: # linear | dynamic
693
  factor: # float
694
 
695
+ # Resume from a specific checkpoint dir
696
  resume_from_checkpoint:
697
+ # If resume_from_checkpoint isn't set and you simply want it to start where it left off.
698
+ # Be careful with this being turned on between different models.
699
  auto_resume_from_checkpoints: false
700
 
701
+ # Don't mess with this, it's here for accelerate and torchrun
702
  local_rank:
703
 
704
+ # Add or change special tokens.
705
+ # If you add tokens here, you don't need to add them to the `tokens` list.
706
  special_tokens:
707
  # bos_token: "<s>"
708
  # eos_token: "</s>"
709
  # unk_token: "<unk>"
710
+
711
+ # Add extra tokens.
712
  tokens:
713
 
714
  # FSDP
715
  fsdp:
716
  fsdp_config:
717
 
718
+ # Deepspeed config path. e.g., deepspeed/zero3.json
719
  deepspeed:
720
 
721
  # Advanced DDP Arguments
 
741
 
742
  </details>
743
 
744
+ <details>
745
+ <summary> Understanding of batch size and gradient accumulation steps </summary>
746
+ <br/>
747
+ Gradient accumulation means accumulating gradients over several mini-batches and updating the model weights afterward. When the samples in each batch are diverse, this technique doesn't significantly impact learning.
748
+
749
+ This method allows for effective training with larger effective batch sizes without needing proportionally larger memory. Here's why:
750
+
751
+ 1. **Memory Consumption with Batch Size**: The primary reason increasing the batch size impacts memory is due to the storage requirements for intermediate activations. When you forward propagate a batch through a network, you have to store the activations at each layer for each sample in the batch, because these activations are used during backpropagation to compute gradients. Therefore, larger batches mean more activations, leading to greater GPU memory consumption.
752
+
753
+ 2. **Gradient Accumulation**: With gradient accumulation, you're effectively simulating a larger batch size by accumulating gradients over several smaller batches (or micro-batches). However, at any given time, you're only forward and backward propagating a micro-batch. This means you only store activations for the micro-batch, not the full accumulated batch. As a result, you can simulate the effect of a larger batch size without the memory cost of storing activations for a large batch.
754
+
755
+ **Example 1:**
756
+ Micro batch size: 3
757
+ Gradient accumulation steps: 2
758
+ Number of GPUs: 3
759
+ Total batch size = 3 * 2 * 3 = 18
760
+
761
+ ```
762
+ | GPU 1 | GPU 2 | GPU 3 |
763
+ |----------------|----------------|----------------|
764
+ | S1, S2, S3 | S4, S5, S6 | S7, S8, S9 |
765
+ | e1, e2, e3 | e4, e5, e6 | e7, e8, e9 |
766
+ |----------------|----------------|----------------|
767
+ | β†’ (accumulate) | β†’ (accumulate) | β†’ (accumulate) |
768
+ |----------------|----------------|----------------|
769
+ | S10, S11, S12 | S13, S14, S15 | S16, S17, S18 |
770
+ | e10, e11, e12 | e13, e14, e15 | e16, e17, e18 |
771
+ |----------------|----------------|----------------|
772
+ | β†’ (apply) | β†’ (apply) | β†’ (apply) |
773
+
774
+ Accumulated gradient for the weight w1 after the second iteration (considering all GPUs):
775
+ Total gradient for w1 = e1 + e2 + e3 + e4 + e5 + e6 + e7 + e8 + e9 + e10 + e11 + e12 + e13 + e14 + e15 + e16 + e17 + e18
776
+
777
+ Weight update for w1:
778
+ w1_new = w1_old - learning rate x (Total gradient for w1 / 18)
779
+ ```
780
+
781
+ **Example 2:**
782
+ Micro batch size: 2
783
+ Gradient accumulation steps: 1
784
+ Number of GPUs: 3
785
+ Total batch size = 2 * 1 * 3 = 6
786
+
787
+ ```
788
+ | GPU 1 | GPU 2 | GPU 3 |
789
+ |-----------|-----------|-----------|
790
+ | S1, S2 | S3, S4 | S5, S6 |
791
+ | e1, e2 | e3, e4 | e5, e6 |
792
+ |-----------|-----------|-----------|
793
+ | β†’ (apply) | β†’ (apply) | β†’ (apply) |
794
+
795
+ Accumulated gradient for the weight w1 (considering all GPUs):
796
+ Total gradient for w1 = e1 + e2 + e3 + e4 + e5 + e6
797
+
798
+ Weight update for w1:
799
+ w1_new = w1_old - learning rate Γ— (Total gradient for w1 / 6)
800
+ ```
801
+
802
+ </details>
803
+
804
  ### Train
805
 
806
  Run