hamel commited on
Commit
86b7d22
1 Parent(s): 0b10377

Reorganize Docs (#1468)

Browse files
README.md CHANGED
@@ -35,13 +35,12 @@ Features:
35
  - [Google Colab](#google-colab)
36
  - [Launching on public clouds via SkyPilot](#launching-on-public-clouds-via-skypilot)
37
  - [Dataset](#dataset)
38
- - [How to Add Custom Prompts](#how-to-add-custom-prompts)
39
- - [How to Use Custom Pretokenized Dataset](#how-to-use-your-custom-pretokenized-dataset)
40
  - [Config](#config)
41
  - [Train](#train)
42
  - [Inference](#inference-playground)
43
  - [Merge LORA to Base](#merge-lora-to-base)
44
  - [Special Tokens](#special-tokens)
 
45
  - Advanced Topics
46
  - [Multipack](./docs/multipack.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
47
  - [RLHF & DPO](./docs/rlhf.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
@@ -299,186 +298,9 @@ HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKE
299
 
300
  ### Dataset
301
 
302
- Axolotl supports a variety of dataset formats. Below are some of the formats you can use.
303
- Have dataset(s) in one of the following format (JSONL recommended):
304
 
305
- #### Pretraining
306
-
307
- - `completion`: raw corpus
308
- ```json
309
- {"text": "..."}
310
- ```
311
-
312
- Note: Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:
313
-
314
- ```yaml
315
- pretraining_dataset: # hf path only
316
- ```
317
-
318
- #### Supervised finetuning
319
-
320
- ##### Instruction
321
-
322
- - `alpaca`: instruction; input(optional)
323
- ```json
324
- {"instruction": "...", "input": "...", "output": "..."}
325
- ```
326
-
327
- <details>
328
-
329
- <summary>See other formats</summary>
330
-
331
- - `jeopardy`: question and answer
332
- ```json
333
- {"question": "...", "category": "...", "answer": "..."}
334
- ```
335
- - `oasst`: instruction
336
- ```json
337
- {"INSTRUCTION": "...", "RESPONSE": "..."}
338
- ```
339
- - `gpteacher`: instruction; input(optional)
340
- ```json
341
- {"instruction": "...", "input": "...", "response": "..."}
342
- ```
343
- - `reflection`: instruction with reflect; input(optional)
344
- ```json
345
- {"instruction": "...", "input": "...", "output": "...", "reflection": "...", "corrected": "..."}
346
- ```
347
- - `explainchoice`: question, choices, (solution OR explanation)
348
- ```json
349
- {"question": "...", "choices": ["..."], "solution": "...", "explanation": "..."}
350
- ```
351
- - `concisechoice`: question, choices, (solution OR explanation)
352
- ```json
353
- {"question": "...", "choices": ["..."], "solution": "...", "explanation": "..."}
354
- ```
355
- - `summarizetldr`: article and summary
356
- ```json
357
- {"article": "...", "summary": "..."}
358
- ```
359
- - `alpaca_chat`: basic instruct for alpaca chat
360
- ```json
361
- {"instruction": "...", "input": "...", "response": "..."}
362
- ```
363
- - `alpaca_chat.load_qa`: question and answer for alpaca chat
364
- ```json
365
- {"question": "...", "answer": "..."}
366
- ```
367
- - `alpaca_chat.load_concise`: question and answer for alpaca chat, for concise answers
368
- ```json
369
- {"instruction": "...", "input": "...", "response": "..."}
370
- ```
371
- - `alpaca_chat.load_camel_ai`: question and answer for alpaca chat, for load_camel_ai
372
- ```json
373
- {"message_1": "...", "message_2": "..."}
374
- ```
375
- - `alpaca_w_system.load_open_orca`: support for open orca datasets with included system prompts, instruct
376
- ```json
377
- {"system_prompt": "...", "question": "...", "response": "..."}
378
- ```
379
- - `context_qa`: in context question answering from an article
380
- ```json
381
- {"article": "...", "question": "...", "answer": "..."}
382
- ```
383
- - `context_qa.load_v2`: in context question answering (alternate)
384
- ```json
385
- {"context": "...", "question": "...", "answer": "..."}
386
- ```
387
- - `context_qa.load_404`: in context question answering from an article, with default response for no answer from context
388
- ```json
389
- {"article": "...", "unanswerable_question": "..."}
390
- ```
391
- - `creative_acr.load_answer`: instruction and revision
392
- ```json
393
- {"instruction": "...", "revision": "..."}
394
- ```
395
- - `creative_acr.load_critique`: critique
396
- ```json
397
- {"scores": "...", "critiques": "...", "instruction": "...", "answer": "..."}
398
- ```
399
- - `creative_acr.load_revise`: critique and revise
400
- ```json
401
- {"scores": "...", "critiques": "...", "instruction": "...", "answer": "...", "revision": "..."}
402
- ```
403
- - `metharme`: instruction, adds additional eos tokens
404
- ```json
405
- {"prompt": "...", "generation": "..."}
406
- ```
407
-
408
- </details>
409
-
410
- ##### Template-Free
411
-
412
- - `input_output`: template-free prompt construction
413
- ```json
414
- {"segments": [{"label": true|false, "text": "..."}]}
415
- ```
416
-
417
- This is a special format that allows you to construct prompts without using templates. This is for advanced users who want more freedom with prompt construction. See [these docs](docs/input_output.qmd) for more details.
418
-
419
- ##### Conversation
420
-
421
- - `sharegpt`: conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)
422
- ```json
423
- {"conversations": [{"from": "...", "value": "..."}]}
424
- ```
425
-
426
- <details>
427
-
428
- <summary>See other formats</summary>
429
-
430
- - `pygmalion`: pygmalion
431
- ```json
432
- {"conversations": [{"role": "...", "value": "..."}]}
433
- ```
434
- - `sharegpt.load_role`: conversations where `role` is used instead of `from`
435
- ```json
436
- {"conversations": [{"role": "...", "value": "..."}]}
437
- ```
438
- - `sharegpt.load_guanaco`: conversations where `from` is `prompter`/`assistant` instead of default sharegpt
439
- ```json
440
- {"conversations": [{"from": "...", "value": "..."}]}
441
- ```
442
- - `sharegpt_jokes`: creates a chat where bot is asked to tell a joke, then explain why the joke is funny
443
- ```json
444
- {"conversations": [{"title": "...", "text": "...", "explanation": "..."}]}
445
- ```
446
-
447
- </details>
448
-
449
- Note: `type: sharegpt` opens a special config `conversation:` that enables conversions to many Conversation types. See dataset section under [all yaml options](#all-yaml-options).
450
-
451
- #### How to add custom prompts
452
-
453
- For a dataset that is preprocessed for instruction purposes:
454
-
455
- ```json
456
- {"input": "...", "output": "..."}
457
- ```
458
-
459
- You can use this example in your YAML config:
460
-
461
- ```yaml
462
- datasets:
463
- - path: repo
464
- type:
465
- system_prompt: ""
466
- field_system: system
467
- field_instruction: input
468
- field_output: output
469
- format: "[INST] {instruction} [/INST]"
470
- no_input_format: "[INST] {instruction} [/INST]"
471
- ```
472
- See full config options under [all yaml options](#all-yaml-options).
473
-
474
- #### How to use your custom pretokenized dataset
475
-
476
- - Do not pass a `type:`
477
- - Columns in Dataset must be exactly `input_ids`, `attention_mask`, `labels`
478
-
479
- ```yaml
480
- - path: ...
481
- ```
482
 
483
  ### Config
484
 
@@ -563,452 +385,9 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
563
  - v_proj
564
  ```
565
 
566
- <details id="all-yaml-options">
567
 
568
- <summary>All yaml options (click to expand)</summary>
569
-
570
- ```yaml
571
- # This is the huggingface model that contains *.pt, *.safetensors, or *.bin files
572
- # This can also be a relative path to a model on disk
573
- base_model: ./llama-7b-hf
574
- # You can specify an ignore pattern if the model repo contains more than 1 model type (*.pt, etc)
575
- base_model_ignore_patterns:
576
- # If the base_model repo on hf hub doesn't include configuration .json files,
577
- # You can set that here, or leave this empty to default to base_model
578
- base_model_config: ./llama-7b-hf
579
- # You can specify to choose a specific model revision from huggingface hub
580
- revision_of_model:
581
- # Optional tokenizer configuration path in case you want to use a different tokenizer
582
- # than the one defined in the base model
583
- tokenizer_config:
584
- # If you want to specify the type of model to load, AutoModelForCausalLM is a good choice too
585
- model_type: AutoModelForCausalLM
586
- # Corresponding tokenizer for the model AutoTokenizer is a good choice
587
- tokenizer_type: AutoTokenizer
588
- # Trust remote code for untrusted source
589
- trust_remote_code:
590
- # use_fast option for tokenizer loading from_pretrained, default to True
591
- tokenizer_use_fast:
592
- # Whether to use the legacy tokenizer setting, defaults to True
593
- tokenizer_legacy:
594
- # Resize the model embeddings when new tokens are added to multiples of 32
595
- # This is reported to improve training speed on some models
596
- resize_token_embeddings_to_32x:
597
-
598
- # (Internal use only)
599
- # Used to identify which the model is based on
600
- is_falcon_derived_model:
601
- is_llama_derived_model:
602
- is_qwen_derived_model:
603
- # Please note that if you set this to true, `padding_side` will be set to "left" by default
604
- is_mistral_derived_model:
605
-
606
- # optional overrides to the base model configuration
607
- overrides_of_model_config:
608
- # RoPE Scaling https://github.com/huggingface/transformers/pull/24653
609
- rope_scaling:
610
- type: # linear | dynamic
611
- factor: # float
612
-
613
- # optional overrides to the bnb 4bit quantization configuration
614
- # https://huggingface.co/docs/transformers/main/main_classes/quantization#transformers.BitsAndBytesConfig
615
- bnb_config_kwargs:
616
- # These are default values
617
- llm_int8_has_fp16_weight: false
618
- bnb_4bit_quant_type: nf4
619
- bnb_4bit_use_double_quant: true
620
-
621
-
622
- # Whether you are training a 4-bit GPTQ quantized model
623
- gptq: true
624
-
625
- # This will attempt to quantize the model down to 8 bits and use adam 8 bit optimizer
626
- load_in_8bit: true
627
- # Use bitsandbytes 4 bit
628
- load_in_4bit:
629
-
630
- # Use CUDA bf16
631
- bf16: true # bool or 'full' for `bf16_full_eval`. require >=ampere
632
- # Use CUDA fp16
633
- fp16: true
634
- # Use CUDA tf32
635
- tf32: true # require >=ampere
636
-
637
- # No AMP (automatic mixed precision)
638
- bfloat16: true # require >=ampere
639
- float16: true
640
-
641
- # Limit the memory for all available GPUs to this amount (if an integer, expressed in gigabytes); default: unset
642
- gpu_memory_limit: 20GiB
643
- # Do the LoRA/PEFT loading on CPU -- this is required if the base model is so large it takes up most or all of the available GPU VRAM, e.g. during a model and LoRA merge
644
- lora_on_cpu: true
645
-
646
- # A list of one or more datasets to finetune the model with
647
- datasets:
648
- # HuggingFace dataset repo | s3://,gs:// path | "json" for local dataset, make sure to fill data_files
649
- - path: vicgalle/alpaca-gpt4
650
- # The type of prompt to use for training. [alpaca, sharegpt, gpteacher, oasst, reflection]
651
- type: alpaca # format | format:<prompt_style> (chat/instruct) | <prompt_strategies>.load_<load_fn>
652
- ds_type: # Optional[str] (json|arrow|parquet|text|csv) defines the datatype when path is a file
653
- data_files: # Optional[str] path to source data files
654
- shards: # Optional[int] number of shards to split data into
655
- name: # Optional[str] name of dataset configuration to load
656
- train_on_split: train # Optional[str] name of dataset split to load from
657
-
658
- # Optional[str] fastchat conversation type, only used with type: sharegpt
659
- conversation: # Options (see Conversation 'name'): https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
660
- field_human: # Optional[str]. Human key to use for conversation.
661
- field_model: # Optional[str]. Assistant key to use for conversation.
662
- # Add additional keys from your dataset as input or output roles
663
- roles:
664
- input: # Optional[List[str]]. These will be masked based on train_on_input
665
- output: # Optional[List[str]].
666
-
667
- # Custom user instruction prompt
668
- - path: repo
669
- type:
670
- # The below are defaults. only set what's needed if you use a different column name.
671
- system_prompt: ""
672
- system_format: "{system}"
673
- field_system: system
674
- field_instruction: instruction
675
- field_input: input
676
- field_output: output
677
-
678
- # Customizable to be single line or multi-line
679
- # Use {instruction}/{input} as key to be replaced
680
- # 'format' can include {input}
681
- format: |-
682
- User: {instruction} {input}
683
- Assistant:
684
- # 'no_input_format' cannot include {input}
685
- no_input_format: "{instruction} "
686
-
687
- # For `completion` datsets only, uses the provided field instead of `text` column
688
- field:
689
-
690
- # If false, the datasets will not be shuffled and will keep their original order in `datasets`.
691
- # The same applies to the `test_datasets` option and the `pretraining_dataset` option. Default is true.
692
- shuffle_merged_datasets: true
693
-
694
- # A list of one or more datasets to eval the model with.
695
- # You can use either test_datasets, or val_set_size, but not both.
696
- test_datasets:
697
- - path: /workspace/data/eval.jsonl
698
- ds_type: json
699
- # You need to specify a split. For "json" datasets the default split is called "train".
700
- split: train
701
- type: completion
702
- data_files:
703
- - /workspace/data/eval.jsonl
704
-
705
- # use RL training: 'dpo', 'ipo', 'kto_pair'
706
- rl:
707
-
708
- # Saves the desired chat template to the tokenizer_config.json for easier inferencing
709
- # Currently supports chatml and inst (mistral/mixtral)
710
- chat_template: chatml
711
- # Changes the default system message
712
- default_system_message: You are a helpful assistant. Please give a long and detailed answer. # Currently only supports chatml.
713
- # Axolotl attempts to save the dataset as an arrow after packing the data together so
714
- # subsequent training attempts load faster, relative path
715
- dataset_prepared_path: data/last_run_prepared
716
- # Push prepared dataset to hub
717
- push_dataset_to_hub: # repo path
718
- # The maximum number of processes to use while preprocessing your input dataset. This defaults to `os.cpu_count()`
719
- # if not set.
720
- dataset_processes: # defaults to os.cpu_count() if not set
721
- # Keep dataset in memory while preprocessing
722
- # Only needed if cached dataset is taking too much storage
723
- dataset_keep_in_memory:
724
- # push checkpoints to hub
725
- hub_model_id: # private repo path to push finetuned model
726
- # how to push checkpoints to hub
727
- # https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.hub_strategy
728
- hub_strategy:
729
- # Whether to use hf `use_auth_token` for loading datasets. Useful for fetching private datasets
730
- # Required to be true when used in combination with `push_dataset_to_hub`
731
- hf_use_auth_token: # boolean
732
- # How much of the dataset to set aside as evaluation. 1 = 100%, 0.50 = 50%, etc. 0 for no eval.
733
- val_set_size: 0.04
734
- # Num shards for whole dataset
735
- dataset_shard_num:
736
- # Index of shard to use for whole dataset
737
- dataset_shard_idx:
738
-
739
- # The maximum length of an input to train with, this should typically be less than 2048
740
- # as most models have a token/context limit of 2048
741
- sequence_len: 2048
742
- # Pad inputs so each step uses constant sized buffers
743
- # This will reduce memory fragmentation and may prevent OOMs, by re-using memory more efficiently
744
- pad_to_sequence_len:
745
- # Use efficient multi-packing with block diagonal attention and per sequence position_ids. Recommend set to 'true'
746
- sample_packing:
747
- # Set to 'false' if getting errors during eval with sample_packing on.
748
- eval_sample_packing:
749
- # You can set these packing optimizations AFTER starting a training at least once.
750
- # The trainer will provide recommended values for these values.
751
- sample_packing_eff_est:
752
- total_num_tokens:
753
-
754
- # Passed through to transformers when loading the model when launched without accelerate
755
- # Use `sequential` when training w/ model parallelism to limit memory
756
- device_map:
757
- # Defines the max memory usage per gpu on the system. Passed through to transformers when loading the model.
758
- max_memory:
759
-
760
- # If you want to use 'lora' or 'qlora' or leave blank to train all parameters in original model
761
- adapter: lora
762
- # If you already have a lora model trained that you want to load, put that here.
763
- # This means after training, if you want to test the model, you should set this to the value of `output_dir`.
764
- # Note that if you merge an adapter to the base model, a new subdirectory `merged` will be created under the `output_dir`.
765
- lora_model_dir:
766
-
767
- # LoRA hyperparameters
768
- # For more details about the following options, see:
769
- # https://www.anyscale.com/blog/fine-tuning-llms-lora-or-full-parameter-an-in-depth-analysis-with-llama-2
770
- lora_r: 8
771
- lora_alpha: 16
772
- lora_dropout: 0.05
773
- lora_target_modules:
774
- - q_proj
775
- - v_proj
776
- # - k_proj
777
- # - o_proj
778
- # - gate_proj
779
- # - down_proj
780
- # - up_proj
781
- lora_target_linear: # If true, will target all linear modules
782
- peft_layers_to_transform: # The layer indices to transform, otherwise, apply to all layers
783
-
784
- # If you added new tokens to the tokenizer, you may need to save some LoRA modules because they need to know the new tokens.
785
- # For LLaMA and Mistral, you need to save `embed_tokens` and `lm_head`. It may vary for other models.
786
- # `embed_tokens` converts tokens to embeddings, and `lm_head` converts embeddings to token probabilities.
787
- # https://github.com/huggingface/peft/issues/334#issuecomment-1561727994
788
- lora_modules_to_save:
789
- # - embed_tokens
790
- # - lm_head
791
-
792
- lora_fan_in_fan_out: false
793
-
794
- peft:
795
- # Configuration options for loftq initialization for LoRA
796
- # https://huggingface.co/docs/peft/developer_guides/quantization#loftq-initialization
797
- loftq_config:
798
- loftq_bits: # typically 4 bits
799
-
800
- # ReLoRA configuration
801
- # Must use either 'lora' or 'qlora' adapter, and does not support fsdp or deepspeed
802
- relora_steps: # Number of steps per ReLoRA restart
803
- relora_warmup_steps: # Number of per-restart warmup steps
804
- relora_anneal_steps: # Number of anneal steps for each relora cycle
805
- relora_prune_ratio: # threshold for optimizer magnitude when pruning
806
- relora_cpu_offload: # True to perform lora weight merges on cpu during restarts, for modest gpu memory savings
807
-
808
- # wandb configuration if you're using it
809
- # Make sure your `WANDB_API_KEY` environment variable is set (recommended) or you login to wandb with `wandb login`.
810
- wandb_mode: # "offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb
811
- wandb_project: # Your wandb project name
812
- wandb_entity: # A wandb Team name if using a Team
813
- wandb_watch:
814
- wandb_name: # Set the name of your wandb run
815
- wandb_run_id: # Set the ID of your wandb run
816
- wandb_log_model: # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training
817
-
818
- # mlflow configuration if you're using it
819
- mlflow_tracking_uri: # URI to mlflow
820
- mlflow_experiment_name: # Your experiment name
821
- hf_mlflow_log_artifacts: # set to true to copy each saved checkpoint on each save to mlflow artifact registry
822
-
823
- # Where to save the full-finetuned model to
824
- output_dir: ./completed-model
825
-
826
- # Whether to use torch.compile and which backend to use
827
- torch_compile: # bool
828
- torch_compile_backend: # Optional[str]
829
-
830
- # Training hyperparameters
831
-
832
- # If greater than 1, backpropagation will be skipped and the gradients will be accumulated for the given number of steps.
833
- gradient_accumulation_steps: 1
834
- # The number of samples to include in each batch. This is the number of samples sent to each GPU.
835
- micro_batch_size: 2
836
- eval_batch_size:
837
- num_epochs: 4
838
- warmup_steps: 100 # cannot use with warmup_ratio
839
- warmup_ratio: 0.05 # cannot use with warmup_steps
840
- learning_rate: 0.00003
841
- lr_quadratic_warmup:
842
- logging_steps:
843
- eval_steps: # Leave empty to eval at each epoch, integers for every N steps. decimal for fraction of total steps
844
- evals_per_epoch: # number of times per epoch to run evals, mutually exclusive with eval_steps
845
- save_strategy: # Set to `no` to skip checkpoint saves
846
- save_steps: # Leave empty to save at each epoch
847
- saves_per_epoch: # number of times per epoch to save a checkpoint, mutually exclusive with save_steps
848
- save_total_limit: # Checkpoints saved at a time
849
- # Maximum number of iterations to train for. It precedes num_epochs which means that
850
- # if both are set, num_epochs will not be guaranteed.
851
- # e.g., when 1 epoch is 1000 steps => `num_epochs: 2` and `max_steps: 100` will train for 100 steps
852
- max_steps:
853
-
854
- eval_table_size: # Approximate number of predictions sent to wandb depending on batch size. Enabled above 0. Default is 0
855
- eval_max_new_tokens: # Total number of tokens generated for predictions sent to wandb. Default is 128
856
- eval_causal_lm_metrics: # HF evaluate metrics used during evaluation. Default is ["sacrebleu", "comet", "ter", chrf]
857
-
858
- loss_watchdog_threshold: # High loss value, indicating the learning has broken down (a good estimate is ~2 times the loss at the start of training)
859
- loss_watchdog_patience: # Number of high-loss steps in a row before the trainer aborts (default: 3)
860
-
861
- # Save model as safetensors (require safetensors package)
862
- save_safetensors:
863
-
864
- # Whether to mask out or include the human's prompt from the training labels
865
- train_on_inputs: false
866
- # Group similarly sized data to minimize padding.
867
- # May be slower to start, as it must download and sort the entire dataset.
868
- # Note that training loss may have an oscillating pattern with this enabled.
869
- group_by_length: false
870
-
871
- # Whether to use gradient checkpointing https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing
872
- gradient_checkpointing: false
873
- # additional kwargs to pass to the trainer for gradient checkpointing
874
- # gradient_checkpointing_kwargs:
875
- # use_reentrant: true
876
-
877
- # Stop training after this many evaluation losses have increased in a row
878
- # https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
879
- early_stopping_patience: 3
880
-
881
- # Specify a scheduler and kwargs to use with the optimizer
882
- lr_scheduler: # 'one_cycle' | 'log_sweep' | empty for cosine
883
- lr_scheduler_kwargs:
884
- cosine_min_lr_ratio: # decay lr to some percentage of the peak lr, e.g. cosine_min_lr_ratio=0.1 for 10% of peak lr
885
- cosine_constant_lr_ratio: # freeze lr at some percentage of the step, e.g. cosine_constant_lr_ratio=0.8 means start cosine_min_lr at 80% of training step (https://arxiv.org/pdf/2308.04014.pdf)
886
-
887
- # For one_cycle optim
888
- lr_div_factor: # Learning rate div factor
889
-
890
- # Specify optimizer
891
- # Valid values are driven by the Transformers OptimizerNames class, see:
892
- # https://github.com/huggingface/transformers/blob/95b374952dc27d8511541d6f5a4e22c9ec11fb24/src/transformers/training_args.py#L134
893
- #
894
- # Note that not all optimizers may be available in your environment, ex: 'adamw_anyprecision' is part of
895
- # torchdistx, 'adamw_bnb_8bit' is part of bnb.optim.Adam8bit, etc. When in doubt, it is recommended to start with the optimizer used
896
- # in the examples/ for your model and fine-tuning use case.
897
- #
898
- # Valid values for 'optimizer' include:
899
- # - adamw_hf
900
- # - adamw_torch
901
- # - adamw_torch_fused
902
- # - adamw_torch_xla
903
- # - adamw_apex_fused
904
- # - adafactor
905
- # - adamw_anyprecision
906
- # - sgd
907
- # - adagrad
908
- # - adamw_bnb_8bit
909
- # - lion_8bit
910
- # - lion_32bit
911
- # - paged_adamw_32bit
912
- # - paged_adamw_8bit
913
- # - paged_lion_32bit
914
- # - paged_lion_8bit
915
- # - galore_adamw
916
- # - galore_adamw_8bit
917
- # - galore_adafactor
918
- # - galore_adamw_layerwise
919
- # - galore_adamw_8bit_layerwise
920
- # - galore_adafactor_layerwise
921
- optimizer:
922
- # Dictionary of arguments to pass to the optimizer
923
- optim_args:
924
- # For Galore Optimizers the following optim_args are available
925
- # rank: # type: int
926
- # update_proj_gap # type: int
927
- # scale # type: float
928
- # proj_type: # type: str, default = std
929
-
930
- # The target modules to optimize, i.e. the module names that you would like to train, right now this is used only for GaLore algorithm
931
- optim_target_modules:
932
- # - self_attn # for llama
933
- # - mlp
934
-
935
- # Specify weight decay
936
- weight_decay:
937
- # adamw hyperparams
938
- adam_beta1:
939
- adam_beta2:
940
- adam_epsilon:
941
- # Gradient clipping max norm
942
- max_grad_norm:
943
-
944
- # Augmentation techniques
945
- # NEFT https://arxiv.org/abs/2310.05914, set this to a number (paper default is 5) to add noise to embeddings
946
- # currently only supported on Llama and Mistral
947
- neftune_noise_alpha:
948
-
949
- # Whether to bettertransformers
950
- flash_optimum:
951
- # Whether to use xformers attention patch https://github.com/facebookresearch/xformers:
952
- xformers_attention:
953
- # Whether to use flash attention patch https://github.com/Dao-AILab/flash-attention:
954
- flash_attention:
955
- flash_attn_cross_entropy: # Whether to use flash-attention cross entropy implementation - advanced use only
956
- flash_attn_rms_norm: # Whether to use flash-attention rms norm implementation - advanced use only
957
- flash_attn_fuse_qkv: # Whether to fuse QKV into a single operation
958
- flash_attn_fuse_mlp: # Whether to fuse part of the MLP into a single operation
959
- # Whether to use scaled-dot-product attention
960
- # https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
961
- sdp_attention:
962
- # Shifted-sparse attention (only llama) - https://arxiv.org/pdf/2309.12307.pdf
963
- s2_attention:
964
- # Resume from a specific checkpoint dir
965
- resume_from_checkpoint:
966
- # If resume_from_checkpoint isn't set and you simply want it to start where it left off.
967
- # Be careful with this being turned on between different models.
968
- auto_resume_from_checkpoints: false
969
-
970
- # Don't mess with this, it's here for accelerate and torchrun
971
- local_rank:
972
-
973
- # Add or change special tokens.
974
- # If you add tokens here, you don't need to add them to the `tokens` list.
975
- special_tokens:
976
- # bos_token: "<s>"
977
- # eos_token: "</s>"
978
- # unk_token: "<unk>"
979
-
980
- # Add extra tokens.
981
- tokens:
982
-
983
- # FSDP
984
- fsdp:
985
- fsdp_config:
986
-
987
- # Deepspeed config path. e.g., deepspeed_configs/zero3.json
988
- deepspeed:
989
-
990
- # Advanced DDP Arguments
991
- ddp_timeout:
992
- ddp_bucket_cap_mb:
993
- ddp_broadcast_buffers:
994
-
995
- # Path to torch distx for optim 'adamw_anyprecision'
996
- torchdistx_path:
997
-
998
- # Set to HF dataset for type: 'completion' for streaming instead of pre-tokenize
999
- pretraining_dataset:
1000
-
1001
- # Debug mode
1002
- debug:
1003
-
1004
- # Seed
1005
- seed:
1006
-
1007
- # Allow overwrite yml config using from cli
1008
- strict:
1009
- ```
1010
-
1011
- </details>
1012
 
1013
  <details>
1014
  <summary> Understanding of batch size and gradient accumulation steps </summary>
 
35
  - [Google Colab](#google-colab)
36
  - [Launching on public clouds via SkyPilot](#launching-on-public-clouds-via-skypilot)
37
  - [Dataset](#dataset)
 
 
38
  - [Config](#config)
39
  - [Train](#train)
40
  - [Inference](#inference-playground)
41
  - [Merge LORA to Base](#merge-lora-to-base)
42
  - [Special Tokens](#special-tokens)
43
+ - [All Config Options](#all-config-options)
44
  - Advanced Topics
45
  - [Multipack](./docs/multipack.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
46
  - [RLHF & DPO](./docs/rlhf.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
 
298
 
299
  ### Dataset
300
 
301
+ Axolotl supports a variety of dataset formats. It is recommended to use a JSONL. The schema of the JSONL depends upon the task and the prompt template you wish to use. Instead of a JSONL, you can also use a HuggingFace dataset with columns for each JSONL field.
 
302
 
303
+ See [these docs](https://openaccess-ai-collective.github.io/axolotl/docs/dataset-formats/) for more information on how to use different dataset formats.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
304
 
305
  ### Config
306
 
 
385
  - v_proj
386
  ```
387
 
388
+ #### All Config Options
389
 
390
+ See [these docs](docs/config.qmd) for all config options.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
391
 
392
  <details>
393
  <summary> Understanding of batch size and gradient accumulation steps </summary>
_quarto.yml CHANGED
@@ -30,20 +30,20 @@ website:
30
  # TODO Edit folder structure after we have more docs.
31
  - docs/debugging.qmd
32
  - docs/multipack.qmd
33
- - docs/fdsp_qlora.qmd
34
  - docs/input_output.qmd
35
  - docs/rlhf.qmd
36
  - docs/nccl.qmd
37
  - docs/mac.qmd
38
  - docs/multi-node.qmd
 
 
39
  - section: "Reference"
40
  contents:
41
  - docs/config.qmd
42
  - docs/faq.qmd
43
 
44
 
45
-
46
-
47
  format:
48
  html:
49
  theme: materia
 
30
  # TODO Edit folder structure after we have more docs.
31
  - docs/debugging.qmd
32
  - docs/multipack.qmd
33
+ - docs/fsdp_qlora.qmd
34
  - docs/input_output.qmd
35
  - docs/rlhf.qmd
36
  - docs/nccl.qmd
37
  - docs/mac.qmd
38
  - docs/multi-node.qmd
39
+ - section: "Dataset Formats"
40
+ contents: docs/dataset-formats/*
41
  - section: "Reference"
42
  contents:
43
  - docs/config.qmd
44
  - docs/faq.qmd
45
 
46
 
 
 
47
  format:
48
  html:
49
  theme: materia
docs/config.qmd CHANGED
@@ -3,15 +3,443 @@ title: Config options
3
  description: A complete list of all configuration options.
4
  ---
5
 
6
- ```{python}
7
- #|echo: false
8
- #|output: asis
9
- import re
10
- # Regex pattern to match the YAML block including its code fence
11
- pattern = r'<details[^>]*id="all-yaml-options"[^>]*>.*?<summary>All yaml options.*?```yaml(.*?)```.*?</details>'
12
-
13
- with open('../README.md', 'r') as f:
14
- doc = f.read()
15
- match = re.search(pattern, doc, re.DOTALL)
16
- print("```yaml", match.group(1).strip(), "```", sep="\n")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ```
 
3
  description: A complete list of all configuration options.
4
  ---
5
 
6
+ ```yaml
7
+ # This is the huggingface model that contains *.pt, *.safetensors, or *.bin files
8
+ # This can also be a relative path to a model on disk
9
+ base_model: ./llama-7b-hf
10
+ # You can specify an ignore pattern if the model repo contains more than 1 model type (*.pt, etc)
11
+ base_model_ignore_patterns:
12
+ # If the base_model repo on hf hub doesn't include configuration .json files,
13
+ # You can set that here, or leave this empty to default to base_model
14
+ base_model_config: ./llama-7b-hf
15
+ # You can specify to choose a specific model revision from huggingface hub
16
+ revision_of_model:
17
+ # Optional tokenizer configuration path in case you want to use a different tokenizer
18
+ # than the one defined in the base model
19
+ tokenizer_config:
20
+ # If you want to specify the type of model to load, AutoModelForCausalLM is a good choice too
21
+ model_type: AutoModelForCausalLM
22
+ # Corresponding tokenizer for the model AutoTokenizer is a good choice
23
+ tokenizer_type: AutoTokenizer
24
+ # Trust remote code for untrusted source
25
+ trust_remote_code:
26
+ # use_fast option for tokenizer loading from_pretrained, default to True
27
+ tokenizer_use_fast:
28
+ # Whether to use the legacy tokenizer setting, defaults to True
29
+ tokenizer_legacy:
30
+ # Resize the model embeddings when new tokens are added to multiples of 32
31
+ # This is reported to improve training speed on some models
32
+ resize_token_embeddings_to_32x:
33
+
34
+ # (Internal use only)
35
+ # Used to identify which the model is based on
36
+ is_falcon_derived_model:
37
+ is_llama_derived_model:
38
+ is_qwen_derived_model:
39
+ # Please note that if you set this to true, `padding_side` will be set to "left" by default
40
+ is_mistral_derived_model:
41
+
42
+ # optional overrides to the base model configuration
43
+ overrides_of_model_config:
44
+ # RoPE Scaling https://github.com/huggingface/transformers/pull/24653
45
+ rope_scaling:
46
+ type: # linear | dynamic
47
+ factor: # float
48
+
49
+ # optional overrides to the bnb 4bit quantization configuration
50
+ # https://huggingface.co/docs/transformers/main/main_classes/quantization#transformers.BitsAndBytesConfig
51
+ bnb_config_kwargs:
52
+ # These are default values
53
+ llm_int8_has_fp16_weight: false
54
+ bnb_4bit_quant_type: nf4
55
+ bnb_4bit_use_double_quant: true
56
+
57
+
58
+ # Whether you are training a 4-bit GPTQ quantized model
59
+ gptq: true
60
+
61
+ # This will attempt to quantize the model down to 8 bits and use adam 8 bit optimizer
62
+ load_in_8bit: true
63
+ # Use bitsandbytes 4 bit
64
+ load_in_4bit:
65
+
66
+ # Use CUDA bf16
67
+ bf16: true # bool or 'full' for `bf16_full_eval`. require >=ampere
68
+ # Use CUDA fp16
69
+ fp16: true
70
+ # Use CUDA tf32
71
+ tf32: true # require >=ampere
72
+
73
+ # No AMP (automatic mixed precision)
74
+ bfloat16: true # require >=ampere
75
+ float16: true
76
+
77
+ # Limit the memory for all available GPUs to this amount (if an integer, expressed in gigabytes); default: unset
78
+ gpu_memory_limit: 20GiB
79
+ # Do the LoRA/PEFT loading on CPU -- this is required if the base model is so large it takes up most or all of the available GPU VRAM, e.g. during a model and LoRA merge
80
+ lora_on_cpu: true
81
+
82
+ # A list of one or more datasets to finetune the model with
83
+ datasets:
84
+ # HuggingFace dataset repo | s3://,gs:// path | "json" for local dataset, make sure to fill data_files
85
+ - path: vicgalle/alpaca-gpt4
86
+ # The type of prompt to use for training. [alpaca, sharegpt, gpteacher, oasst, reflection]
87
+ type: alpaca # format | format:<prompt_style> (chat/instruct) | <prompt_strategies>.load_<load_fn>
88
+ ds_type: # Optional[str] (json|arrow|parquet|text|csv) defines the datatype when path is a file
89
+ data_files: # Optional[str] path to source data files
90
+ shards: # Optional[int] number of shards to split data into
91
+ name: # Optional[str] name of dataset configuration to load
92
+ train_on_split: train # Optional[str] name of dataset split to load from
93
+
94
+ # Optional[str] fastchat conversation type, only used with type: sharegpt
95
+ conversation: # Options (see Conversation 'name'): https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
96
+ field_human: # Optional[str]. Human key to use for conversation.
97
+ field_model: # Optional[str]. Assistant key to use for conversation.
98
+ # Add additional keys from your dataset as input or output roles
99
+ roles:
100
+ input: # Optional[List[str]]. These will be masked based on train_on_input
101
+ output: # Optional[List[str]].
102
+
103
+ # Custom user instruction prompt
104
+ - path: repo
105
+ type:
106
+ # The below are defaults. only set what's needed if you use a different column name.
107
+ system_prompt: ""
108
+ system_format: "{system}"
109
+ field_system: system
110
+ field_instruction: instruction
111
+ field_input: input
112
+ field_output: output
113
+
114
+ # Customizable to be single line or multi-line
115
+ # Use {instruction}/{input} as key to be replaced
116
+ # 'format' can include {input}
117
+ format: |-
118
+ User: {instruction} {input}
119
+ Assistant:
120
+ # 'no_input_format' cannot include {input}
121
+ no_input_format: "{instruction} "
122
+
123
+ # For `completion` datsets only, uses the provided field instead of `text` column
124
+ field:
125
+
126
+ # If false, the datasets will not be shuffled and will keep their original order in `datasets`.
127
+ # The same applies to the `test_datasets` option and the `pretraining_dataset` option. Default is true.
128
+ shuffle_merged_datasets: true
129
+
130
+ # A list of one or more datasets to eval the model with.
131
+ # You can use either test_datasets, or val_set_size, but not both.
132
+ test_datasets:
133
+ - path: /workspace/data/eval.jsonl
134
+ ds_type: json
135
+ # You need to specify a split. For "json" datasets the default split is called "train".
136
+ split: train
137
+ type: completion
138
+ data_files:
139
+ - /workspace/data/eval.jsonl
140
+
141
+ # use RL training: 'dpo', 'ipo', 'kto_pair'
142
+ rl:
143
+
144
+ # Saves the desired chat template to the tokenizer_config.json for easier inferencing
145
+ # Currently supports chatml and inst (mistral/mixtral)
146
+ chat_template: chatml
147
+ # Changes the default system message
148
+ default_system_message: You are a helpful assistant. Please give a long and detailed answer. # Currently only supports chatml.
149
+ # Axolotl attempts to save the dataset as an arrow after packing the data together so
150
+ # subsequent training attempts load faster, relative path
151
+ dataset_prepared_path: data/last_run_prepared
152
+ # Push prepared dataset to hub
153
+ push_dataset_to_hub: # repo path
154
+ # The maximum number of processes to use while preprocessing your input dataset. This defaults to `os.cpu_count()`
155
+ # if not set.
156
+ dataset_processes: # defaults to os.cpu_count() if not set
157
+ # Keep dataset in memory while preprocessing
158
+ # Only needed if cached dataset is taking too much storage
159
+ dataset_keep_in_memory:
160
+ # push checkpoints to hub
161
+ hub_model_id: # private repo path to push finetuned model
162
+ # how to push checkpoints to hub
163
+ # https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.hub_strategy
164
+ hub_strategy:
165
+ # Whether to use hf `use_auth_token` for loading datasets. Useful for fetching private datasets
166
+ # Required to be true when used in combination with `push_dataset_to_hub`
167
+ hf_use_auth_token: # boolean
168
+ # How much of the dataset to set aside as evaluation. 1 = 100%, 0.50 = 50%, etc. 0 for no eval.
169
+ val_set_size: 0.04
170
+ # Num shards for whole dataset
171
+ dataset_shard_num:
172
+ # Index of shard to use for whole dataset
173
+ dataset_shard_idx:
174
+
175
+ # The maximum length of an input to train with, this should typically be less than 2048
176
+ # as most models have a token/context limit of 2048
177
+ sequence_len: 2048
178
+ # Pad inputs so each step uses constant sized buffers
179
+ # This will reduce memory fragmentation and may prevent OOMs, by re-using memory more efficiently
180
+ pad_to_sequence_len:
181
+ # Use efficient multi-packing with block diagonal attention and per sequence position_ids. Recommend set to 'true'
182
+ sample_packing:
183
+ # Set to 'false' if getting errors during eval with sample_packing on.
184
+ eval_sample_packing:
185
+ # You can set these packing optimizations AFTER starting a training at least once.
186
+ # The trainer will provide recommended values for these values.
187
+ sample_packing_eff_est:
188
+ total_num_tokens:
189
+
190
+ # Passed through to transformers when loading the model when launched without accelerate
191
+ # Use `sequential` when training w/ model parallelism to limit memory
192
+ device_map:
193
+ # Defines the max memory usage per gpu on the system. Passed through to transformers when loading the model.
194
+ max_memory:
195
+
196
+ # If you want to use 'lora' or 'qlora' or leave blank to train all parameters in original model
197
+ adapter: lora
198
+ # If you already have a lora model trained that you want to load, put that here.
199
+ # This means after training, if you want to test the model, you should set this to the value of `output_dir`.
200
+ # Note that if you merge an adapter to the base model, a new subdirectory `merged` will be created under the `output_dir`.
201
+ lora_model_dir:
202
+
203
+ # LoRA hyperparameters
204
+ # For more details about the following options, see:
205
+ # https://www.anyscale.com/blog/fine-tuning-llms-lora-or-full-parameter-an-in-depth-analysis-with-llama-2
206
+ lora_r: 8
207
+ lora_alpha: 16
208
+ lora_dropout: 0.05
209
+ lora_target_modules:
210
+ - q_proj
211
+ - v_proj
212
+ # - k_proj
213
+ # - o_proj
214
+ # - gate_proj
215
+ # - down_proj
216
+ # - up_proj
217
+ lora_target_linear: # If true, will target all linear modules
218
+ peft_layers_to_transform: # The layer indices to transform, otherwise, apply to all layers
219
+
220
+ # If you added new tokens to the tokenizer, you may need to save some LoRA modules because they need to know the new tokens.
221
+ # For LLaMA and Mistral, you need to save `embed_tokens` and `lm_head`. It may vary for other models.
222
+ # `embed_tokens` converts tokens to embeddings, and `lm_head` converts embeddings to token probabilities.
223
+ # https://github.com/huggingface/peft/issues/334#issuecomment-1561727994
224
+ lora_modules_to_save:
225
+ # - embed_tokens
226
+ # - lm_head
227
+
228
+ lora_fan_in_fan_out: false
229
+
230
+ peft:
231
+ # Configuration options for loftq initialization for LoRA
232
+ # https://huggingface.co/docs/peft/developer_guides/quantization#loftq-initialization
233
+ loftq_config:
234
+ loftq_bits: # typically 4 bits
235
+
236
+ # ReLoRA configuration
237
+ # Must use either 'lora' or 'qlora' adapter, and does not support fsdp or deepspeed
238
+ relora_steps: # Number of steps per ReLoRA restart
239
+ relora_warmup_steps: # Number of per-restart warmup steps
240
+ relora_anneal_steps: # Number of anneal steps for each relora cycle
241
+ relora_prune_ratio: # threshold for optimizer magnitude when pruning
242
+ relora_cpu_offload: # True to perform lora weight merges on cpu during restarts, for modest gpu memory savings
243
+
244
+ # wandb configuration if you're using it
245
+ # Make sure your `WANDB_API_KEY` environment variable is set (recommended) or you login to wandb with `wandb login`.
246
+ wandb_mode: # "offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb
247
+ wandb_project: # Your wandb project name
248
+ wandb_entity: # A wandb Team name if using a Team
249
+ wandb_watch:
250
+ wandb_name: # Set the name of your wandb run
251
+ wandb_run_id: # Set the ID of your wandb run
252
+ wandb_log_model: # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training
253
+
254
+ # mlflow configuration if you're using it
255
+ mlflow_tracking_uri: # URI to mlflow
256
+ mlflow_experiment_name: # Your experiment name
257
+ hf_mlflow_log_artifacts: # set to true to copy each saved checkpoint on each save to mlflow artifact registry
258
+
259
+ # Where to save the full-finetuned model to
260
+ output_dir: ./completed-model
261
+
262
+ # Whether to use torch.compile and which backend to use
263
+ torch_compile: # bool
264
+ torch_compile_backend: # Optional[str]
265
+
266
+ # Training hyperparameters
267
+
268
+ # If greater than 1, backpropagation will be skipped and the gradients will be accumulated for the given number of steps.
269
+ gradient_accumulation_steps: 1
270
+ # The number of samples to include in each batch. This is the number of samples sent to each GPU.
271
+ micro_batch_size: 2
272
+ eval_batch_size:
273
+ num_epochs: 4
274
+ warmup_steps: 100 # cannot use with warmup_ratio
275
+ warmup_ratio: 0.05 # cannot use with warmup_steps
276
+ learning_rate: 0.00003
277
+ lr_quadratic_warmup:
278
+ logging_steps:
279
+ eval_steps: # Leave empty to eval at each epoch, integers for every N steps. decimal for fraction of total steps
280
+ evals_per_epoch: # number of times per epoch to run evals, mutually exclusive with eval_steps
281
+ save_strategy: # Set to `no` to skip checkpoint saves
282
+ save_steps: # Leave empty to save at each epoch
283
+ saves_per_epoch: # number of times per epoch to save a checkpoint, mutually exclusive with save_steps
284
+ save_total_limit: # Checkpoints saved at a time
285
+ # Maximum number of iterations to train for. It precedes num_epochs which means that
286
+ # if both are set, num_epochs will not be guaranteed.
287
+ # e.g., when 1 epoch is 1000 steps => `num_epochs: 2` and `max_steps: 100` will train for 100 steps
288
+ max_steps:
289
+
290
+ eval_table_size: # Approximate number of predictions sent to wandb depending on batch size. Enabled above 0. Default is 0
291
+ eval_max_new_tokens: # Total number of tokens generated for predictions sent to wandb. Default is 128
292
+ eval_causal_lm_metrics: # HF evaluate metrics used during evaluation. Default is ["sacrebleu", "comet", "ter", chrf]
293
+
294
+ loss_watchdog_threshold: # High loss value, indicating the learning has broken down (a good estimate is ~2 times the loss at the start of training)
295
+ loss_watchdog_patience: # Number of high-loss steps in a row before the trainer aborts (default: 3)
296
+
297
+ # Save model as safetensors (require safetensors package)
298
+ save_safetensors:
299
+
300
+ # Whether to mask out or include the human's prompt from the training labels
301
+ train_on_inputs: false
302
+ # Group similarly sized data to minimize padding.
303
+ # May be slower to start, as it must download and sort the entire dataset.
304
+ # Note that training loss may have an oscillating pattern with this enabled.
305
+ group_by_length: false
306
+
307
+ # Whether to use gradient checkpointing https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing
308
+ gradient_checkpointing: false
309
+ # additional kwargs to pass to the trainer for gradient checkpointing
310
+ # gradient_checkpointing_kwargs:
311
+ # use_reentrant: true
312
+
313
+ # Stop training after this many evaluation losses have increased in a row
314
+ # https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
315
+ early_stopping_patience: 3
316
+
317
+ # Specify a scheduler and kwargs to use with the optimizer
318
+ lr_scheduler: # 'one_cycle' | 'log_sweep' | empty for cosine
319
+ lr_scheduler_kwargs:
320
+ cosine_min_lr_ratio: # decay lr to some percentage of the peak lr, e.g. cosine_min_lr_ratio=0.1 for 10% of peak lr
321
+ cosine_constant_lr_ratio: # freeze lr at some percentage of the step, e.g. cosine_constant_lr_ratio=0.8 means start cosine_min_lr at 80% of training step (https://arxiv.org/pdf/2308.04014.pdf)
322
+
323
+ # For one_cycle optim
324
+ lr_div_factor: # Learning rate div factor
325
+
326
+ # Specify optimizer
327
+ # Valid values are driven by the Transformers OptimizerNames class, see:
328
+ # https://github.com/huggingface/transformers/blob/95b374952dc27d8511541d6f5a4e22c9ec11fb24/src/transformers/training_args.py#L134
329
+ #
330
+ # Note that not all optimizers may be available in your environment, ex: 'adamw_anyprecision' is part of
331
+ # torchdistx, 'adamw_bnb_8bit' is part of bnb.optim.Adam8bit, etc. When in doubt, it is recommended to start with the optimizer used
332
+ # in the examples/ for your model and fine-tuning use case.
333
+ #
334
+ # Valid values for 'optimizer' include:
335
+ # - adamw_hf
336
+ # - adamw_torch
337
+ # - adamw_torch_fused
338
+ # - adamw_torch_xla
339
+ # - adamw_apex_fused
340
+ # - adafactor
341
+ # - adamw_anyprecision
342
+ # - sgd
343
+ # - adagrad
344
+ # - adamw_bnb_8bit
345
+ # - lion_8bit
346
+ # - lion_32bit
347
+ # - paged_adamw_32bit
348
+ # - paged_adamw_8bit
349
+ # - paged_lion_32bit
350
+ # - paged_lion_8bit
351
+ # - galore_adamw
352
+ # - galore_adamw_8bit
353
+ # - galore_adafactor
354
+ # - galore_adamw_layerwise
355
+ # - galore_adamw_8bit_layerwise
356
+ # - galore_adafactor_layerwise
357
+ optimizer:
358
+ # Dictionary of arguments to pass to the optimizer
359
+ optim_args:
360
+ # For Galore Optimizers the following optim_args are available
361
+ # rank: # type: int
362
+ # update_proj_gap # type: int
363
+ # scale # type: float
364
+ # proj_type: # type: str, default = std
365
+
366
+ # The target modules to optimize, i.e. the module names that you would like to train, right now this is used only for GaLore algorithm
367
+ optim_target_modules:
368
+ # - self_attn # for llama
369
+ # - mlp
370
+
371
+ # Specify weight decay
372
+ weight_decay:
373
+ # adamw hyperparams
374
+ adam_beta1:
375
+ adam_beta2:
376
+ adam_epsilon:
377
+ # Gradient clipping max norm
378
+ max_grad_norm:
379
+
380
+ # Augmentation techniques
381
+ # NEFT https://arxiv.org/abs/2310.05914, set this to a number (paper default is 5) to add noise to embeddings
382
+ # currently only supported on Llama and Mistral
383
+ neftune_noise_alpha:
384
+
385
+ # Whether to bettertransformers
386
+ flash_optimum:
387
+ # Whether to use xformers attention patch https://github.com/facebookresearch/xformers:
388
+ xformers_attention:
389
+ # Whether to use flash attention patch https://github.com/Dao-AILab/flash-attention:
390
+ flash_attention:
391
+ flash_attn_cross_entropy: # Whether to use flash-attention cross entropy implementation - advanced use only
392
+ flash_attn_rms_norm: # Whether to use flash-attention rms norm implementation - advanced use only
393
+ flash_attn_fuse_qkv: # Whether to fuse QKV into a single operation
394
+ flash_attn_fuse_mlp: # Whether to fuse part of the MLP into a single operation
395
+ # Whether to use scaled-dot-product attention
396
+ # https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
397
+ sdp_attention:
398
+ # Shifted-sparse attention (only llama) - https://arxiv.org/pdf/2309.12307.pdf
399
+ s2_attention:
400
+ # Resume from a specific checkpoint dir
401
+ resume_from_checkpoint:
402
+ # If resume_from_checkpoint isn't set and you simply want it to start where it left off.
403
+ # Be careful with this being turned on between different models.
404
+ auto_resume_from_checkpoints: false
405
+
406
+ # Don't mess with this, it's here for accelerate and torchrun
407
+ local_rank:
408
+
409
+ # Add or change special tokens.
410
+ # If you add tokens here, you don't need to add them to the `tokens` list.
411
+ special_tokens:
412
+ # bos_token: "<s>"
413
+ # eos_token: "</s>"
414
+ # unk_token: "<unk>"
415
+
416
+ # Add extra tokens.
417
+ tokens:
418
+
419
+ # FSDP
420
+ fsdp:
421
+ fsdp_config:
422
+
423
+ # Deepspeed config path. e.g., deepspeed_configs/zero3.json
424
+ deepspeed:
425
+
426
+ # Advanced DDP Arguments
427
+ ddp_timeout:
428
+ ddp_bucket_cap_mb:
429
+ ddp_broadcast_buffers:
430
+
431
+ # Path to torch distx for optim 'adamw_anyprecision'
432
+ torchdistx_path:
433
+
434
+ # Set to HF dataset for type: 'completion' for streaming instead of pre-tokenize
435
+ pretraining_dataset:
436
+
437
+ # Debug mode
438
+ debug:
439
+
440
+ # Seed
441
+ seed:
442
+
443
+ # Allow overwrite yml config using from cli
444
+ strict:
445
  ```
docs/dataset-formats/conversation.qmd ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Conversation
3
+ description: Conversation format for supervised fine-tuning.
4
+ order: 1
5
+ ---
6
+
7
+ ## Formats
8
+
9
+ ### sharegpt
10
+
11
+ conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)
12
+
13
+ ```{.json filename="data.jsonl"}
14
+ {"conversations": [{"from": "...", "value": "..."}]}
15
+ ```
16
+
17
+ Note: `type: sharegpt` opens a special config `conversation:` that enables conversions to many Conversation types. See [the docs](../docs/config.qmd) for all config options.
18
+
19
+ ### pygmalion
20
+
21
+ ```{.json filename="data.jsonl"}
22
+ {"conversations": [{"role": "...", "value": "..."}]}
23
+ ```
24
+
25
+ ### sharegpt.load_role
26
+
27
+ conversations where `role` is used instead of `from`
28
+
29
+ ```{.json filename="data.jsonl"}
30
+ {"conversations": [{"role": "...", "value": "..."}]}
31
+ ```
32
+
33
+ ### sharegpt.load_guanaco
34
+
35
+ conversations where `from` is `prompter` `assistant` instead of default sharegpt
36
+
37
+ ```{.json filename="data.jsonl"}
38
+ {"conversations": [{"from": "...", "value": "..."}]}
39
+ ```
40
+
41
+ ### sharegpt_jokes
42
+
43
+ creates a chat where bot is asked to tell a joke, then explain why the joke is funny
44
+
45
+ ```{.json filename="data.jsonl"}
46
+ {"conversations": [{"title": "...", "text": "...", "explanation": "..."}]}
47
+ ```
48
+
49
+ ## How to add custom prompts for instruction-tuning
50
+
51
+ For a dataset that is preprocessed for instruction purposes:
52
+
53
+ ```{.json filename="data.jsonl"}
54
+ {"input": "...", "output": "..."}
55
+ ```
56
+
57
+ You can use this example in your YAML config:
58
+
59
+ ```{.yaml filename="config.yaml"}
60
+ datasets:
61
+ - path: repo
62
+ type:
63
+ system_prompt: ""
64
+ field_system: system
65
+ field_instruction: input
66
+ field_output: output
67
+ format: "[INST] {instruction} [/INST]"
68
+ no_input_format: "[INST] {instruction} [/INST]"
69
+ ```
70
+
71
+ See full config options under [here](../docs/config.qmd).
docs/dataset-formats/index.qmd ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Dataset Formats
3
+ description: Supported dataset formats.
4
+ listing:
5
+ fields: [title, description]
6
+ type: table
7
+ sort-ui: false
8
+ filter-ui: false
9
+ max-description-length: 250
10
+ ---
11
+
12
+ Axolotl supports a variety of dataset formats. It is recommended to use a JSONL format. The schema of the JSONL depends upon the task and the prompt template you wish to use. Instead of a JSONL, you can also use a HuggingFace dataset with columns for each JSONL field.
13
+
14
+ Below are these various formats organized by task:
docs/dataset-formats/inst_tune.qmd ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Instruction Tuning
3
+ description: Instruction tuning formats for supervised fine-tuning.
4
+ order: 2
5
+ ---
6
+
7
+ ## alpaca
8
+
9
+ instruction; input(optional)
10
+
11
+ ```{.json filename="data.jsonl"}
12
+ {"instruction": "...", "input": "...", "output": "..."}
13
+ ```
14
+
15
+ ## jeopardy
16
+
17
+ question and answer
18
+
19
+ ```{.json filename="data.jsonl"}
20
+ {"question": "...", "category": "...", "answer": "..."}
21
+ ```
22
+
23
+ ## oasst
24
+
25
+ instruction
26
+
27
+ ```{.json filename="data.jsonl"}
28
+ {"INSTRUCTION": "...", "RESPONSE": "..."}
29
+ ```
30
+
31
+ ## gpteacher
32
+
33
+ instruction; input(optional)
34
+
35
+ ```{.json filename="data.jsonl"}
36
+ {"instruction": "...", "input": "...", "response": "..."}
37
+ ```
38
+
39
+ ## reflection
40
+
41
+ instruction with reflect; input(optional)
42
+
43
+ ```{.json filename="data.jsonl"}
44
+ {"instruction": "...", "input": "...", "output": "...", "reflection": "...", "corrected": "..."}
45
+ ```
46
+
47
+ ## explainchoice
48
+
49
+ question, choices, (solution OR explanation)
50
+
51
+ ```{.json filename="data.jsonl"}
52
+ {"question": "...", "choices": ["..."], "solution": "...", "explanation": "..."}
53
+ ```
54
+
55
+ ## concisechoice
56
+
57
+ question, choices, (solution OR explanation)
58
+
59
+ ```{.json filename="data.jsonl"}
60
+ {"question": "...", "choices": ["..."], "solution": "...", "explanation": "..."}
61
+ ```
62
+
63
+ ## summarizetldr
64
+
65
+ article and summary
66
+
67
+ ```{.json filename="data.jsonl"}
68
+ {"article": "...", "summary": "..."}
69
+ ```
70
+
71
+ ## alpaca_chat
72
+
73
+ basic instruct for alpaca chat
74
+
75
+ ```{.json filename="data.jsonl"}
76
+ {"instruction": "...", "input": "...", "response": "..."}
77
+ ```
78
+
79
+ ## alpaca_chat.load_qa
80
+
81
+ question and answer for alpaca chat
82
+
83
+ ```{.json filename="data.jsonl"}
84
+ {"question": "...", "answer": "..."}
85
+ ```
86
+
87
+ ## alpaca_chat.load_concise
88
+
89
+ question and answer for alpaca chat, for concise answers
90
+
91
+ ```{.json filename="data.jsonl"}
92
+ {"instruction": "...", "input": "...", "response": "..."}
93
+ ```
94
+
95
+ ## alpaca_chat.load_camel_ai
96
+
97
+ question and answer for alpaca chat, for load_camel_ai
98
+
99
+ ```{.json filename="data.jsonl"}
100
+ {"message_1": "...", "message_2": "..."}
101
+ ```
102
+
103
+ ## alpaca_w_system.load_open_orca
104
+
105
+ support for open orca datasets with included system prompts, instruct
106
+
107
+ ```{.json filename="data.jsonl"}
108
+ {"system_prompt": "...", "question": "...", "response": "..."}
109
+ ```
110
+
111
+ ## context_qa
112
+
113
+ in context question answering from an article
114
+
115
+ ```{.json filename="data.jsonl"}
116
+ {"article": "...", "question": "...", "answer": "..."}
117
+ ```
118
+
119
+ ## context_qa.load_v2
120
+
121
+ in context question answering (alternate)
122
+
123
+ ```{.json filename="data.jsonl"}
124
+ {"context": "...", "question": "...", "answer": "..."}
125
+ ```
126
+
127
+ ## context_qa.load_404
128
+
129
+ in context question answering from an article, with default response for no answer from context
130
+
131
+ ```{.json filename="data.jsonl"}
132
+ {"article": "...", "unanswerable_question": "..."}
133
+ ```
134
+
135
+ ## creative_acr.load_answer
136
+
137
+ instruction and revision
138
+
139
+ ```{.json filename="data.jsonl"}
140
+ {"instruction": "...", "revision": "..."}
141
+ ```
142
+
143
+ ## creative_acr.load_critique
144
+
145
+ critique
146
+
147
+ ```{.json filename="data.jsonl"}
148
+ {"scores": "...", "critiques": "...", "instruction": "...", "answer": "..."}
149
+ ```
150
+
151
+ ## creative_acr.load_revise
152
+
153
+ critique and revise
154
+
155
+ ```{.json filename="data.jsonl"}
156
+ {"scores": "...", "critiques": "...", "instruction": "...", "answer": "...", "revision": "..."}
157
+ ```
158
+
159
+ ## metharme
160
+
161
+ instruction, adds additional eos tokens
162
+
163
+ ```{.json filename="data.jsonl"}
164
+ {"prompt": "...", "generation": "..."}
165
+ ```
docs/dataset-formats/pretraining.qmd ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Pre-training
3
+ description: Data format for a pre-training completion task.
4
+ order: 3
5
+ ---
6
+
7
+ For pretraining, there is no prompt template or roles. The only required field is `text`:
8
+
9
+ ```{.json filename="data.jsonl"}
10
+ {"text": "first row"}
11
+ {"text": "second row"}
12
+ ...
13
+ ```
14
+
15
+ :::{.callout-note}
16
+
17
+ ### Streaming is recommended for large datasets
18
+
19
+ Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:
20
+
21
+ ```{.yaml filename="config.yaml"}
22
+ pretraining_dataset: # hf path only
23
+ ...
24
+ ```
25
+
26
+ :::
docs/dataset-formats/template_free.qmd ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Template-Free
3
+ description: Construct prompts without a template.
4
+ order: 4
5
+ ---
6
+
7
+ See [these docs](../input_output.qmd).
docs/dataset-formats/tokenized.qmd ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Custom Pre-Tokenized Dataset
3
+ description: How to use a custom pre-tokenized dataset.
4
+ order: 5
5
+ ---
6
+
7
+ - Do not pass a `type:` in your axolotl config.
8
+ - Columns in Dataset must be exactly `input_ids`, `attention_mask`, `labels`
9
+
10
+ ```{.yaml filename="config.yml"}
11
+ - path: ...
12
+ ```
docs/fsdp_qlora.qmd CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: FDSP + QLoRA
3
  description: Use FSDP with QLoRA to fine-tune large LLMs on consumer GPUs.
4
  format:
5
  html:
 
1
  ---
2
+ title: "FDSP + QLoRA"
3
  description: Use FSDP with QLoRA to fine-tune large LLMs on consumer GPUs.
4
  format:
5
  html:
docs/input_output.qmd CHANGED
@@ -91,8 +91,9 @@ format into a jsonl file (below is the first row from the file
91
 
92
  ```bash
93
  $ head -n1 output.jsonl | python -m json.tool
 
94
 
95
- {.cell-output .cell-output-stdout}
96
  {
97
  "segments": [
98
  {
@@ -113,7 +114,7 @@ $ head -n1 output.jsonl | python -m json.tool
113
  }
114
  ]
115
  }
116
- ```
117
 
118
  Set `label:false` when you want to mask a segment of text so that the
119
  model isn't trained on it. Some things to keep in mind:
@@ -238,8 +239,9 @@ version is repeated below for reference):
238
 
239
  ```bash
240
  $ head -n1 output.jsonl | python -m json.tool
 
241
 
242
- {.cell-output .cell-output-stdout}
243
  {
244
  "segments": [
245
  {
@@ -260,4 +262,4 @@ $ head -n1 output.jsonl | python -m json.tool
260
  }
261
  ]
262
  }
263
- ```
 
91
 
92
  ```bash
93
  $ head -n1 output.jsonl | python -m json.tool
94
+ ```
95
 
96
+ :::{.cell-output .cell-output-stdout}
97
  {
98
  "segments": [
99
  {
 
114
  }
115
  ]
116
  }
117
+ :::
118
 
119
  Set `label:false` when you want to mask a segment of text so that the
120
  model isn't trained on it. Some things to keep in mind:
 
239
 
240
  ```bash
241
  $ head -n1 output.jsonl | python -m json.tool
242
+ ```
243
 
244
+ :::{.cell-output .cell-output-stdout}
245
  {
246
  "segments": [
247
  {
 
262
  }
263
  ]
264
  }
265
+ :::