## Run `accelerate config` and answer the questionnaire accordingly. Below is an example yaml for BF16 mixed-precision training using PyTorch FSDP with CPU offloading on 8 GPUs.
compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: FSDP downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_offload_params: true fsdp_sharding_strategy: 1 fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: T5Block machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true use_cpu: false##
from accelerate import Accelerator + def main(): accelerator = Accelerator() model = accelerator.prepare(model) optimizer, training_dataloader, scheduler = accelerator.prepare( optimizer, training_dataloader, scheduler ) for batch in training_dataloader: optimizer.zero_grad() inputs, targets = batch outputs = model(inputs) loss = loss_function(outputs, targets) accelerator.backward(loss) optimizer.step() scheduler.step() ... + if __name__ == "__main__": + main()Launching a script using default accelerate config file looks like the following: ``` accelerate launch {script_name.py} {--arg1} {--arg2} ... ``` Alternatively, you can use `accelerate launch` with right config params for multi-gpu training as shown below ``` accelerate launch \ --use_fsdp \ --num_processes=8 \ --mixed_precision=bf16 \ --fsdp_sharding_strategy=1 \ --fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \ --fsdp_transformer_layer_cls_to_wrap=T5Block \ --fsdp_offload_params=true \ {script_name.py} {--arg1} {--arg2} ... ``` ## For PyTorch FDSP, you need to prepare the model first before preparing the optimizer since FSDP will shard parameters in-place and this will break any previously initialized optimizers. Same in outlined in the above code snippet. For transformer models, please use `TRANSFORMER_BASED_WRAP` auto wrap policy as shown in the config above. ## To learn more checkout the related documentation: - How to use FSDP - Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel