Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
## | |
Run `accelerate config` and answer the questionnaire accordingly. | |
Below is an example yaml for BF16 mixed-precision training using PyTorch FSDP with CPU offloading on 8 GPUs. | |
<pre> | |
compute_environment: LOCAL_MACHINE | |
deepspeed_config: {} | |
distributed_type: FSDP | |
downcast_bf16: 'no' | |
dynamo_backend: 'NO' | |
fsdp_config: | |
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP | |
fsdp_backward_prefetch_policy: BACKWARD_PRE | |
fsdp_offload_params: true | |
fsdp_sharding_strategy: 1 | |
fsdp_state_dict_type: FULL_STATE_DICT | |
fsdp_transformer_layer_cls_to_wrap: T5Block | |
machine_rank: 0 | |
main_training_function: main | |
megatron_lm_config: {} | |
mixed_precision: bf16 | |
num_machines: 1 | |
num_processes: 8 | |
rdzv_backend: static | |
same_network: true | |
use_cpu: false | |
</pre> | |
## | |
<pre> | |
from accelerate import Accelerator | |
+ def main(): | |
accelerator = Accelerator() | |
model = accelerator.prepare(model) | |
optimizer, training_dataloader, scheduler = accelerator.prepare( | |
optimizer, training_dataloader, scheduler | |
) | |
for batch in training_dataloader: | |
optimizer.zero_grad() | |
inputs, targets = batch | |
outputs = model(inputs) | |
loss = loss_function(outputs, targets) | |
accelerator.backward(loss) | |
optimizer.step() | |
scheduler.step() | |
... | |
+ if __name__ == "__main__": | |
+ main() | |
</pre> | |
Launching a script using default accelerate config file looks like the following: | |
``` | |
accelerate launch {script_name.py} {--arg1} {--arg2} ... | |
``` | |
Alternatively, you can use `accelerate launch` with right config params for multi-gpu training as shown below | |
``` | |
accelerate launch \ | |
--use_fsdp \ | |
--num_processes=8 \ | |
--mixed_precision=bf16 \ | |
--fsdp_sharding_strategy=1 \ | |
--fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \ | |
--fsdp_transformer_layer_cls_to_wrap=T5Block \ | |
--fsdp_offload_params=true \ | |
{script_name.py} {--arg1} {--arg2} ... | |
``` | |
## | |
For PyTorch FDSP, you need to prepare the model first before preparing the optimizer since FSDP will shard parameters in-place and this will break any previously initialized optimizers. Same in outlined in the above code snippet. For transformer models, please use `TRANSFORMER_BASED_WRAP` auto wrap policy as shown in the config above. | |
## | |
To learn more checkout the related documentation: | |
- <a href="https://huggingface.co/docs/accelerate/usage_guides/fsdp" target="_blank">How to use FSDP</a> | |
- <a href="https://huggingface.co/blog/pytorch-fsdp" target="_blank">Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel</a> |