## Below are example yamls for using multi-gpu training with 4 GPUs on two machines (nodes) where each machine has two GPUs: On machine 1 (host):
compute_environment: LOCAL_MACHINE deepspeed_config: {} +distributed_type: MULTI_GPU downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} gpu_ids: all +machine_rank: 0 +main_process_ip: 192.168.20.1 +main_process_port: 8080 main_training_function: main megatron_lm_config: {} mixed_precision: 'no' +num_machines: 2 +num_processes: 8 +rdzv_backend: static +same_network: true use_cpu: falseOn machine 2:
compute_environment: LOCAL_MACHINE deepspeed_config: {} +distributed_type: MULTI_GPU downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} gpu_ids: all -machine_rank: 0 +machine_rank: 1 +main_process_ip: 192.168.20.1 +main_process_port: 8080 main_training_function: main megatron_lm_config: {} mixed_precision: 'no' +num_machines: 2 +num_processes: 8 +rdzv_backend: static +same_network: true use_cpu: false## None ## To launch a script, on each machine run one of the following variations: If the YAML was generated through the `accelerate config` command: ``` accelerate launch {script_name.py} {--arg1} {--arg2} ... ``` If the YAML is saved to a `~/config.yaml` file: ``` accelerate launch --config_file ~/config.yaml {script_name.py} {--arg1} {--arg2} ... ``` Or you can use `accelerate launch` with right configuration parameters and have no `config.yaml` file: Replace `{node_number}` with appropriate machine number (0 for host, 1+ if not). ``` accelerate launch --multi_gpu --num_machines=2 --num_processes=8 --main_process_ip="192.168.20.1" --main_process_port=8080 --machine_rank={node_number} {script_name.py} {--arg1} {--arg2} ... ``` ## When utilizing multiple machines (nodes) for training, the config file needs to know how each machine will be able to communicate (the IP address and port), how many *total* GPUs there are, and whether the current machine is either the host or a client. **Remember that you can always use the `accelerate launch` functionality, even if the code in your script does not use the `Accelerator`** ## To learn more checkout the related documentation: - Launching distributed code - The Command Line