Spaces:

hf-accelerate
/

accelerate_examples

Running on CPU Upgrade

App Files Files Community

accelerate_examples / code_samples /training_configuration /multi_node_multi_gpu

muellerzr HF staff

Refactor

06a60a3 almost 2 years ago

raw

history blame

2.44 kB

	##
	Below are example yamls for using multi-gpu training with 4 GPUs on two machines (nodes) where each machine has two GPUs:

	On machine 1 (host):
	<pre>
	compute_environment: LOCAL_MACHINE
	deepspeed_config: {}
	+distributed_type: MULTI_GPU
	downcast_bf16: 'no'
	dynamo_backend: 'NO'
	fsdp_config: {}
	gpu_ids: all
	+machine_rank: 0
	+main_process_ip: 192.168.20.1
	+main_process_port: 8080
	main_training_function: main
	megatron_lm_config: {}
	mixed_precision: 'no'
	+num_machines: 2
	+num_processes: 8
	+rdzv_backend: static
	+same_network: true
	use_cpu: false
	</pre>

	On machine 2:
	<pre>
	compute_environment: LOCAL_MACHINE
	deepspeed_config: {}
	+distributed_type: MULTI_GPU
	downcast_bf16: 'no'
	dynamo_backend: 'NO'
	fsdp_config: {}
	gpu_ids: all
	-machine_rank: 0
	+machine_rank: 1
	+main_process_ip: 192.168.20.1
	+main_process_port: 8080
	main_training_function: main
	megatron_lm_config: {}
	mixed_precision: 'no'
	+num_machines: 2
	+num_processes: 8
	+rdzv_backend: static
	+same_network: true
	use_cpu: false
	</pre>
	##
	None
	##
	To launch a script, on each machine run one of the following variations:

	If the YAML was generated through the `accelerate config` command:
	```
	accelerate launch {script_name.py} {--arg1} {--arg2} ...
	```

	If the YAML is saved to a `~/config.yaml` file:
	```
	accelerate launch --config_file ~/config.yaml {script_name.py} {--arg1} {--arg2} ...
	```

	Or you can use `accelerate launch` with right configuration parameters and have no `config.yaml` file:

	Replace `{node_number}` with appropriate machine number (0 for host, 1+ if not).
	```
	accelerate launch --multi_gpu --num_machines=2 --num_processes=8 --main_process_ip="192.168.20.1" --main_process_port=8080
	--machine_rank={node_number} {script_name.py} {--arg1} {--arg2} ...
	```

	##
	When utilizing multiple machines (nodes) for training, the config file needs to know how each machine will be able to communicate (the IP address and port), how many total GPUs there are, and whether the current machine is either the host or a client.

	Remember that you can always use the `accelerate launch` functionality, even if the code in your script does not use the `Accelerator`
	##
	To learn more checkout the related documentation:
	- <a href="https://huggingface.co/docs/accelerate/main/en/basic_tutorials/launch" target="_blank">Launching distributed code</a>
	- <a href="https://huggingface.co/docs/accelerate/main/en/package_reference/cli" target="_blank">The Command Line</a>