File size: 2,435 Bytes
06a60a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
##
Below are example yamls for using multi-gpu training with 4 GPUs on two machines (nodes) where each machine has two GPUs:

On machine 1 (host):
<pre>
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
+distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: all
+machine_rank: 0
+main_process_ip: 192.168.20.1
+main_process_port: 8080
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
+num_machines: 2
+num_processes: 8
+rdzv_backend: static
+same_network: true
use_cpu: false
</pre>

On machine 2:
<pre>
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
+distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: all
-machine_rank: 0
+machine_rank: 1
+main_process_ip: 192.168.20.1
+main_process_port: 8080
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
+num_machines: 2
+num_processes: 8
+rdzv_backend: static
+same_network: true
use_cpu: false
</pre>
##
None
##
To launch a script, on each machine run one of the following variations:

If the YAML was generated through the `accelerate config` command:
```
accelerate launch {script_name.py} {--arg1} {--arg2} ...
```

If the YAML is saved to a `~/config.yaml` file:
```
accelerate launch --config_file ~/config.yaml {script_name.py} {--arg1} {--arg2} ...
```

Or you can use `accelerate launch` with right configuration parameters and have no `config.yaml` file:

Replace `{node_number}` with appropriate machine number (0 for host, 1+ if not).
```
accelerate launch --multi_gpu --num_machines=2 --num_processes=8 --main_process_ip="192.168.20.1" --main_process_port=8080
 --machine_rank={node_number} {script_name.py} {--arg1} {--arg2} ...
```

##
When utilizing multiple machines (nodes) for training, the config file needs to know how each machine will be able to communicate (the IP address and port), how many *total* GPUs there are, and whether the current machine is either the host or a client.

**Remember that you can always use the `accelerate launch` functionality, even if the code in your script does not use the `Accelerator`**
##
To learn more checkout the related documentation:
- <a href="https://huggingface.co/docs/accelerate/main/en/basic_tutorials/launch" target="_blank">Launching distributed code</a>
- <a href="https://huggingface.co/docs/accelerate/main/en/package_reference/cli" target="_blank">The Command Line</a>