Accelerating Training
Gaudi offers several possibilities to make training faster. They are all compatible with each other and can be coupled with distributed training.
Execution Modes
The following execution modes are supported:
- Lazy mode, where operations are accumulated in a graph whose execution is triggered in a lazy manner. This allows the graph compiler to optimize the device execution for these operations.
- Eager mode, where one operation at a time is executed.
- Eager mode with torch.compile, where a model (or part of a model) is enclosed into a graph.
Not all models are yet supported with Eager mode and Eager mode with torch.compile (still in development). Lazy mode is the default mode.
In lazy mode, the graph compiler generates optimized binary code that implements the given model topology on Gaudi. It performs operator fusion, data layout management, parallelization, pipelining and memory management, as well as graph-level optimizations.
To execute your training in lazy mode, you must provide the following training arguments:
args = GaudiTrainingArguments(
# same arguments as in Transformers,
use_habana=True,
use_lazy_mode=True,
gaudi_config_name=path_to_my_gaudi_config
)
In lazy mode, the last batch is filled with extra samples by default so that it has the same dimensions as previous batches.
This enables to avoid extra graph compilations during training.
You can also discard the last batch with dataloader_drop_last=True
.
In lazy mode, the first two or three training iterations may be slower due to graph compilations.
To not take them into account in the computation of the throughput at the end of the training, you can add the following training argument: throughput_warmup_steps=3
.
Mixed-Precision Training
Mixed-precision training enables to compute some operations using lighter data types to accelerate training. Optimum for Intel Gaudi enables mixed precision training in a similar fashion as 🤗 Transformers:
- argument
--bf16
enables usage of PyTorch autocast - argument
--half_precision_backend [hpu_amp, cpu_amp]
is used to specify a device on which mixed precision operations should be performed
Please refer to the advanced autocast usage on Gaudi for more informations regarding:
- default autocast operations
- default autocast operations override
HPU Graphs
The flexibility of PyTorch comes at a price - usually the same pythonic logic is processed every training step over and over. This may lead to a situation where it takes longer for CPU to schedule the work on Gaudi than it is effectively computed by it. To cope with such host-bound workloads, you may want to try enabling the HPU Graphs feature, which records the computational graph once, then only triggers it for execution much faster multiple times.
To do so, specify --use_hpu_graphs_for_training True
.
This option will wrap the model in habana_frameworks.torch.hpu.ModuleCacher
, which automatically records HPU Graphs on the model’s usage.
For multi-worker distributed training, you also need to specify --distribution_strategy fast_ddp
.
This option replaces the usage of torch.nn.parallel.DistributedDataParallel
with much simpler and usually faster optimum.habana.distributed.all_reduce_gradients
.
Use with caution: currently using HPU Graphs for training may not support all the possible cases. However, the potential performance gain could be dramatic!
Fast DDP
For distributed training on several devices, you can also specify --distribution_strategy fast_ddp
.
This option replaces the usage of torch.nn.parallel.DistributedDataParallel
with much simpler and usually faster optimum.habana.distributed.all_reduce_gradients
.
Pipelining Forward and Backward Passes
There are two stages when running models on Intel Gaudi HPU: python code interpretation on CPU and HPU recipe computation.
The HPU computation stage can be triggered manually or when a copy to the CPU is requested, and generally HPU computation is triggered after loss.backward()
to make the CPU code interpretation and HPU recipe computation overlap as shown in the following illustration:
CPU:...forward + backward ...optimizer ...forward + backward ...optimizer ...
HPU:........................forward + backward...optimizer......forward + backward...optimizer
However, when CPU code interpretation takes longer than HPU computation, it becomes the bottleneck and HPU computation can not be triggered until CPU code interpretation is done. So one potential optimization for such cases is to trigger the HPU forward computation right after the CPU forward interpretation and before the CPU backward interpretation. You can see an example below where the CPU backward interpretation overlaps with the HPU forward computation:
CPU:...forward ...backward ...optimizer ...forward ...backward ...optimizer ...
HPU:.............forward.......backward......optimizer......forward.....backward.......optimizer
To enable this optimization, you can set the following training argument --pipelining_fwd_bwd True
.
We recommend using it on Gaudi2 as the host will often be the bottleneck. You should be able to see a speedup on first-generation Gaudi too, but it will be less significant than on Gaudi2 because your run is more likely to be HPU-bound.
Furthermore, when training models that require large device memory, we suggest disabling this optimization because it will increase the HPU memory usage.
Use More Workers for Data Loading
If the workload of the data loader is heavy, you can increase the number of workers to make your run faster.
You can enable this with the training argument --dataloader_num_workers N
with N
being the number of workers to use.
We recommend using it with datasets containing images.
Besides, using --dataloader_num_workers 1
should help in most cases as it enables data loading in a thread different from the main one.
Non-Blocking Data Copy
This optimization is well-suited for models with a high cost of copying data from the host to the device (e.g. vision models like ViT or Swin).
You can enable it with the training argument --non_blocking_data_copy True
.
We recommend using it on Gaudi2 where the host can continue to execute other tasks (e.g. graph building) to get a better pipelining between the host and the device. On first-generation Gaudi, the device executing time is longer so one should not expect to get any speedup.
Custom Operators
Intel Gaudi provides a few custom operators that achieve better performance than their PyTorch counterparts on Gaudi. You can also define your own custom operator for Gaudi as described here.
Fused ADAM
Intel Gaudi offers a custom fused ADAM implementation.
It can be used by specifying "use_fused_adam": true
in the Gaudi configuration file.
The default value of epsilon is 1e-6
for the Intel Gaudi fused ADAM optimizer, while it is 1e-8
for torch.optim.AdamW
.
Fused Gradient Norm Clipping
Intel Gaudi provides a custom gradient norm clipping implementation.
It can be used by specifying "use_fused_clip_norm": true
in the Gaudi configuration file.
Gaudi Optimized Flash Attention
Flash attention algorithm with additional Intel® Gaudi® AI Accelerator optimizetions is supported for both Lazy and Eager mode. See Using Fused Scaled Dot Product Attention (FusedSDPA).
Tracking Memory Usage
Live memory statistics are displayed every logging_steps
(default is 500) steps:
memory_allocated (GB)
refers to the current memory consumption in GB,max_memory_allocated (GB)
refers to the maximum memory consumption reached during the run in GB,total_memory_available (GB)
refers to the total memory available on the device in GB.
These metrics can help you to adjust the batch size of your runs.
In distributed mode, memory stats are communicated only by the main process.
You can take a look at Intel Gaudi AI Accelerator’s official documentation for more information about the memory stats API.
< > Update on GitHub