Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
## | |
Run `accelerate config` and answer the questionnaire accordingly. | |
Below is an example yaml for BF16 mixed-precision training using Megatron-LM with DPxTPxPP=2x2x2 degrees on 8 GPUs. (DP-Data Parallelism, PP-Pipeline Parallelism, TP-Tensor Parallelism). It is also using Sequence Parallelism and selective activation checkpointing along with sharded optimizer. | |
<pre> | |
compute_environment: LOCAL_MACHINE | |
deepspeed_config: {} | |
distributed_type: MEGATRON_LM | |
downcast_bf16: 'no' | |
dynamo_backend: 'NO' | |
fsdp_config: {} | |
machine_rank: 0 | |
main_training_function: main | |
megatron_lm_config: | |
megatron_lm_gradient_clipping: 1.0 | |
megatron_lm_num_micro_batches: 2 | |
megatron_lm_pp_degree: 2 | |
megatron_lm_recompute_activations: true | |
megatron_lm_sequence_parallelism: true | |
megatron_lm_tp_degree: 2 | |
megatron_lm_use_distributed_optimizer: true | |
mixed_precision: bf16 | |
num_machines: 1 | |
num_processes: 8 | |
rdzv_backend: static | |
same_network: true | |
use_cpu: false | |
</pre> | |
## | |
<pre> | |
from accelerate import Accelerator | |
+ def main(): | |
accelerator = Accelerator() | |
... | |
- lr_scheduler = get_scheduler( | |
- name=args.lr_scheduler_type, | |
+ lr_scheduler = accelerate.utils.MegatronLMDummyScheduler( | |
optimizer=optimizer, | |
num_warmup_steps=args.num_warmup_steps * args.gradient_accumulation_steps, | |
num_training_steps=args.max_train_steps * args.gradient_accumulation_steps, | |
) | |
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare( | |
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler | |
) | |
total_batch_size = ( | |
- args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps | |
+ accelerator.state.megatron_lm_plugin.global_batch_size | |
) | |
for batch in training_dataloader: | |
optimizer.zero_grad() | |
inputs, targets = batch | |
outputs = model(inputs) | |
loss = loss_function(outputs, targets) | |
accelerator.backward(loss) | |
optimizer.step() | |
scheduler.step() | |
... | |
# in eval loop | |
for step, batch in enumerate(eval_dataloader): | |
with torch.no_grad(): | |
outputs = model(**batch) | |
loss = outputs.loss | |
- losses.append(accelerator.gather_for_metrics(loss.repeat(args.per_device_eval_batch_size))) | |
+ losses.append(loss) # For Megatron-LM, the losses are already averaged across the data parallel group | |
- losses = torch.cat(losses) | |
+ losses = torch.tensor(losses) | |
eval_loss = torch.mean(losses) | |
perplexity = math.exp(eval_loss) | |
logger.info(f"epoch {epoch}: perplexity: {perplexity} eval_loss: {eval_loss}") | |
+ accelerator.save_state(output_dir) | |
+ if __name__ == "__main__": | |
+ main() | |
</pre> | |
Launching a script using default accelerate config file looks like the following: | |
``` | |
accelerate launch {script_name.py} {--arg1} {--arg2} ... | |
``` | |
Alternatively, you can use `accelerate launch` with right config params for multi-gpu training as shown below | |
``` | |
accelerate launch \ | |
--use_megatron_lm \ | |
--num_processes=8 \ | |
--mixed_precision=bf16 \ | |
--megatron_lm_tp_degree=2 \ | |
--megatron_lm_pp_degree=2 \ | |
--megatron_lm_num_micro_batches=2 \ | |
--megatron_lm_sequence_parallelism=true \ | |
--megatron_lm_recompute_activations=true \ | |
--megatron_lm_use_distributed_optimizer=true \ | |
{script_name.py} {--arg1} {--arg2} ... | |
``` | |
## | |
For Megatron-LM, the supported models Transformers GPT2, Megatron-BERT and T5 models covering Decoder only, Encode only and Encoder-Decoder model classes. Given the complexity of the features of Megatron-LM, 4 changes that are required to get started are: | |
1. Using `accelerate.utils.MegatronLMDummyScheduler` as Megatron-LM uses its own implementation of Optimizer, the corresponding scheduler compatible with it needs to be used. | |
2. Getting the details of the total batch size now needs to be cognization of tensor and pipeline parallel sizes. | |
3. Losses are already averaged across the data parallel group | |
4. save the model using `accelerator.save_state` instead of transformers `from_pretrianed` | |
These changes have been highlited in the code snippet above. | |
Megatron-LM intergration supports many advanced features such as ability to leverage custom train step, using Megatron-LM indexed datasets, checkpoint reshaping and interoperabiloity utilities, `megatron_generate` function for text generation using Tensor and Pipeline Parallelism and support for ROPE/ALibi Positional embeddings and Multi-Query Attention. However, these require more changes owing to the complexity; worth it for getting the highest performance. | |
## | |
To learn more checkout the related documentation: | |
- <a href="https://huggingface.co/docs/accelerate/usage_guides/megatron_lm" target="_blank">How to use Megatron-LM</a> | |
- <a href="https://github.com/pacman100/accelerate-megatron-test" target="_blank">Examples showcasing the Megatron-LM integration of Accelerate</a> |