Use AIACC-Training for PyTorch - Elastic GPU Service - Alibaba Cloud Documentation Center

Since the release of PyTorch 1.x, PyTorch built-in DistributedDataParallel (DDP) has become the main choice for users to perform distributed training. This topic describes how to use AIACC-Training to accelerate distributed training by using models that are built based on PyTorch. This topic also provides answers to some commonly asked questions when you use AIACC-Training for PyTorch.

(Recommended) Adapt to the API operations of PyTorch DDP

Background information

For more information about PyTorch DDP, see Getting started with DDP.

Adapt and run training code

Adapt training code.
You need to only add the code that is used to import AIACC-Training to the file of the main() function. Make sure that Perseus is imported before Torch. Sample code:
```
import perseus    # You must import Perseus before Torch.
import torch
import torch.nn as nn
import torch.distributed as dist
......
```
Start the training script.
Use torch.distributed.launch to perform distributed training by using PyTorch DDP. The following sample code shows how to start the script by using PyTorch DDP. In this example, two machines are used and each machine is configured with eight GPUs. Sample code:
```
### Run the following command on Machine 1.
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=${machine1_ip} --master_port=6007 --use_env ${TRAIN_SCRIPT}

### Run the following command on Machine 2.
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=${machine1_ip} --master_port=6007 --use_env ${TRAIN_SCRIPT}                          
```
Parameters:
- ${machine1_ip}: the IP address of the internal network that is used by Machine 1. Example: 192.168.2.211. You can run the ifconfig command to query the IP address.
- ${TRAIN_SCRIPT}: the training script that you use.

Example

AIACC-Training provides a sample of the adapted code of PyTorch DDP in the directory of its software package. You can perform the following operations to start training.

Go to the directory in which the sample code is stored.

cd `echo $(python -c "import perseus; print(perseus)") | cut -d\' -f 4 | sed "s/\_\_init\_\_\.py//"`examples/

Start training.

Use native PyTorch DDP to start and run the pytorch_ddp_benchmark.py script. In this example, a machine that is configured with eight GPUs is used. Sample code:

NP=8
ADDR=localhost
PORT=6006
python -m torch.distributed.launch --nproc_per_node=$NP --nnodes=1 --node_rank=0 --master_addr=$ADDR --master_port=$PORT \
  pytorch_ddp_benchmark.py

Adapt to the API operations of Horovod

Prerequisites

Perseus is upgraded to v1.3.2 or later. If you do not upgrade Perseus to v1.3.2 or later, AIACC-Training for PyTorch cannot accelerate training as expected.

Procedure

AIACC-Training for PyTorch supports Horovod API operations. If you have used Horovod to perform distributed training, you need to only change the code in the import module to the following code:

import perseus.torch.horovod as hvd

If your training code is not written to perform distributed training, you can perform the following operations to upgrade your code to the code for distributed training that is compatible with Horovod API operations.

In the first part of the main() function, run the following command to initialize the Perseus Horovod module.
Note
Make sure that you run the command before you call other Perseus API operations.
```
hvd.init()
```
Bind the current process to a GPU.
```
torch.cuda.set_device(hvd.local_rank())
```
In most cases, the number of training steps and warmup steps must be divided by hvd.size(), and the learning rate must be multiplied by hvd.size(). hvd.size() is the size of the process set.
Note
You do not need to increase the learning rate of specific models, such as the Bidirectional Encoder Representations from Transformers (BERT) model. You can determine whether to increase the learning rate based on the training convergence results.
```
step = step // hvd.size()
learning_rate = learning_rate * hvd.size()
```

Reload the optimizer.

optimizer = hvd.DistributedOptimizer(
 optimizer,named_parameters=model.named_parameters())

If you use multiple models and multiple named_parameters parameters exist, you must merge the models and the parameters. Sample code:

all_named_parameters = []
for name, value in model1.named_parameters():
    all_named_parameters.append((name, value))
for name, value in model2.named_parameters():
    all_named_parameters.append((name, value))
optimizer = hvd.DistributedOptimizer(
                           optimizer, named_parameters=all_named_parameters)

Broadcast global variable parameters to all machines.

hvd.broadcast_parameters(model.state_dict(), root_rank=0)
hvd.broadcast_optimizer_state(optimizer, root_rank=0)

If you use multiple models and multiple state_dict parameters exist, you must merge the models and the parameters. Sample code:

all_state_dict={}
all_state_dict.update(model1.state_dict())
all_state_dict.update(model2.state_dict())
hvd.broadcast_parameters(all_state_dict, root_rank=0)

Split a dataset into shards.

train_sampler = torch.utils.data.distributed.DistributedSampler(
    train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
loader = torch.utils.DataLoader(
    train_dataset, batch_size=batch_size, sampler=train_sampler, **kwargs)

Configure the single-machine single-GPU model for the program.
Perseus runs the program of the single-machine single-GPU type and distributes the program to each GPU to perform single-machine multi-GPU training. Therefore, you must configure the single-machine single-GPU model for the program.
Original program code:
```
model = nn.DataParallel(model.cuda())
```
Use one of the following methods to change the original code:
- Method 1
```
model = nn.DataParallel(model.cuda(), device_ids=[hvd.local_rank()])
```
- Method 2
```
model = model.cuda()
```
  If you use Method 2, the GPU that you bound to the process in Step 2 is used by cuda().

Save the checkpoint.

save_checkpoint = True if hvd.rank() == 0 else False
verbose = 1 if hvd.rank() == 0 else 0
log_writer = tensorboardX.SummaryWritter(log_dir) if hvd.rank() == 0 else None

Load the checkpoint.

if hvd.rank() == 0:
    checkpoint = torch.load(filepath)
    model.load_state_dict(checkpoint['model'])
    optimizer.load_state_dict(checkpoint['optimizer'])

Start distributed training.

In this example, one machine that is configured with eight GPUs is used.

mpirun -allow-run-as-root -np 8 -npernode 8 -x NCCL_DEBUG=INFO ./train.sh

In this example, four machines are used and each machine is configured with eight GPUs.

mpirun -allow-run-as-root -bind-to none -np 32 -npernode 8 \
                   -x NCCL_DEBUG=INFO -x PATH -x LD_LIBRARY \
                   -x PERSEUS_ALLREDUCE_STREAMS=8 \
                   -hostfile mpi_host.txt ./train.sh

FAQ

What do I do if the system reports the `Input type (CUDAFloatTensor) and weight type (CPUFloatTensor) should be the same` error?

In most cases, the system reports the error because the model is on CUDA, but your input data is on the GPU. The model on CUDA is ignored. You can add model.cuda() to the command to transfer the model to the GPU.

What do I do if the system reports the `RuntimeError: all tensors must be on devices[0]` error?

You can perform the following operations to troubleshoot the error:

Check whether you specify gpu_ids when the system initializes the program. For example, check whether you run the torch.cuda.set_device(hvd.local_rank()) command.
Check whether you specify device_ids when you use DataParallel. For example, check whether you run the nn.DataParallel(model.cuda(), device_ids=str(hvd.local_rank())) command.

How do I fix GPU out of memory errors that occur when the system loads a model?

In most cases, out of memory errors occur because the data size of the model is too large. To fix this error, import your model to CPU. For example, you can run the torch.load(pretrain, map_location='cpu') command.

What do I do if the system prompts an exit exception?

Before you handle the exception, make sure that your code is adapted as expected. For more information, see Adapt and run training code.

The exception may be caused due to the merged models. You must merge the named_parameters parameters of multiple models and pass the parameters to Perseus to encapsulate the optimizer. If you merge multiple models, you must also broadcast all parameters and merge all state_dict parameters.

What do I do if the system prompts that a port is being used?

You must check whether the port is being used by another process. If the port is being used by another process, you can run the pkill python command to terminate the process.

By default, torch.distributed.run is used to start Torch 1.9. In this method, rdzv_backend is used as the default launcher for training and a process is created before you start the training. If dist.init_process_group() is contained in your training code, the system continues to create the process when the port for the process is being used. As a result, a conflict occurs in the process. If you want to use a machine that is configured with multiple GPUs, you can add --standalone to the training code to fix this error.