MXNet allows you to perform distributed training by using KVStore and Horovod. AIACC-Training 1.5 can accelerate distributed training in MXNet by using KVStore and by calling Horovod APIs of all versions.
Horovod APIs are similar to the operations used in Horovod. If you have used Horovod to perform distributed training, you need to only change your training code in the import module to the following code:
import perseus.mxnet as hvd
AIACC-Training provides a sample of the adapted code in its software package. You can perform the following operations to start distributed training.
1. Go to the directory in which the sample code is stored.
cd `echo $(python -c "import perseus; print(perseus)") | cut -d\' -f 4 | sed "s/\_\_init\_\_\.py//"`examples/
2. Start distributed training.
The following sample code shows how to start distributed training for a Mixed National Institute of Standards and Technology (MNIST) model. In this example, one machine that is configured with eight GPUs is used.
perseusrun -np 8 -H localhost:8 python $examples_path/mxnet_mnist.py
The following APIs are added to Perseus KVStore to support the data and model parallelism in InsightFace:
local_rank
: returns the numerical identifier of the current GPU on the process. The numerical identifier can be used to create a GPU context. In Python, you can directly use the value of local_rank as a GPU ID to create a GPU context. This method is more convenient than the typical method of obtaining the numerical identifier of the current GPU from the started shell script and passing the numerical identifier to Python code described in Example section.init(key_name, ndarray, param_only = false)
: You can add the param_only parameter to init(). You can set the param_only
parameter to the following values:
param_only
parameter to init() and set param_only to true.push(key_name, ndarray, op = PerseusOp.Sum)
: You can add the op
output parameter that is used to synchronize softmax layers to push(). Valid values: Sum, Max, and Min. Default value: Sum.1. Use Perseus KVStore.
The following sample code provides an example that you can use to adapt your training code. To adapt the training code, perform the following operations: Add the import perseus.mxnet
module after the first plus sign (+
), and replace the code after the minus sign (-
) by using the code that is used to generate KVStore after the second plus sign (+).
diff --git a/example/image-classification/common/fit.py b/example/image-classification/common/fit.py
index 9412b6f..3a6e9a0 100755
--- a/example/image-classification/common/fit.py
+++ b/example/image-classification/common/fit.py
@@ -22,6 +22,7 @@ import time
import re
import math
import mxnet as mx
+import perseus.mxnet as perseus_kv
def _get_lr_scheduler(args, kv):
@@ -146,7 +147,8 @@ def fit(args, network, data_loader, **kwargs):
# kvstore
- kv = mx.kvstore.create(args.kv_store)
+ kv = perseus_kv.create(args.kv_store) if args.kv_store == dist_sync_perseus else mx.kvstore.create(args.kv_store)
if args.gc_type != 'none':
kv.set_gradient_compression({'type': args.gc_type,
'threshold': args.gc_threshold})
2. Bind the current process to a GPU.
AIACC-Training is compatible with KVStore APIs and reloads KVStore to support distributed training in MXNet. After you use AIACC-Training, you need to only modify the context settings in the model code to bind a process to a GPU.
The following sample code shows how to use local_rank to bind the current process to a GPU that corresponds to the value of kv.local_rank
. local_rank is a new API that is added to Perseus KVStore.
ctx = []
cvd = os.environ['DEVICES'].strip()
if 'perseus' in args.kv_store:
import perseus.mxnet as perseus
ctx.append(mx.gpu(kv.local_rank))
3. Start distributed training.
Message Passing Interface (MPI) is the preferred choice for starting distributed training in frameworks. In Perseus, single-machine multi-GPU training and multi-machine multi-GPU training are started in a similar way by using MPI. Perseus no longer supports single-machine multi-GPU training performed in a single process in MXNet. The following sample code shows how to start distributed training. In this example, four machines are used and each machine is configured with eight GPUs.
a) Prepare the config.sh
training script.
You must run the mpirun
command to use the script. Therefore, you must specify the MPI environment variable that is used to infer the GPU ID of the process, and pass the ID to the model code to create the GPU context. Sample script:
#!/bin/sh
let GPU=OMPI_COMM_WORLD_RANK % OMPI_COMM_WORLD_LOCAL_SIZE
export OMP_NUM_THREADS=4
MXNET_VISIBLE_DEVICE=$GPU python train_imagenet.py \
--network resnet \
--num-layers 50 \
--kv-store dist_sync_perseus \
--gpus $GPU …
b) Run the following command to start the training script:
mpirun –np 32 –npernode 8 –hostfile mpi_host.txt ./config.sh
In the preceding command, mpi_host.txt is a regular MPI machine file and is similar to the host file of MXNet SSHLauncher. Sample code of mpi_host.txt:
192.168.0.1
192.168.0.2
192.168.0.3
192.168.0.4
After you start training, each GPU works as a process and returns its independent output. The overall training performance is evaluated by the output of each machine.
By default, open source MXNet occupies all CPU resources in your system. This increases the CPU time required to start training jobs. To accelerate the startup speed, you can configure the following environment variables:
export MXNET_USE_OPERATOR_TUNING=0
export MXNET_USE_NUM_CORES_OPERATOR_TUNING=1
export OMP_NUM_THREADS=1
This section describes how to call Horovod APIs to perform distributed training in MXNet and how to adapt your original training code for AIACC-Training.
AIACC-Training for MXNet supports Horovod APIs. Horovod APIs are similar to the operations used in Horovod. If you have used Horovod to perform distributed training, you need to only change your training code in the import module to the following code:
import perseus.mxnet as hvd
If your training code is not written to perform distributed training, you can perform the following operations to upgrade your code to the code for distributed training that is compatible with Horovod APIs.
1. In the first part of the main() function, run the following command to initialize the Perseus Horovod module.
Note
Make sure that you run the command before you call other Perseus APIs.
hvd.init()
2. Bind the current process to a GPU.
# rank and size
rank = hvd.rank()
num_workers = hvd.size()
local_rank = hvd.local_rank()
# Horovod: pin GPU to local rank
context = mx.gpu(local_rank)
3. Create an optimizer.
In most cases, the learning rate of a model must be multiplied by the value of hvd.size().
Note
You do not need to increase the learning rate of specific models, such as the Bidirectional Encoder Representations from Transformers (BERT) model. You can determine whether to increase the learning rate based on the training convergence results.
learning_rate = ...
optimizer_params = {'learning_rate': learning_rate * hvd.size()}
opt = mx.optimizer.create(optimizer, **optimizer_params)
4. Broadcast parameters.
# Horovod: fetch and broadcast parameters
params = net.collect_params()
if params is not None:
hvd.broadcast_parameters(params)
5. Reload the optimizer.
# Horovod: create DistributedTrainer, a subclass of gluon.Trainer
trainer = hvd.DistributedTrainer(params, opt)
6. Start training.
In this example, four machines are used and each machine is configured with eight GPUs. Sample code:
mpirun –np 32 –npernode 8 –hostfile mpi_host.txt ./train.sh
In the preceding command, mpi_host.txt is a regular MPI machine file and is similar to the host file of MXNet SSHLauncher. Sample code of mpi_host.txt:
192.168.0.1
192.168.0.2
192.168.0.3
192.168.0.4
After you start training, each GPU works as a process and returns its independent output. The overall training performance is evaluated by the output of each machine.
To implement Synchronized Batch Normalization (SyncBatchNorm) in Perseus, use the official MXNet code src/operator/contrib/sync_batch_norm-inl.h and load libperseus_MXNet.so to call the communication APIs of Perseus. This way, SyncBatchNorm is implemented within operators. SyncBatchNorm supports local and global modes.
In scenarios in which the batch size is small, such as object detection scenarios, the mean value and variance calculated by BatchNorm on each GPU have large deviations. This results in an accuracy loss. SyncBatchNorm can adjust internal deviations of statistics and ensure the performance of the batch normalization (BN) layer. Even if in large scale distributed training scenarios, SyncBatchNorm can achieve higher precision than BatchNorm. Compared with BatchNorm, SyncBatchNorm can improve convergence precision when overall training performance is ensured.
1. Use the perseus-MXNet-sync-bn.patch patch.
patch -p1 < perseus-mxnet-sync-bn.patch
2. Compile the MXNet source code.
make USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 USE_DIST_KVSTORE=1 USE_NCCL=1 USE_LIBJPEG_TURBO=1 MPI_ROOT=/usr/local -j24
3. Invoke SyncBatchNorm().
PerseusSyncBatchNorm is implemented based on the official MXNet code. Therefore, PerseusSyncBatchNorm is compatible with SyncBatchNorm. To change the synchronization mode, change SyncBatchNorm to PerseusSyncBatchNorm and add the comm_scope
parameter to the command. For example, you can set comm_scope to 0
when you use mx.gluon.contrib.nn.PerseusSyncBatchNorm()
or set comm_scope to 0 when you use mx.sym.contrib.PerseusSyncBatchNorm()
.
4. Change the mode.
The following information describes local and global modes:
The following figure compares the precision between BatchNorm and PerseusSyncBatchNorm. In this example, a Faster R-CNN model implemented based on GluonCV and adapted to Perseus is used. One machine configured with eight GPUs is used, and the batch size of each GPU is 2.aiacc-sbn
In the preceding figure, the precision achieved by PerseusSyncBatchNorm is higher than the precision achieved by BatchNorm from the 1st epoch to the 20th epoch. The highest mAP rises from 31.3% to 34.6%.
For example, you use a machine that is configured with eight GPUs. If the memory of GPU 0 is occupied by the remaining seven processes and the GPU memory occupied by each process ranges from 200 MB to 500 MB, no memory is available on GPU 0.
GPU memory in MXNet is allocated based on the built-in cpu_pinned memory mechanism. By default, the memory of GPU 0 is allocated to the processes. You can bind processes to different GPUs. For more information, see Step 2.
Undefined symbols
error?The error indicates that the symbols related to N-dimensional array (NDArray) are not defined.
The system reports the error because MXNet 1.4 or earlier is installed in your pip. As a result, the symbols required by Perseus are not exported from libMXNet.so. You can upgrade MXNet to MXNet 1.4 or later, or recompile MXNet and install it again.
You can perform the following operations to accelerate the startup speed of the training script:
export MXNET_USE_OPERATOR_TUNING=0
export MXNET_USE_NUM_CORES_OPERATOR_TUNING=1
export OMP_NUM_THREADS=1
The system reports the create cusolver handle failed
error because specific machines that you use for a training job are occupied by another training job. In this case, you can run nvidia-smi
by using mpirun to check whether the machines are occupied.
Learning about AIACC-Training | Use AIACC-Training for TensorFlow
Learning about AIACC-Training | Startup Commands and Environment Variables
877 posts | 198 followers
FollowAlibaba Cloud Community - April 3, 2024
Alibaba Cloud ECS - June 4, 2020
Alibaba Cloud Community - April 3, 2024
Alibaba Cloud Community - April 3, 2024
Alibaba Cloud Community - April 3, 2024
Alex - December 26, 2018
877 posts | 198 followers
FollowA high-quality personalized recommendation service for your applications.
Learn MoreThis solution provides you with Artificial Intelligence services and allows you to build AI-powered, human-like, conversational, multilingual chatbots over omnichannel to quickly respond to your customers 24/7.
Learn MorePowerful parallel computing capabilities based on GPU technology.
Learn MoreA one-stop service platform for whole-process generative AI engineering and application development, based on Tongyi Qianwen (Qwen) and other popular models
Learn MoreMore Posts by Alibaba Cloud Community