Install AIACC-Training - Elastic GPU Service - Alibaba Cloud Documentation Center

AIACC-Training lets you accelerate distributed model training tasks that are built based on mainstream AI computing frameworks, such as PyTorch, TensorFlow, MXNet, and Caffe. AIACC-Training is compatible with the API operations of PyTorch DistributedDataParallel (DDP) and Horovod. You can use AIACC-Training to directly improve the performance of code that is written in these distributed training frameworks. This topic describes how to install AIACC-Training 1.5.0.

Prerequisites

An Alibaba Cloud GPU-accelerated instance that meets the following requirements is created:

The operating system is Alibaba Cloud Linux, CentOS 7.x and later, or Ubuntu 16.04 and later.
An NVIDIA driver and CUDA 10.0 or later are installed.

Background information

This topic uses AIACC-Training 1.5.0 as an example. You can use one of the following methods to install AIACC-Training:


Installation methods	Note
Method 1: Install AIACC-Training in an existing AI software environment	If a deep learning training AI environment is deployed, you can install AIACC-Training in automatic or manual mode.
Method 2: Install a Conda environment that contains AIACC-Training	If you want to use a Conda environment, you can create a Conda environment that contains AIACC-Training.
Method 3: Install a Docker image that is configured with AIACC-Training	If you want to use Docker environment, you can install a Docker image that is configured with AIACC-Training.

Note If you select AIACC-Training when you create an ECS instance in the ECS console, AIACC-Training 1.3.3 is automatically installed when the ECS instance is created. We recommend that you install AIACC-Training 1.5.0 by using one of the three methods above.

Supported frameworks

Alibaba Cloud provides AIACC-Training software packages for different versions of deep learning frameworks. The following table lists the versions of the frameworks that are supported by AIACC-Training.


CUDA version	Framework type	Framework version
10.0	PyTorch	1.2.0 and 1.3.0
	TensorFlow	1.14.0, 1.15.0, and 2.0.0
	MXNet	1.4.1, 1.5.0, and 1.7.0
10.1	PyTorch	1.6.0, 1.5.1, and 1.4.0
	TensorFlow	2.1.0, 2.2.0, and 2.3.0
	MXNet	1.4.1, 1.5.0, 1.6.0, 1.7.0, and 1.9.0
10.2	PyTorch	1.5.1, 1.6.0, 1.8.0, 1.8.2, 1.9.0, and 1.10.0
10.2	MXNet	1.9.0
11.0	PyTorch	1.7.0 and 1.7.1
	TensorFlow	2.4.0
	MXNet	1.9.0

Note

TensorFlow and MXNet support only Python 3.6.
PyTorch supports Python 3.6 to Python 3.9. The Python versions that are displayed on the PyTorch download page prevail.
If the version of the framework that you use is not contained in the preceding table, submit a ticket.

Method 1: Install AIACC-Training in an existing AI software environment

If a deep learning training environment is deployed, you can install AIACC-Training in automatic or manual mode. Before you install AIACC-Training, make sure that your environment meets the following requirements:

Python 3 and the proper pip are installed.
A deep learning framework, such as PyTorch, TensorFlow, or MXNet is installed.

Important If you re-install your deep learning framework, you must re-install AIACC-Training.

(Recommended) Install AIACC-Training in automatic mode

AIACC-Training provides Python software packages for different framework versions. You can run a single script to install AIACC-Training in automatic mode. Sample code:

wget https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/install_AIACC-Training.sh && bash install_AIACC-Training.sh

Note The default installation script uses python3 as the Python version. If you use a different version, you can add your Python version to the end of the script name. For example, if you want to use Python, add python to the end of the script name and run

wget https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/install_AIACC-Training.sh && bash install_AIACC-Training.sh python

to install AIACC-Training.

Install AIACC-Training in manual mode

You can run the following commands to use pip to install the latest AIACC-Training software package in manual mode.

If you use PyTorch, run the following command to install AIACC-Training:

pip install --force-reinstall https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/cuda${cuda_version}/perseus_torch-1.5.0%2B${framework_version}-cp${python_version}-cp${python_version}m-linux_x86_64.whl --trusted-host mirrors.aliyun.com -i http://mirrors.aliyun.com/pypi/simple/

Where,

${cuda_version}: the version of CUDA. You must remove periods (.) from the version number. For example, if you use CUDA 11.0, add cuda_version=110 to the command.
${framework_version}: the version of the framework. For example, if you use PyTorch 1.7.1, add framework_version=1.7.1 to the command.
${python_version}: the version of Python. You must remove periods (.) from the version number. For example, if you use Python 3.6, add python_version=36 to the command.

If you want to use the WHL packages of Python 3.8 or later, you can use the following download URL:

https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/cuda${cuda_version}/perseus_${framework}-1.5.0%2B${framework_version}-cp${python_version}-cp${python_version}-linux_x86_64.whl

In this example, PyTorch 1.7.1, CUDA 11.0, and Python 3.6 are used. Sample code:

cuda_version=110         #The version cannot contain periods (.).
framework=torch
framework_version=1.7.1
python_version=36
pip install --force-reinstall https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/cuda${cuda_version}/perseus_${framework}-1.5.0%2B${framework_version}-cp${python_version}-cp${python_version}m-linux_x86_64.whl --trusted-host mirrors.aliyun.com -i http://mirrors.aliyun.com/pypi/simple/

If you use TensorFlow or MXNet, run the following command to install AIACC-Training:

pip install --force-reinstall https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/cuda${cuda_version}/perseus_${framework}-1.5.0%2B${framework_version}-py2.py3-none-manylinux1_x86_64.whl --trusted-host mirrors.aliyun.com -i http://mirrors.aliyun.com/pypi/simple/

In this example, TensorFlow 1.15.0, CUDA 10.0, and Python 3.6 are used. Sample code:

cuda_version=100         #The version cannot contain periods (.).
framework=tensorflow
framework_version=1.15.0
pip install --force-reinstall https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/cuda${cuda_version}/perseus_${framework}-1.5.0%2B${framework_version}-py2.py3-none-manylinux1_x86_64.whl --trusted-host mirrors.aliyun.com -i http://mirrors.aliyun.com/pypi/simple/

Method 2: Install a Conda environment that contains AIACC-Training

Conda is an open source system that is used to manage software packages and environments. This system is supported on different platforms. Alibaba Cloud provides a single command that allows you to create a Conda environment that contains AIACC-Training. In the Conda environment, CUDA Toolkit, Python 3, deep learning frameworks, and the latest AIACC-Training software are installed. This helps you quickly build and manage different deep learning frameworks and framework versions. This also helps you significantly improve training performance by using AIACC-Training.

Visit the Conda official website to download and install the latest version of Miniconda. For more information, see Miniconda.
Run the following command based on the required framework version and environment to create a Conda environment that contains AIACC-Training:
```
conda env create -f https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/conda/latest/${framework}_${framework_version}_cu${cuda_version}_py${python_version}.yaml
```
In this example, PyTorch 1.7.1, CUDA 11.0, and Python 3.6 are used. Sample code:
```
cuda_version=11.0  
framework=torch    
framework_version=1.7.1
python_version=36
conda env create -f https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/conda/latest/${framework}_${framework_version}_cu${cuda_version}_py${python_version}.yaml
```
Where,
- ${cuda_version}: the version of CUDA. Specify the full version number, including periods (.). The version cannot be later than the CUDA version that is installed on the GPU-accelerated instance.
- ${framework}: the type of the deep learning framework. If you use TensorFlow, MXNet, or PyTorch, set the parameter to tensorflow, mxnet, or torch.
Note If the system prompts that the download URL of the Conda environment cannot be found, the specified framework version is not supported. For more information, see Supported frameworks.

Method 3: Install a Docker image that is configured with AIACC-Training

You can download a Docker image that is configured with AIACC-Training. In the Docker image, CUDA, Python 3, deep learning frameworks, and the latest AIACC-Training software are installed. This helps you quickly deploy deep learning environments and manage different CUDA environments. This also helps you significantly improve training performance by using AIACC-Training.

Before you install the Docker image, make sure that your environment meets the following requirements:

Docker is installed in the environment. For more information about how to install Docker in Alibaba Cloud Linux 2, see Deploy and use Docker on Alibaba Cloud Linux 2 instances.
NVIDIA Container Toolkit is installed. For more information, see NVIDIA Container Toolkit.

Run the following command based on the required framework version and environment to download a Docker image that is configured with AIACC-Training:

docker pull registry.cn-beijing.aliyuncs.com/cto_office/perseus-training:${os_type}-cu${cuda_version}-${framework}${framework_version}-py${python_version}-latest

The following table describes the parameters in the command.


Parameter	Description	Example
os_type	The OS type of the Docker image. Note The OS type of the Docker image is independent of the OS that the ECS instance runs.	centos7
cuda_version	The version of CUDA. Note Specify the version number, including periods (.). It can be the same as or earlier than the CUDA version that is installed on the ECS instance.	11.0
framework	The abbreviation of the deep learning framework. If you use TensorFlow, MXNet, or PyTorch, set the parameter to tf, mx, or pt.	tf
framework_version	The version of the deep learning framework. Specify the value in the xx.xx.xx format.	2.4.0
python_version	The version of Python. Note Specify the version number without the periods (.). For example, if you use Python 3.6, Python 3.7, or Python 3.8, set the parameter to 36, 37, or 38.	36

You can run a single command to download the docker image that is configured with AIACC-Training. In this example, CentOS 7, CUDA 11.0, and TensorFlow 2.4.0 are used. Sample code:

os_type=centos7          
cuda_version=11.0        
framework=tf             
framework_version=2.4.0
python_version=36        
docker pull registry.cn-beijing.aliyuncs.com/cto_office/perseus-training:${os_type}-cu${cuda_version}-${framework}${framework_version}-py${python_version}-latest

For more information about how to use Docker to perform distributed training, see Horovod in Docker.

Note

If the system prompts that the Docker image cannot be found, the specified framework version is not supported. For more information, see Supported frameworks.
If you use a container to perform distributed training, you must allocate more size to the shared memory (SHM) when you run the docker run command to start the container. For example, you can add --shm-size=1g --ulimit memlock=-1 to the command.