AIACC-Training lets you accelerate distributed model training tasks that are built based on mainstream AI computing frameworks, such as PyTorch, TensorFlow, MXNet, and Caffe. AIACC-Training is compatible with the API operations of PyTorch DistributedDataParallel (DDP) and Horovod. You can use AIACC-Training to directly improve the performance of code that is written in these distributed training frameworks. This article describes how to install AIACC-Training 1.5.0.
An Alibaba Cloud GPU-accelerated instance that meets the following requirements is created:
This article uses AIACC-Training 1.5.0 as an example. You can use one of the following methods to install AIACC-Training:
Installation methods | Note |
Method 1: Install AIACC-Training in an existing AI software environment | If a deep learning training AI environment is deployed, you can install AIACC-Training in automatic or manual mode. |
Method 2: Install a Conda environment that contains AIACC-Training | If you want to use a Conda environment, you can create a Conda environment that contains AIACC-Training. |
Method 3: Install a Docker image that is configured with AIACC-Training | If you want to use Docker environment, you can install a Docker image that is configured with AIACC-Training. |
Note
If you select AIACC-Training when you create an ECS instance in the ECS console, AIACC-Training 1.3.3 is automatically installed when the ECS instance is created. We recommend that you install AIACC-Training 1.5.0 by using one of the three methods above.
Alibaba Cloud provides AIACC-Training software packages for different versions of deep learning frameworks. The following table lists the versions of the frameworks that are supported by AIACC-Training.
Note
If a deep learning training environment is deployed, you can install AIACC-Training in automatic or manual mode. Before you install AIACC-Training, make sure that your environment meets the following requirements:
Note:
If you re-install your deep learning framework, you must re-install AIACC-Training.
AIACC-Training provides Python software packages for different framework versions. You can run a single script to install AIACC-Training in automatic mode. Sample code:
wget https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/install_AIACC-Training.sh && bash install_AIACC-Training.sh
Note
The default installation script uses python3
as the Python version. If you use a different version, you can add your Python version to the end of the script name. For example, if you want to use Python, add python
to the end of the script name and run wget https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/install_AIACC-Training.sh && bash install_AIACC-Training.sh python
to install AIACC-Training.
You can run the following commands to use pip to install the latest AIACC-Training software package in manual mode.
pip install --force-reinstall https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/cuda${cuda_version}/perseus_torch-1.5.0%2B${framework_version}-cp${python_version}-cp${python_version}m-linux_x86_64.whl --trusted-host mirrors.aliyun.com -i http://mirrors.aliyun.com/pypi/simple/
Where,
${cuda_version}
: the version of CUDA. You must remove periods (.) from the version number. For example, if you use CUDA 11.0, add cuda_version=110 to the command.${framework_version}
: the version of the framework. For example, if you use PyTorch 1.7.1, add framework_version=1.7.1 to the command.${python_version}
: the version of Python. You must remove periods (.) from the version number. For example, if you use Python 3.6, add python_version=36 to the command.If you want to use the WHL packages of Python 3.8 or later, you can use the following download URL:
https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/cuda${cuda_version}/perseus_${framework}-1.5.0%2B${framework_version}-cp${python_version}-cp${python_version}-linux_x86_64.whl
In this example, PyTorch 1.7.1, CUDA 11.0, and Python 3.6 are used. Sample code:
cuda_version=110 #The version cannot contain periods (.).
framework=torch
framework_version=1.7.1
python_version=36
pip install --force-reinstall https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/cuda${cuda_version}/perseus_${framework}-1.5.0%2B${framework_version}-cp${python_version}-cp${python_version}m-linux_x86_64.whl --trusted-host mirrors.aliyun.com -i http://mirrors.aliyun.com/pypi/simple/
pip install --force-reinstall https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/cuda${cuda_version}/perseus_${framework}-1.5.0%2B${framework_version}-py2.py3-none-manylinux1_x86_64.whl --trusted-host mirrors.aliyun.com -i http://mirrors.aliyun.com/pypi/simple/
In this example, TensorFlow 1.15.0, CUDA 10.0, and Python 3.6 are used. Sample code:
cuda_version=100 #The version cannot contain periods (.).
framework=tensorflow
framework_version=1.15.0
pip install --force-reinstall https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/cuda${cuda_version}/perseus_${framework}-1.5.0%2B${framework_version}-py2.py3-none-manylinux1_x86_64.whl --trusted-host mirrors.aliyun.com -i http://mirrors.aliyun.com/pypi/simple/
Conda is an open source system that is used to manage software packages and environments. This system is supported on different platforms. Alibaba Cloud provides a single command that allows you to create a Conda environment that contains AIACC-Training. In the Conda environment, CUDA Toolkit, Python 3, deep learning frameworks, and the latest AIACC-Training software are installed. This helps you quickly build and manage different deep learning frameworks and framework versions. This also helps you significantly improve training performance by using AIACC-Training.
conda env create -f https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/conda/latest/${framework}_${framework_version}_cu${cuda_version}_py${python_version}.yaml
In this example, PyTorch 1.7.1, CUDA 11.0, and Python 3.6 are used. Sample code:
cuda_version=11.0
framework=torch
framework_version=1.7.1
python_version=36
conda env create -f https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/conda/latest/${framework}_${framework_version}_cu${cuda_version}_py${python_version}.yaml
Where,
${cuda_version}
: the version of CUDA. Specify the full version number, including periods (.). The version cannot be later than the CUDA version that is installed on the GPU-accelerated instance.${framework}
: the type of the deep learning framework. If you use TensorFlow, MXNet, or PyTorch, set the parameter to tensorflow, mxnet, or torch.Note
If the system prompts that the download URL of the Conda environment cannot be found, the specified framework version is not supported. For more information, see Supported frameworks.
You can download a Docker image that is configured with AIACC-Training. In the Docker image, CUDA, Python 3, deep learning frameworks, and the latest AIACC-Training software are installed. This helps you quickly deploy deep learning environments and manage different CUDA environments. This also helps you significantly improve training performance by using AIACC-Training.
Before you install the Docker image, make sure that your environment meets the following requirements:
Run the following command based on the required framework version and environment to download a Docker image that is configured with AIACC-Training:
docker pull registry.cn-beijing.aliyuncs.com/cto_office/perseus-training:${os_type}-cu${cuda_version}-${framework}${framework_version}-py${python_version}-latest
The following table describes the parameters in the command.
You can run a single command to download the docker image that is configured with AIACC-Training. In this example, CentOS 7, CUDA 11.0, and TensorFlow 2.4.0 are used. Sample code:
os_type=centos7
cuda_version=11.0
framework=tf
framework_version=2.4.0
python_version=36
docker pull registry.cn-beijing.aliyuncs.com/cto_office/perseus-training:${os_type}-cu${cuda_version}-${framework}${framework_version}-py${python_version}-latest
For more information about how to use Docker to perform distributed training, see Horovod in Docker.
Note
docker run
command to start the container. For example, you can add --shm-size=1g --ulimit memlock=-1
to the command.Cloud Nexus: Ideas Exchange and Collision with Alibaba Cloud
Learning about AIACC-Training | Use AIACC-Training for PyTorch
1,115 posts | 342 followers
FollowAlibaba Cloud Community - April 3, 2024
Alibaba Cloud Community - April 3, 2024
Alibaba Cloud Community - April 3, 2024
Alibaba Cloud Community - April 3, 2024
Alibaba Cloud ECS - June 4, 2020
Alex - December 26, 2018
1,115 posts | 342 followers
FollowPowerful parallel computing capabilities based on GPU technology.
Learn MoreAccelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technology
Learn MoreTop-performance foundation models from Alibaba Cloud
Learn MoreOffline SDKs for visual production, such as image segmentation, video segmentation, and character recognition, based on deep learning technologies developed by Alibaba Cloud.
Learn MoreMore Posts by Alibaba Cloud Community