AIACC-Training significantly improves the performance of AI training. AIACC-Training
can optimize models that are developed based on mainstream AI computing frameworks,
including TensorFlow, PyTorch, Apache MXNet, and Caffe. This topic describes how to
automatically install AIACC-Training and test a demo.
Background information
Conda is an open source package management system and environment management system
that can run on different platforms. Miniconda is a minimal installer for Conda and
is used to deploy environments. You can create GPU-accelerated instances that have
AIACC-Training preinstalled in a conda environment. You can use Miniconda to select
different conda environments. Then, in a conda environment, you can install and switch
between deep learning frameworks and use AIACC-Training to significantly improves
training performance.
Automatically install AIACC-Training
AIACC-Training depends on GPU drivers, NVIDIA Compute Unified Device Architecture
(CUDA), and NVIDIA CUDA Deep Neural Network (cuDNN). When you create a GPU-accelerated
instance, you must select Auto-install GPU Driver and
AIACC-Training.

Conda environments contain AIACC-Training and OpenMPI dependencies. However, you must
install deep learning frameworks on your own. For more information about how to install
deep learning frameworks, see Select a conda environment and install a deep learning framework.
CUDA versions determine the versions of deep learning frameworks that you can install.
The following table lists the versions of deep learning frameworks that you can install
based on the CUDA versions.
CUDA version |
Default conda environment |
Framework version |
CUDA 11.0 |
aiacct_tr1.7.0_cu11.0_py36 |
TensorFlow 2.4 |
CUDA 10.1 |
aiacct_tf2.1_cu10.1_py36 |
TensorFlow 2.1 |
CUDA 10.0 |
aiacct_tf1.15_tr1.4.0_mx1.5.0_cu10.0_py36 |
- TensorFlow 1.15 + Pytorch 1.4.0 + MXNet 1.5.0
- TensorFlow 1.14 + Pytorch 1.3.0 + MXNet 1.4.0
|
CUDA 9.0 |
aiacct_tf1.12_tr1.3.0_mx1.5.0_cu9.0_py36 |
TensorFlow 1.12 + Pytorch 1.3.0 + MXNet 1.5.0 |
Select a conda environment and install a deep learning framework
- Connect to a GPU-accelerated instance. For more information, see Connect to a Linux instance by using password authentication.
- Select a conda environment.
- Initialize Miniconda.
. /root/miniconda/etc/profile.d/conda.sh
- View existing conda environments.
conda env list
The following figure shows an example of the command output.

- Select a conda environment.
conda activate [environments_name]
The following figure shows an example of the command output.

In the preceding figure, aiacct_tf2.1_cu10.1_py36 indicates the following items:
- TensorFlow 2.1
- CUDA 10.1
- Python 3.6
- Install a deep learning framework.
install_frameworks.sh
The
install_frameworks.sh script includes the command to install a deep learning framework that is suitable
for the selected conda environment. The following figure shows an example of the script.

The following figure shows an example of the script output.

- Test a demo.
By default, the ali-perseus-demos.tgz demo file is stored in the /root directory. In this example, the TensorFlow demo is tested.
- If you use TensorFlow 2.1, perform the following operations:
- Decompress the demo test package.
tar -xvf ali-perseus-demos.tgz
- Go to the directory of the TensorFlow demo.
cd ali-perseus-demos/tensorflow2-examples
- Run the test script in the directory.
The following sample code provides an example on how you run the test script:
python tensorflow2_keras_mnist_perseus.py
This demo uses Modified National Institute of Standards and Technology (MNIST) datasets
for training. This improves training performance and ensures that the demo has the
same level of precision as that of your benchmark code. The following figure shows
an example of the training results.

- If you use TensorFlow 1.14, perform the following operations:
- Decompress the demo test package.
tar -xvf ali-perseus-demos.tgz
- Go to the directory of the TensorFlow demo.
cd ali-perseus-demos/tensorflow-benchmarks
- View the test command in the README.txt file.
- Go to the directory where the test script of the demo of TensorFlow 1.14 is stored.
The following sample code provides an example on how you go to the directory:
cd benchmarks-tf1.14
- Modify the test command based on the number of GPUs that correspond to the instance
specifications and run the command.
The following sample code provides an example on how you modify and run the command:
mpirun --allow-run-as-root --bind-to none -np 1 -npernode 1 \
--mca btl_tcp_if_include eth0 \
--mca orte_keep_fqdn_hostnames t \
-x NCCL_SOCKET_IFNAME=eth0 \
-x LD_LIBRARY_PATH \
./config-fp16-tf.sh
This demo uses synthetic data to test the training speed. The following figure shows
an example of the training results.

Delete Miniconda
You can delete Miniconda if you no longer need AIACC-Training. By default, you can
install and delete Miniconda for the root user.
- Delete the miniconda folder.
- Delete the environment variables and output.
- Modify the /root/.bashrc file and comment out the environment variables and output that are related to Miniconda
and AIACC-Training.
The following figure shows an example of the modified environment variables and output.

- Make the new environment variables take effect.