AIACC-Training can optimize models based on mainstream AI computing frameworks that include TensorFlow, PyTorch, MXNet, and Caffe, which can improve training performance. This topic describes how to automatically install AIACC-Training and test demos.

Background information

conda is an open source package management system and environment management system that can run on different platforms. Miniconda is a small installer of conda. When you create a GPU-accelerated instance, you can configure a conda environment that contains AIACC-Training to be installed automatically. You can use Miniconda to select a conda environment. Then, in the conda environment, you can install and switch between deep learning frameworks and use AIACC-Training to improve training performance.

Automatically install AIACC-Training

AIACC-Training depends on the GPU driver, CUDA, and cuDNN. When you create a GPU-accelerated instance, select Auto-install GPU Driver and then select Auto-install AIACC-Training. For more information, see Create an NVIDIA GPU-accelerated instance. p185329

conda environments contain dependency packages such as AIACC-Training and OpenMPI, but do not contain deep learning frameworks. For more information about how to install a deep learning framework, see Select a conda environment and install a deep learning framework.

CUDA versions determine the versions of deep learning frameworks that can be installed. The following table describes the mappings between CUDA versions and versions of deep learning frameworks.
CUDA version Default conda environment Version of the deep learning framework that can be installed
CUDA 11.0 aiacct_tr1.7.0_cu11.0_py36 TensorFlow 2.4
CUDA 10.1 aiacct_tf2.1_cu10.1_py36 TensorFlow 2.1
CUDA 10.0 aiacct_tf1.15_tr1.4.0_mx1.5.0_cu10.0_py36
  • TensorFlow 1.15 + Pytorch 1.4.0 + MXNet 1.5.0
  • TensorFlow 1.14 + Pytorch 1.3.0 + MXNet 1.4.0
CUDA 9.0 aiacct_tf1.12_tr1.3.0_mx1.5.0_cu9.0_py36 TensorFlow 1.12 + Pytorch 1.3.0 + MXNet 1.5.0

Select a conda environment and install a deep learning framework

  1. Connect to the instance. For more information, see Connect to a Linux instance by using Workbench.
  2. Select a conda environment.
    1. Initialize Miniconda.
      . /root/miniconda/etc/profile.d/conda.sh
    2. View all conda environments.
      conda env list
      The following figure shows an example command output. aiacc-training-envlist
    3. Select a conda environment.
      conda activate [environments_name]
      The following figure shows an example command output. aiacc-training-activate
      aiacct_tf2.1_cu10.1_py36 indicates:
      • TensorFlow 2.1
      • CUDA 10.1
      • Python 3.6
  3. Install a deep learning framework.
    install_frameworks.sh
    The install_frameworks.sh script includes the commands used to install deep learning frameworks that are compatible with the selected conda environment. The following figure shows a sample script. install-frameworks-script
    The following figure shows an example script output. install-frameworks
  4. Test the demo.
    By default, the ali-perseus-demos.tgz demo file is located in the /root directory. In this example, the TensorFlow demo is tested.
    • For TensorFlow 2.1, perform the following operations:
      1. Decompress the demo test package.
        tar -xvf ali-perseus-demos.tgz
      2. Go to the directory of the TensorFlow demo.
        cd ali-perseus-demos/tensorflow2-examples
      3. Run the test script in the directory.

        The following command is used in an example:

        python tensorflow2_keras_mnist_perseus.py
        This demo uses Modified National Institute of Standards and Technology (MNIST) datasets for training to improve training performance and ensure that the same level of precision is achieved as that of your benchmark code. The following figure shows an example training result. tf2.1-demo
    • For TensorFlow 1.14, perform the following operations:
      1. Decompress the demo test package.
        tar -xvf ali-perseus-demos.tgz
      2. Go to the directory of the TensorFlow demo.
        cd ali-perseus-demos/tensorflow-benchmarks
      3. View the test command in README.txt.
      4. Go to the directory where the test script of the corresponding version resides.
        The following command is used in the example:
        cd benchmarks-tf1.14
      5. Modify and run the test command based on the number of GPUs with which the specified instance type is equipped.
        The following command is used in an example:
        mpirun --allow-run-as-root --bind-to none -np 1 -npernode 1  \
               --mca btl_tcp_if_include eth0  \
               --mca orte_keep_fqdn_hostnames t   \
               -x NCCL_SOCKET_IFNAME=eth0   \
               -x LD_LIBRARY_PATH   \
               ./config-fp16-tf.sh
        This demo uses synthetic data for training to test the training speed. The following figure shows an example training result. tf1.14-demo

Delete Miniconda

You can delete Miniconda if you no longer need AIACC-Training. By default, the root user can install and delete Miniconda.

  1. Delete the miniconda folder.
    rm -rf /root/miniconda
  2. Delete relevant environment variables and output.
    1. Modify the /root/.bashrc file and comment out the environment variables and output related to Miniconda and AIACC-Training.
      The following figure shows an example command output. bashrc-file
    2. Make the changes to the environment variables take effect.
      source /root/.bashrc