AIACC-Training significantly improves the performance of AI training. AIACC-Training can optimize models that are developed based on mainstream AI computing frameworks, including TensorFlow, PyTorch, Apache MXNet, and Caffe. This topic describes how to automatically install AIACC-Training and test a demo.

Background information

Conda is an open source package management system and environment management system that can run on different platforms. Miniconda is a minimal installer for Conda and is used to deploy environments. You can create GPU-accelerated instances that have AIACC-Training preinstalled in a conda environment. You can use Miniconda to select different conda environments. Then, in a conda environment, you can install and switch between deep learning frameworks and use AIACC-Training to significantly improves training performance.

Automatically install AIACC-Training

AIACC-Training depends on GPU drivers, NVIDIA Compute Unified Device Architecture (CUDA), and NVIDIA CUDA Deep Neural Network (cuDNN). When you create a GPU-accelerated instance, you must select Auto-install GPU Driver and AIACC-Training. AIACC-Training

Conda environments contain AIACC-Training and OpenMPI dependencies. However, you must install deep learning frameworks on your own. For more information about how to install deep learning frameworks, see Select a conda environment and install a deep learning framework.

CUDA versions determine the versions of deep learning frameworks that you can install. The following table lists the versions of deep learning frameworks that you can install based on the CUDA versions.
CUDA version Default conda environment Framework version
CUDA 11.0 aiacct_tr1.7.0_cu11.0_py36 TensorFlow 2.4
CUDA 10.1 aiacct_tf2.1_cu10.1_py36 TensorFlow 2.1
CUDA 10.0 aiacct_tf1.15_tr1.4.0_mx1.5.0_cu10.0_py36
  • TensorFlow 1.15 + Pytorch 1.4.0 + MXNet 1.5.0
  • TensorFlow 1.14 + Pytorch 1.3.0 + MXNet 1.4.0
CUDA 9.0 aiacct_tf1.12_tr1.3.0_mx1.5.0_cu9.0_py36 TensorFlow 1.12 + Pytorch 1.3.0 + MXNet 1.5.0

Select a conda environment and install a deep learning framework

  1. Connect to a GPU-accelerated instance. For more information, see Connect to a Linux instance by using password authentication.
  2. Select a conda environment.
    1. Initialize Miniconda.
      . /root/miniconda/etc/profile.d/conda.sh
    2. View existing conda environments.
      conda env list
      The following figure shows an example of the command output. aiacc-training-envlist
    3. Select a conda environment.
      conda activate [environments_name]
      The following figure shows an example of the command output. aiacc-training-activate
      In the preceding figure, aiacct_tf2.1_cu10.1_py36 indicates the following items:
      • TensorFlow 2.1
      • CUDA 10.1
      • Python 3.6
  3. Install a deep learning framework.
    install_frameworks.sh
    The install_frameworks.sh script includes the command to install a deep learning framework that is suitable for the selected conda environment. The following figure shows an example of the script. install-frameworks-script
    The following figure shows an example of the script output. install-frameworks
  4. Test a demo.
    By default, the ali-perseus-demos.tgz demo file is stored in the /root directory. In this example, the TensorFlow demo is tested.
    • If you use TensorFlow 2.1, perform the following operations:
      1. Decompress the demo test package.
        tar -xvf ali-perseus-demos.tgz
      2. Go to the directory of the TensorFlow demo.
        cd ali-perseus-demos/tensorflow2-examples
      3. Run the test script in the directory.

        The following sample code provides an example on how you run the test script:

        python tensorflow2_keras_mnist_perseus.py
        This demo uses Modified National Institute of Standards and Technology (MNIST) datasets for training. This improves training performance and ensures that the demo has the same level of precision as that of your benchmark code. The following figure shows an example of the training results. tf2.1-demo
    • If you use TensorFlow 1.14, perform the following operations:
      1. Decompress the demo test package.
        tar -xvf ali-perseus-demos.tgz
      2. Go to the directory of the TensorFlow demo.
        cd ali-perseus-demos/tensorflow-benchmarks
      3. View the test command in the README.txt file.
      4. Go to the directory where the test script of the demo of TensorFlow 1.14 is stored.
        The following sample code provides an example on how you go to the directory:
        cd benchmarks-tf1.14
      5. Modify the test command based on the number of GPUs that correspond to the instance specifications and run the command.
        The following sample code provides an example on how you modify and run the command:
        mpirun --allow-run-as-root --bind-to none -np 1 -npernode 1  \
               --mca btl_tcp_if_include eth0  \
               --mca orte_keep_fqdn_hostnames t   \
               -x NCCL_SOCKET_IFNAME=eth0   \
               -x LD_LIBRARY_PATH   \
               ./config-fp16-tf.sh
        This demo uses synthetic data to test the training speed. The following figure shows an example of the training results. tf1.14-demo

Delete Miniconda

You can delete Miniconda if you no longer need AIACC-Training. By default, you can install and delete Miniconda for the root user.

  1. Delete the miniconda folder.
    rm -rf /root/miniconda
  2. Delete the environment variables and output.
    1. Modify the /root/.bashrc file and comment out the environment variables and output that are related to Miniconda and AIACC-Training.
      The following figure shows an example of the modified environment variables and output. bashrc-file
    2. Make the new environment variables take effect.
      source /root/.bashrc