AIACC-Training can optimize models based on mainstream AI computing frameworks that include TensorFlow, PyTorch, MXNet, and Caffe, which can improve training performance. This topic describes how to automatically install AIACC-Training and test demos.

Background information

Conda is an open source package and environment management system that can run on different platforms. Miniconda is a minimal installer of conda. When you create a GPU-accelerated instance, you can configure a conda environment that contains AIACC-Training to be automatically installed. You can use Miniconda to select a conda environment. Then, in the conda environment, you can install and switch between deep learning frameworks and use AIACC-Training to improve training performance.

Automatically install AIACC-Training

AIACC-Training depends on the GPU driver, CUDA, and cuDNN. When you create a GPU-accelerated instance, select Auto-install GPU Driver and then select AIACC-Training after you select an image. AIACC-Training

Conda environments contain dependency packages such as AIACC-Training and OpenMPI, but do not contain deep learning frameworks. For more information about how to install a deep learning framework, see Select a conda environment and install a deep learning framework.

CUDA versions determine the versions of deep learning frameworks that can be installed. The following table describes the mappings between CUDA versions and versions of deep learning frameworks.
CUDA version Default conda environment Version of the deep learning framework that can be installed
CUDA 11.0 aiacct_tr1.7.0_cu11.0_py36 TensorFlow 2.4
CUDA 10.1 aiacct_tf2.1_cu10.1_py36 TensorFlow 2.1
CUDA 10.0 aiacct_tf1.15_tr1.4.0_mx1.5.0_cu10.0_py36
  • TensorFlow 1.15 + Pytorch 1.4.0 + MXNet 1.5.0
  • TensorFlow 1.14 + Pytorch 1.3.0 + MXNet 1.4.0
CUDA 9.0 aiacct_tf1.12_tr1.3.0_mx1.5.0_cu9.0_py36 TensorFlow 1.12 + Pytorch 1.3.0 + MXNet 1.5.0

Select a conda environment and install a deep learning framework

  1. Connect to the instance. For more information, see Connect to a Linux instance by using password authentication.
  2. Select a conda environment.
    1. Initialize Miniconda.
      . /root/miniconda/etc/profile.d/conda.sh
    2. View all conda environments.
      conda env list
      The following figure shows an example command output. aiacc-training-envlist
    3. Select a conda environment.
      conda activate [environments_name]
      The following figure shows an example command output. aiacc-training-activate
      In the preceding figure, aiacct_tf2.1_cu10.1_py36 indicates the following items:
      • TensorFlow 2.1
      • CUDA 10.1
      • Python 3.6
  3. Install a deep learning framework.
    install_frameworks.sh
    The install_frameworks.sh script includes the commands used to install deep learning frameworks that are compatible with the selected conda environment. The following figure shows a sample script. install-frameworks-script
    The following figure shows an example script output. install-frameworks
  4. Test the demo.
    By default, the ali-perseus-demos.tgz demo file is located in the /root directory. In this example, the TensorFlow demo is tested.
    • For TensorFlow 2.1, perform the following operations:
      1. Decompress the demo test package.
        tar -xvf ali-perseus-demos.tgz
      2. Go to the directory of the TensorFlow demo.
        cd ali-perseus-demos/tensorflow2-examples
      3. Run the test script in the directory.

        The following command is used in an example:

        python tensorflow2_keras_mnist_perseus.py
        This demo uses Modified National Institute of Standards and Technology (MNIST) datasets for training to improve training performance and ensure that the same level of precision is achieved as that of your benchmark code. The following figure shows an example training result. tf2.1-demo
    • For TensorFlow 1.14, perform the following operations:
      1. Decompress the demo test package.
        tar -xvf ali-perseus-demos.tgz
      2. Go to the directory of the TensorFlow demo.
        cd ali-perseus-demos/tensorflow-benchmarks
      3. View the test command in README.txt.
      4. Go to the directory where the test script of the corresponding version resides.
        The following command is used in the example:
        cd benchmarks-tf1.14
      5. Modify and run the test command based on the number of GPUs with which the specified instance type is equipped.
        The following commands are used in an example:
        mpirun --allow-run-as-root --bind-to none -np 1 -npernode 1  \
               --mca btl_tcp_if_include eth0  \
               --mca orte_keep_fqdn_hostnames t   \
               -x NCCL_SOCKET_IFNAME=eth0   \
               -x LD_LIBRARY_PATH   \
               ./config-fp16-tf.sh
        This demo uses synthetic data for training to test the training speed. The following figure shows an example training result. tf1.14-demo

Delete Miniconda

You can delete Miniconda if AIACC-Training is no longer needed. By default, the root user can install and delete Miniconda.

  1. Delete the miniconda folder.
    rm -rf /root/miniconda
  2. Delete relevant environment variables and output.
    1. Modify the /root/.bashrc file and comment out the environment variables and output related to Miniconda and AIACC-Training.
      The following figure shows an example command output. bashrc-file
    2. Make the changes to the environment variables take effect.
      source /root/.bashrc