This topic describes how to create a NVIDIA GPU-accelerated instance and install a GPU driver to use GPUs.

Prerequisites

You must complete the following preparations to create an ECS instance:
  1. Create an account and complete the account information.
  2. Alibaba Cloud provides a default VPC in each region. If you do not want to use the default VPC, you can create a VPC and a VSwitch in the region in which to create the instance. For more information, see Create an IPv4 VPC network.
  3. Alibaba Cloud provides a default security group in each region. If you do not want to use the default security group, you can create a security group in the region in which to create the instance. For more information, see Create a security group.
If you need other extended features, you must complete corresponding preparations:
  • To specify an SSH key pair when you create a Linux instance, you must create the SSH key pair in the corresponding region. For more information, see Create an SSH key pair.
  • To add user data for the instance, you must first prepare user data. For more information about how to prepare user data, see Prepare user data.
  • To associate an ECS instance with an instance RAM role, you must create the RAM role, attach permission policies to the role, and then bind the role to the instance. For more information, see Bind an instance RAM role.

Procedure

This topic focuses on the configurations of which you must take note when you create a NVIDIA GPU-accelerated instance in the ECS console. For other general configurations, see Create an instance by using the wizard.
Note If you call the RunInstances operation to create an instance, you can upload the automatic installation script only by setting the UserData parameter. For information about how to prepare the automatic installation script, see the Automatic installation script section in this topic.
  1. Go to the Custom Launch tab in the ECS console.
  2. Configure the settings in the Basic Configurations step.
    Note GPU-accelerated instance types are available only in specific regions and zones. For more information, see ECS Instance Types Available for Each Region. Select a billing method and enter an instance type name to search for the instance type.
    The following table describes the parameters of which you must take note.
    Parameter Description
    Instance Type Set Architecture to Heterogeneous Computing and then set Category to Virtualization Compute Optimized Type with GPU or Compute Optimized Type with GPU. Then, select an instance type.
    The selected instance type affects the types of drivers that can be installed on the instance. The instances of vGPU-accelerated instance families such as vgn6i and vgn5i are generated from a full GPU virtualization solution with mediated pass-through. You can install only GRID drivers on these instances. However, you can install GPU drivers and GRID drivers on GPU-accelerated compute optimized instances.
    • GPU drivers: used to drive physical GPUs.
    • GRID drivers: used to provide instances with graphics acceleration capabilities.
    Image The selected image affects how the GPU driver and GRID driver are installed. For more information, see Table 1.
    The following table describes how the drivers are installed.
    Table 1. Installation methods of the drivers
    Instance type Driver type Installation method of the driver
    vGPU-accelerated instance family such as vgn6i and vgn5i GRID driver No images that are pre-installed with GRID drivers are provided. You must purchase a GRID license, and manually install a GRID driver and activate the license after the instance is created.
    GPU-accelerated compute optimized instance families GPU driver You can use one of the following methods to install the GPU driver:
    • Select Auto-install GPU Driver. For more information, see Configure the automatic installation script.
      Note Only some Linux public images allow the GPU driver to be automatically installed when you create instances. If you select Shared Image or Custom Image when you create an instance, you can install the GPU driver only after you create the instance.
    • Select an Alibaba Cloud Marketplace image that is pre-installed with a GPU driver and relevant software. Alibaba Cloud Marketplace provides images that have operating systems, application environments, and various software pre-installed. Alibaba Cloud Marketplace images are reviewed by Alibaba Cloud to ensure quality and stability. You can use these images to deploy ECS instances without additional configurations.

      For example, you can select the NVIDIA GPU Cloud Virtual Machine Image deep learning image. The image is pre-installed with a NVIDIA GPU-specific optimized deep learning framework and an optimized environment for HPC application containers. For more information, see Deploy an NGC environment on instances with GPU capabilities.

    • Manually install the GPU driver after you create the instance. For more information, see Manually install a GPU driver.
    GPU-accelerated compute optimized instance families GRID driver GPU-accelerated compute optimized instances can be installed with a GPU driver. However, no images that are pre-installed with GRID drivers are provided to create instances. You must purchase a GRID license, and manually install a GRID driver and activate the license after the instance is created.
  3. Configure the settings in the Networking step.
    The following table describes the parameters of which you must take note.
    Parameter Description
    Network Type Select VPC.
    Public IP Address If you select an image of Windows 2008 R2 or an earlier version in the Basic Configurations step, you cannot connect to the instance by using a VNC management terminal after the GPU driver is installed. A black screen or the startup interface persists when you attempt to connect to the instance. You can select Assign Public IPv4 Address in the Public IP Address section in the Networking step, or associate an elastic IP address (EIP) after you create the instance. This way, you can connect to the instance over other protocols such as Remote Desktop in Windows (RDP), PC over IP (PCoIP), and XenDesktop HDX 3D.
    Note RDP does not support applications such as DirectX and OpenGL. You must install the VNC service and client on your own.
  4. Configure the settings in the System Configurations step.
    The following table describes the parameters of which you must take note.
    Parameter Description
    Logon Credentials We recommend that you select Key Pair or Password. If you select Set Later, you must bind an SSH key pair or set a password by using the password reset feature before you can connect to the instance by using a VNC management terminal. Then, you must restart the instance for the modification to take effect. If you restart the instance while the GPU driver is being installed, the installation fails.
    User Data cloud-init automatically runs the script entered in the User Data section when the instance is started for the first time after the instance is created.
    • If you selected Auto-install GPU Driver, Auto-install AIACC-Training, or Auto-install AIACC-Inference in the Basic Configurations step, the automatic installation script is displayed in the User Data section.
    • If you did not select Auto-install GPU Driver, Auto-install AIACC-Training, or Auto-install AIACC-Inference in the Basic Configurations step, you can manually enter the automatic installation script in the User Data section. For information about how to prepare the automatic installation script, see the Automatic installation script section in this topic.
  5. Configure the parameters in the Grouping step, confirm the configurations in the Preview step, and then click Create Order or Create Instance.
    If you enter the automatic installation script in the User Data section, the GPU Driver, AIACC-Training, or AIACC-Inference is automatically installed on the instance after the instance is started. After the GPU driver is installed, the instance is automatically restarted for the GPU driver to run.
    Note The GPU driver is more stable in persistence mode. The automatic installation script automatically enables the persistence mode for the GPU driver. Then, the script adds the corresponding commands as a Linux system service to ensure that the persistence mode is automatically enabled for the GPU driver on instance startup.
    The automatic installation process may take 10 to 20 minutes based on the internal bandwidth and the number of CPU cores of different instance types. You can connect to the instance to view the installation process. You can also view the /root/auto_install/auto_install.log installation log after the installation is complete. The following table describes the display effects of the installation process.
    Installation process Display effect
    The installation is in progress. The installation progress bar is displayed.
    The installation succeeds. ALL INSTALL OK appears as the installation result.
    The installation fails. INSTALL FAIL appears as the installation result.
    Notice When the installation is in progress, the GPUs are unavailable. To prevent installation failures and keep the instance available, do not perform operations or install other GPU-related software on the instance until the installation is complete.

Configure the automatic installation script

When you create an instance in the ECS console, you can select Auto-install GPU Driver, Auto-install AIACC-Training, or Auto-install AIACC-Inference in the Image section of the Basic Configurations step. If you select Auto-install GPU Driver, the GPU driver, CUDA, and the NVIDIA CUDA Deep Neural Network library (cuDNN) are installed.select-autoinstall
The following section describes the features of GPU drivers, AIACC-Training, and AIACC-Inference, and the available versions of GPU drivers, CUDA, and cuDNN library.
  • GPU drivers are used to drive physical GPUs. When used together with CUDA and cuDNN library, GPU drivers can work efficiently. For a new business system, we recommend that you select the latest versions of the GPU driver, CUDA, and cuDNN library. The following table lists the available versions of the GPU driver, CUDA, and cuDNN library.
    CUDA GPU driver cuDNN Supported version of the public image (only images supplied and tested by Alibaba Cloud) Supported instance family
    11.0.2 450.80.02 8.0.4
    • Alibaba Cloud Linux 2
    • Ubuntu 20.04, Ubuntu 18.04, and Ubuntu16.04
    • CentOS 8.x and CentOS 7.x
    • gn6v, gn6i, gn6e, gn5, and gn5i
    • ebmgn6v, ebmgn6i, ebmgn6e, and ebmgn5i
    10.2.89
    • 450.80.02
    • 440.64.00
    • 8.0.4
    • 7.6.5
    • Alibaba Cloud Linux 2
    • Ubuntu 18.04 and Ubuntu 16.04
    • CentOS 8.x, CentOS 7.x, and CentOS 6.x
    • gn6v, gn6i, gn6e, gn5, and gn5i
    • ebmgn6v, ebmgn6i, ebmgn6e, and ebmgn5i
    10.1.168
    • 450.80.02
    • 440.64.00
    • 8.0.4
    • 7.6.5
    • 7.5.0
    • Ubuntu 18.04 and Ubuntu 16.04
    • CentOS 7.x and CentOS 6.x
    • gn6v, gn6i, gn6e, gn5, and gn5i
    • ebmgn6v, ebmgn6i, ebmgn6e, and ebmgn5i
    10.0.130
    • 450.80.02
    • 440.64.00
    • 7.6.5
    • 7.5.0
    • 7.4.2
    • 7.3.1
    • Ubuntu 18.04 and Ubuntu 16.04
    • CentOS 7.x and CentOS 6.x
    • gn6v, gn6i, gn6e, gn5, and gn5i
    • ebmgn6v, ebmgn6i, ebmgn6e, and ebmgn5i
    9.2.148
    • 450.80.02
    • 440.64.00
    • 390.116
    • 7.6.5
    • 7.5.0
    • 7.4.2
    • 7.3.1
    • 7.1.4
    • Ubuntu 16.04
    • CentOS 7.x and CentOS 6.x
    • gn6v, gn6e, gn5, and gn5i
    • ebmgn6v, ebmgn6e, and ebmgn5i
    9.0.176
    • 450.80.02
    • 440.64.00
    • 390.116
    • 7.6.5
    • 7.5.0
    • 7.4.2
    • 7.3.1
    • 7.1.4
    • 7.0.5
    • Ubuntu 16.04
    • CentOS 7.x and CentOS 6.x
    • SUSE 12sp2
    • gn6v, gn6e, gn5, and gn5i
    • ebmgn6v, ebmgn6e, and ebmgn5i
    8.0.61
    • 450.80.02
    • 440.64.00
    • 390.116
    • 7.1.3
    • 7.0.5
    • Ubuntu 16.04
    • CentOS 7.x and CentOS 6.x
    • gn5 and gn5i
    • ebmgn5i
    Note If you replace the operating system after the instance is created, make sure that you use an image that allows GPU drivers to be automatically installed to prevent failures in automatic installation.
  • AIACC-Training is an AI accelerator developed by Alibaba Cloud. AIACC-Training can accelerate major AI computing frameworks such as TensorFlow, PyTorch, MxNet, and Caffe to achieve significant gains in training performance. For more information, see Use AIACC-Training.
    Note AIACC-Training is not supported in CentOS 8, CentOS 6, SUSE Linux, or Alibaba Cloud Linux.
  • AIACC-Inference is an AI accelerator developed by Alibaba Cloud. AIACC-Inference can accelerate the major AI computing framework TensorFlow and exportable frameworks in the Open Neural Network Exchange (ONNX) format to achieve significant gains in inference performance. For more information, see Use AIACC-Inference.
    Note AIACC-Inference is not supported in CentOS 8, CentOS 6, SUSE Linux, or Alibaba Cloud Linux.
If you selected Auto-install GPU Driver, Auto-install AIACC-Training, or Auto-install AIACC-Inference in the Basic Configurations step, the automatic installation script is displayed in the User Data section of the System Configurations step. cloud-init automatically runs the automatic installation script when the instance is started for the first time after the instance is created.autoinstall-script
Note If you did not select Auto-install GPU Driver, Auto-install AIACC-Training, or Auto-install AIACC-Inference in the Basic Configurations step, you can manually enter the automatic installation script in the System Configurations step. For information about how to prepare the automatic installation script, see the Automatic installation script section in this topic.

Automatic installation script

The automatic installation script has been updated to v3.2. The latest version of the automatic installation script has the following benefits:
  • Provides the latest versions of the GPU driver, CUDA, and cuDNN Library.
  • Shows the installation process after the instance is connected.
The following section lists the content of the automatic installation script:
#! /bin/sh

#Please input version to install
IS_INSTALL_AIACC_TRAIN=""
IS_INSTALL_AIACC_INFERENCE=""
DRIVER_VERSION=""
CUDA_VERSION=""
CUDNN_VERSION=""
IS_INSTALL_RAPIDS="FALSE"

INSTALL_DIR="/root/auto_install"

#using .deb to install driver and cuda on ubuntu OS
#using .run to install driver and cuda on ubuntu OS
auto_install_script="auto_install_v3.2.sh"

script_download_url=$(curl http://100.100.100.200/latest/meta-data/source-address | head -1)"/opsx/ecs/linux/binary/script/${auto_install_script}"
echo $script_download_url

mkdir $INSTALL_DIR && cd $INSTALL_DIR
wget -t 10 --timeout=10 $script_download_url && sh ${INSTALL_DIR}/${auto_install_script} $DRIVER_VERSION $CUDA_VERSION $CUDNN_VERSION $IS_INSTALL_AIACC_TRAIN $IS_INSTALL_AIACC_INFERENCE $IS_INSTALL_RAPIDS
Note If you use a CentOS, SUSE, or Ubuntu 20.04 image to create the instance, the .run installation package is used when you run the automatic installation script. If you use a Ubuntu 18.04 or Ubuntu 16.04 image, the .deb installation package is used when you run the automatic installation script.
To use the automatic installation script, you must modify the version parameters of the GPU driver, CUDA, and cuDNN library in the installation script, and specify whether to install AIACC-Training and AIACC-Inference.
  • If you want to install AIACC-Training, set IS_INSTALL_AIACC_TRAIN to TRUE. Otherwise, set IS_INSTALL_AIACC_TRAIN to FALSE.
  • If you want to install AIACC-Inference, set IS_INSTALL_AIACC_INFERENCE to TRUE. Otherwise, set IS_INSTALL_AIACC_INFERENCE to FALSE.
Example:
IS_INSTALL_AIACC_TRAIN="FALSE"
IS_INSTALL_AIACC_INFERENCE="FALSE"
DRIVER_VERSION="440.64.00"
CUDA_VERSION="10.2.89"
CUDNN_VERSION="8.0.4"