All Products
Search
Document Center

Elastic GPU Service:Elastic GPU Service FAQ

Last Updated:Mar 04, 2024

This topic provides answers to some commonly asked questions about Elastic GPU Service for you to troubleshoot issues related to Elastic GPU Service.

Category

FAQ

Feature issue

Operation issue

Why do GPU-accelerated Windows instances not support features such as DirectX?

Windows Remote Desktop Protocol (RDP) does not support applications such as DirectX and Open Graphics Library (OpenGL) applications. You must install TightVNC and TightVNC clients or remote connection clients that support protocols such as PC over IP (PCoIP) and XenDesktop HDX 3D on your GPU-accelerated Windows instances.

Do GPU-accelerated instances support Android emulators?

Only the following GPU-accelerated compute-optimized Elastic Compute Service (ECS) Bare Metal Instance families support Android emulators: ebmgn7e, ebmgn7i, ebmgn7, ebmgn6ia, ebmgn6e, ebmgn6v, and ebmgn6i.

Can I change the instance types of GPU-accelerated instances?

You can change the instance types of GPU-accelerated instances within the same instance family except for the following instance families:

  • gn5: a GPU-accelerated compute-optimized instance family that uses local storage

  • vgn5i: a vGPU-accelerated instance family

For more information, see Instance families that support instance type changes.

Do pay-as-you-go GPU-accelerated instances support the economical mode?

The GPU-accelerated instance families that use local storage, such as gn5, do not support the economical mode. For more information, see Economical mode.

What are the differences between GPUs and CPUs?

The following table describes the differences between GPUs and CPUs.

Comparison item

GPU

CPU

ALU

A large number of arithmetic logic units (ALUs) that can be used for large-scale parallel computing.

A small number of powerful ALUs.

Logic control unit

Simple logic control units.

Complex logic control units.

Cache

A small size of cache that is used for threads and cannot be used to store accessed data.

A large size of cache that stores data to increase the speed of data access and reduce latency.

Response mode

GPUs can integrate all tasks before the GPUs perform batch processing.

CPUs can respond to a task in real time.

Scenario

Compute-intensive and high-throughput scenarios where multiple threads run in parallel to process highly similar tasks.

Serial computing scenarios that involve complex logic and require high response speeds.

Can regular ECS instance families be upgraded or changed to GPU-accelerated instance families?

No, regular ECS instance families cannot be upgraded or changed to GPU-accelerated instance families.

For more information, see Instance families that support instance type changes.

Why am I unable to view GPUs by running the nvidia-smi command after I purchase GPU-accelerated instances?

In most cases, you are unable to view GPUs by running the nvidia-smi command because GPU-accelerated instances or NVIDIA drivers are not installed. You can install drivers based on the instance families of the GPU-accelerated instances. The following information describes the drivers that you can install and how to install the drivers:

For more information about the installation scenarios and guideline of drivers, see Installation guideline for NVIDIA Tesla and GRID drivers.

Which drivers do I need to install on vGPU-accelerated instances?

The instances of vGPU-accelerated instance families such as vgn6i and vgn5i are configured with vGPUs that are generated from GPU virtualization based on the mediated pass-through method. You can install only GRID drivers on the instances. You must install GRID drivers based on the OS types of your vGPU-accelerated instances. The following information describes how to install the drivers:

What issue causes the CUDA version to become inconsistent after I create and install a GPU-accelerated instance?

After you run the nvidia-smi command, the system displays the latest CUDA version that the GPU-accelerated instance supports instead of the CUDA version that you selected when you created the GPU-accelerated instance.

Which drivers do I need to install when I use tools such as OpenGL and Direct3D for graphics computing on GPU-accelerated compute-optimized instances?

You can install drivers on GPU-accelerated compute-optimized instances based on the OS types of the instances. The following information describes the drivers that you can install and how to install the drivers:

How do I view GPU monitoring data?

You can log on to the CloudMonitor console or call the DescribeMetricList operation to view the GPU monitoring data. For more information, see GPU monitoring.

How do I transmit data between GPU-accelerated instances and regular ECS instances?

GPU-accelerated instances deliver the same level of experience as regular ECS instances. GPU-accelerated instances also provide GPU acceleration. By default, data can be transmitted between GPU-accelerated instances and regular ECS instances that belong to the same security group over an internal network. You do not need to configure network connectivity.

What do I do if a black screen appears on a VNC client when I use the VNC client to connect to a GPU-accelerated Windows instance on which a GRID driver is installed?

Cause: After you install a GRID driver on a GPU-accelerated Windows instance, the GRID driver controls the output display of the virtual machine. The Virtual Network Computing (VNC) client can no longer obtain the output display that is processed by the integrated GPU on the instance. Then, a black screen appears on the VNC client. This issue is normal.

Solution: Connect to the GPU-accelerated Windows instance from a client on your computer. For more information, see Connect to a Windows instance by using a username and password.

How do I view the details of the GPUs that are used by GPU-accelerated instances?

The methods that you can use to view the details of the GPUs vary based on the OS types of your GPU-accelerated instances. The following information describes how to view the details:

  • If your GPU-accelerated instances run Linux, run the nvidia-smi command to view the details of the GPUs.

  • If your GPU-accelerated instances run Windows, view the details of the GPUs in Device Manager from your computer.

If you want to view information about the GPUs, such as the idle rate, utilization, temperature, and power, go to the CloudMonitor console. For more information, see GPU monitoring.

How do I obtain GRID licenses?

You can obtain GRID licenses based on the OS types of your GPU-accelerated instances.

How do I install cGPU?

We recommend that you install and use cGPU by using the GPU sharing component that is provided by Container Service for Kubernetes (ACK). For more information, see Configure the GPU sharing component.

How do I disable automatic installation of a GPU driver when I change the OS?

If you select Auto-install GPU Driver when you create a GPU-accelerated instance, the instance is created with a GPU driver automatically installed. If you want to change the OS of the instance and disable automatic installation of the GPU driver, perform the following steps.

Note

For more information about how to automatically install a GPU driver when you create a GPU-accelerated instance, see Create a GPU-accelerated Linux instance configured with a GPU driver.

  1. Stop the GPU-accelerated instance.

    For more information, see Stop an instance.

  2. On the Instance page, find the stopped GPU-accelerated instance and click the More icon in the Actions column. In the Instance Settings section, click Set User Data.

  3. In the User Data: field, delete the user data and click Confirm.

  4. Change the OS of the GPU-accelerated instance.

    The system changes the OS of a GPU-accelerated instance by replacing the system disk of the instance. Therefore, you can change the OS of a GPU-accelerated instance by replacing the image of the instance. For more information, see Replace the operating system (system disk) of an instance.

What do I do if a GPU has fallen off the bus due to an XID 119 or XID 120 error?

  • Problem description

    A GPU has fallen off the bus. For example, an error message that indicates a GPU fails to be started appears when you use the GPU on a Linux machine. After you run the sh nvidia-bug-report.sh command, you can view XID 119 or XID 120 error messages in the log. The following figure shows an example of XID 119 error messages.

    报错信息.png

    Note

    For more information, see Common XID Errors in the official NVIDIA documentation.

  • Cause

    The preceding issue may be caused by the abnormal running status of the GPU System Processor (GSP) component of the GPU. NVIDIA does not provide a specific driver version to fix the issue. We recommend that you disable the GSP feature before you use the GPU.

    Note

    For more information about GSP, see Chapter 42. GSP Firmware in the official NVIDIA documentation.

  • Solution

    1. Log on to the GPU-accelerated instance.

      For more information, see Connect to a Linux instance by using a password or key.

    2. Run the following command to disable the GSP component:

      echo options nvidia NVreg_EnableGpuFirmware=0 > /etc/modprobe.d/nvidia-gsp.conf
    3. Restart the GPU-accelerated instance.

      For more information, see Restart instances.

    4. Log on to the GPU-accelerated instance again.

    5. Run the following command to obtain the value of the EnableGpuFirmware parameter:

      cat /proc/driver/nvidia/params | grep EnableGpuFirmware:
      • If EnableGpuFirmware:0 is returned, the GSP component is disabled and the issue is fixed.

        Important

        In this case, you can run the nvidia-smi command to check the GPU status, which is expected to be normal.

      • If EnableGpuFirmware:0 is not returned, the GSP component is not disabled. Proceed to the next step.

    6. Run the following command to check whether the GPU is running as expected:

      nvidia-smi

      If an error is returned, the issue persists on the GPU. Contact Alibaba Cloud technical support to shut down the instance and migrate data.

What do I do if the "undefined symbol: __nvJitLinkAddData_12_1, version libnvJitLink.so.12" error message is reported when I use PyTorch on a GPU-accelerated Linux instance?

  • Problem description

    When you use PyTorch on a GPU-accelerated Linux instance, the following error message is reported:

    >>> import torch
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python3.8/dist-packages/torch/__init__.py", line 235, in <module>
        from torch._C import *  # noqa: F403
    ImportError: /usr/local/lib/python3.8/dist-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkAddData_12_1, version libnvJitLink.so.12
  • Cause

    The CUDA version installed on the GPU-accelerated instance may be incompatible with the PyTorch version. For more information about the mapping between CUDA and PyTorch versions, see the Previous PyTorch versions tab in the official PyTorch documentation.

    The PyTorch version installed by running the pip install torch command is 2.1.2. This requires that the CUDA version must be 12.1. However, the CUDA version automatically installed on the purchased GPU-accelerated instance is 12.0, which does not match the CUDA version required by PyTorch.

  • Solution

    If you selected Auto-install GPU Driver on the Public Images tab in the Image section when you purchased a GPU-accelerated instance, you can change the CUDA version to 12.1 by using one of the following methods:

    • Method 1: Manually install CUDA

      Manually install CUDA of version 12.1. For more information, see NVIDIA CUDA Installation Guide for Linux.

    • Method 2: Install CUDA by using a custom script

      1. Release the GPU-accelerated instance.

        For more information, see Release instances.

      2. Purchase a new GPU-accelerated instance.

        For more information, see Create a GPU-accelerated instance. The following section describes how to configure key parameters:

        • On the Public Images tab in the Image section, do not select Auto-install GPU Driver.

        • In the field in the User Data part of the Advanced Settings (Optional) section, enter a custom script to install the NVIDIA Tesla driver of version 535.129.03 and CUDA of version 12.1.1. The following sample code provides an example of a custom script:

          Sample code of a custom script

          #!/bin/sh
          
          #Please input version to install
          DRIVER_VERSION="535.129.03"
          CUDA_VERSION="12.1.1"
          CUDNN_VERSION="8.9.7.29"
          IS_INSTALL_eRDMA="FALSE"
          IS_INSTALL_RDMA="FALSE"
          IS_INSTALL_AIACC_TRAIN="FALSE"
          IS_INSTALL_AIACC_INFERENCE="FALSE"
          IS_INSTALL_RAPIDS="FALSE"
          INSTALL_DIR="/root/auto_install"
          
          #using .run to install driver and cuda
          auto_install_script="auto_install.sh"
          
          script_download_url=$(curl http://100.100.100.200/latest/meta-data/source-address | head -1)"/opsx/ecs/linux/binary/script/${auto_install_script}"
          echo $script_download_url
          
          rm -rf $INSTALL_DIR
          mkdir -p $INSTALL_DIR
          cd $INSTALL_DIR
          wget -t 10 --timeout=10 $script_download_url && bash ${INSTALL_DIR}/${auto_install_script} $DRIVER_VERSION $CUDA_VERSION $CUDNN_VERSION $IS_INSTALL_AIACC_TRAIN $IS_INSTALL_AIACC_INFERENCE $IS_INSTALL_RDMA $IS_INSTALL_eRDMA $IS_INSTALL_RAPIDS
    • Method 3: Modify a custom script and change the OS

      1. Stop the GPU-accelerated instance.

        For more information, see Stop instances.

      2. On the Instance page, find the stopped GPU-accelerated instance and click image in the Actions column. In the Instance Settings section, click Set User Data.

      3. Modify the user data and click Confirm.

        In this example, the values of the DRIVER_VERION, CUDA_VERSION, and CUDNN_VERSION parameters are changed to the following versions:

        ...
        DRIVER_VERSION="535.129.03"
        CUDA_VERSION="12.1.1"
        CUDNN_VERSION="8.9.7.29"
        ...

        修改数据.jpg

      4. Change the OS of the GPU-accelerated instance.

        For more information, see Replace the operating system (system disk) of an instance.

        After the GPU-accelerated instance is restarted, the system re-installs the new versions of the NVIDIA Tesla driver, CUDA, and cuDNN.