feature and operation issues in Elastic GPU Service - Elastic GPU Service

This topic provides answers to some commonly asked questions about Elastic GPU Service for you to troubleshoot the related issues.

Category	FAQ
Feature issue	Why does Windows not support features such as DirectX? Do GPU-accelerated instances support Android emulators? Can I change the instance types of GPU-accelerated instances? Do pay-as-you-go GPU-accelerated instances support the economical mode? What are the differences between GPUs and CPUs? Can regular ECS instance families be upgraded or changed to GPU-accelerated instance families? Why am I unable to view GPUs by running the nvidia-smi command after I purchase GPU-accelerated instances? Which drivers do I need to install on vGPU-accelerated instances? What issue causes the CUDA version to become inconsistent after I create and install a GPU-accelerated instance? Which drivers do I need to install when I use tools such as OpenGL and Direct3D for graphics computing on GPU-accelerated compute-optimized instances?
Operation issue	How do I view GPU monitoring data? How do I transmit data between GPU-accelerated instances and regular ECS instances? What do I do if a black screen appears on a VNC client when I use the VNC client to connect to a GPU-accelerated Windows instance on which a GRID driver is installed? How do I view the details of the GPUs that are used by GPU-accelerated instances? How do I obtain GRID licenses? How do I install cGPU? How do I disable automatic installation of a GPU driver when I change the OS? What do I do if a GPU has fallen off the bus due to an XID 119 or XID 120 error? What do I do if the "undefined symbol: __nvJitLinkAddData_12_1, version libnvJitLink.so.12" error message is reported when I use PyTorch on a GPU-accelerated Linux instance? What do I do if Persistence-M that I have run nvidia-smi -pm 1 to enable and the ECC state become ineffective after an instance restart? What do I do if an error is reported when an application runs in a CUDA environment of a lower version?

Why does Windows not support features such as DirectX?

Windows Remote Desktop Protocol (RDP) does not support applications such as DirectX and Open Graphics Library (OpenGL) applications. You must install TightVNC and TightVNC clients or remote connection clients that support protocols such as PC over IP (PCoIP) and XenDesktop HDX 3D on your Windows.

Do GPU-accelerated instances support Android emulators?

Only the following GPU-accelerated compute-optimized Elastic Compute Service (ECS) Bare Metal Instance families support Android emulators: ebmgn7e, ebmgn7i, ebmgn7, ebmgn6ia, ebmgn6e, ebmgn6v, and ebmgn6i.

Can I change the instance types of GPU-accelerated instances?

You can change the instance types of GPU-accelerated instances within the same instance family except for the following instance families:

gn5: a GPU-accelerated compute-optimized instance family that uses local storage
vgn5i: a vGPU-accelerated instance family

For more information, see Instance families that support instance type changes.

Do pay-as-you-go GPU-accelerated instances support the economical mode?

The GPU-accelerated instance families that use local storage, such as gn5, do not support the economical mode. For more information about the economical mode, see Economical mode.

What are the differences between GPUs and CPUs?

The following table describes the differences between GPUs and CPUs.

Item	GPU	CPU
ALU	A large number of arithmetic logic units (ALUs) that can be used for large-scale parallel computing.	A small number of powerful ALUs.
Logic control unit	Simple logic control units.	Complex logic control units.
Cache	A small size of cache that is used for threads and cannot be used to store accessed data.	A large size of cache that stores data to increase the speed of data access and reduce latency.
Response mode	GPUs can integrate all tasks before the GPUs perform batch processing.	CPUs can respond to a task in real time.
Scenarios	Compute-intensive and high-throughput scenarios where multiple threads run in parallel to process highly similar tasks.	Serial computing scenarios that involve complex logic and require high response speeds.

Can regular ECS instance families be upgraded or changed to GPU-accelerated instance families?

No, regular ECS instance families cannot be upgraded or changed to GPU-accelerated instance families.

For more information, see Instance families that support instance type changes.

Why am I unable to view GPUs by running the nvidia-smi command after I purchase GPU-accelerated instances?

In most cases, the reason you are unable to view the GPUs of GPU-accelerated instances by running the nvidia-smi command is that you have not installed or failed to install NVIDIA drivers on the instances. You can install drivers based on the instance families of the GPU-accelerated instances. The following information describes the drivers that you can install and how to install the drivers:

If you purchase vGPU-accelerated instances, you must install GRID drivers. For more information, see Install a GRID driver on a vGPU-accelerated Linux instance or Install a GRID driver on a GPU-accelerated compute-optimized or vGPU-accelerated Windows instance.
If you purchase GPU-accelerated compute-optimized instances, you can install GPU drivers. For more information, see Manually install a Tesla driver on a GPU-accelerated compute-optimized Linux instance or Manually install a Tesla driver on a GPU-accelerated compute-optimized Windows instance.

For more information about the installation scenarios and guideline of drivers, see Installation guideline for NVIDIA Tesla and GRID drivers.

Which drivers do I need to install on vGPU-accelerated instances?

The instances of vGPU-accelerated instance families such as vgn6i and vgn5i are configured with vGPUs that are generated from GPU virtualization based on the mediated pass-through method. You can install only GRID drivers on the instances. You must install GRID drivers based on the OS types of your vGPU-accelerated instances. The following information describes how to install the drivers:

If you want to install Windows GRID drivers, go to the Alibaba Cloud Marketplace homepage to purchase the images that contain the GRID drivers, and use the images to install the drivers. For example, you can purchase images such as Windows Server 2019 DataCenter 64-bit (English) Preinstalled GRID 13 Driver Image and Windows Server 2016 DataCenter 64-bit (English) Preinstalled GRID 13 Driver Image.
For more information, see Install a GRID driver on a GPU-accelerated compute-optimized or vGPU-accelerated Windows instance.
If you want to install Linux GRID drivers, see Install a GRID driver on a vGPU-accelerated Linux instance.

What issue causes the CUDA version to become inconsistent after I create and install a GPU-accelerated instance?

After you run the nvidia-smi command, the system displays the latest CUDA version that the GPU-accelerated instance supports instead of the CUDA version that you selected when you created the GPU-accelerated instance.

Which drivers do I need to install when I use tools such as OpenGL and Direct3D for graphics computing on GPU-accelerated compute-optimized instances?

You can install drivers on GPU-accelerated compute-optimized instances based on the OS types of the instances. The following information describes the drivers that you can install and how to install the drivers:

If your GPU-accelerated compute-optimized instances run Linux, install GPU drivers. For more information, see Manually install a Tesla driver on a GPU-accelerated compute-optimized Linux instance.
If your GPU-accelerated compute-optimized instances run Windows, go to the Alibaba Cloud Marketplace homepage to purchase the images that contain GRID drivers, and use the images to install the drivers. For example, you can purchase images such as Windows Server 2019 DataCenter 64-bit (English) Preinstalled GRID 13 Driver Image and Windows Server 2016 DataCenter 64-bit (English) Preinstalled GRID 13 Driver Image.
For more information, see Install a GRID driver on a GPU-accelerated compute-optimized or vGPU-accelerated Windows instance.

How do I view GPU monitoring data?

You can log on to the CloudMonitor console or call the DescribeMetricList operation to view the GPU monitoring data. For more information, see GPU monitoring.

How do I transmit data between GPU-accelerated instances and regular ECS instances?

GPU-accelerated instances deliver the same level of experience as regular ECS instances. GPU-accelerated instances also provide GPU acceleration. By default, data can be transmitted between GPU-accelerated instances and regular ECS instances that belong to the same security group over an internal network. You do not need to configure network connectivity.

What do I do if a black screen appears on a VNC client when I use the VNC client to connect to a GPU-accelerated Windows instance on which a GRID driver is installed?

Cause: After you install a GRID driver on a GPU-accelerated Windows instance, the GRID driver controls the output display of the virtual machine. The Virtual Network Computing (VNC) client can no longer obtain the output display that is processed by the integrated GPU on the instance. Then, a black screen appears on the VNC client. This issue is normal.

Solution: Connect to the GPU-accelerated Windows instance from a client on your computer. For more information, see Connect to a Windows instance by using a username and password.

How do I view the details of the GPUs that are used by GPU-accelerated instances?

The methods that you can use to view the details of the GPUs vary based on the OS types of your GPU-accelerated instances. The following information describes how to view the details:

If your GPU-accelerated instances run Linux, run the nvidia-smi command to view the details of the GPUs.
If your GPU-accelerated instances run Windows, view the details of the GPUs in Device Manager from your computer.

If you want to view information about the GPUs, such as the idle rate, utilization, temperature, and power, go to the CloudMonitor console. For more information, see GPU monitoring.

How do I obtain GRID licenses?

You can obtain GRID licenses based on the OS types of your GPU-accelerated instances.

If you want to install GRID drivers on GPU-accelerated Windows instances, go to the Alibaba Cloud Marketplace homepage to purchase the images that contain the GRID drivers, and use the images to install the drivers. For example, you can purchase images such as Windows Server 2019 DataCenter 64-bit (English) Preinstalled GRID 13 Driver Image and Windows Server 2016 DataCenter 64-bit (English) Preinstalled GRID 13 Driver Image.
For more information, see Install a GRID driver on a GPU-accelerated compute-optimized or vGPU-accelerated Windows instance.
If you want to install GRID drivers on vGPU-accelerated Linux instances, submit a ticket to obtain the GRID licenses, and install the drivers. For more information, see Install a GRID driver on a vGPU-accelerated Linux instance.

How do I install cGPU?

We recommend that you install cGPU by using the GPU sharing component that is provided by Container Service for Kubernetes (ACK). For more information, see Configure the GPU sharing component.

How do I disable automatic installation of a GPU driver when I change the OS?

If you select Auto-install GPU Driver when you create a GPU-accelerated instance, the instance is created with a GPU driver automatically installed. If you want to change the OS of the instance and disable automatic installation of the GPU driver, perform the following steps:

Note

For more information about how to automatically install a GPU driver when you create a GPU-accelerated instance, see Create a GPU-accelerated Linux instance configured with a GPU driver.

Stop the GPU-accelerated instance.
For more information, see Stop instances.
On the Instance page, find the stopped GPU-accelerated instance and click the icon in the Actions column. In the Instance Settings section, click Set User Data.
In the User Data field, delete the user data and click Confirm.
Change the OS of the GPU-accelerated instance.
The system changes the OS of a GPU-accelerated instance by replacing the system disk of the instance. Therefore, you can change the OS of a GPU-accelerated instance by replacing the image of the instance. For more information, see Replace the operating system (system disk) of an instance.

What do I do if a GPU has fallen off the bus due to an XID 119 or XID 120 error?

Problem description
A GPU has fallen off the bus. For example, an error message that indicates a GPU fails to be started appears when you use the GPU on a Linux machine. After you run the sh nvidia-bug-report.sh command, you can view XID 119 or XID 120 error messages in the log. The following figure shows an example of an XID 119 error page.
Note
For more information, see Common XID Errors in the official NVIDIA documentation.
Cause
The preceding issue may be caused by the abnormal running status of the GPU System Processor (GSP) component of the GPU. NVIDIA does not provide a specific driver version to fix the issue. We recommend that you disable the GSP feature before you use the GPU.
Note
For more information about GSP, see Chapter 42. GSP Firmware in the official NVIDIA documentation.
Solution
1. Log on to the GPU-accelerated instance.
  For more information, see Connect to a Linux instance by using a password or key.
2. Run the following command to disable the GSP component:
```
echo options nvidia NVreg_EnableGpuFirmware=0 > /etc/modprobe.d/nvidia-gsp.conf
```
3. Restart the GPU-accelerated instance.
  For more information, see Restart instances.
4. Log on to the GPU-accelerated instance again.
5. Run the following command to obtain the value of the EnableGpuFirmware parameter:
```
cat /proc/driver/nvidia/params | grep EnableGpuFirmware:
```
  - If EnableGpuFirmware:0 is returned, the GSP component is disabled and the issue is fixed.
    Important
    In this case, you can run the command to check the GPU status, which is expected to be normal.
  - If EnableGpuFirmware:0 is not returned, the GSP component is not disabled. Proceed to the next step.
6. Run the following command to check whether the NVIDIA GPU is running as expected:
```
nvidia-smi
```
  If an error is returned, the issue persists on the GPU. Contact Alibaba Cloud technical support to shut down the instance and migrate data.

What do I do if the "undefined symbol: __nvJitLinkAddData_12_1, version libnvJitLink.so.12" error message is reported when I use PyTorch on a GPU-accelerated Linux instance?

Problem description

When you use PyTorch on a GPU-accelerated Linux instance, the following error message is reported:

>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/torch/__init__.py", line 235, in <module>
    from torch._C import *  # noqa: F403
ImportError: /usr/local/lib/python3.8/dist-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkAddData_12_1, version libnvJitLink.so.12

Cause
The CUDA version installed on the GPU-accelerated instance may be incompatible with the PyTorch version. For more information about the mappings between CUDA and PyTorch versions, see the Previous PyTorch versions tab in the official PyTorch documentation.
The PyTorch version installed by running the pip install torch command is 2.1.2. This requires that the CUDA version must be 12.1. However, the CUDA version automatically installed on the purchased GPU-accelerated instance is 12.0, which does not match the CUDA version required by PyTorch.
Solution
If you selected Auto-install GPU Driver on the Public Images tab in the Image section when you purchased a GPU-accelerated instance, you can change the CUDA version to 12.1 by using one of the following methods:
- Method 1: Manually install CUDA
  Manually install CUDA of version 12.1. For more information, see NVIDIA CUDA Installation Guide for Linux.
- Method 2: Install CUDA by using a custom script
  1. Release the GPU-accelerated instance.
    For more information, see Release instances.
  2. Purchase a new GPU-accelerated instance.
    For more information, see Create a GPU-accelerated instance. The following section describes how to configure key parameters:
    On the Public Images tab in the Image section, do not select Auto-install GPU Driver.
    In the field in the User Data part of the Advanced Settings(Optional) section, enter a custom script to install the NVIDIA Tesla driver of version 535.129.03 and CUDA of version 12.1.1. The following code shows a sample custom script:
    Sample custom script
    #!/bin/sh #Please input version to install DRIVER_VERSION="535.129.03" CUDA_VERSION="12.1.1" CUDNN_VERSION="8.9.7.29" IS_INSTALL_eRDMA="FALSE" IS_INSTALL_RDMA="FALSE" IS_INSTALL_AIACC_TRAIN="FALSE" IS_INSTALL_AIACC_INFERENCE="FALSE" IS_INSTALL_RAPIDS="FALSE" INSTALL_DIR="/root/auto_install" #using .run to install driver and cuda auto_install_script="auto_install.sh" script_download_url=$(curl http://100.100.100.200/latest/meta-data/source-address | head -1)"/opsx/ecs/linux/binary/script/${auto_install_script}" echo $script_download_url rm -rf $INSTALL_DIR mkdir -p $INSTALL_DIR cd $INSTALL_DIR wget -t 10 --timeout=10 $script_download_url && bash ${INSTALL_DIR}/${auto_install_script} $DRIVER_VERSION $CUDA_VERSION $CUDNN_VERSION $IS_INSTALL_AIACC_TRAIN $IS_INSTALL_AIACC_INFERENCE $IS_INSTALL_RDMA $IS_INSTALL_eRDMA $IS_INSTALL_RAPIDS
- Method 3: Modify a custom script and change the OS
  1. Stop the GPU-accelerated instance.
    For more information, see Stop instances.
  2. On the Instance page, find the stopped GPU-accelerated instance and click the icon in the Actions column. In the Instance Settings section, click Set User Data.
  3. Modify the user data and click Confirm.
    In this example, the values of the DRIVER_VERION, CUDA_VERSION, and CUDNN_VERSION parameters are changed to the following versions:
    ... DRIVER_VERSION="535.129.03" CUDA_VERSION="12.1.1" CUDNN_VERSION="8.9.7.29" ...
  4. Change the OS of the GPU-accelerated instance.
    For more information, see Replace the operating system (system disk) of an instance.
    After the GPU-accelerated instance is restarted, the system re-installs the new versions of the NVIDIA Tesla driver, CUDA, and cuDNN.

What do I do if Persistence-M that I have run `nvidia-smi -pm 1` to enable and the ECC state become ineffective after an instance restart?

Problem description
If you run the nvidia-smi -pm 1 command to enable Persistence-M when you install a Linux Tesla driver of version 535 or later on a GPU-accelerated compute-optimized instance, the following issues may be caused:
- Persistence-M becomes ineffective after the instance is restarted. That is, Persistence-M is still in the default Off state.
- The ECC state fails to be configured.
- The MIG feature fails to be configured.
Cause
When you run the nvidia-smi -pm 1 command to enable Persistence-M for the instance, the Tesla driver of version 535 or later is not supported. As a result, the preceding issues are caused after the instance is restarted.
Solution
Check whether the following information exists in the dmesg log. If the following information exists, we recommend that you use the NVIDIA Persistence Daemon to enable Persistence-M. For more information, see (Optional) Enable Persistence-M by using the NVIDIA Persistence Daemon.
```
NVRM: Persistence mode is deprecated and will be removed in a future release. Please use nvidia-persistenced instead.
```

What do I do if an error is reported when an application runs in a CUDA environment of a lower version?

Problem description
For a vGPU-accelerated Linux instance with a GRID driver and a lower CUDA version installed, an error is reported when you run a CUDA application that requires a higher CUDA version on the instance. In the following example, the lower CUDA version is 11.4 and the matrixMul application requires CUDA 12.2. The following figure shows the error message.
Cause
CUDA versions vary based on GPU driver versions. For more information about the mappings between CUDA and driver versions, see CUDA Toolkit Major Component Versions. Therefore, an application that requires a higher CUDA version such as CUDA 12.2 cannot run on a CUDA 11.4 environment. This section describes how to fix the error by updating the CUDA compat package.
Solution
1. Connect to your GPU-accelerated Linux instance.
  In this example, an Ubuntu 20.04 instance is connected. For more information, see Connect to a Linux instance by using a password or key.
2. Run the following command to download the CUDA 12.2 compat package:
  In this example, the compat package of Ubuntu 20.04 and x86_64 is downloaded. You can download the compat package based on the OS and architecture of your GPU-accelerated instance from the Index of /compute/cuda/repos page.
```
sudo wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/cuda-compat-12-2_535.104.05-1_amd64.deb
```
3. Run the following command to extract files from the .deb file and decompress the files to the specified directory:
  Important
  In this example, the files are decompressed to the /home directory. Replace the directory with your actual directory.
```
sudo dpkg  -x cuda-compat-12-2_535.104.05-1_amd64.deb /home
```
4. Run the following commands in sequence to configure a CUDA environment of a higher version:
```
sudo echo "export LD_LIBRARY_PATH=/home/usr/local/cuda-12.2/compat:$LD_LIBRARY_PATH" >> ~/.bashrc
```
```
source ~/.bashrc
```
5. Run the application to check whether it runs as expected.
  In this example, the matrixMul application is run. The following figure shows that the application runs as expected.