This topic provides answers to some commonly asked questions about Elastic GPU Service for you to troubleshoot issues related to Elastic GPU Service.
Why do GPU-accelerated Windows instances not support features such as DirectX?
Windows Remote Desktop Protocol (RDP) does not support applications such as DirectX and Open Graphics Library (OpenGL) applications. You must install TightVNC and TightVNC clients or remote connection clients that support protocols such as PC over IP (PCoIP) and XenDesktop HDX 3D on your GPU-accelerated Windows instances.
Do GPU-accelerated instances support Android emulators?
Only the following GPU-accelerated compute-optimized Elastic Compute Service (ECS) Bare Metal Instance families support Android emulators: ebmgn7e, ebmgn7i, ebmgn7, ebmgn6ia, ebmgn6e, ebmgn6v, and ebmgn6i.
Can I change the instance types of GPU-accelerated instances?
You can change the instance types of GPU-accelerated instances within the same instance family except for the following instance families:
gn5: a GPU-accelerated compute-optimized instance family that uses local storage
vgn5i: a vGPU-accelerated instance family
For more information, see Instance families that support instance type changes.
Do pay-as-you-go GPU-accelerated instances support the economical mode?
The GPU-accelerated instance families that use local storage, such as gn5, do not support the economical mode. For more information, see Economical mode.
What are the differences between GPUs and CPUs?
The following table describes the differences between GPUs and CPUs.
Comparison item | GPU | CPU |
ALU | A large number of arithmetic logic units (ALUs) that can be used for large-scale parallel computing. | A small number of powerful ALUs. |
Logic control unit | Simple logic control units. | Complex logic control units. |
Cache | A small size of cache that is used for threads and cannot be used to store accessed data. | A large size of cache that stores data to increase the speed of data access and reduce latency. |
Response mode | GPUs can integrate all tasks before the GPUs perform batch processing. | CPUs can respond to a task in real time. |
Scenario | Compute-intensive and high-throughput scenarios where multiple threads run in parallel to process highly similar tasks. | Serial computing scenarios that involve complex logic and require high response speeds. |
Can regular ECS instance families be upgraded or changed to GPU-accelerated instance families?
No, regular ECS instance families cannot be upgraded or changed to GPU-accelerated instance families.
For more information, see Instance families that support instance type changes.
Why am I unable to view GPUs by running the nvidia-smi command after I purchase GPU-accelerated instances?
In most cases, you are unable to view GPUs by running the nvidia-smi
command because GPU-accelerated instances or NVIDIA drivers are not installed. You can install drivers based on the instance families of the GPU-accelerated instances. The following information describes the drivers that you can install and how to install the drivers:
If you purchase vGPU-accelerated instances, you must install GRID drivers. For more information, see Install a GRID driver on a vGPU-accelerated Linux instance or Install a GRID driver on a GPU-accelerated compute-optimized or vGPU-accelerated Windows instance.
If you purchase GPU-accelerated compute-optimized instances, you can install GPU drivers. For more information, see Install a Tesla driver on a GPU-accelerated compute-optimized Linux instance or Install a Tesla driver on a GPU-accelerated compute-optimized Windows instance.
For more information about the installation scenarios and guideline of drivers, see Installation guideline for NVIDIA Tesla and GRID drivers.
Which drivers do I need to install on vGPU-accelerated instances?
The instances of vGPU-accelerated instance families such as vgn6i and vgn5i are configured with vGPUs that are generated from GPU virtualization based on the mediated pass-through method. You can install only GRID drivers on the instances. You must install GRID drivers based on the OS types of your vGPU-accelerated instances. The following information describes how to install the drivers:
If you want to install Windows GRID drivers, go to the Alibaba Cloud Marketplace homepage to purchase the images that contain the GRID drivers, and use the images to install the drivers. For example, you can purchase images such as Windows Server 2019 DataCenter 64-bit (English) Preinstalled GRID 13 Driver Image and Windows Server 2016 DataCenter 64-bit (English) Preinstalled GRID 13 Driver Image.
For more information, see Install a GRID driver on a GPU-accelerated compute-optimized or vGPU-accelerated Windows instance.
If you want to install Linux GRID drivers, see Install a GRID driver on a vGPU-accelerated Linux instance.
What issue causes the CUDA version to become inconsistent after I create and install a GPU-accelerated instance?
After you run the nvidia-smi
command, the system displays the latest CUDA version that the GPU-accelerated instance supports instead of the CUDA version that you selected when you created the GPU-accelerated instance.
Which drivers do I need to install when I use tools such as OpenGL and Direct3D for graphics computing on GPU-accelerated compute-optimized instances?
You can install drivers on GPU-accelerated compute-optimized instances based on the OS types of the instances. The following information describes the drivers that you can install and how to install the drivers:
If your GPU-accelerated compute-optimized instances run Linux, install GPU drivers. For more information, see Install a Tesla driver on a GPU-accelerated compute-optimized Linux instance.
If your GPU-accelerated compute-optimized instances run Windows, go to the Alibaba Cloud Marketplace homepage to purchase the images that contain GRID drivers, and use the images to install the drivers. For example, you can purchase images such as Windows Server 2019 DataCenter 64-bit (English) Preinstalled GRID 13 Driver Image and Windows Server 2016 DataCenter 64-bit (English) Preinstalled GRID 13 Driver Image.
For more information, see Install a GRID driver on a GPU-accelerated compute-optimized or vGPU-accelerated Windows instance.
How do I view GPU monitoring data?
You can log on to the CloudMonitor console or call the DescribeMetricList operation to view the GPU monitoring data. For more information, see GPU monitoring.
How do I transmit data between GPU-accelerated instances and regular ECS instances?
GPU-accelerated instances deliver the same level of experience as regular ECS instances. GPU-accelerated instances also provide GPU acceleration. By default, data can be transmitted between GPU-accelerated instances and regular ECS instances that belong to the same security group over an internal network. You do not need to configure network connectivity.
What do I do if a black screen appears on a VNC client when I use the VNC client to connect to a GPU-accelerated Windows instance on which a GRID driver is installed?
Cause: After you install a GRID driver on a GPU-accelerated Windows instance, the GRID driver controls the output display of the virtual machine. The Virtual Network Computing (VNC) client can no longer obtain the output display that is processed by the integrated GPU on the instance. Then, a black screen appears on the VNC client. This issue is normal.
Solution: Connect to the GPU-accelerated Windows instance from a client on your computer. For more information, see Connect to a Windows instance by using a username and password.
How do I view the details of the GPUs that are used by GPU-accelerated instances?
The methods that you can use to view the details of the GPUs vary based on the OS types of your GPU-accelerated instances. The following information describes how to view the details:
If your GPU-accelerated instances run Linux, run the
nvidia-smi
command to view the details of the GPUs.If your GPU-accelerated instances run Windows, view the details of the GPUs in Device Manager from your computer.
If you want to view information about the GPUs, such as the idle rate, utilization, temperature, and power, go to the CloudMonitor console. For more information, see GPU monitoring.
How do I obtain GRID licenses?
You can obtain GRID licenses based on the OS types of your GPU-accelerated instances.
If you want to install GRID drivers on GPU-accelerated Windows instances, go to the Alibaba Cloud Marketplace homepage to purchase the images that contain the GRID drivers, and use the images to install the drivers. For example, you can purchase images such as Windows Server 2019 DataCenter 64-bit (English) Preinstalled GRID 13 Driver Image and Windows Server 2016 DataCenter 64-bit (English) Preinstalled GRID 13 Driver Image.
For more information, see Install a GRID driver on a GPU-accelerated compute-optimized or vGPU-accelerated Windows instance.
If you want to install GRID drivers on vGPU-accelerated Linux instances, submit a ticket to obtain the GRID licenses, and install the drivers. For more information, see Install a GRID driver on a vGPU-accelerated Linux instance.
How do I install cGPU?
We recommend that you install and use cGPU by using the GPU sharing component that is provided by Container Service for Kubernetes (ACK). For more information, see Configure the GPU sharing component.
How do I disable automatic installation of a GPU driver when I change the OS?
If you select Auto-install GPU Driver when you create a GPU-accelerated instance, the instance is created with a GPU driver automatically installed. If you want to change the OS of the instance and disable automatic installation of the GPU driver, perform the following steps.
For more information about how to automatically install a GPU driver when you create a GPU-accelerated instance, see Create a GPU-accelerated Linux instance configured with a GPU driver.
Stop the GPU-accelerated instance.
For more information, see Stop an instance.
On the Instance page, find the stopped GPU-accelerated instance and click the More icon in the Actions column. In the Instance Settings section, click Set User Data.
In the User Data: field, delete the user data and click Confirm.
Change the OS of the GPU-accelerated instance.
The system changes the OS of a GPU-accelerated instance by replacing the system disk of the instance. Therefore, you can change the OS of a GPU-accelerated instance by replacing the image of the instance. For more information, see Replace the operating system (system disk) of an instance.
What do I do if a GPU has fallen off the bus due to an XID 119 or XID 120 error?
Problem description
A GPU has fallen off the bus. For example, an error message that indicates a GPU fails to be started appears when you use the GPU on a Linux machine. After you run the
sh nvidia-bug-report.sh
command, you can view XID 119 or XID 120 error messages in the log. The following figure shows an example of XID 119 error messages.NoteFor more information, see Common XID Errors in the official NVIDIA documentation.
Cause
The preceding issue may be caused by the abnormal running status of the GPU System Processor (GSP) component of the GPU. NVIDIA does not provide a specific driver version to fix the issue. We recommend that you disable the GSP feature before you use the GPU.
NoteFor more information about GSP, see Chapter 42. GSP Firmware in the official NVIDIA documentation.
Solution
Log on to the GPU-accelerated instance.
For more information, see Connect to a Linux instance by using a password or key.
Run the following command to disable the GSP component:
echo options nvidia NVreg_EnableGpuFirmware=0 > /etc/modprobe.d/nvidia-gsp.conf
Restart the GPU-accelerated instance.
For more information, see Restart instances.
Log on to the GPU-accelerated instance again.
Run the following command to obtain the value of the
EnableGpuFirmware
parameter:cat /proc/driver/nvidia/params | grep EnableGpuFirmware:
If
EnableGpuFirmware:0
is returned, the GSP component is disabled and the issue is fixed.ImportantIn this case, you can run the
nvidia-smi
command to check the GPU status, which is expected to be normal.If
EnableGpuFirmware:0
is not returned, the GSP component is not disabled. Proceed to the next step.
Run the following command to check whether the GPU is running as expected:
nvidia-smi
If an error is returned, the issue persists on the GPU. Contact Alibaba Cloud technical support to shut down the instance and migrate data.
What do I do if the "undefined symbol: __nvJitLinkAddData_12_1, version libnvJitLink.so.12" error message is reported when I use PyTorch on a GPU-accelerated Linux instance?
Problem description
When you use PyTorch on a GPU-accelerated Linux instance, the following error message is reported:
>>> import torch Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.8/dist-packages/torch/__init__.py", line 235, in <module> from torch._C import * # noqa: F403 ImportError: /usr/local/lib/python3.8/dist-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkAddData_12_1, version libnvJitLink.so.12
Cause
The CUDA version installed on the GPU-accelerated instance may be incompatible with the PyTorch version. For more information about the mapping between CUDA and PyTorch versions, see the Previous PyTorch versions tab in the official PyTorch documentation.
The PyTorch version installed by running the
pip install torch
command is 2.1.2. This requires that the CUDA version must be 12.1. However, the CUDA version automatically installed on the purchased GPU-accelerated instance is 12.0, which does not match the CUDA version required by PyTorch.Solution
If you selected Auto-install GPU Driver on the Public Images tab in the Image section when you purchased a GPU-accelerated instance, you can change the CUDA version to 12.1 by using one of the following methods: