This topic helps you troubleshoot and resolve issues with Elastic GPU Service by summarizing common issues encountered when using GPUs.
|
Category |
Related questions |
|
GPU-accelerated instance |
|
|
GPU card |
|
|
GPU memory |
|
|
GPU driver |
|
|
GPU monitoring |
|
|
Others |
How do I install the cGPU service? The nvidia-smi -r command hangs after you install the cGPU service |
GPU-accelerated instances
Do GPU-accelerated instances support Android emulators?
Android emulators can be installed on only some GPU-accelerated instances.
Android emulators are supported only on the following GPU-accelerated compute-optimized ECS Bare Metal Instance families: ebmgn7e, ebmgn7i, ebmgn7, ebmgn6ia, ebmgn6e, ebmgn6v, ebmgn6i.
Can the configuration of a GPU-accelerated instance be changed?
Some GPU-accelerated instances support configuration changes.
Supported instance types are listed in Instance type change restrictions and checks.
Can a standard ECS instance family be upgraded or changed to a GPU-accelerated instance family?
No. Standard ECS instance families cannot be changed to GPU-accelerated instance families.
Supported instance types are listed in Instance type change restrictions and checks.
How do I transfer data between a GPU-accelerated instance and a standard ECS instance?
No special settings are required to transfer data.
GPU-accelerated instances behave like standard ECS instances. Instances in the same security group communicate over the internal network by default. No special configuration is required.
What is the difference between a GPU and a CPU?
The following table compares GPUs and CPUs.
|
Comparison |
GPU |
CPU |
|
Arithmetic Logic Unit (ALU) |
Many ALUs optimized for large-scale parallel computation. |
Few but powerful ALUs. |
|
Control unit |
Has a relatively simple control unit. |
Has a complex control unit. |
|
Cache |
Has a small cache that serves threads instead of storing accessed data. |
Has large cache structures that can store data to improve access speed and reduce latency. |
|
Response method |
Integrates all tasks before batch processing. |
Responds to individual tasks in real-time. |
|
Scenarios |
Suitable for compute-intensive, highly similar, and multi-threaded parallel high-throughput computing scenarios. |
Suitable for logically complex serial computing scenarios that require fast response times. |
GPU cards
After I purchase a GPU-accelerated instance, why can't the nvidia-smi command find the GPU card?
Cause: The nvidia-smi command cannot find the GPU card because the Tesla or GRID driver is not installed or the installation failed.
Solution: To use the high-performance features of your GPU-accelerated instance, you must install the correct driver for your instance type:
-
vGPU-accelerated instances require a GRID driver:
-
GPU-accelerated compute-optimized instances support Tesla or GRID drivers:
How do I view the details of a GPU card?
The method varies by operating system:
-
On Linux, you can run the
nvidia-smicommand to view the GPU card details. -
On Windows, you can view the GPU card details in .
To view information such as GPU idle rate, usage, temperature, and power, go to the CloudMonitor console. For more information, see GPU monitoring.
A GPU initialization failure (such as RmInitAdapter failed!) occurs when I use a GPU on Linux
-
Symptoms: The GPU device goes offline and the system cannot recognize the GPU card. For example, on a Linux system, a GPU initialization failure error occurs. After you run the
sh nvidia-bug-report.shcommand, theRmInitAdapter failederror message appears in the generated log, as shown in the following example:NVRM: _kgspBootGspRm: unexpected WPR2 already up, cannot proceed with booting GSP NVRM: _kgspBootGspRm: (the GPU is likely in a bad state and may need to be reset) NVRM: crashcatWayfinderGetReportQueue_V1: insufficiently-sized L1 wayfinder scratch location 0 NVRM: RmInitAdapter: Cannot initialize GSP firmware RM NVRM: GPU 0000:00:09.0: RmInitAdapter failed! (0x62:0x40:2015) NVRM: GPU 0000:00:09.0: rm_init_adapter failed, device minor number 0 -
Cause: The GPU System Processor (GSP) component may be in an abnormal state. This causes the device to go offline and the system to be unable to detect the GPU card.
-
Solution: Restart the instance from the console. This action performs a complete GPU reset and usually resolves the issue. If the issue persists, see GPU device loss due to XID 119/XID 120 errors when using a GPU for further troubleshooting. We recommend that you disable the GSP feature.
GPU memory
Why does an instance with 48 GB of GPU memory show about 3 GB less in nvidia-smi?
ECC (Error-Correcting Code) is enabled and uses approximately 2-3 GB of GPU memory on a 48 GB instance. Run nvidia-smi to check ECC status (OFF = disabled, ON = enabled).
How do I disable the ECC feature to free up GPU memory?
-
Command line: Stop all processes that use the GPU. Run
nvidia-smi -e 0to disable ECC. Then, runnvidia-smi -rto reset the GPU. -
Startup script: Add
nvidia-smi -e 0andnvidia-smi -rto the first line of the/etc/rc.localstartup script. For some systems, the path is/etc/rc.d/rc.local. Then, restart the instance.
What do I do if an error indicating a GPU is in use by another client occurs when I disable ECC?
This error indicates that a component or process is still using the GPU. Make sure no GPU processes are running on the machine. If you cannot stop them manually, create a snapshot backup. Then, add the nvidia-smi -e 0 and nvidia-smi -r commands to the /etc/rc.local startup script. For some systems, the path is /etc/rc.d/rc.local. Restart the instance for the changes to take effect.
GPU drivers
What driver do I need to install for a vGPU-accelerated instance?
vGPU-accelerated instances require a GRID driver.
For general-purpose computing or graphics acceleration scenarios, you can load the GRID driver during instance creation or install it with Cloud Assistant afterward:
-
Load the GRID driver during instance creation. Load a GRID driver from an image with a pre-installed driver.
-
Install the GRID driver with Cloud Assistant after creation:
Can I upgrade CUDA to 12.4 or the NVIDIA driver to 550 or later on a vGPU-accelerated instance?
No.
vGPU-accelerated instances use the platform-provided GRID driver with a fixed version. You cannot install drivers from the NVIDIA website. To upgrade CUDA or the driver, use a gn or ebm series instance instead.
What driver do I need to install to use tools such as OpenGL and Direct3D for graphics acceleration on a GPU-accelerated compute-optimized instance?
Install the driver based on your operating system:
-
Linux GPU-accelerated compute-optimized instances require a Tesla driver:
-
Windows GPU-accelerated compute-optimized instances require a GRID driver:
Why is the CUDA version I see after installation different from the one I selected when creating the GPU-accelerated instance?
The nvidia-smi command shows the highest CUDA version that your GPU-accelerated instance supports, not the version you selected during instance creation.
After I install a GRID driver on a Windows GPU-accelerated instance, what do I do if a black screen appears when I use a VNC connection from the console?
-
Cause: The GRID driver takes over display output. VNC can no longer render from the integrated graphics, causing a black screen. This is expected behavior.
-
Solution: Connect to the GPU-accelerated instance using Workbench. For more information, see Connect to a Windows instance by using Workbench.
How do I get a GRID License?
The method depends on your operating system:
-
On Windows, use a pre-installed driver image or install the driver manually.
-
On Linux, use a pre-installed driver image or Cloud Assistant.
How do I upgrade a GPU driver (Tesla or GRID)?
You cannot directly upgrade a GPU driver. Uninstall the old version, restart, and then install the new version. Upgrade a Tesla or GRID driver.
Upgrade during off-peak hours. Back up disk data by creating a snapshot first. Create a snapshot.
A system crash and a kernel NULL pointer dereference error occur after you install NVIDIA driver version 570.124.xx (Linux) or 572.61 (Windows)
-
Symptoms: On some instance types, the system reports a
kernel NULL pointer dereferenceerror either during the installation of NVIDIA driver version 570.124.xx (Linux) or 572.61 (Windows), or when running thenvidia-smicommand after the installation. The following log shows the error: -
Solution: Avoid using driver version 570.124.xx (Linux) or 572.61 (Windows). We recommend that you use version 570.133.20 (Linux) or 572.83 (Windows) or later.
The nvidia-smi command returns a "No devices were found" error if you select NVIDIA Proprietary for the kernel module type during driver installation
-
Symptoms: On some instance types, if you select NVIDIA Proprietary for the kernel module type during driver installation, the nvidia-smi command returns a
No devices were founderror after the installation.The other available kernel module type on this screen is MIT/GPL.
-
Cause: Not all GPU models are compatible with the NVIDIA Proprietary driver.
-
Recommended kernel module type configuration:
-
For Blackwell architecture GPUs: You must use the open-source driver (select
MIT/GPL). -
For Turing, Ampere, Ada Lovelace, and Hopper architecture GPUs: We recommend that you use the open-source driver (select
MIT/GPL). -
For Maxwell, Pascal, and Volta architecture GPUs: You can only select
NVIDIA Proprietary.
-
GPU monitoring
How do I view the resource usage (vCPU, network traffic, bandwidth, and disk) of a GPU-accelerated instance?
You can use one of the following methods to view monitoring data such as vCPU usage, memory, average system load, internal bandwidth, public bandwidth, network connections, disk usage and reads, GPU usage, GPU memory usage, and GPU power.
-
Product console
-
ECS console: Provides vCPU usage, network traffic, disk I/O, and GPU metrics. View monitoring information in the ECS console.
-
CloudMonitor console: Provides fine-grained infrastructure, OS, GPU, network, process, and disk monitoring. For more information, see Host monitoring.
-
-
Expenses and Costs center
On the View Usage Details page, filter by Time Period, Commodity Name, Billable Item, Billable Item, and Time Unit. Click Export CSV to export usage data. Billing details.
For example, to view the traffic usage of an ECS instance, select ECS - Pay-As-You-Go for Product name, Outbound traffic for Billable item, Public traffic for Metering specification (the specification name is
ECS_FLOW), and Hour for Metering granularity.NoteUsage details show raw resource consumption, which differs from billable usage in billing details. These results are for reference only and cannot be used for reconciliation.
Others
How do I install the cGPU service?
Install the cGPU service through the Docker runtime in ACK. This is the recommended method for both enterprise users and individual users who have completed identity verification. Manage the shared GPU scheduling component.
The nvidia-smi -r command hangs after you install the cGPU service
-
Symptoms: When the cGPU service is loaded (verify with
lsmod | grep cgpu), thenvidia-smi -rcommand hangs when resetting the GPU. An error also appears in thedmesglog.[527717.881425] NVRM: Attempting to remove device 0000:08:00.0 with non-zero usage count! -
Cause: The cGPU component is still using the GPU device. This blocks the hardware reset operation.
-
Solution:
-
Uninstall cGPU: Uninstall the cGPU component. After the uninstallation, the
nvidia-smi -rcommand resumes and returns a result. -
Restart the instance: If the issue persists after the uninstallation, restart the instance from the console. Running the reboot command inside the instance is not effective.
ImportantDo not reset the GPU by running commands such as
nvidia-smi -r, detaching the device, or reinstalling the driver when the cGPU service is loaded. Always uninstall the cGPU service first to prevent failures. -