ACK Pro clusters include a built-in GPU node diagnostics feature that collects GPU metrics and surfaces nvidia-smi status codes and XID error messages, so you can identify and resolve GPU-accelerated node problems without opening a support ticket.
This page covers how to run a node diagnostic and how to interpret the results using the nvidia-smi status code reference and XID error reference.
Prerequisites
Before you begin, ensure that you have:
-
An ACK Pro cluster. For more information, see Create an ACK managed cluster
-
A cluster in the Running state on the Clusters page of the ACK console
Run a node diagnostic
-
Log on to the ACK console. In the left-side navigation pane, click Clusters.
-
On the Clusters page, click the name of your cluster. In the left-side navigation pane, choose Inspections and Diagnostics > Diagnostics.
-
On the Diagnosis page, click Node diagnosis.
-
In the Select node panel, specify a Node name, read the warning and select I know and agree, then click Create diagnosis. Wait until the Status column of the diagnostic report shows Success.
If you diagnose a single GPU-accelerated node, the report displays GPU metrics for the selected node. Use the nvidia-smi status code reference and XID error reference below to interpret the results and take action.
If you need to submit a ticket for further support, include the diagnostic information from the report in your ticket.
nvidia-smi status code reference
NVIDIA System Management Interface (nvidia-smi) is a command-line utility for monitoring and managing NVIDIA GPU devices. When a diagnostic runs, the report includes an NVIDIASMIStatusCode field. Look up that code in the table below to identify the root cause and next steps.
For driver-related codes, check the driver installation log at /var/log/nvidia-installer.log and run dmesg | grep -i nv to look for driver error messages. For hardware-related codes, submit a ticket to request Elastic Compute Service (ECS) technical support.
| Status code | Category | Description | Action |
|---|---|---|---|
| 0 | — | Success | None required |
| 3 | Driver | The requested operation is not available on the target device. The device may not support nvidia-smi, or a driver issue exists. | Check /var/log/nvidia-installer.log and run dmesg | grep -i nv |
| 6 | Driver | A query to find an object was unsuccessful. | Check /var/log/nvidia-installer.log and run dmesg | grep -i nv |
| 8 | Hardware | The external power cables of a device are not properly attached. | Submit a ticket to request ECS technical support |
| 9 | Driver | The NVIDIA driver is not loaded. | Check /var/log/nvidia-installer.log and run dmesg | grep -i nv |
| 10 | — | The NVIDIA kernel detected an interrupt issue with a GPU. | Check /var/log/nvidia-installer.log, run dmesg | grep -i nv, or check the XID in the report |
| 12 | — | NVML (NVIDIA Management Library) Shared Library cannot be found or loaded. | Check /var/log/nvidia-installer.log, run dmesg | grep -i nv, or check the XID in the report |
| 13 | — | The local version of NVML does not implement this function. | Check /var/log/nvidia-installer.log, run dmesg | grep -i nv, or check the XID in the report |
| 14 | Hardware | infoROM is corrupted. | Submit a ticket to request ECS technical support |
| 15 | Hardware | The GPU has fallen off the bus. | Submit a ticket to request ECS technical support |
| 255 | — | Other errors or internal driver errors occurred. | Check /var/log/nvidia-installer.log, run dmesg | grep -i nv, or check the XID in the report |
| -1 | — | nvidia-smi timed out. | Check /var/log/nvidia-installer.log, run dmesg | grep -i nv, or check the XID in the report |
XID error reference
XID messages are error reports that the NVIDIA driver writes to the kernel log or OS event log. Each XID indicates a hardware problem, NVIDIA software problem, or application problem, and includes the error location and code.
In the diagnostic report, check the XID exceptions on GPU-accelerated node field:
-
Empty — no XID errors were detected.
-
Not empty — one or more XID errors exist. Look up the XID in the tables below to determine the action.
XIDs you can troubleshoot yourself
For the following XIDs, follow these steps before escalating:
-
Resubmit the workload and check whether the same XID recurs.
-
If the same XID recurs, inspect your application code and analyze the logs to confirm the error is not caused by the code.
-
If the code is not the cause, submit a ticket.
| XID | Description | Most likely cause | Action |
|---|---|---|---|
| 13 | Graphics Engine Exception | Arrays out of their declared ranges, or an instruction error. Hardware error in rare cases. | Debug your application. |
| 31 | GPU memory page fault | The application accessed an illegal address. Driver or hardware error in rare cases. | Check your application for invalid memory accesses. |
| 43 | GPU stopped processing | The application encountered an error. | Check application logs for the root cause. |
| 45 | Preemptive cleanup due to a previous error | The application was manually stopped, or stopped because of another error such as a hardware issue or resource limit. XID 45 indicates the result, not the cause — analyze the logs to locate the underlying issue. | No action for XID 45 itself. Investigate the preceding error in your logs. |
| 68 | NVDEC0 Exception | Hardware or driver error. | Follow the three troubleshooting steps above. |
XIDs that require a support ticket
For the following XIDs, submit a ticket and include the diagnostic information from the GPU-accelerated node in the ticket.
| XID | Description | Root cause |
|---|---|---|
| 32 | Invalid or corrupted push buffer stream | A PCIe bus quality issue. The error is reported by the DMA (Direct Memory Access) controller of the PCIe bus, which manages communication between the NVIDIA driver and the GPU. |
| 38 | Driver firmware error | A driver firmware issue. |
| 48 | Double Bit ECC Error (DBE) | An uncorrectable GPU memory ECC (Error Correction Code) error. The error is also reported to your application. The GPU or node typically needs to be reset to recover. |
| 61 | Internal micro-controller breakpoint or warning | The GPU internal engine stopped, affecting running workloads. |
| 62 | Internal micro-controller halt | Similar to XID 61. |
| 63 | ECC page retirement or row remapping recording event | A GPU memory hardware error triggered the ECC mechanism. The retirement or remapping event was successfully recorded in infoROM. Volt architecture: ECC page retirement recorded. Ampere architecture: row remapping recorded. |
| 64 | ECC page retirement or row remapper recording failure | Similar to XID 63, but the retirement or remapping information failed to be recorded in infoROM. |
| 74 | NVLink error | An NVLink hardware error. The GPU encountered a critical hardware failure and must be repaired. |
| 79 | GPU has fallen off the bus | The PCIe bus cannot find the GPU. The GPU encountered a critical hardware failure and must be repaired. |
| 92 | High single-bit ECC error rate | A hardware or driver error. |
| 94 | Contained ECC error | An uncorrectable GPU memory ECC error occurred, but the ECC mechanism successfully suppressed it to prevent it from affecting other applications. Only the faulty application is impacted. |
| 95 | Uncontained ECC error | Similar to XID 94, but the error suppression failed. Other applications on the GPU-accelerated node are also affected. |