All Products
Search
Document Center

:Configure GPU fault alerting and solutions

Last Updated:Dec 17, 2025

To resolve GPU faults in Container Service for Kubernetes (ACK) clusters, ACK provides monitoring, diagnostics, alerting, and recovery mechanisms from various perspectives. This topic describes how to troubleshoot and fix GPU faults.

Background Information

image
  1. Configure routine monitoring and alerting: You can configure alert rules based on GPU metrics (ACK GPU Monitoring 2.0) and events (ACK Node Problem Detector) based on your business requirements. When a fault occurs on a GPU, the system triggers an alert at the earliest opportunity. This improves the responsiveness of your system to GPU faults.

  2. Container Intelligence Service (CIS)-assisted diagnostics and analytics: When a fault occurs on a GPU, the relevant GPU alerts or events may provide limited information for the GPU status and the details of the fault. In this case, you can use CIS to diagnose the node to which the faulty GPU belongs or the pod that uses the faulty GPU. CIS generates a diagnosis report that provides detailed information about the fault. The detailed information helps you identify the type of the fault and select suitable solutions to fix the fault.

  3. Fault isolation and recovery: To mitigate the impact of specific faults, ACK allows you to enable node isolation and GPU isolation. After the fault is fixed, you can cancel the isolation to run your applications as normal.

Step 1: Configure routine monitoring and alerting

Node Problem Detector (NPD) supports regular GPU inspections and helps you detect GPU faults. You can specify contacts to receive alerts when NPD detects GPU faults.

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, find the one you want to manage and click its name. In the left navigation pane, click Add-ons.

  3. On the Add-ons page, search for ack-node-problem-detector and click Install after it appears.

    Note

    If you have previously installed the component, make sure that the version is 1.2.20 or later. For more information, see ack-node-problem-detector.

  4. On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose Operations > Alerts.

  5. On the Alerts page, click the Alert Contacts tab. Then, click Create and configure a contact based on the on-screen instructions.

  6. On the Alert Rules tab, select Alert Rule Set for GPU Monitoring and click Status. Then, click Modify Contacts and select the contact you created.

Step 2: Use CIS to perform diagnostics and analysis

When faults occur on GPUs in a cluster, the cluster administrator is notified by text messages, emails, or DingTalk messages. You can log on to the ACK console and use CIS to locate and analyze the faults.

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose Inspections and Diagnostics > Diagnostics.

  3. On the Diagnosis page, click Node diagnosis. In the upper-left corner of the Node diagnosis page, click Diagnosis.

  4. In the Select node panel, specify the Node name parameter, read and select I know and agree, and then click Create diagnosis.

  5. After the diagnosis is completed, the results of all diagnostic items are displayed. For more information about the diagnostic items supported by the node diagnostics feature, see Node diagnostics.

    If an Xid error occurs on a GPU, {"GPU-326dc0ce-XXXX-77b6-XXXX-9a2eeXXXX":["43"]} is displayed for the GPUXIDErrors item, which indicates that an Xid 43 error occurs on the GPU-xxx GPU. For more information about the GPU errors in the diagnosis reports provided by CIS, see Events of GPU faults detected by NPD.

Step 3: Isolate the faulty GPU

Manually isolate the faulty GPU

When a fault is detected on a GPU, you can isolate the GPU to prevent other pods from being scheduled to the GPU. For more information, see Configure and manage the NVIDIA Device Plugin. If the index of the faulty GPU is 1 or the UUID of the faulty GPU is GPU-xxx-xxx-xxx, you need to create or modify a file named unhealthyDevices.json in the /etc/nvidia-device-plugin/ directory of the node to which the GPU belongs. Add the index or UUID to the file. Example:

// Specify the index of the GPU.
{
    "index": ["1"]
}

// Specify the UUID of the GPU.
{
    "uuid": ["GPU-xxx-xxx-xxx"]
}

Save the change and exit. After the fault is fixed, remove the relevant GPU item from the file to cancel GPU isolation.

Automatically isolate the faulty GPU

  1. Log on to the node to which the faulty GPU belongs to modify the /etc/kubernetes//manifests/nvidia-device-plugin.yml by deleting the following environment variable:

    Note

    If the environment variable does not exist, automatic GPU isolation is enabled.

          env:
          - name: DP_DISABLE_HEALTHCHECKS
            value: all
  2. After you delete the environment variable, run the following command to restart Device Plugin:

    mv /etc/kubernetes/manifests/nvidia-device-plugin.yml /etc/kubernetes/
    # Wait a few seconds for the system to delete the original pod. 
    mv /etc/kubernetes/nvidia-device-plugin.yml /etc/kubernetes/manifests/

Events of GPU faults detected by NPD

Event cause

Event content

Description

Mitigation

NodeGPULostCard

Node has lost GPU card

A GPU falls off the bus on the node.

Restart the node. If the issue persists, submit a ticket to contact the technical support team.

NodeHasGPUXidError

Node GPU Xid error has occurred

An Xid error occurs on a GPU on the node and an Xid message is reported.

Restart the node. If the issue persists, submit a ticket to contact the technical support team.

NodeHasNvidiaSmiError

Node GPU nvidia-smi error, maybe infoROM or device fault

Run the nvidia-smi command on the node. If the system fails to run the nvidia-smi command or the fault ERR or infoROM error occurs, the faulty GPU is reported.

Restart the node. If the issue persists, submit a ticket to contact the technical support team.

NodeHasGPUECCError

Node GPU maybe have some ECC errors

Check whether an Error Correction Code (ECC) error occurs on the node. If an ECC error occurs, the faulty GPU is reported.

Restart the node. If the issue persists, submit a ticket to contact the technical support team.

NodeGPUHasHighTemperature

Node GPU have high temperature, above 85 degrees

Check the temperature of the GPU. If the temperature exceeds 89 degrees Celsius, the GPU is reported.

This event is only a warning.