All Products
Search
Document Center

Container Service for Kubernetes:GPU anomaly detection and automatic fencing

Last Updated:Dec 04, 2025

This topic describes how to install, configure, and use the ACK GPU anomaly detection component. This component helps you monitor the health of GPU resources in your ACK environment to improve cluster reliability and efficiency.

Prerequisites

  • ack-node-problem-detector (NPD) version 1.2.24 or later is installed.

  • If you use ack-nvidia-device-plugin 0.17.0 or later with NPD 1.2.24 or later, NPD automatically fences a GPU card when it detects an anomaly. When NPD detects that the GPU has recovered, it automatically deactivates the fencing.

    To view and upgrade the ack-nvidia-device-plugin component, see View the NVIDIA Device Plugin version.

ack-node-problem-detector (NPD) is a component that monitors anomaly events on cluster nodes. Created by ACK, NPD is an enhancement of the open source node-problem-detector project. It includes a wide range of check items to improve anomaly detection in GPU scenarios. When an anomaly is detected, the component generates a Kubernetes Event or a Kubernetes Node Condition based on the anomaly type.

Notes

  • NVIDIA XIDs and SXIDs are written by the GPU driver to /var/log/messages or /var/log/syslog through the NVRM event mechanism. NPD records whether each XID and SXID has been processed. If you restart the node after an XID or SXID is detected, NPD will not generate an Event or Node Condition for that XID or SXID. This occurs even if the underlying issue is not resolved (for example, XID 79 indicates that the GPU device must be replaced). NPD considers the XID resolved after a restart.

  • NPD detects NVIDIA XIDs and SXIDs by checking the /var/log/messages file or /var/log/syslog file on the node. If the dmesg log is redirected to another file, NPD cannot detect NVIDIA XIDs and SXIDs.

  • When a GPU on a node experiences an anomaly, ACK automatically fences the faulty GPU. This prevents new jobs from being scheduled to the faulty device. Automatic fencing does not restore the GPU to a normal state. You still need to manually restart the node or perform hardware maintenance based on the specific anomaly type. Enabling automatic fencing might cause unexpected behavior. For example, an 8-card job may fail to schedule if one card becomes faulty. You can disable automatic GPU fencing in the following ways:

  • Starting from NPD 1.2.29, the GPU anomaly detection plugin in NPD is deployed separately as a DaemonSet named ack-accel-health-monitor.

  • In some cases, a GPU anomaly on a node might prevent GPU containers from being created on that node. The GPU anomaly detection container might be affected and fail to start. This prevents the detection from running correctly.

  • The NPD GPU detection plugin pod needs to check the status of GPU devices and components. This requires high permissions, such as privileged=true. See the following table for details.

    Cluster RBAC permissions

    Container permissions

    Node: get

    Node/Status: update

    Events: create

    privileged: true

    Read-only mount of the host's /dev/kmsg

    Read-only mount of the host's /usr/lib

    Read-only mount of the host's /etc

    Read-only mount of the host's /usr/lib64

    Read-only mount of the host's /proc

Check items and repair recommendations

Repair suggestion is None, no hardware operations are required. Instead, check your application configuration.

Check item name

Generates Node Condition

Generates Event

Description

Fences GPU card by default

Repair suggestion

NvidiaXID13Error

No

Yes

  • Type: Warning

  • Reason: NvidiaXID13Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 13 error has occurred.

  • Graphics Engine Exception.

  • This is usually an array-index out of bounds or an instruction error. It is rarely a hardware issue.

No

None

NvidiaXID31Error

No

Yes

  • Type: Warning

  • Reason: NvidiaXID31Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 31 error has occurred.

  • GPU memory page fault.

  • This is usually an illegal address access by the application. It is rarely a driver or hardware issue.

No

None

NvidiaXID43Error

No

Yes

  • Type: Warning

  • Reason: NvidiaXID43Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 43 error has occurred.

  • GPU stopped processing.

  • This event is recorded when your application encounters a software-induced exception and must be terminated. The GPU is still healthy.

  • In most cases, this does not indicate a problem with the driver, but an error in your application.

No

None

NvidiaXID45Error

No

Yes

  • Type: Warning

  • Reason: NvidiaXID45Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 45 error has occurred.

  • Preemptive cleanup due to previous errors. This is most likely to occur when running multiple CUDA applications and hitting a DBE.

  • This event is recorded when your application is aborted and the kernel driver terminates the GPU application running on the GPU.

  • Control-C, GPU reset, and sigkill are all examples of an application being aborted, which can trigger this event.

  • In many cases, this does not indicate an error, but an action by you or the system.

No

None

NvidiaXID48Error

Yes

  • Type: NvidiaXID48Error

  • Reason: NodeHasNvidiaXID48Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 48 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID48Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 48 error has occurred.

  • Double-bit ECC error (DBE).

  • This event is recorded when the GPU detects that an uncorrectable error has occurred. This is also reported to the application. A GPU reset or node restart is required to clear this error.

Yes

Restart the node.

NvidiaXID63Error

No

Yes

  • Type: Warning

  • Reason: NvidiaXID63Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 63 error has occurred.

  • ECC page retirement or row remapping recording event.

  • When an application encounters a GPU memory hardware error, the NVIDIA self-correction mechanism retires or remaps the faulty memory region. The retirement and remapping information must be recorded in the infoROM to be permanent.

  • Volta architecture: Records the ECC page retirement event to the infoROM successfully.

  • Ampere architecture: Records the row remapping event to the infoROM successfully.

No

None

NvidiaXID64Error

No

Yes

  • Type: Warning

  • Reason: NvidiaXID64Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 64 error has occurred.

  • ECC page retirement or row remapper recording failure.

  • The trigger scenario is similar to XID 63. XID 63 indicates that the retirement and remapping information was successfully recorded to the infoROM. XID 64 indicates that the recording operation failed.

No

None

NvidiaXID74Error

Yes

  • Type: NvidiaXID74Error

  • Reason: NodeHasNvidiaXID74Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 74 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID74Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 74 error has occurred.

  • Fatal NVLINK Error.

  • An XID generated by an NVLink hardware error. This event indicates a critical hardware failure in the GPU. The GPU must be taken offline for maintenance.

Yes

Hardware maintenance.

NvidiaXID79Error

Yes

  • Type: NvidiaXID79Error

  • Reason: NodeHasNvidiaXID79Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 79 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID79Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 79 error has occurred.

  • GPU has fallen off the bus.

  • The GPU hardware has detected that the card has fallen off the bus and is no longer accessible. This event indicates a critical hardware failure in the GPU. The GPU must be taken offline for maintenance.

Yes

Hardware maintenance.

NvidiaXID94Error

No

Yes

  • Type: Warning

  • Reason: NvidiaXID94Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 94 error has occurred.

  • Contained ECC error.

  • When an application encounters an uncorrectable GPU memory ECC error, the NVIDIA error containment mechanism tries to contain the error within the application that caused the problem. This prevents the error from affecting other applications on the GPU. When the containment mechanism successfully contains the error, an XID 94 event is generated. It only affects the application that encountered the uncorrectable ECC error.

No

None

NvidiaXID95Error

Yes

  • Type: NvidiaXID95Error

  • Reason: NodeHasNvidiaXID95Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 95 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID95Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 95 error has occurred.

  • Uncontained ECC error.

  • XID 95 indicates that the containment failed. This means all applications running on the GPU are affected. The affected GPU must be reset before the applications can be restarted.

Yes

Restart the node.

NvidiaXID119Error

Yes

  • Type: NvidiaXID119Error

  • Reason: NodeHasNvidiaXID119Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 119 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID119Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 119 error has occurred.

  • GSP RPC Timeout.

  • A timeout occurred while waiting for the GSP core to respond to an RPC message.

Yes

Restart the node.

NvidiaXID120Error

Yes

  • Type: NvidiaXID120Error

  • Reason: NodeHasNvidiaXID120Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 120 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID120Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 120 error has occurred.

  • GSP Error.

  • An error occurred in the code running on the GPU's GSP core.

Yes

Restart the node.

NvidiaXID140Error

Yes

  • Type: NvidiaXID140Error

  • Reason: NodeHasNvidiaXID140Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 140 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID140Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 140 error has occurred.

  • Unrecovered ECC Error.

  • This event can occur when the GPU driver detects uncorrectable errors in the GPU memory. These errors affect the driver's ability to mark pages for dynamic page retirement or row remapping. A GPU reset is required.

Yes

Restart the node.

NvidiaEccModeNotEnabled

Yes

  • Type: NvidiaEccModeNotEnabled

  • Reason: EccModeNotEnabled

  • Message: GpuIds=xxx;EccModeCurrent=xxx;EccModePending=xxx;MSG=The ECC mode of the GPU is not enabled.

Yes (generates events continuously until the issue is resolved)

  • Type: Warning

  • Reason: NvidiaEccModeNotEnabled

  • Message: GpuIds=xxx;EccModeCurrent=xxx;EccModePending=xxx;MSG=The ECC mode of the GPU is not enabled.

ECC Mode is not enabled on the node.

No

Enable ECC Mode and restart the node.

NvidiaPendingRetiredPages

Yes

  • Type: NvidiaPendingRetiredPages

  • Reason: NodeHasNvidiaPendingRetiredPages

  • Message: GpuIds=xxx;VolatileTotalUncorrected=xxx;AggregateTotalUncorrected=xxx;MSG=There are retired pages in a pending state on the GPU.

Yes (generates events continuously until the issue is resolved)

  • Type: Warning

  • Reason: NvidiaPendingRetiredPages

  • Message: GpuIds=xxx;VolatileTotalUncorrected=xxx;AggregateTotalUncorrected=xxx;MSG=There are retired pages in a pending state on the GPU.

  • The GPU has retired pages in a pending state.

  • A GPU reset is required for these retired pages to take effect.

Yes

Restart the node.

NvidiaRemappingRowsFailed

Yes

  • Type: NvidiaRemappedRowsFailed

  • Reason: GPUMemoryRemappingRowsFailed

  • Message: GpuIds=xxx;RemappedDueToUncorrectableErrors=xxx;MSG=The GPU has encountered an error with row remapping.

Yes (generates events continuously until the issue is resolved)

  • Type: Warning

  • Reason: NvidiaRemappedRowsFailed

  • Message: GpuIds=xxx;RemappedDueToUncorrectableErrors=xxx;MSG=The GPU has encountered an error with row remapping.

The GPU has a row remapping failure.

Yes

Hardware maintenance.

NvidiaRemappingRowsRequireReset

Yes

  • Type: NvidiaRemappingRowsRequireReset

  • Reason: UncontainedEccError

  • Message: GpuIds=xxx;MSG=Row remapping requires a GPU reset.

Yes (generates events continuously until the issue is resolved)

  • Type: Warning

  • Reason: NvidiaRemappingRowsRequireReset

  • Message: GpuIds=xxx;MSG=Row remapping requires a GPU reset.

The GPU has encountered an uncorrectable, uncontained error that requires a GPU reset to recover. The GPU should be reset as soon as possible to restore operation.

Yes

Restart the node.

NvidiaDeviceLost

Yes

  • Type: NvidiaDeviceLost

  • Reason: NodeHasNvidiaDeviceLost

  • Message: GpuIds=xxx;MSG=The GPU has fallen off the bus or has otherwise become inaccessible.

Yes (generates events continuously until the issue is resolved)

  • Type: Warning

  • Reason: NvidiaDeviceLost

  • Message: GpuIds=xxx;MSG=The GPU has fallen off the bus or has otherwise become inaccessible.

  • The GPU has fallen off the bus or has otherwise become inaccessible.

  • The GPU has fallen off the bus or has otherwise become inaccessible.

Yes

Hardware maintenance.

NvidiaInfoRomCorrupted

Yes

  • Type: NvidiaInfoRomCorrupted

  • Reason: NodeHasNvidiaInfoRomCorrupted

  • Message: GpuIds=xxx;MSG=The GPU infoROM is corrupted.

Yes (generates events continuously until the issue is resolved)

  • Type: Warning

  • Reason: NvidiaInfoRomCorrupted

  • Message: GpuIds=xxx;MSG=The GPU infoROM is corrupted.

  • The infoROM is corrupted.

  • The infoROM is corrupted.

Yes

Hardware maintenance.

NvidiaPowerCableErr

Yes

  • Type: NvidiaPowerCableErr

  • Reason: NodeHasNvidiaPowerCableErr

  • Message: GpuIds=xxx;MSG=A device's external power cables are not properly attached.

Yes (generates events continuously until the issue is resolved)

  • Type: Warning

  • Reason: NvidiaPowerCableErr

  • Message: GpuIds=xxx;MSG=A device's external power cables are not properly attached.

  • A device's external power cables are not properly attached.

  • The external power cables are not correctly attached to the device.

Yes

Hardware maintenance.

NvidiaXID44Error

Yes

  • Type: NvidiaXID44Error

  • Reason: NodeHasNvidiaXID44Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 44 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID44Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 44 error has occurred.

  • Graphics Engine fault during context switch

  • This usually means an uncorrectable error occurred on the GPU, and the error is also reported to the user application. A GPU reset or node restart is required to clear this error.

Yes

Restart the node.

NvidiaXID61Error

Yes

  • Type: NvidiaXID61Error

  • Reason: NodeHasNvidiaXID61Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 61 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID61Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 61 error has occurred.

  • Internal micro-controller breakpoint/warning (newer drivers)

  • This usually means an uncorrectable error occurred on the GPU, and the error is also reported to the user application. A GPU reset or node restart is required to clear this error.

Yes

Restart the node.

NvidiaXID62Error

Yes

  • Type: NvidiaXID62Error

  • Reason: NodeHasNvidiaXID62Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 62 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID62Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 62 error has occurred.

  • Internal micro-controller halt (newer drivers)

  • These anomalies mean an uncorrectable error occurred on the GPU, and the error is also reported to the user application. A GPU reset or node restart is required to clear this error.

Yes

Restart the node.

NvidiaXID69Error

Yes

  • Type: NvidiaXID69Error

  • Reason: NodeHasNvidiaXID69Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 69 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID69Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 69 error has occurred.

  • Graphics Engine class error

  • These anomalies mean an uncorrectable error occurred on the GPU, and the error is also reported to the user application. A GPU reset or node restart is required to clear this error.

Yes

Restart the node.

NvidiaXID[code]Error

No

Yes (generates only three events)

  • Type: Warning

  • Reason: NvidiaXID[code]Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID [code] error has occurred.

Other XIDs not listed in this table.

No

Submit a ticket.

NvidiaSXID[code]Error

No

Yes (generates only three events)

  • Type: Warning

  • Reason: NvidiaSXID[code]Error

  • Message: TS=xxx;NVSwitchIds=xxx;MSG=An NVIDIA SXID [code] error has occurred.

  • SXID errors can be divided into three categories:

    • Correctable: The error has been corrected. System behavior is not affected by this type of error. No additional recovery is needed.

    • Fatal: The error is fatal to the device. System behavior is affected. The only way to recover from this error is to reset the device or restart the system.

    • Non-fatal: The error is not fatal to the device. System behavior is affected. Resetting the device or restarting the system may not be required.

No

None

Other related events

In an exclusive GPU scenario, NPD automatically fences GPU cards by default based on the anomaly check items. After a GPU is fenced, new GPU application pods are not assigned to it. You can check the effect of fencing by viewing the number of nvidia.com/gpu resources reported on the Kubernetes Node. After the GPU card recovers, ACK automatically deactivates the fencing.

Trigger

Event content

Description

GPU card fencing

Yes

  • Type: Warning

  • Reason: NvidiaDeviceIsolated

  • Message: GpuIds=xxx;MSG=nvidia device has been isolated due to detected issues.

The GPU card is fenced due to a detected anomaly.

GPU card fencing deactivation

Yes

  • Type: Normal

  • Reason: NvidiaDeviceRecovered

  • Message: GpuIds=xxx;MSG=nvidia device has recovered from the fault.

The GPU card has recovered from the anomaly, and the fencing is deactivated.

FAQ

How do I disable automatic fencing for abnormal GPU cards in NPD?

Important

The following method to disable the GPU fencing feature in NPD is a temporary solution. This configuration is lost when you upgrade NPD. You must re-apply the configuration by following these steps after the upgrade.

Background

When a GPU on a node experiences an anomaly, ACK automatically fences the faulty GPU through NPD. This prevents jobs from being scheduled to the faulty GPU. However, automatic fencing does not perform automatic repair. You still need to manually restart or repair the node. We recommend that you configure GPU anomaly alerts to ensure prompt handling.

  • After a GPU is fenced, if the remaining GPUs on the node are insufficient for a job's requirements (for example, an 8-card job when only 7 cards are available), the job will fail to schedule. This may leave GPU resources idle.

  • After the GPU status returns to normal, the fencing on the GPU device is automatically deactivated.

  • To disable automatic fencing so that faulty GPUs continue to report resources and are not fenced, see the following solutions.

Solutions

  1. Disable the automatic GPU fencing feature in NPD.

    For component versions 1.2.24 and later, but earlier than 1.2.28

    1. Edit the NPD component YAML file.

      kubectl edit ds -n kube-system ack-node-problem-detector-daemonset
    2. Change the EnabledIsolateGPU configuration to false.

      Before:

       --EnabledIsolateGPU=true

      After:

      --EnabledIsolateGPU=false

    For component version 1.2.28 and later

    1. Edit the NPD component YAML file.

      kubectl edit ds ack-accel-health-monitor -n kube-system
    2. Change the GenerateNvidiaGpuIsolationFile configuration to false.

      Before:

      --GenerateNvidiaGpuIsolationFile=true

      After:

      --GenerateNvidiaGpuIsolationFile=false
  2. Deactivate existing GPU fencing.

    To remove existing fencing from a GPU, log on to the node where the XID error occurred and delete the /etc/nvidia-device-plugin/unhealthyDevices.json file. This deactivates the GPU fencing on the node. To prevent the GPU from being fenced again, follow the steps in the previous section to disable the automatic fencing feature.