This topic describes how to install, configure, and use the ACK GPU anomaly detection component. This component helps you monitor the health of GPU resources in your ACK environment to improve cluster reliability and efficiency.
Prerequisites
ack-node-problem-detector (NPD) version 1.2.24 or later is installed.
If you use ack-nvidia-device-plugin 0.17.0 or later with NPD 1.2.24 or later, NPD automatically fences a GPU card when it detects an anomaly. When NPD detects that the GPU has recovered, it automatically deactivates the fencing.
To view and upgrade the ack-nvidia-device-plugin component, see View the NVIDIA Device Plugin version.
ack-node-problem-detector (NPD) is a component that monitors anomaly events on cluster nodes. Created by ACK, NPD is an enhancement of the open source node-problem-detector project. It includes a wide range of check items to improve anomaly detection in GPU scenarios. When an anomaly is detected, the component generates a Kubernetes Event or a Kubernetes Node Condition based on the anomaly type.
Notes
NVIDIA XIDs and SXIDs are written by the GPU driver to
/var/log/messagesor/var/log/syslogthrough the NVRM event mechanism. NPD records whether each XID and SXID has been processed. If you restart the node after an XID or SXID is detected, NPD will not generate an Event or Node Condition for that XID or SXID. This occurs even if the underlying issue is not resolved (for example, XID 79 indicates that the GPU device must be replaced). NPD considers the XID resolved after a restart.NPD detects NVIDIA XIDs and SXIDs by checking the
/var/log/messagesfile or/var/log/syslogfile on the node. If the dmesg log is redirected to another file, NPD cannot detect NVIDIA XIDs and SXIDs.When a GPU on a node experiences an anomaly, ACK automatically fences the faulty GPU. This prevents new jobs from being scheduled to the faulty device. Automatic fencing does not restore the GPU to a normal state. You still need to manually restart the node or perform hardware maintenance based on the specific anomaly type. Enabling automatic fencing might cause unexpected behavior. For example, an 8-card job may fail to schedule if one card becomes faulty. You can disable automatic GPU fencing in the following ways:
Upgrade the NPD component: Starting from NPD 1.2.29, the automatic fencing feature for faulty GPU devices is disabled by default in the NPD GPU detection plugin.
Manually disable automatic fencing: For detailed steps, see How do I disable automatic fencing for faulty GPU cards in NPD?.
The NVIDIA Device Plugin component supports automatic fencing for faulty GPU cards in specific versions, but the method to disable this feature is different. For more information, see How do I disable the native GPU fencing feature of the NVIDIA Device Plugin?.
Starting from NPD 1.2.29, the GPU anomaly detection plugin in NPD is deployed separately as a DaemonSet named ack-accel-health-monitor.
In some cases, a GPU anomaly on a node might prevent GPU containers from being created on that node. The GPU anomaly detection container might be affected and fail to start. This prevents the detection from running correctly.
The NPD GPU detection plugin pod needs to check the status of GPU devices and components. This requires high permissions, such as
privileged=true. See the following table for details.Cluster RBAC permissions
Container permissions
Node: get
Node/Status: update
Events: create
privileged: trueRead-only mount of the host's
/dev/kmsgRead-only mount of the host's
/usr/libRead-only mount of the host's
/etcRead-only mount of the host's
/usr/lib64Read-only mount of the host's
/proc
Check items and repair recommendations
Repair suggestion is None, no hardware operations are required. Instead, check your application configuration.
Check item name | Generates Node Condition | Generates Event | Description | Fences GPU card by default | Repair suggestion |
NvidiaXID13Error | No | Yes
|
| No | None |
NvidiaXID31Error | No | Yes
|
| No | None |
NvidiaXID43Error | No | Yes
|
| No | None |
NvidiaXID45Error | No | Yes
|
| No | None |
NvidiaXID48Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID63Error | No | Yes
|
| No | None |
NvidiaXID64Error | No | Yes
|
| No | None |
NvidiaXID74Error | Yes
| Yes
|
| Yes | Hardware maintenance. |
NvidiaXID79Error | Yes
| Yes
|
| Yes | Hardware maintenance. |
NvidiaXID94Error | No | Yes
|
| No | None |
NvidiaXID95Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID119Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID120Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID140Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaEccModeNotEnabled | Yes
| Yes (generates events continuously until the issue is resolved)
| ECC Mode is not enabled on the node. | No | Enable ECC Mode and restart the node. |
NvidiaPendingRetiredPages | Yes
| Yes (generates events continuously until the issue is resolved)
|
| Yes | Restart the node. |
NvidiaRemappingRowsFailed | Yes
| Yes (generates events continuously until the issue is resolved)
| The GPU has a row remapping failure. | Yes | Hardware maintenance. |
NvidiaRemappingRowsRequireReset | Yes
| Yes (generates events continuously until the issue is resolved)
| The GPU has encountered an uncorrectable, uncontained error that requires a GPU reset to recover. The GPU should be reset as soon as possible to restore operation. | Yes | Restart the node. |
NvidiaDeviceLost | Yes
| Yes (generates events continuously until the issue is resolved)
|
| Yes | Hardware maintenance. |
NvidiaInfoRomCorrupted | Yes
| Yes (generates events continuously until the issue is resolved)
|
| Yes | Hardware maintenance. |
NvidiaPowerCableErr | Yes
| Yes (generates events continuously until the issue is resolved)
|
| Yes | Hardware maintenance. |
NvidiaXID44Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID61Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID62Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID69Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID[code]Error | No | Yes (generates only three events)
| Other XIDs not listed in this table. | No | |
NvidiaSXID[code]Error | No | Yes (generates only three events)
|
| No | None |
Other related events
In an exclusive GPU scenario, NPD automatically fences GPU cards by default based on the anomaly check items. After a GPU is fenced, new GPU application pods are not assigned to it. You can check the effect of fencing by viewing the number of nvidia.com/gpu resources reported on the Kubernetes Node. After the GPU card recovers, ACK automatically deactivates the fencing.
Trigger | Event content | Description |
GPU card fencing | Yes
| The GPU card is fenced due to a detected anomaly. |
GPU card fencing deactivation | Yes
| The GPU card has recovered from the anomaly, and the fencing is deactivated. |
FAQ
How do I disable automatic fencing for abnormal GPU cards in NPD?
The following method to disable the GPU fencing feature in NPD is a temporary solution. This configuration is lost when you upgrade NPD. You must re-apply the configuration by following these steps after the upgrade.
Background
When a GPU on a node experiences an anomaly, ACK automatically fences the faulty GPU through NPD. This prevents jobs from being scheduled to the faulty GPU. However, automatic fencing does not perform automatic repair. You still need to manually restart or repair the node. We recommend that you configure GPU anomaly alerts to ensure prompt handling.
After a GPU is fenced, if the remaining GPUs on the node are insufficient for a job's requirements (for example, an 8-card job when only 7 cards are available), the job will fail to schedule. This may leave GPU resources idle.
After the GPU status returns to normal, the fencing on the GPU device is automatically deactivated.
To disable automatic fencing so that faulty GPUs continue to report resources and are not fenced, see the following solutions.
Solutions
Disable the automatic GPU fencing feature in NPD.
For component versions 1.2.24 and later, but earlier than 1.2.28
Edit the NPD component YAML file.
kubectl edit ds -n kube-system ack-node-problem-detector-daemonsetChange the
EnabledIsolateGPUconfiguration tofalse.Before:
--EnabledIsolateGPU=trueAfter:
--EnabledIsolateGPU=false
For component version 1.2.28 and later
Edit the NPD component YAML file.
kubectl edit ds ack-accel-health-monitor -n kube-systemChange the
GenerateNvidiaGpuIsolationFileconfiguration tofalse.Before:
--GenerateNvidiaGpuIsolationFile=trueAfter:
--GenerateNvidiaGpuIsolationFile=false
Deactivate existing GPU fencing.
To remove existing fencing from a GPU, log on to the node where the XID error occurred and delete the
/etc/nvidia-device-plugin/unhealthyDevices.jsonfile. This deactivates the GPU fencing on the node. To prevent the GPU from being fenced again, follow the steps in the previous section to disable the automatic fencing feature.