End-to-end GPU fault diagnostics and recovery - - Alibaba Cloud Documentation Center

GPU faults can be caused by various factors, including hardware faults, driver faults, GPU containerized environment faults, and application faults. These faults may compromise the performance of computing tasks and interrupt applications. This topic provides a structured, end-to-end GPU fault diagnostics workflow to help you quickly identify and resolve GPU issues.

Note

Before you start, we recommend that you learn about common GPU faults and suggested solutions.

Introduction

To quickly identify and fix GPU faults, an end-to-end fault diagnostics workflow is needed.

Triggers of fault diagnostics: Identify the source that triggers fault diagnostics.
Fault diagnostics: Check whether faults exist, observe the symptoms, and identify the cause by using logs, monitoring systems, and other diagnostic tools.
Fault isolation: Isolate the faulty component from the workflow in case the fault spreads.
Fault confirmation: After the diagnostic, confirm the occurrence of the fault again and take appropriate measures.
Fault recovery: Fix the fault based on the cause identified in the previous step.
Fault isolation removal: After the issue is resolved, redeploy the repaired resources in the production environment and restore the business.

The following figure shows the entire workflow.

Step 1: Triggers of fault diagnostics

Triggers of fault diagnostics include Kubernetes events, Managed Service for Prometheus, routine inspections, manually triggered diagnostics, ECS events, applications, and application controllers.

Kubernetes events

Kubernetes events record the occurrences in a cluster. Events are generated due to transitions between different states. For example, an event is generated when a pod is created, deleted, or failed to be scheduled. In Kubernetes GPU fault diagnostics and recovery scenarios, you can use the ack-node-problem-detector component to monitor the status of GPU-accelerated nodes in real time. When a GPU fault is detected, the ack-node-problem-detector component reports an event. The fault diagnostics and recovery system captures the event and triggers the GPU fault diagnostics and recovery workflow. For more information about how to enable event monitoring in a cluster, see Event monitoring.

Managed Service for Prometheus

Managed Service for Prometheus is an open source monitoring and alerting service commonly used to collect time series data in real time. In Container Service for Kubernetes (ACK) clusters, Managed Service for Prometheus collects the metrics of nodes, pods, and containers, such as the GPU usage, memory usage, and hardware temperature. When an abnormal metric is detected, Managed Service for Prometheus triggers an alert to start fault diagnostics.

GPU monitoring 2.0 provided by ACK clusters is a full-stack GPU monitoring system developed based on NVIDIA Data Center GPU Manager (DCGM). For more information about how to enable GPU monitoring and the introduction to the related metrics, see GPU monitoring 2.0.

Routine inspections

You can enable the cluster inspection feature in an ACK cluster and configure routine inspection rules. For example, you can configure a rule to inspect GPU-accelerated nodes at 21:00 every Sunday. This way, the fault diagnostics and recovery workflow can be triggered when a GPU-accelerated node enters an abnormal state. For more information about how to enable cluster inspection in an ACK cluster, see Cluster Inspections.

Manually triggered diagnostics

In some scenarios, if you want to diagnose a GPU-accelerated node, you can manually trigger the fault diagnostics and recovery workflow. For more information, see Diagnose GPU-accelerated nodes.

ECS events

In some scenarios, Elastic Compute Service (ECS) events are reported when GPU faults occur. You must restart the corresponding node to migrate the GPU. For more information about ECS events, see Overview.

Applications

GPU fault diagnostics can be triggered by applications in the following ways:

Logs: Applications may record detailed operational logs. If exceptions or errors are detected, you can analyze the log content to trigger fault diagnostics.
Monitoring metrics of applications: Many applications expose their performance metrics, such as the response time and processing speed. The fault diagnostics workflow is triggered when abnormal metrics are detected.
Diagnostic tools of applications: Some applications have built-in diagnostic tools to monitor their status. The diagnostic tools automatically start fault diagnostics when they detect faults.

Application controllers

In some scenarios, when the controller to which an application belongs detects an abnormal application state, the controller triggers the fault diagnostics and recovery workflow at the application layer.

Step 2: Fault diagnostics

Fault diagnostic is an important component of the fault diagnostics and recovery workflow. The following figure shows the fault diagnostics workflow in an ACK cluster.

Hardware fault diagnostics.
Driver fault diagnostics.
Containerized environment faults diagnostics.
cGPU fault diagnostics.
Application fault diagnostics.
Other diagnostic systems. For example, when a GPU error log system is installed in the cluster, the system can quickly diagnose the cause of faults after receiving application error logs.

The system diagnoses from the lower hardware layer to the upper application layer in sequence because faults in the upper layer may be caused by faults in the lower layer. For example, if an application cannot use GPU resources, an issue may occur in the underlying GPU. This bottom-up diagnostics workflow can avoid interference from faults in lower layers when diagnosing faults in upper layers.

Step 3: Fault isolation

If a fault is detected during the fault diagnostics workflow, fault isolation is performed. The following figure shows the fault isolation workflow.

In the preceding figure, fault isolation consists of scheduling layer isolation and application layer isolation. When an application fault occurs, the application does not affect the use of GPU resources by other applications on the node. Therefore, isolation is not required at the scheduling layer or application layer.

Scheduling layer isolation:
- Node isolation: Perform the cordon operation on the node. Driver faults, containerized environment faults, and cGPU faults affect the use of GPUs on the node. You must isolate the node.
- GPU isolation: Report the IDs of healthy GPUs to the kubelet by using Device Plugin and ignore the IDs of unhealthy GPUs. When a hardware fault occurs, if a single GPU fails without affecting the use of other GPUs on the node, isolating the faulty GPU can reduce GPU resource waste. However, if a single GPU fails and the fault affects the use of other GPUs on the node, node isolation is required.
Application layer isolation: In addition to scheduling layer isolation, application controllers also need to be isolated in some scenarios.
- Avoid scheduling application pods that tolerate all taints to the faulty node.
- Avoid scheduling application pods to the faulty node if nodes or GPUs are already specified for the pods.
- In elastic training scenarios, the application controller may terminate other healthy pods of a training task on the faulty node and then launch these pods on other nodes. The controller also avoids scheduling other training tasks to the faulty node during the fault and recovery periods.
  For example, during elastic training, if a pod of a training task is restarted due to a GPU-accelerated node fault, the event source of the node reports the fault to the training task controller. The controller identifies the fault and related event in advance and performs a scale-in operation to isolate the faulty node or GPU. During the node fault and recovery periods, the application controller avoids repeatedly migrating training task pods. For more information about how to deploy elastic model training tasks and scale resources for training tasks, see Elastic training based on Horovod.

Step 4: Fault confirmation

After fault isolation is completed, you need to further confirm whether the fault is a false positive.

If the fault diagnostic result shows a false positive, cancel the recovery workflow.
If the fault diagnostic result shows a fault, assess whether you can perform operations that may affect your businesses (such as restarting a node) during the recovery workflow based on the business scenario.

Step 5: Fault recovery

After you confirm that the fault must be fixed, initiate the fault recovery workflow. The following figure shows the workflow.

You must select an appropriate recovery method based on the fault.

Hardware faults, driver faults, and cGPU faults: You must ensure that no applications use the GPU resources on the faulty node. This means you must perform the recovery operation after the node is drained. However, not all types of applications support the node draining operation. If all pods on a node are drained during elastic training, the training task may fail. For example, the actual state of the task may not be saved. Before you drain the node, the system notifies the application controller to migrate the pods that use GPU resources on the node. You are notified after the migration is completed.
Containerized environment faults: Reinstall NVIDIA Container Toolkit or restart the pod that runs Device Plugin without migrating the applications on the GPU-accelerated node.
Application faults: Identify and resolve the problem in the application.

Step 6: Fault isolation removal

After the fault is fixed, you must remove fault isolation to ensure that the faulty node or GPU can be used in the production environment as normal and the applications run smoothly.