All Products
Search
Document Center

How can I troubleshoot GPU issues in a Kubernetes cluster?

Last Updated: Nov 30, 2020

Overview

An Xid error message may occur during the scheduling of Kubernetes graphical processing unit (GPU) resources. The error message indicates that the number of available GPUs is smaller than the actual number of GPUs on the nodes in a Kubernetes cluster. This topic describes how to collect and analyze diagnostic information about this issue.

 

Description

Collect diagnostic information

To collect diagnostic information, you must download and run a diagnostic script, confirm the root cause of the issue based on the diagnostic results, and then save the log file. The following sections describe the procedure in detail.

 

Download the diagnostic script

Log on to a master node, and on the command line, run the following command to download the diagnostic script:

curl -o /usr/local/bin/diagnose_gpu.sh http://aliacs-k8s-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/public/diagnose/diagnose_gpu.sh
chmod +x  /usr/local/bin/diagnose_gpu.sh

 

Run the script

On the command line, run the following command to check how to use the script:

diagnose_gpu.sh -h

The help information is returned:


Usage: diagnose_gpu.sh [ OPTION ]
  --nodes NODE_IP_LIST  give the IP of node to be diagnosed,eg: --nodes 192.168.1.1,192.168.1.2
  -h,     --help        print the help information.

You can run the script by setting the --nodes option. The option specifies the GPU-accelerated nodes that you want to diagnose. For example, on the command line, run the following command to diagnose the GPU-accelerated nodes that are assigned the IP addresses 192.168.1.1 and 192.168.1.2:

diagnose_gpu.sh --nodes 192.168.1.1,192.168.1.2

On the command line, run the following command to diagnose all GPU-accelerated nodes:

diagnose_gpu.sh

 

Check the diagnostic report

After you run the script, a simple report in the following format is displayed on the terminal.

================================================ Report ========================================
NODE NAME:                cn-XXX.10.X.X.60
NODE IP:                  10.X.X.60
DEVICE PLUGIN POD NAME:   nvidia-device-plugin-cn-XXX.10.X.X.60
DEVICE PLUGIN POD STATUS: Running
NVIDIA VERSION:
  NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: N/A
COMMON XID ERRORS:
  not found Xid errors.
--------------------------------------------------------------------------------------------
NODE NAME:                cn-XXX.10.X.X.61
NODE IP:                  10.X.X.61
DEVICE PLUGIN POD NAME:   nvidia-device-plugin-cn-XXX.10.X.X.61
DEVICE PLUGIN POD STATUS: Running
NVIDIA VERSION:
  NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: N/A
COMMON XID ERRORS:
  store xid errors to /root/diagnose_gpu_1573439265.tar.gz 
--------------------------------------------------------------------------------------------
================================================  End   ========================================

The diagnostic results in the report for both GPU-accelerated nodes include the following items:

  • NODE NAME: the name of the GPU-accelerated node that is diagnosed.
  • NODE IP: the IP address of the GPU-accelerated node.
  • DEVICE PLUGIN POD NAME: the name of the pod on which the NVIDIA device plug-in is installed. The pod runs on the GPU-accelerated node.
  • DEVICE PLUGIN POD STATUS: the status of the pod on which the NVIDIA device plug-in is installed. The pod runs on the GPU-accelerated node.
  • NVIDIA VERSION: the version of the NVIDIA driver that is installed on the GPU-accelerated node.
  • COMMON XID ERRORS: specifies whether any Xid error occurred on the GPU-accelerated node. The error message is saved in the file. If no errors occurred, the message "not found Xid errors." is returned.

 

Collect logs

After you run the diagnostic script, the following output appears on the terminal. You must save the tar.gz file that is indicated in the message. If you cannot fix the error, submit a ticket to Alibaba Cloud and provide the tar.gz file in the ticket to request technical support.

2019-11-11/10:27:52  DEBUG  reports has been generated,please upload /root/diagnose_gpu_1573439265.tar.gz to us.

 

Analyze an Xid error message

The NVIDIA driver checks the GPU-based devices and prints the Xid error messages that are found. Each error message corresponds to an error code. For more information, see Xid Errors.

  1. Based on the diagnostic results, you can check the details of the detected Xid errors in the Common XID Errors section. The following example shows the Xid error that occurred on a GPU-accelerated node:
    NODE NAME: cn-XXX.10.X.X.61
    NODE IP: 10.X.X.61
    DEVICE PLUGIN POD NAME: nvidia-device-plugin-cn-XXX.10.X.X.61
    DEVICE PLUGIN POD STATUS: Running
    NVIDIA VERSION:
    NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: N/A
    COMMON XID ERRORS:
    store xid errors to /root/diagnose_gpu_1573439265.tar.gz
  2. Decompress the file /root/diagnose_gpu_1573439265.tar.gz and go to the xid_errors subdirectory. Find the node that is assigned the IP address 10.X.X.61 to obtain the following Xid error message, and verify that the error code is 43.
    [1296323.160491] NVRM: Xid (PCI:0000:00:08): 43, Ch 00000008, engmask 00000101
  3. Check the details of the Xid error message based on the error code on the NVIDIA official website. For more information, see Xid errors. The following figure shows the details of the error message.

    The following figure shows a more detailed description of the error message.
  4. The description of the error message shows that the error is not caused by a defect of the NVIDIA driver. The error is caused by the applications. In this case, check the relevant code of the applications to troubleshoot the issue.
  5. To fix other Xid errors, perform the preceding steps. If an error is caused by a defect of the NVIDIA driver, report the error to NVIDIA and request technical support.
    Note: If the Xid error code 31 is returned, the error does not affect the performance of subsequent GPU resource scheduling and can be ignored. Pay attention to the container stdout logs and container workload logs. These logs indicate major error messages that may cause issues.

 

Solutions

Restore the number of available GPUs for a cluster at the earliest opportunity

If an error occurs on a GPU-accelerated node, you must delete the pod on which the GPU device plug-in is installed. The pod runs on the GPU-accelerated node. Then, Kubernetes automatically restarts a pod with the NVIDIA device plug-in. On the command line, enter kubectl delete po [$POD_NAME] -n kube-system to delete the pod on which the NVIDIA device plug-in is installed. In this example, the pod nvidia-device-plugin-cn-XXX.10.X.X.60 is deleted.

Note: [$POD_NAME] specifies the name of the pod that you want to delete.

  1. On the command line, run the following command to delete the pod nvidia-device-plugin-cn-XXX.10.X.X.60. On this pod, the faulty NVIDIA device plug-in is installed.
    kubectl delete po nvidia-device-plugin-cn-XXX.10.X.X.60 -n kube-system
  2. On the command line, run the following command to check whether the pod on which the NVIDIA device plug-in is installed serves in the running state on each node. If the pod is not in the running state, you can follow the instructions described in the Collect logs section to fix the issue.
    kubectl get po -n kube-system  -o wide | grep nvidia-device-plugin

 

Hardware faults

If the detected Xid error is caused by hardware faults, check related hardware and determine whether to replace the faulty hardware.

 

Disable health checks for the NVIDIA device plug-in

To disable health checks for the NVIDIA device plug-in on a node, log on to the node. Edit the configuration file /etc/kubernetes/manifests/nvidia-device-plugin.yml. Add the following configuration items to the env keyword if these configuration items do not exist in the configuration file.

Note: If you disable health checks for the NVIDIA device plug-in, Xid errors cannot be captured.

- name: DP_DISABLE_HEALTHCHECKS
        value: all

In this example, k8s-device-plugin:1.12 is used. The following sample code shows the expected result of the configurations.

apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
  labels:
    component: nvidia-device-plugin
  name: nvidia-device-plugin
  namespace: kube-system
spec:
  priorityClassName: system-node-critical
  hostNetwork: true
  containers:
  - image: registry-vpc.cn-XXX.aliyuncs.com/acs/k8s-device-plugin:1.12
    name: nvidia-device-plugin-ctr
    # Make this pod as Guaranteed pod which will never be evicted because of node's resource consumption.
    resources:
      limits:
        memory: "300Mi"
        cpu: "500m"
      requests:
        memory: "300Mi"
        cpu: "500m"
    env:
      - name: DP_DISABLE_HEALTHCHECKS
        value: all
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
    volumeMounts:
      - name: device-plugin
        mountPath: /var/lib/kubelet/device-plugins
  volumes:
    - name: device-plugin
      hostPath:
        path: /var/lib/kubelet/device-plugins

 

Application scope

  • Alibaba Cloud Container Service for Kubernetes