This topic describes how to collect diagnostic data from GPU-accelerated nodes.

Pod anomalies

If a pod that requests GPU resources fails to run as normal on a GPU-accelerated node, perform the following steps to collect diagnostic data:

  1. Run the following command to query the node on which the pod runs.
    In this example, the failed pod is named test-pod and belongs to the test-namespace namespace.
    kubectl get pod test-pod -n test-namespace -o wide
  2. Log on to the GPU-accelerated node and run the following command to download and run a diagnostic script:
    curl https://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/diagnose/diagnose-gpu.sh | bash -s -- --pod test-pod
    Expected output:
    Please upload diagnose-gpu_xx-xx-xx_xx-xx-xx.tar.gz to ACK developers
  3. Submit a ticket to submit the diagnose-gpu_xx-xx-xx_xx-xx-xx.tar.gz and diagnose-gpu.log files in the current directory to the Container Service for Kubernetes (ACK) technical team for analysis.

GPU-accelerated node anomalies

If a GPU-accelerated node fails to run as normal or errors occur in the runtime environment of the GPU-accelerated node, perform the following steps to collect diagnostic data:

  1. Log on to the GPU-accelerated node and run the following command to download and run a diagnostic script:
    curl https://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/diagnose/diagnose-gpu.sh | bash
    Expected output:
    Please upload diagnose-gpu_xx-xx-xx_xx-xx-xx.tar.gz to ACK developers
  2. Submit a ticket to submit the diagnose-gpu_xx-xx-xx_xx-xx-xx.tar.gz file in the current directory to the ACK technical team for analysis.