Collect diagnostic data from GPU-accelerated nodes - Container Service for Kubernetes

This topic describes how to collect diagnostic data from GPU-accelerated nodes.

Pod anomalies

If a pod that requests GPU resources fails to run as normal on a GPU-accelerated node, perform the following steps to collect diagnostic data:

Run the following command to query the node on which the pod runs.
In this example, the failed pod is named test-pod and belongs to the test-namespace namespace.
```
kubectl get pod test-pod -n test-namespace -o wide
```

Log on to the GPU-accelerated node and run the following command to download and run a diagnostic script:

curl https://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/diagnose/diagnose-gpu.sh | bash -s -- --pod test-pod

Expected output:

Please upload diagnose-gpu_xx-xx-xx_xx-xx-xx.tar.gz to ACK developers

Submit a ticket to submit the diagnose-gpu_xx-xx-xx_xx-xx-xx.tar.gz and diagnose-gpu.log files in the current directory to the Container Service for Kubernetes (ACK) technical team for analysis.

GPU-accelerated node anomalies

If a GPU-accelerated node fails to run as normal or errors occur in the runtime environment of the GPU-accelerated node, perform the following steps to collect diagnostic data:

Log on to the GPU-accelerated node and run the following command to download and run a diagnostic script:

curl https://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/diagnose/diagnose-gpu.sh | bash

Expected output:

Please upload diagnose-gpu_xx-xx-xx_xx-xx-xx.tar.gz to ACK developers

Submit a ticket to submit the diagnose-gpu_xx-xx-xx_xx-xx-xx.tar.gz file in the current directory to the ACK technical team for analysis.