This topic describes how to use Kubernetes Event Center to monitor GPU-accelerated instances and configure alerts for GPU errors that generate Xid messages. This helps you understand NVIDIA driver error reports.
An Xid message is an error report from the NVIDIA driver that is printed to the kernel logs or event logs of the operating system. An Xid message indicate that a general GPU error occurs. Typically, the GPU error occurs due to incorrect driver programmings over the GPU or the corruption of the commands sent to the GPU. The messages can be indicative of issues related to the hardware, the NVIDIA software, or your applications.
GPU-accelerated instances are prone to errors that generate Xid messages. You can use Kubernetes Event Center to detect Xid errors and trigger alerts. This allows you to discover issues and locate causes at the earliest opportunity.
- Log on to the Log Service console, click Start in the K8s Event Center section.
For more information, see Create and use a Kubernetes event center.
- In the left-side navigation pane of the K8s Event Center page, click the target cluster, and then click Event Overview.
On the Event Overview tab, you can view Xid messages and the triggered alerts.
- In the left-side navigation pane, click the target cluster, and then click Alert Configuration in the drop-down list of the target cluster.
- Click Add Notification Method. On the Add Notification Method page, configure the notification method, and then click OK.
You can choose to receive alerts through SMS messages, emails, or DingTalk notifications, and then customize the alert content. In the following example, alerts are sent through SMS messages.
- After you configure the notification method, click Modify in the upper-right corner of the Events tab. Select Kubernetes GPU Xid Alerts, and select SMS in the Kubernetes GPU Xid Alerts drop-down list.
- On the Events page, click Save.
After an alert is triggered, you receive an SMS message from Alibaba Cloud.