This topic describes how to use Kubernetes Event Center to monitor GPU-accelerated instances and configure alerts for Xid messages that indicate GPU errors. This provides diagnostic information that can be used for debugging reported NVIDIA driver errors.
Prerequisites
- A managed or dedicated GPU cluster is created. For more information, see Create a managed GPU cluster or Create a dedicated GPU cluster for heterogeneous computing.
- Create and use a Kubernetes event center
Background information
An Xid message is an error report from the NVIDIA driver. Such a report is printed to the kernel log or event log of the operating system. An Xid message indicates that a general GPU error occurred. In most cases, a GPU error occurs due to improper driver programming over the GPU or due to the corruption of the commands sent to the GPU. You can use Xid messages to identify hardware, NVIDIA software, or application issues.
GPU drivers are prone to Xid errors. You can use Kubernetes Event Center to monitor Xid errors and configure alerts. This allows you to identify and troubleshoot issues at the earliest opportunity.