Collect monitoring data from NVIDIA GPU servers - - Alibaba Cloud Documentation Center

NVIDIA GPU servers support multiple metrics. You can collect metrics from NVIDIA GPU servers to the Full-stack Monitoring application. This way, you can view the metrics in the Simple Log Service console.

Prerequisites

A Full-stack Monitoring instance is created. For more information, see Create an instance.

Step 1: Install an NVIDIA GPU driver

Simple Log Service uses the nvidia-smi command to collect GPU information. The command is included in a GPU driver. You must install a GPU driver before you can use Simple Log Service to collect monitoring data from NVIDIA GPU servers. For more information, see Install a GPU driver on a GPU-accelerated compute-optimized Linux instance. If you use a GPU-accelerated instance of Elastic Compute Service (ECS), the driver is automatically installed. In this case, you can skip this step.

Step 2: Create a Logtail configuration

Log on to the Log Service console.
On the Intelligent O&M tab of the Log Application section, click Full-stack Monitoring.
On the Full-stack Monitoring page, click the instance.
On the Data Import page, turn on Nvidia GPU in the Middleware Monitoring section.
If this is your first time to create a Logtail configuration for host monitoring, turn on the switch to go to the configuration page. If you have created a Logtail configuration, click the icon to go to the configuration page.
Create a machine group.
- If a machine group is available, click Use Existing Machine Groups.
- If no machine groups are available, perform the following steps to create a machine group. In this example, an Elastic Compute Service (ECS) instance is used.
  1. On the ECS Instances tab, select Manually Select Instances. Then, select the ECS instance that you want to use and click Create.
    For more information, see Install Logtail on ECS instances.
    Important
    If you want to collect logs from an ECS instance that belongs to a different Alibaba Cloud account than Simple Log Service, a server in a data center, or a server of a third-party cloud service provider, you must manually install Linux Logtail V0.16.50 or later on the server. For more information, see Install Logtail on a Linux server. After the installation is complete, you also need to manually configure a user identifier on the server. For more information, see Configure a user identifier.
  2. After Logtail is installed, click Complete Installation.
  3. In the Create Machine Group step, configure the Name parameter and click Next.
    Simple Log Service allows you to create IP address-based machine groups and custom identifier-based machine groups. For more information, see Create an IP address-based machine group and Create a custom identifier-based machine group.
Important Make sure that the server on which you want to install Logtail can connect to the NVIDIA GPU server whose monitoring data you want to collect.
Select the new machine group from Source Server Groups and move the machine group to Applied Server Groups. Then, click Next.
Important If you apply a machine group immediately after you create the machine group, the heartbeat status of the machine group may be FAIL. This issue occurs because the machine group is not connected to Log Service. To resolve this issue, you can click Automatic Retry. If the issue persists, see What do I do if no heartbeat connections are detected on Logtail?

In the Specify Data Source step, configure the parameters and click Complete. The following table describes the parameters.

Parameter	Description
Configuration Name	The name of the Logtail configuration. You can enter a custom name.
Cluster Name	The name of the NVIDIA GPU cluster. You can enter a custom name. After you configure this parameter, Simple Log Service adds a cluster=Cluster name tag to the NVIDIA GPU monitoring data that is collected by using the Logtail configuration. Important Make sure that the cluster name is unique. Otherwise, data conflicts may occur.
Nvidia SMI Path	The directory in which nvidia-smi is installed. Default value: /usr/bin/nvidia-smi.
Custom Tags	The custom tags that are added to the collected NVIDIA GPU monitoring data. The tags are key-value pairs. After you configure this parameter, Simple Log Service adds the custom tags to the NVIDIA GPU monitoring data that is collected by using the Logtail configuration.

After you complete the configuration, Simple Log Service automatically creates assets such as Metricstores. For more information, see Assets.

What to do next

After NVIDIA GPU monitoring data is collected to the Full-stack Monitoring application, the application automatically creates dedicated dashboards for the monitoring data. You can use the dashboards to analyze the monitoring data. For more information, see View dashboards.