NVIDIA GPU servers support multiple metrics. You can collect metric data from NVIDIA GPU servers to the Full-stack Monitoring application. This way, you can monitor the data in a visualized manner.

Prerequisites

Procedure

  1. Log on to the Log Service console.
  2. In the Log Application section, click Full-stack Monitoring.
  3. On the Full-stack Monitoring page, click the instance.
  4. On the Data Import page, enable Nvidia GPU.

    If this is your first time to create a Logtail configuration for host monitoring, turn on the switch to go to the configuration page. If you have created a Logtail configuration, click the Create icon to go to the configuration page.

  5. In the Install Logtail step, select the machine on which you want to install Logtail and click Next.
    • If you want to install Logtail on an Elastic Compute Service (ECS) instance, select the ECS instance on the ECS Instances tab and click Execute Now. For more information, see Install Logtail on ECS instances.
    • If you want to install Logtail on a self-managed Linux server or a Linux server from a third-party cloud, you must manually install Logtail V0.16.50 or later on the server. For more information, see Install Logtail on a Linux server.
    Notice Make sure that the server on which you want to install Logtail can connect to the NVIDIA GPU server whose metric data you want to collect.
  6. In the Create Machine Group step, create a machine group and click Next.
    Log Service allows you to create IP address-based machine groups and custom identifier-based machine groups. For more information, see Create an IP address-based machine group and Create a custom ID-based machine group.
  7. In the Machine Group Settings step, select the machine group that you create in the Source Server Groups section and move the machine group to the Applied Server Groups section. Then, click Next.
    Notice If you immediately apply a machine group after it is created, the heartbeat status of the machine group may be FAIL. This issue occurs because the machine group is not connected to Log Service. In this case, you can click Automatic Retry. If the issue persists, see What do I do if no heartbeat connections are detected on Logtail?
  8. In the Specify Data Source step, configure the following parameters and click Complete.
    Parameter Description
    Configuration Name The name of the Logtail configuration. You can enter a custom value.
    Cluster Name The name of the NVIDIA GPU cluster. You can enter a custom value.

    After you configure this parameter, Log Service adds a cluster=Cluster name tag to the NVIDIA GPU monitoring data that is collected by using the Logtail configuration.

    Notice Make sure that the cluster name is unique. Otherwise, data conflicts may occur.
    Nvidia SMI Path The directory in which nvidia-smi is installed. Default value: /usr/bin/nvidia-smi.
    Custom Tags The custom tags that are added to the collected NVIDIA GPU monitoring data. The tags are key-value pairs.

    After you configure this parameter, Log Service adds the custom tags to the NVIDIA GPU monitoring data that is collected by using the Logtail configuration.

    After you complete the configurations, Log Service automatically creates assets such as Metricstores. For more information, see Assets.

What to do next

After NVIDIA GPU monitoring data is collected to Log Service, the Full-stack Monitoring application automatically creates dedicated dashboards for the monitoring data. You can use the dashboards to analyze the monitoring data. For more information, see View dashboards.