Query GPU metric data - CloudMonitor - Alibaba Cloud Documentation Center

After you install the CloudMonitor on a GPU-accelerated compute optimized Elastic Compute Service (ECS) instance, CloudMonitor collects GPU metrics. You can also create an alert rule for the metrics. If the value of a metric meets the specified alert condition, an alert is triggered and CloudMonitor sends an alert notification. This helps you monitor the metric status in real time.

Prerequisites

A GPU-accelerated compute optimized ECS instance is created. The required GPU driver is installed on the instance. For more information, see Create a GPU-accelerated elastic container instance.
Note
If you install the CloudMonitor agent before you install the GPU driver, you must restart the CloudMonitor agent. For more information about how to restart the CloudMonitor agent, see How can I restart the CloudMonitor agent for C++?
The CloudMonitor agent is installed on the ECS instance. For more information, see Install and uninstall the CloudMonitor agent for C++.

GPU metrics

You can view GPU metrics based on GPUs, instances, and application groups. The following table lists the GPU metrics.

Metric	Unit	MetricName	Dimensions
(Agent)gpu_decoder_utilization	%	gpu_decoder_utilization	userId, instanceId, and gpuId
(Agent)gpu_encoder_utilization	%	gpu_encoder_utilization	userId, instanceId, and gpuId
(Agent)gpu_gpu_temperature	°C	gpu_gpu_temperature	userId, instanceId, and gpuId
(Agent)gpu_gpu_usedutilization	%	gpu_gpu_usedutilization	userId, instanceId, and gpuId
(Agent)gpu_memory_freespace	Byte	gpu_memory_freespace	userId, instanceId, and gpuId
(Agent)gpu_memory_freeutilization	%	gpu_memory_freeutilization	userId, instanceId, and gpuId
(Agent)gpu_memory_userdspace	Byte	gpu_memory_usedspace	userId, instanceId, and gpuId
(Agent)gpu_memory_usedutilization	%	gpu_memory_usedutilization	userId, instanceId, and gpuId
(Agent)gpu_power_readings_power_draw	W	gpu_power_readings_power_draw	userId, instanceId, and gpuId

View GPU metric data in the CloudMonitor console

Log on to the CloudMonitor console.
In the left-side navigation pane, click Cloud Service Monitoring > Host Monitoring.
On the Host Monitoring page, click the host name or click Monitoring Charts in the Actions column of the host.
Click the GPU Monitoring tab.
On the GPUMonitor tab, view the monitoring charts for GPU metrics.
You can view the GPU metrics of the host. You can also configure alert rules for specific GPU metrics and view alerts. For more information, see Step 2: Create an alert rule for the host and Step 3: View host alerts.

CloudMonitor:GPU monitoring

Prerequisites

GPU metrics

View GPU metric data in the CloudMonitor console

References