CloudMonitor collects GPU metrics from ECS instances through the CloudMonitor agent. You can create alert rules to receive notifications when metrics breach configured thresholds.
Prerequisites
-
A GPU-accelerated ECS instance is created and the GPU driver is installed. Create a GPU instance.
NoteIf you install the CloudMonitor agent before the GPU driver, restart the agent afterward. How do I restart the C++ version of the CloudMonitor agent?.
-
The CloudMonitor agent is installed on the ECS instance. Install the CloudMonitor agent.
Metrics
GPU metrics are available at the GPU, instance, and application group levels. The following table lists the available metrics.
|
Metric |
Unit |
Metric name |
Dimensions |
|
(Agent) GPU decoder utilization |
% |
gpu_decoder_utilization |
userId, instanceId, gpuId |
|
(Agent) GPU encoder utilization |
% |
gpu_encoder_utilization |
userId, instanceId, gpuId |
|
(Agent) GPU temperature |
°C |
gpu_temperature |
userId, instanceId, gpuId |
|
(Agent) GPU utilization |
% |
gpu_utilization |
userId, instanceId, gpuId |
|
(Agent) GPU memory free space |
Byte |
gpu_memory_freespace |
userId, instanceId, gpuId |
|
(Agent) GPU memory free utilization |
% |
gpu_memory_free_utilization |
userId, instanceId, gpuId |
|
(Agent) GPU memory used space |
Byte |
gpu_memory_usedspace |
userId, instanceId, gpuId |
|
(Agent) GPU memory utilization |
% |
gpu_memory_utilization |
userId, instanceId, gpuId |
|
(Agent) GPU power draw |
W |
gpu_power_readings_power_draw |
userId, instanceId, gpuId |
|
(Agent) Instance-level decoder utilization |
% |
instance_gpu_decoder_utilization |
userId, instanceId |
|
(Agent) Instance-level encoder utilization |
% |
instance_gpu_encoder_utilization |
userId, instanceId |
|
(Agent) Instance-level GPU temperature |
°C |
instance_gpu_temperature |
userId, instanceId |
|
(Agent) Instance-level GPU utilization |
% |
instance_gpu_utilization |
userId, instanceId |
|
(Agent) Instance-level GPU memory free space |
Byte |
instance_gpu_memory_freespace |
userId, instanceId |
|
(Agent) Instance-level GPU memory free utilization |
% |
instance_gpu_memory_free_utilization |
userId, instanceId |
|
(Agent) Instance-level GPU memory used space |
Byte |
instance_gpu_memory_usedspace |
userId, instanceId |
|
(Agent) Instance-level GPU memory utilization |
% |
instance_gpu_memory_utilization |
userId, instanceId |
|
(Agent) Instance-level GPU power draw |
W |
instance_gpu_power_readings_power_draw |
userId, instanceId |
For instance-level metrics such as instance_gpu_decoder_utilization and instance_gpu_temperature:
-
Average: The average value across all GPUs on the instance. For example, two GPUs with values 'a' and 'b' yield (a + b) / 2.
-
Maximum: The highest value among all GPUs on the instance. For example, two GPUs with values 'a' and 'b' yield max(a, b).
-
Minimum: The lowest value among all GPUs on the instance. For example, two GPUs with values 'a' and 'b' yield min(a, b).
View GPU monitoring data
-
Log on to the Cloud Monitor console.
-
In the left-side navigation pane, choose .
-
On the Host Monitoring page, click the target instance's name, or click View Charts in its Actions column.
-
Click the GPU monitoring tab.
The GPU monitoring tab displays GPU monitoring charts for the host.
GPU metrics also support alerting. Create an alert rule for a host. View alerts.