ACS GPU-HPN node monitoring metrics - Container Compute Service

This topic introduces Prometheus metrics for GPU-HPN nodes in ACS clusters.

Metrics

Metric	Description	Label	Example
node_cpu_seconds_total	The total CPU time used on the node.	NodeName: The name of the node, which corresponds to `spec.nodeName` in the Node object. instance: The name of the node, which corresponds to `spec.nodeName` in the Node object. mode: The type of time slice, which can be idle, iowait, irq, nice, softirq, steal, system, or user.	node_cpu_seconds_total{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx",mode="user"} 135268.20999999988
node_boot_time_seconds	The time point reserved when purchasing a GPU-HPN node. When the node triggers auto repair due to a failure, this metric is updated to the time point when the most recent atuo repair event is completed.	None	node_boot_time_seconds 1.735635132e+09
node_memory_MemAvailable_bytes	The amount of available memory on the node, in bytes.	NodeName: The name of the node, which corresponds to `spec.nodeName` in the Node object. instance: The name of the node, which corresponds to `spec.nodeName` in the Node object.	node_memory_MemAvailable_bytes{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 1.070595100672e+12
node_memory_MemFree_bytes	The amount of free memory on the node, in bytes.	NodeName: The name of the node, which corresponds to `spec.nodeName` in the Node object. instance: The name of the node, which corresponds to `spec.nodeName` in the Node object.	node_memory_MemFree_bytes{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 1.069967446016e+12
node_memory_MemTotal_bytes	The total amount of memory on the node, in bytes.	NodeName: The name of the node, which corresponds to `spec.nodeName` in the Node object. instance: The name of the node, which corresponds to `spec.nodeName` in the Node object.	node_memory_MemTotal_bytes{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 1.9327352832e+12
node_disk_read_bytes_total	The total number of bytes read from the disks of the node.	NodeName: The name of the node, which corresponds to `spec.nodeName` in the Node object. instance: The name of the node, which corresponds to `spec.nodeName` in the Node object.	node_disk_read_bytes_total{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 1.36580096e+08
node_disk_reads_completed_total	The total number of completed disk read operations on the node.	NodeName: The name of the node, which corresponds to `spec.nodeName` in the Node object. instance: The name of the node, which corresponds to `spec.nodeName` in the Node object.	node_disk_reads_completed_total{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 2530
node_disk_writes_completed_total	The total number of completed disk write operations on the node.	NodeName: The name of the node, which corresponds to `spec.nodeName` in the Node object. instance: The name of the node, which corresponds to `spec.nodeName` in the Node object.	node_disk_writes_completed_total{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 85965
node_disk_written_bytes_total	The total number of bytes written to the disks of the node.	NodeName: The name of the node, which corresponds to `spec.nodeName` in the Node object. instance: The name of the node, which corresponds to `spec.nodeName` in the Node object.	node_disk_written_bytes_total{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 7.331622912e+09
node_network_receive_bytes_total	The total number of bytes received by the node.	NodeName: The name of the node, which corresponds to `spec.nodeName` in the Node object. instance: The name of the node, which corresponds to `spec.nodeName` in the Node object.	node_network_receive_bytes_total{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 4.5447566e+07
node_network_transmit_bytes_total	The total number of bytes sent by the node.	NodeName: The name of the node, which corresponds to `spec.nodeName` in the Node object. instance: The name of the node, which corresponds to `spec.nodeName` in the Node object.	node_network_transmit_bytes_total{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 8.6421368e+07
DCGM_FI_DEV_COUNT	The number of devices.	NodeName: The name of the node, which corresponds to `spec.nodeName` in the Node object. instance: The name of the node, which corresponds to `spec.nodeName` in the Node object.	DCGM_FI_DEV_COUNT{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 8
DCGM_FI_DEV_FB_TOTAL	The total amount of the frame buffer in MB.	NodeName: The name of the node, which corresponds to `spec.nodeName` in the Node object. instance: The name of the node, which corresponds to `spec.nodeName` in the Node object.	DCGM_FI_DEV_FB_TOTAL{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 1.56672e+06
DCGM_FI_DEV_FB_USED	The amount of the used frame buffer in MB.	NodeName: The name of the node, which corresponds to `spec.nodeName` in the Node object. instance: The name of the node, which corresponds to `spec.nodeName` in the Node object. UUID: The unique identifier of the device. modelName: The model name of the device. device: The name of the device. gpu: The device number.	DCGM_FI_DEV_FB_USED{NodeName="cn-wulanchabu-c.cr-xxx",UUID="GPU-hashID",instance="cn-wulanchabu-c.cr-xx",modelName="mode-name-demo"} 9672
DCGM_FI_DEV_GPU_UTIL	The GPU utilization, which is a percentage value.	NodeName: The name of the node, which corresponds to `spec.nodeName` in the Node object. instance: The name of the node, which corresponds to `spec.nodeName` in the Node object. UUID: The unique identifier of the device. modelName: The model name of the device. device: The name of the device. gpu: The device number.	DCGM_FI_DEV_GPU_UTIL{NodeName="cn-wulanchabu-c.cr-xxx",UUID="GPU-hashID",instance="cn-wulanchabu-c.cr-xx",modelName="mode-name-demo"} 56
sysom_imc_node_event	Node-level memory bandwidth performance monitoring (the sum of multiple NUMA sockets). The collection time window is 30 seconds.	instance: The name of the node, which corresponds to `spec.nodeName` in the Node object. value: The type of memory bandwidth metric. bw_rd: Read bandwidth (MB/s). bw_wr: Write bandwidth (MB/s). rlat: Average read latency (ns). Other types are not currently supported.	sysom_imc_node_event{instance="cn-wulanchabu-c.cr-akrjaz1r0csm2qdrk227",value="bw_rd"} 780

FAQ

How do I distinguish ACS pod metrics with the same name, such as DCGM_FI_DEV_FB_USED, when I configure a Grafana dashboard?

Pod metrics carry the Namespace and Pod labels, which you can use to distinguish metrics with the same name when writing PromQL queries.

What causes the value of a cumulative metric, such as node_cpu_seconds_total, reset to zero?

For example, the cumulative metric such as node_cpu_seconds_total indicates the total amount of CPU time consumed. On traditional ECS nodes, this value is collected by the operating system. When an ECS node restarts, this value is reset to zero. GPU-HPN nodes in ACS clusters are not physical machines. The cumulative value is collected by the ACS monitoring component. Changes or upgrades to the component, along with fault migrations within the lifecycle of GPU-HPN nodes, can cause changes in physical resources. Consequently, the value of the cumulative metric is reset to zero.

We recommend you use a calculation method, such as irate, to calculate the values of cumulative metrics, which is more straightforward. If you have configured threshold-based alerts for the related metrics, we recommend you add filtering parameters to avoid false alarms.

What is the definition of the timestamp in the original metrics?

The GPU-HPN node metrics have a timestamp attribute, which is in standard Prometheus format. It indicates the timestamp when the resource metric is collected. The format is as follows:

node_cpu_seconds_total{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx",mode="idle"} 17.509999999999998 1735112457237

You can use it with the honor_timestamps configuration of Prometheus. The built-in Prometheus dashboard in ACS has this feature enabled by default.