This topic introduces Prometheus metrics for GPU-HPN nodes in ACS clusters.
Metrics
Metric | Description | Label | Example |
node_cpu_seconds_total | The total CPU time used on the node. |
| node_cpu_seconds_total{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx",mode="user"} 135268.20999999988 |
node_boot_time_seconds | The time point reserved when purchasing a GPU-HPN node. When the node triggers auto repair due to a failure, this metric is updated to the time point when the most recent atuo repair event is completed. | None | node_boot_time_seconds 1.735635132e+09 |
node_memory_MemAvailable_bytes | The amount of available memory on the node, in bytes. |
| node_memory_MemAvailable_bytes{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 1.070595100672e+12 |
node_memory_MemFree_bytes | The amount of free memory on the node, in bytes. |
| node_memory_MemFree_bytes{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 1.069967446016e+12 |
node_memory_MemTotal_bytes | The total amount of memory on the node, in bytes. |
| node_memory_MemTotal_bytes{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 1.9327352832e+12 |
node_disk_read_bytes_total | The total number of bytes read from the disks of the node. |
| node_disk_read_bytes_total{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 1.36580096e+08 |
node_disk_reads_completed_total | The total number of completed disk read operations on the node. |
| node_disk_reads_completed_total{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 2530 |
node_disk_writes_completed_total | The total number of completed disk write operations on the node. |
| node_disk_writes_completed_total{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 85965 |
node_disk_written_bytes_total | The total number of bytes written to the disks of the node. |
| node_disk_written_bytes_total{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 7.331622912e+09 |
node_network_receive_bytes_total | The total number of bytes received by the node. |
| node_network_receive_bytes_total{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 4.5447566e+07 |
node_network_transmit_bytes_total | The total number of bytes sent by the node. |
| node_network_transmit_bytes_total{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 8.6421368e+07 |
DCGM_FI_DEV_COUNT | The number of devices. |
| DCGM_FI_DEV_COUNT{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 8 |
DCGM_FI_DEV_FB_TOTAL | The total amount of the frame buffer in MB. |
| DCGM_FI_DEV_FB_TOTAL{NodeName="cn-wulanchabu-c.cr-xxx",instance="cn-wulanchabu-c.cr-xxx"} 1.56672e+06 |
DCGM_FI_DEV_FB_USED | The amount of the used frame buffer in MB. |
| DCGM_FI_DEV_FB_USED{NodeName="cn-wulanchabu-c.cr-xxx",UUID="GPU-hashID",instance="cn-wulanchabu-c.cr-xx",modelName="mode-name-demo"} 9672 |
DCGM_FI_DEV_GPU_UTIL | The GPU utilization, which is a percentage value. |
| DCGM_FI_DEV_GPU_UTIL{NodeName="cn-wulanchabu-c.cr-xxx",UUID="GPU-hashID",instance="cn-wulanchabu-c.cr-xx",modelName="mode-name-demo"} 56 |
sysom_imc_node_event | Node-level memory bandwidth performance monitoring (the sum of multiple NUMA sockets). The collection time window is 30 seconds. |
| sysom_imc_node_event{instance="cn-wulanchabu-c.cr-akrjaz1r0csm2qdrk227",value="bw_rd"} 780 |