All Products
Search
Document Center

Managed Service for Prometheus:Basic metrics for container clusters

Last Updated:Mar 11, 2026

Managed Service for Prometheus collects a default set of basic metrics from container clusters. Basic metrics are free of charge. Metrics not listed on this page are classified as custom metrics and incur charges. You are charged for custom metrics starting from January 6, 2020. For billing details, see Billing overview.

Note

The basic metric scope defined on this page takes effect from 00:00:00 on November 12, 2024 (UTC+8). Custom metrics are charged based on the volume of observability data written or the number of data reports.

Metric categories

Each category corresponds to a Prometheus scrape target. The following table lists all categories and metric counts.

CategoryJob nameMetrics
cAdvisor_arms/kubelet/cadvisor56
ACK control plane API serverapiserver486
Node Exporternode-exporter421
kube-state-metrics_kube-state-metrics132
kube-events_arms/kube-event59
CoreDNSarms-ack-coredns102
CSI clustersk8s-csi-cluster-pv17
CSI nodesk8s-csi-node-pv107
GPU-Exportergpu-exporter64
Cost-Exporteralibaba-cloud-cost-exporter14
Ingressarms-ack-ingress134
Koordinatorkube-system, koordlet-metrics-podmonitor, or koord-manager-metrics-service14
ETCDetcd198
Schedulerack-scheduler293

cAdvisor

Job name: _arms/kubelet/cadvisor

Container resource usage metrics collected by cAdvisor through the kubelet. Covers CPU, memory, filesystem, network, and GPU allocation for individual containers.

MetricDescription
container_cpu_usage_seconds_totalThe total CPU time consumed by the container in seconds.
container_fs_usage_bytesThe number of bytes used by the container file system.
container_memory_cacheThe memory cache size of the container in bytes.
container_memory_usage_bytesThe amount of memory used by the container in bytes.
container_memory_working_set_bytesThe memory working set size (WSS) of the container in bytes.
container_network_receive_bytes_totalThe total network traffic received by the container in bytes.
container_network_transmit_bytes_totalThe total network traffic transmitted by the container in bytes.
container_scrape_errorThe number of container metric scraping errors.
DCGM_CUSTOM_CONTAINER_CP_ALLOCATEDThe ratio of the GPU computing power allocated to the container to the total computing power of the GPU. The value ranges from 0 to 1. In exclusive GPU mode or in shared GPU mode in which the container requests only GPU memory, the value of this metric is 0, which indicates that the allocation of GPU computing power is unlimited. For example, if a GPU provides a total of 100 compute units (CUs) of GPU computing power and allocates 30 CUs to a container, the ratio of the GPU computing power allocated to the container is calculated by using the following formula: 30/100 = 0.3.
DCGM_CUSTOM_CONTAINER_MEM_ALLOCATEDThe amount of GPU memory allocated to the container.
DCGM_CUSTOM_DEV_FB_ALLOCATEDThe ratio of the allocated GPU memory to the total memory of the GPU. The value ranges from 0 to 1.
DCGM_CUSTOM_DEV_FB_TOTALThe total memory of the GPU.
DCGM_CUSTOM_DEV_HEALTHThe health status of the GPU.
DCGM_CUSTOM_PROCESS_DECODE_UTILThe decoder utilization of GPU threads.
DCGM_CUSTOM_PROCESS_ENCODE_UTILThe encoder utilization of GPU threads.
DCGM_CUSTOM_PROCESS_MEM_COPY_UTILThe memory copy utilization of GPU threads.
DCGM_CUSTOM_PROCESS_MEM_USEDThe amount of GPU memory used by GPU threads.
DCGM_CUSTOM_PROCESS_SM_UTILThe streaming multiprocessor (SM) utilization of GPU threads.
DCGM_CUSTOM_PROF_MEM_BANDWIDTH_USEDThe GPU memory bandwidth used.
DCGM_CUSTOM_PROF_TENS_TFPS_USEDThe tensor core utilization.
DCGM_FI_DEV_DEC_UTILThe decoder utilization.
DCGM_FI_DEV_ENC_UTILThe encoder utilization.
DCGM_FI_DEV_FB_FREEThe amount of free frame buffer memory.
DCGM_FI_DEV_FB_USEDThe amount of used frame buffer memory. The value of this metric is the same as the value of Memory-Usage returned by the nvidia-smi command.
DCGM_FI_DEV_GPU_TEMPThe GPU temperature.
DCGM_FI_DEV_GPU_UTILThe GPU utilization within a cycle of 1 second or 1/6 second. The cycle varies based on the GPU model. A cycle is a period of time during which one or more kernel functions remain active. This metric only indicates that one or more kernel functions are occupying GPU resources. The metric does not display detailed GPU usage information.
DCGM_FI_DEV_MEM_CLOCKThe memory clock speed.
DCGM_FI_DEV_MEM_COPY_UTILThe memory bandwidth utilization. For example, the maximum memory bandwidth of NVIDIA V100 is 900 GB/s. If the memory bandwidth used is 450 GB/s, the memory bandwidth utilization is 50%.
DCGM_FI_DEV_POWER_USAGEThe power usage.
DCGM_FI_DEV_SM_CLOCKThe SM clock speed.
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTIONThe total energy consumed since the driver was last loaded.
DCGM_FI_DEV_XID_ERRORSThe last XID error that occurred within a period of time.
DCGM_FI_PROF_DRAM_ACTIVEThe cycle fraction for memory bandwidth utilization when sending data to device memory or receiving data from device memory. The value is an average value within a time interval rather than an instantaneous value. A larger value of this metric indicates higher device memory utilization. If the value is 1 (100%), a DRAM command is executed every cycle within the entire interval. The peak value of the metric can reach 0.8 (80%). If the value of this metric is 0.2 (20%), 20% of the cycles within the time interval are spent reading from or writing to device memory.
DCGM_FI_PROF_NVLINK_RX_BYTESThe TX rate of NVLink and the RX rate of NVLink. The bytes transmitted or received exclude the header. The value is an average value within a time interval rather than an instantaneous value. For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/s regardless of whether the transmission occurs at a consistent rate or in bursts. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per direction per link.
DCGM_FI_PROF_NVLINK_TX_BYTESThe total number of bytes sent through NVLink.
DCGM_FI_PROF_PCIE_RX_BYTESThe TX rate of PCIe and the RX rate of PCIe. The bytes transmitted or received include both the header and payload. The value is an average value within a time interval rather than an instantaneous value. For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/s regardless of whether the transmission occurs at a consistent rate or in bursts. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per lane.
DCGM_FI_PROF_PCIE_TX_BYTESThe TX rate of PCIe and the RX rate of PCIe. The bytes transmitted or received include both the header and payload. The value is an average value within a time interval rather than an instantaneous value. For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/s regardless of whether the transmission occurs at a consistent rate or in bursts. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per lane.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVEThe cycle fraction for the Tensor (HMMA/IMMA) pipe being in the Active state. The value is an average value within a time interval rather than an instantaneous value. A larger value of this metric indicates higher tensor core utilization. If the value is 1 (100%), a Tensor instruction is issued every cycle within the entire interval. One instruction completes in two cycles. If the value of this metric is 0.2 (20%), one of the following conditions may exist: The tensor core utilization of 20% of the SMs within the time interval is 100%. The tensor core utilization of all SMs within the time interval is 20%. The tensor core utilization of all SMs within 20% of the time interval is 100%. Other conditions.
DCGM_FI_PROF_SM_ACTIVEThe ratio of cycles during which at least one warp on an SM remains active. The value is an average of all SMs. The value does not vary with the number of warps included in the thread block. When a warp is scheduled and resources are allocated to the warp, the warp is considered active. In this case, the status of the warp may be Computing or not Computing; for example, it may be waiting for memory requests or in another non-Computing state. If the value of this metric drops below 0.5, the GPU utilization is low. For high GPU utilization, the value should be greater than 0.8. Assume that a GPU has N SMs. If all SMs in N thread blocks run a kernel function within a time interval, the value of this metric is 1 (100%). If N/5 thread blocks run a kernel function within a time interval, the value of this metric is 0.2. If N thread blocks run a kernel function during 20% of the cycle within a time interval, the value of this metric is 0.2.
machine_cpu_coresThe number of CPU cores on the machine.
machine_memory_bytesThe machine memory in bytes.
node_exporter_build_infoThe build information about the node exporter.
nvidia_gpu_duty_cycleThe percentage of time over the past sample period during which the NVIDIA GPU was occupied.
nvidia_gpu_memory_total_bytesThe total memory of the NVIDIA GPU in bytes.
nvidia_gpu_memory_used_bytesThe memory used by the NVIDIA GPU in bytes.
nvidia_gpu_num_devicesThe number of NVIDIA GPUs.
nvidia_gpu_power_usage_milliwattsThe power consumption of the NVIDIA GPU in milliwatts.
nvidia_gpu_temperature_celsiusThe temperature of the NVIDIA GPU in °C.
rdma_service_monitor_local_ack_timeout_errThe number of timeout errors that occurred in the remote direct memory access (RDMA) network.
rdma_service_monitor_out_of_seqThe number of out-of-order packets in the RDMA network.
rdma_service_monitor_packet_seq_errThe number of out-of-order packet errors in the RDMA network.
rdma_service_monitor_rx_bytesThe throughput received over the RDMA network in bytes.
rdma_service_monitor_rx_packetsThe number of packets received over the RDMA network.
rdma_service_monitor_tx_bytesThe throughput sent over the RDMA network in bytes.
rdma_service_monitor_tx_packetsThe number of packets sent over the RDMA network.
upThe connectivity of metric collection.

ACK control plane API server

Job name: apiserver

Control plane component metrics for Container Service for Kubernetes (ACK) Pro clusters and ACK dedicated clusters. For Pro clusters, this includes the API server, etcd, Scheduler, Kube Controller Manager, and Cloud Controller Manager. For dedicated clusters, this includes the API server.

MetricDescription
aggregator_discovery_aggregation_count_totalThe count of discovery aggregations performed by the aggregator.
aggregator_openapi_v2_regeneration_countThe number of regenerations based on OpenAPI 2.0.
aggregator_openapi_v2_regeneration_durationThe amount of time consumed for regenerations based on OpenAPI 2.0.
aggregator_unavailable_apiserviceThe APIServices that are unavailable to the aggregator.
aggregator_unavailable_apiservice_countThe count of APIServices that are unavailable to the aggregator.
aggregator_unavailable_apiservice_totalThe total number of APIServices that are unavailable to the aggregator.
aliyun_prometheus_agent_append_duration_secondsThe additional time spent by the Prometheus agent in seconds.
aliyun_prometheus_agent_job_discovery_statusThe job status that is discovered by the Prometheus agent.
aliyun_prometheus_agent_scrapes_by_target_totalThe total number of target scrapes performed by the Prometheus agent.
aliyun_prometheus_agent_target_infoThe information about targets scraped by the Prometheus agent.
apiextensions_apiserver_validation_ratcheting_seconds_bucketThe distribution of incremental time intervals for validation in seconds in the APIServer.
apiextensions_apiserver_validation_ratcheting_seconds_countThe count of incremental time intervals for validation in seconds in the APIServer.
apiextensions_apiserver_validation_ratcheting_seconds_sumThe sum of incremental time intervals for validation in seconds in the APIServer.
apiextensions_openapi_v2_regeneration_countThe number of API extension regenerations based on OpenAPI 2.0.
apiextensions_openapi_v3_regeneration_countThe number of API extension regenerations based on OpenAPI 3.0.
apiserver_accepted_listall_requests_totalThe total number of ListAll requests accepted by the APIServer.
apiserver_admission_controller_admission_duration_seconds_bucketThe distribution of APIServer admission controller durations in seconds.
apiserver_admission_controller_admission_duration_seconds_countThe count of APIServer admission controller durations in seconds.
apiserver_admission_controller_admission_duration_seconds_sumThe sum of APIServer admission controller durations in seconds.
apiserver_admission_step_admission_duration_seconds_bucketThe distribution of APIServer admission step durations in seconds.
apiserver_admission_step_admission_duration_seconds_countThe count of APIServer admission step durations per second.
apiserver_admission_step_admission_duration_seconds_sumThe sum of APIServer admission step durations in seconds.
apiserver_admission_step_admission_duration_seconds_summaryThe summary of APIServer admission step durations in seconds.
apiserver_admission_step_admission_duration_seconds_summary_countThe summary count of APIServer admission step durations in seconds.
apiserver_admission_step_admission_duration_seconds_summary_sumThe summary total of APIServer admission step durations in seconds.
apiserver_admission_webhook_admission_duration_seconds_bucketThe distribution of APIServer admission webhook durations in seconds.
apiserver_admission_webhook_admission_duration_seconds_countThe count of APIServer admission webhook durations in seconds.
apiserver_admission_webhook_admission_duration_seconds_sumThe sum of APIServer admission webhook durations in seconds.
apiserver_admission_webhook_fail_open_countThe count of times that the APIServer admission webhook is configured as fail open.
apiserver_admission_webhook_rejection_countThe count of requests rejected by the APIServer admission webhook.
apiserver_admission_webhook_request_totalThe total number of requests to the APIServer admission webhook.
apiserver_audit_error_totalThe total number of APIServer audit errors.
apiserver_audit_event_totalThe total number of APIServer audit events.
apiserver_audit_level_totalThe total number of APIServer audit levels.
apiserver_audit_requests_rejected_totalThe total number of rejected APIServer requests.
apiserver_authorization_decisions_totalThe total number of authorization decisions made by the APIServer.
apiserver_cache_list_fetched_objects_totalThe total number of objects obtained by the APIServer cache list.
apiserver_cache_list_returned_objects_totalThe total number of objects returned by the APIServer cache list.
apiserver_cache_list_totalThe total number of operations performed by the APIServer cache list.
apiserver_cacher_received_eventsThe number of events received by the APIServer cache.
apiserver_cacher_sended_events_latency_milliseconds_bucketThe distribution of APIServer event sending latencies in milliseconds.
apiserver_cacher_sended_events_latency_milliseconds_countThe count of APIServer event sending latencies in milliseconds.
apiserver_cacher_sended_events_latency_milliseconds_sumThe total of APIServer event sending latencies in milliseconds.
apiserver_cacher_watcher_channel_lengthThe watcher channel length of the APIServer cache.
apiserver_cel_compilation_duration_seconds_bucketThe distribution of APIServer Common Expression Language (CEL) compilation latencies in seconds.
apiserver_cel_compilation_duration_seconds_countThe count of APIServer CEL compilations.
apiserver_cel_compilation_duration_seconds_sumThe total time consumed for APIServer CEL compilations in seconds.
apiserver_cel_evaluation_duration_seconds_bucketThe distribution of APIServer CEL evaluation latencies in seconds.
apiserver_cel_evaluation_duration_seconds_countThe count of APIServer CEL evaluations.
apiserver_cel_evaluation_duration_seconds_sumThe total of APIServer CEL evaluation latencies in seconds.
apiserver_client_certificate_expiration_seconds_bucketThe distribution of remaining seconds until APIServer client certificate expiration.
apiserver_client_certificate_expiration_seconds_countThe count of remaining seconds until APIServer client certificate expiration.
apiserver_client_certificate_expiration_seconds_sumThe total remaining seconds until APIServer client certificate expiration.
apiserver_clusterip_repair_ip_errors_totalThe total number of ClusterIP errors fixed by the APIServer.
apiserver_clusterip_repair_reconcile_errors_totalThe total number of ClusterIP reconcile errors fixed by the APIServer.
apiserver_conversion_webhook_duration_seconds_bucketThe distribution of APIServer conversion webhook latencies in seconds.
apiserver_conversion_webhook_duration_seconds_countThe count of APIServer conversion webhook calls.
apiserver_conversion_webhook_duration_seconds_sumThe total of APIServer conversion webhook latencies in seconds.
apiserver_conversion_webhook_request_totalThe total number of APIServer conversion webhook requests.
apiserver_crd_conversion_webhook_duration_seconds_bucketThe distribution of APIServer Custom Resource Definition (CRD) conversion webhook latencies in seconds.
apiserver_crd_conversion_webhook_duration_seconds_countThe count of APIServer CRD conversion webhook calls.
apiserver_crd_conversion_webhook_duration_seconds_sumThe total of APIServer CRD conversion webhook latencies in seconds.
apiserver_crd_webhook_conversion_duration_seconds_bucketThe distribution of APIServer CRD webhook conversion latencies in seconds.
apiserver_crd_webhook_conversion_duration_seconds_countThe count of APIServer CRD webhook conversions.
apiserver_crd_webhook_conversion_duration_seconds_sumThe total of APIServer CRD webhook conversion latencies in seconds.
apiserver_created_watchersThe number of watchers created by the APIServer.
apiserver_current_inflight_requestsThe number of requests that are being processed by the APIServer.
apiserver_current_inqueue_requestsThe maximum number of queued requests in the APIServer.
apiserver_dropped_requests_totalThe total number of requests dropped by the APIServer.
apiserver_encryption_config_controller_automatic_reload_failures_totalThe number of times that the encryption configuration controller of the APIServer failed to be automatically reloaded.
apiserver_encryption_config_controller_automatic_reload_success_totalThe number of times that the encryption configuration controller of the APIServer was automatically reloaded.
apiserver_envelope_encryption_dek_cache_fill_percentThe percentage of APIServer envelope encryption Data Encryption Key (DEK) cache filled.
apiserver_error_watchersThe number of watchers in the Error state in the APIServer.
apiserver_flowcontrol_current_executing_requestsThe number of requests being processed by APIServer rate limiting.
apiserver_flowcontrol_current_executing_seatsThe number of seats occupied by APIServer rate limiting.
apiserver_flowcontrol_current_inqueue_requestsThe number of requests pending in queues in the APF system.
apiserver_flowcontrol_current_inqueue_seatsThe number of seats pending in APIServer rate limiting queues.
apiserver_flowcontrol_current_limit_seatsThe number of seats limited by APIServer rate limiting.
apiserver_flowcontrol_current_rThe current R value of APIServer rate limiting.
apiserver_flowcontrol_demand_seats_averageThe average number of seats requested by APIServer rate limiting.
apiserver_flowcontrol_demand_seats_bucketThe distribution of seats requested by APIServer rate limiting.
apiserver_flowcontrol_demand_seats_countThe count of seats requested by APIServer rate limiting.
apiserver_flowcontrol_demand_seats_high_watermarkThe high watermark of seats requested by APIServer rate limiting.
apiserver_flowcontrol_demand_seats_smoothedThe smoothed value of seats requested by APIServer rate limiting.
apiserver_flowcontrol_demand_seats_stdevThe standard deviation of seats requested by APIServer rate limiting.
apiserver_flowcontrol_demand_seats_sumThe sum of seats requested by APIServer rate limiting.
apiserver_flowcontrol_dispatch_rThe scheduling R value of APIServer rate limiting.
apiserver_flowcontrol_dispatched_requests_totalThe total number of requests scheduled by APIServer rate limiting.
apiserver_flowcontrol_latest_sThe recent S value bounds of APIServer rate limiting.
apiserver_flowcontrol_lower_limit_seatsThe lower bound of seats in APIServer rate limiting.
apiserver_flowcontrol_next_discounted_s_boundsThe next discounted S value bounds of APIServer rate limiting.
apiserver_flowcontrol_next_s_boundsThe next S value bounds of APIServer rate limiting.
apiserver_flowcontrol_nominal_limit_seatsThe nominal upper bound of seats in APIServer rate limiting.
apiserver_flowcontrol_priority_level_request_count_samples_bucketThe distribution of priority level request samples in APIServer rate limiting.
apiserver_flowcontrol_priority_level_request_count_samples_countThe count of priority level request samples in APIServer rate limiting.
apiserver_flowcontrol_priority_level_request_count_samples_sumThe sum of priority level request samples in APIServer rate limiting.
apiserver_flowcontrol_priority_level_request_count_watermarks_bucketThe distribution of watermark levels for priority level request samples in APIServer rate limiting.
apiserver_flowcontrol_priority_level_request_count_watermarks_countThe count of watermark levels for priority level request samples in APIServer rate limiting.
apiserver_flowcontrol_priority_level_request_count_watermarks_sumThe sum of watermark levels for priority level request samples in APIServer rate limiting.
apiserver_flowcontrol_priority_level_request_utilization_bucketThe distribution of request utilization samples by priority level in APIServer rate limiting.
apiserver_flowcontrol_priority_level_request_utilization_countThe count of request utilization samples by priority level in APIServer rate limiting.
apiserver_flowcontrol_priority_level_request_utilization_sumThe sum of request utilization by priority level in APIServer rate limiting.
apiserver_flowcontrol_priority_level_seat_count_samples_bucketThe distribution of seat samples for priority level in APIServer rate limiting.
apiserver_flowcontrol_priority_level_seat_count_samples_countThe count of seat samples for priority level in APIServer rate limiting.
apiserver_flowcontrol_priority_level_seat_count_samples_sumThe sum of seat samples for priority level in APIServer rate limiting.
apiserver_flowcontrol_priority_level_seat_count_watermarks_bucketThe distribution of watermark levels for seat samples in APIServer rate limiting by priority level.
apiserver_flowcontrol_priority_level_seat_count_watermarks_countThe count of watermark levels for seat samples in APIServer rate limiting by priority level.
apiserver_flowcontrol_priority_level_seat_count_watermarks_sumThe sum of watermark levels for seat samples in APIServer rate limiting by priority level.
apiserver_flowcontrol_priority_level_seat_utilization_bucketThe distribution of seat utilization samples by priority level in APIServer rate limiting.
apiserver_flowcontrol_priority_level_seat_utilization_countThe count of seat utilization samples by priority level in APIServer rate limiting.
apiserver_flowcontrol_priority_level_seat_utilization_sumThe sum of seat utilization by priority level in APIServer rate limiting.
apiserver_flowcontrol_read_vs_write_current_requests_bucketThe distribution of current read/write requests in APIServer rate limiting.
apiserver_flowcontrol_read_vs_write_current_requests_countThe count of current read/write requests in APIServer rate limiting.
apiserver_flowcontrol_read_vs_write_current_requests_sumThe sum of current read/write requests in APIServer rate limiting.
apiserver_flowcontrol_read_vs_write_request_count_samples_bucketThe distribution of read/write request count samples in APIServer rate limiting.
apiserver_flowcontrol_read_vs_write_request_count_samples_countThe count of read/write request count samples in APIServer rate limiting.
apiserver_flowcontrol_read_vs_write_request_count_samples_sumThe sum of read/write request count samples in APIServer rate limiting.
apiserver_flowcontrol_read_vs_write_request_count_watermarks_bucketThe distribution of read/write request count watermarks in APIServer rate limiting.
apiserver_flowcontrol_read_vs_write_request_count_watermarks_countThe count of read/write request count watermarks in APIServer rate limiting.
apiserver_flowcontrol_read_vs_write_request_count_watermarks_sumThe sum of read/write request count watermarks in APIServer rate limiting.
apiserver_flowcontrol_rejected_requests_totalThe total number of requests rejected by APIServer rate limiting.
apiserver_flowcontrol_request_concurrency_in_useThe count of concurrent requests in APIServer rate limiting.
apiserver_flowcontrol_request_concurrency_limitThe concurrent request limit in APIServer rate limiting.
apiserver_flowcontrol_request_dispatch_no_accommodation_totalThe total number of requests that could not be accommodated by the scheduling of APIServer rate limiting.
apiserver_flowcontrol_request_execution_seconds_bucketThe distribution of request latencies in seconds in APIServer rate limiting.
apiserver_flowcontrol_request_execution_seconds_countThe count of request latencies in seconds in APIServer rate limiting.
apiserver_flowcontrol_request_execution_seconds_sumThe sum of request latencies in seconds in APIServer rate limiting.
apiserver_flowcontrol_request_queue_length_after_enqueue_bucketThe distribution of request queue lengths after enqueuing in APIServer rate limiting.
apiserver_flowcontrol_request_queue_length_after_enqueue_countThe count of request queue lengths after enqueuing in APIServer rate limiting.
apiserver_flowcontrol_request_queue_length_after_enqueue_sumThe sum of request queue lengths after enqueuing in APIServer rate limiting.
apiserver_flowcontrol_request_wait_duration_seconds_bucketThe distribution of request waiting durations in seconds in APIServer rate limiting.
apiserver_flowcontrol_request_wait_duration_seconds_countThe count of request waiting durations in seconds in APIServer rate limiting.
apiserver_flowcontrol_request_wait_duration_seconds_sumThe sum of request waiting durations in seconds in APIServer rate limiting.
apiserver_flowcontrol_seat_fair_fracThe fair share ratios determined by the APIServer during the last borrowing adjustment period.
apiserver_flowcontrol_target_seatsThe target number of seats in APIServer rate limiting.
apiserver_flowcontrol_upper_limit_seatsThe upper bound of seats in APIServer rate limiting.
apiserver_flowcontrol_watch_count_samples_bucketThe distribution of observed samples in APIServer rate limiting.
apiserver_flowcontrol_watch_count_samples_countThe count of observed samples in APIServer rate limiting.
apiserver_flowcontrol_watch_count_samples_sumThe sum of observed samples in APIServer rate limiting.
apiserver_flowcontrol_work_estimated_seats_bucketThe distribution of estimated seats in APIServer rate limiting.
apiserver_flowcontrol_work_estimated_seats_countThe count of estimated seats in APIServer rate limiting.
apiserver_flowcontrol_work_estimated_seats_sumThe sum of estimated seats in APIServer rate limiting.
apiserver_init_events_totalThe total number of initialization events in the APIServer.
apiserver_kube_aggregator_x509_insecure_sha1_totalThe number of requests using insecure Secure Hash Algorithm 1 (SHA1) signatures.
apiserver_kube_aggregator_x509_missing_san_totalThe total number of x509 certificates missing Subject Alternative Names (SANs) in APIServer kube-aggregator.
apiserver_longrunning_gaugeThe long-running meter in the APIServer.
apiserver_longrunning_requestsThe long-running requests in the APIServer.
apiserver_nodeport_repair_reconcile_errors_totalThe total number of node port fix reconcile errors in the APIServer.
apiserver_realtime_watchersThe number of real-time observers in the APIServer.
apiserver_registered_watchersThe number of registered watchers in the APIServer.
apiserver_request_aborts_totalThe total number of suspended APIServer requests.
apiserver_request_body_size_bytes_bucketThe distribution of APIServer request body sizes in bytes.
apiserver_request_body_size_bytes_countThe count of APIServer request body sizes in bytes.
apiserver_request_body_size_bytes_sumThe sum of APIServer request body sizes in bytes.
apiserver_request_countThe number of APIServer requests.
apiserver_request_duration_seconds_bucketThe distribution of APIServer request latencies in seconds.
apiserver_request_duration_seconds_countThe count of APIServer request latencies in seconds.
apiserver_request_duration_seconds_sumThe sum of APIServer request latencies in seconds.
apiserver_request_filter_duration_seconds_bucketThe distribution of request filter latencies in seconds.
apiserver_request_filter_duration_seconds_countThe count of request filter latencies in seconds.
apiserver_request_filter_duration_seconds_sumThe sum of request filter latencies in seconds.
apiserver_request_latencies_summaryThe summary of APIServer request latencies.
apiserver_request_no_resourceversion_list_totalThe total number of unversioned LIST requests.
apiserver_request_post_timeout_totalThe total number of timed out POST requests.
apiserver_request_sli_duration_seconds_bucketThe distribution of Service Level Indicator (SLI) request latencies in seconds.
apiserver_request_sli_duration_seconds_countThe count of SLI request latencies in seconds.
apiserver_request_sli_duration_seconds_sumThe sum of SLI request latencies in seconds.
apiserver_request_slo_duration_seconds_bucketThe distribution of Service Level Objective (SLO) request latencies in seconds.
apiserver_request_slo_duration_seconds_countThe count of SLO request latencies in seconds.
apiserver_request_slo_duration_seconds_sumThe sum of SLO request latencies in seconds.
apiserver_request_terminations_totalThe total number of terminated API requests.
apiserver_request_timestamp_comparison_time_bucketThe distribution of time spent in timestamp comparison of API requests.
apiserver_request_timestamp_comparison_time_countThe count of API request samples for timestamp comparison.
apiserver_request_timestamp_comparison_time_sumThe sum of time spent in timestamp comparison of API requests.
apiserver_request_totalThe total number of API requests.
apiserver_requested_deprecated_apisThe count of APIServer requests for deprecated APIs.
apiserver_response_sizes_bucketThe distribution of response body sizes of API requests.
apiserver_response_sizes_countThe count of response body sizes of API requests.
apiserver_response_sizes_sumThe sum of response body sizes of API requests.
apiserver_selfrequest_totalThe total number of APIServer self-requests.
apiserver_storage_data_key_generation_duration_seconds_bucketThe distribution of time consumed by the APIServer to generate data keys in seconds.
apiserver_storage_data_key_generation_duration_seconds_countThe count of time consumed by the APIServer to generate data keys in seconds.
apiserver_storage_data_key_generation_duration_seconds_sumThe sum of time consumed by the APIServer to generate data keys in seconds.
apiserver_storage_data_key_generation_failures_totalThe total number of data key generation failures.
apiserver_storage_db_total_size_in_bytesThe total size of APIServer databases in bytes.
apiserver_storage_decode_errors_totalThe total number of decoding errors in the APIServer.
apiserver_storage_envelope_transformation_cache_misses_totalThe total number of envelope conversion cache misses in the APIServer.
apiserver_storage_events_received_totalThe total number of events received by the APIServer.
apiserver_storage_list_evaluated_objects_totalThe total number of evaluated objects in the APIServer storage list.
apiserver_storage_list_fetched_objects_totalThe total number of objects obtained by the APIServer storage list.
apiserver_storage_list_returned_objects_totalThe total number of objects returned by the APIServer storage list.
apiserver_storage_list_totalThe total number of operations performed by the APIServer storage list.
apiserver_storage_objectsThe number of objects stored in the APIServer.
apiserver_storage_size_bytesThe total size of objects stored in the APIServer.
apiserver_terminated_watchers_totalThe total number of watchers terminated by the APIServer.
apiserver_tls_handshake_errors_totalThe total number of requests with Transport Layer Security (TLS) handshake errors in the APIServer.
apiserver_too_large_resourceversion_errorsThe total number of requests whose resource version is too late in the APIServer.
apiserver_watch_cache_events_dispatched_totalThe total number of cache distribution events observed by the APIServer.
apiserver_watch_cache_events_received_totalThe total number of cache reception events observed by the APIServer.
apiserver_watch_cache_initializations_totalThe total number of cache initializations observed by the APIServer.
apiserver_watch_cache_read_wait_seconds_bucketThe distribution of cache read waiting durations in seconds observed by the APIServer.
apiserver_watch_cache_read_wait_seconds_countThe count of cache read waiting durations in seconds observed by the APIServer.
apiserver_watch_cache_read_wait_seconds_sumThe sum of cache read waiting durations in seconds observed by the APIServer.
apiserver_watch_cache_watch_cache_initializations_totalThe total number of cache initializations observed by the APIServer.
apiserver_watch_events_sizes_bucketThe distribution of sizes of events observed by the APIServer.
apiserver_watch_events_sizes_countThe count of sizes of events observed by the APIServer.
apiserver_watch_events_sizes_sumThe sum of sizes of events observed by the APIServer.
apiserver_watch_events_totalThe total number of events observed by the APIServer.
apiserver_webhooks_x509_insecure_sha1_totalThe number of requests using insecure SHA1 signatures.
apiserver_webhooks_x509_missing_san_totalThe total number of missing SANs in APIServer webhooks.
authenticated_user_requestsThe total number of authenticated user requests.
authentication_attemptsThe number of authentication attempts.
authentication_duration_seconds_bucketThe distribution of authentication durations in seconds.
authentication_duration_seconds_countThe count of authentication durations in seconds.
authentication_duration_seconds_sumThe sum of authentication durations in seconds.
authentication_token_cache_active_fetch_countThe count of active fetches for the authentication token cache.
authentication_token_cache_fetch_totalThe total number of times the authentication token was retrieved from the cache.
authentication_token_cache_request_duration_seconds_bucketThe distribution of request durations in seconds for authentication token cache.
authentication_token_cache_request_duration_seconds_countThe count of request durations in seconds for authentication token cache.
authentication_token_cache_request_duration_seconds_sumThe sum of request durations in seconds for authentication token cache.
authentication_token_cache_request_totalThe total number of requests for authentication token cache.
authorization_attempts_totalThe total number of authorization attempts.
authorization_duration_seconds_bucketThe distribution of authorization durations in seconds.
authorization_duration_seconds_countThe count of authorization durations in seconds.
authorization_duration_seconds_sumThe sum of authorization durations in seconds.
cardinality_enforcement_unexpected_categorizations_totalThe total number of unexpected classifications in classification execution.
countThe count details.
cpu_utilization_coreThe CPU utilization of the core.
disabled_metric_totalThe total number of disabled metrics.
disabled_metrics_totalThe total number of disabled metrics.
etcd_bookmark_countsThe number of ETCD bookmarks.
etcd_db_total_size_in_bytesThe total size of ETCD databases in bytes.
etcd_lease_object_counts_bucketThe distribution of objects attached to a single ETCD lease.
etcd_lease_object_counts_countThe count of objects attached to a single ETCD lease.
etcd_lease_object_counts_sumThe sum of objects attached to a single ETCD lease.
etcd_object_countsThe number of ETCD objects.
etcd_request_duration_seconds_bucketThe distribution of ETCD request latencies in seconds.
etcd_request_duration_seconds_countThe count of ETCD request latencies in seconds.
etcd_request_duration_seconds_sumThe sum of ETCD request latencies in seconds.
etcd_request_errors_totalThe total number of failed ETCD requests.
etcd_requests_totalThe total number of ETCD requests.
etcd_watcher_channel_lengthThe channel length of the ETCD watcher.
etcd_watcher_received_eventsThe number of events received by the ETCD watcher.
etcd_watcher_sended_events_latency_milliseconds_bucketThe distribution of event sending latencies of the ETCD watcher in milliseconds.
etcd_watcher_sended_events_latency_milliseconds_countThe count of event sending latencies of the ETCD watcher in milliseconds.
etcd_watcher_sended_events_latency_milliseconds_sumThe sum of event sending latencies of the ETCD watcher in milliseconds.
field_validation_request_duration_seconds_bucketThe distribution of field validation request latencies in seconds.
field_validation_request_duration_seconds_countThe count of field validation request latencies in seconds.
field_validation_request_duration_seconds_sumThe sum of field validation request latencies in seconds.
get_token_countThe number of obtained tokens.
get_token_fail_countThe number of token obtaining failures.
go_cgo_go_to_c_calls_calls_totalThe total number of C function calls made by cgo.
go_cpu_classes_gc_mark_assist_cpu_seconds_totalThe total CPU seconds spent on garbage collection (GC) mark assistance by Go.
go_cpu_classes_gc_mark_dedicated_cpu_seconds_totalThe total CPU seconds spent on dedicated GC marking by Go.
go_cpu_classes_gc_mark_idle_cpu_seconds_totalThe total CPU seconds spent on idle GC marking by Go.
go_cpu_classes_gc_pause_cpu_seconds_totalThe total CPU seconds spent on GC pauses by Go.
go_cpu_classes_gc_total_cpu_seconds_totalThe total CPU seconds spent on GC by Go.
go_cpu_classes_idle_cpu_seconds_totalThe total CPU idle time in Go.
go_cpu_classes_scavenge_assist_cpu_seconds_totalThe total CPU seconds spent on GC assist scanning by Go.
go_cpu_classes_scavenge_background_cpu_seconds_totalThe total CPU seconds spent on background GC scanning by Go.
go_cpu_classes_scavenge_total_cpu_seconds_totalThe total CPU seconds spent on GC by Go.
go_cpu_classes_total_cpu_seconds_totalThe total CPU seconds.
go_cpu_classes_user_cpu_seconds_totalThe user CPU time.
go_gc_cycles_automatic_gc_cycles_totalThe total number of automatic GC cycles.
go_gc_cycles_forced_gc_cycles_totalThe total number of forced GC cycles.
go_gc_cycles_total_gc_cycles_totalThe total number of GC cycles.
go_gc_duration_secondsThe GC pause time in seconds.
go_gc_duration_seconds_countThe count of GC pause time in seconds.
go_gc_duration_seconds_sumThe sum of GC pause time in seconds.
go_gc_gogc_percentThe GO GC target percentage.
go_gc_gomemlimit_bytesThe GC memory limit in bytes.
go_gc_heap_allocs_by_size_bytes_bucketThe distribution of allocated heap memory sizes in bytes.
go_gc_heap_allocs_by_size_bytes_countThe count of allocated heap memory sizes in bytes.
go_gc_heap_allocs_by_size_bytes_sumThe sum of allocated heap memory sizes in bytes.
go_gc_heap_allocs_by_size_bytes_total_bucketThe distribution of all allocated heap memory sizes in bytes.
go_gc_heap_allocs_by_size_bytes_total_countThe count of all allocated heap memory sizes in bytes.
go_gc_heap_allocs_by_size_bytes_total_sumThe sum of all allocated heap memory sizes in bytes.
go_gc_heap_allocs_bytes_totalThe total number of bytes allocated on the heap.
go_gc_heap_allocs_objects_totalThe total number of objects allocated on the heap.
go_gc_heap_frees_by_size_bytes_bucketThe distribution of released heap memory sizes in bytes.
go_gc_heap_frees_by_size_bytes_countThe count of released heap memory sizes in bytes.
go_gc_heap_frees_by_size_bytes_sumThe sum of released heap memory sizes in bytes.
go_gc_heap_frees_by_size_bytes_total_bucketThe distribution of all released heap memory sizes in bytes.
go_gc_heap_frees_by_size_bytes_total_countThe count of all released heap memory sizes in bytes.
go_gc_heap_frees_by_size_bytes_total_sumThe sum of all released heap memory sizes in bytes.
go_gc_heap_frees_bytes_totalThe total number of bytes released from the heap.
go_gc_heap_frees_objects_totalThe total number of objects released from the heap.
go_gc_heap_goal_bytesThe expected heap size in bytes.
go_gc_heap_live_bytesThe heap memory occupied by live objects in bytes.
go_gc_heap_objects_objectsThe number of objects that occupy the heap memory.
go_gc_heap_tiny_allocs_objects_totalThe total number of tiny object allocations.
go_gc_limiter_last_enabled_gc_cycleThe last GC cycle enabled.
go_gc_pauses_seconds_bucketThe distribution of GC pause durations.
go_gc_pauses_seconds_countThe count of GC pause durations.
go_gc_pauses_seconds_sumThe sum of GC pause durations.
go_gc_pauses_seconds_total_bucketThe distribution of all GC pause durations.
go_gc_pauses_seconds_total_countThe count of all GC pause durations.
go_gc_pauses_seconds_total_sumThe sum of all GC pause durations.
go_gc_scan_globals_bytesThe number of bytes scanned in global variables.
go_gc_scan_heap_bytesThe number of bytes scanned in the heap.
go_gc_scan_stack_bytesThe number of bytes scanned in the stack.
go_gc_scan_total_bytesThe total number of scanned bytes.
go_gc_stack_starting_size_bytesThe initial stack size in bytes.
go_godebug_non_default_behavior_execerrdot_events_totalThe count of non-default behavior debug events related to the execerrdot debug setting.
go_godebug_non_default_behavior_gocachehash_events_totalThe count of non-default behavior debug events related to the gocachehash debug setting.
go_godebug_non_default_behavior_gocachetest_events_totalThe count of non-default behavior debug events related to the gocachetest debug setting.
go_godebug_non_default_behavior_gocacheverify_events_totalThe count of non-default behavior debug events related to the gocacheverify debug setting.
go_godebug_non_default_behavior_gotypesalias_events_totalThe count of non-default behavior debug events related to the gotypesalias debug setting.
go_godebug_non_default_behavior_http2client_events_totalThe count of non-default behavior debug events related to the http2client debug setting.
go_godebug_non_default_behavior_http2server_events_totalThe count of non-default behavior debug events related to the http2server debug setting.
go_godebug_non_default_behavior_httplaxcontentlength_events_totalThe count of non-default behavior debug events related to the httplaxcontentlength debug setting.
go_godebug_non_default_behavior_httpmuxgo121_events_totalThe count of non-default behavior debug events related to the httpmuxgo121 debug setting.
go_godebug_non_default_behavior_installgoroot_events_totalThe count of non-default behavior debug events related to the installgoroot debug setting.
go_godebug_non_default_behavior_jstmpllitinterp_events_totalThe count of non-default behavior debug events related to the jstmpllitinterp debug setting.
go_godebug_non_default_behavior_multipartmaxheaders_events_totalThe count of non-default behavior debug events related to the multipartmaxheaders debug setting.
go_godebug_non_default_behavior_multipartmaxparts_events_totalThe count of non-default behavior debug events related to the multipartmaxparts debug setting.
go_godebug_non_default_behavior_multipathtcp_events_totalThe count of non-default behavior debug events related to the multipathtcp debug setting.
go_godebug_non_default_behavior_panicnil_events_totalThe count of non-default behavior debug events related to the panicnil debug setting.
go_godebug_non_default_behavior_randautoseed_events_totalThe count of non-default behavior debug events related to the randautoseed debug setting.
go_godebug_non_default_behavior_tarinsecurepath_events_totalThe count of non-default behavior debug events related to the tarinsecurepath debug setting.
go_godebug_non_default_behavior_tls10server_events_totalThe count of non-default behavior debug events related to the tls10server debug setting.
go_godebug_non_default_behavior_tlsmaxrsasize_events_totalThe count of non-default behavior debug events related to the tlsmaxrsasize debug setting.
go_godebug_non_default_behavior_tlsrsakex_events_totalThe count of non-default behavior debug events related to the tlsrsakex debug setting.
go_godebug_non_default_behavior_tlsunsafeekm_events_totalThe count of non-default behavior debug events related to the tlsunsafeekm debug setting.
go_godebug_non_default_behavior_x509sha1_events_totalThe count of non-default behavior debug events related to the x509sha1 debug setting.
go_godebug_non_default_behavior_x509usefallbackroots_events_totalThe count of non-default behavior debug events related to the x509usefallbackroots debug setting.
go_godebug_non_default_behavior_x509usepolicies_events_totalThe count of non-default behavior debug events related to the x509usepolicies debug setting.
go_godebug_non_default_behavior_zipinsecurepath_events_totalThe count of non-default behavior debug events related to the zipinsecurepath debug setting.
go_goroutinesThe number of goroutines.
go_infoThe operating system information.
go_memory_classes_heap_free_bytesThe amount of idle heap memory in bytes.
go_memory_classes_heap_objects_bytesThe amount of heap memory occupied by objects in bytes.
go_memory_classes_heap_released_bytesThe amount of heap memory released in bytes.
go_memory_classes_heap_stacks_bytesThe amount of memory reserved for the stack in bytes.
go_memory_classes_heap_unused_bytesThe amount of heap memory not used in bytes.
go_memory_classes_metadata_mcache_free_bytesThe amount of idle memory in mcache in bytes.
go_memory_classes_metadata_mcache_inuse_bytesThe amount of memory in use in mcache in bytes.
go_memory_classes_metadata_mspan_free_bytesThe amount of idle memory in mspan in bytes.
go_memory_classes_metadata_mspan_inuse_bytesThe amount of memory in use in mspan in bytes.
go_memory_classes_metadata_other_bytesThe amount of memory occupied by other metadata in bytes.
go_memory_classes_os_stacks_bytesThe amount of memory reserved for the operating system stack in bytes.
go_memory_classes_other_bytesThe amount of memory used for other purposes in bytes.
go_memory_classes_profiling_buckets_bytesThe bytes used by profiling buckets.
go_memory_classes_total_bytesThe total memory in bytes.
go_memstats_alloc_bytesThe amount of memory allocated in bytes.
go_memstats_alloc_bytes_totalThe cumulative amount of memory allocated in bytes.
go_memstats_buck_hash_sys_bytesThe amount of memory used by hash tables in the operating system in bytes.
go_memstats_frees_totalThe total number of releases.
go_memstats_gc_cpu_fractionThe GC CPU utilization (%).
go_memstats_gc_sys_bytesThe amount of memory used by GC in the operating system in bytes.
go_memstats_heap_alloc_bytesThe amount of heap memory allocated in bytes.
go_memstats_heap_idle_bytesThe amount of idle heap memory in bytes.
go_memstats_heap_inuse_bytesThe amount of heap memory in use in bytes.
go_memstats_heap_objectsThe number of objects allocated on the heap.
go_memstats_heap_released_bytesThe amount of heap memory released in bytes.
go_memstats_heap_sys_bytesThe amount of memory allocated to the heap by the operating system in bytes.
go_memstats_last_gc_time_secondsThe last GC duration in seconds.
go_memstats_lookups_totalThe total number of lookups.
go_memstats_mallocs_totalThe total number of allocations.
go_memstats_mcache_inuse_bytesThe amount of memory in use in mcache in bytes.
go_memstats_mcache_sys_bytesThe amount of memory allocated to mcache by the operating system in bytes.
go_memstats_mspan_inuse_bytesThe amount of memory in use in mspan in bytes.
go_memstats_mspan_sys_bytesThe amount of memory allocated to mspan by the operating system in bytes.
go_memstats_next_gc_bytesThe number of bytes to be released at the next GC in bytes.
go_memstats_other_sys_bytesThe amount of memory allocated for other purposes by the operating system in bytes.
go_memstats_stack_inuse_bytesThe amount of stack memory in use in bytes.
go_memstats_stack_sys_bytesThe amount of memory allocated to the stack by the operating system in bytes.
go_memstats_sys_bytesThe total memory allocated by the operating system in bytes.
go_sched_gomaxprocs_threadsThe number of threads determined by GOMAXPROCS.
go_sched_goroutines_goroutinesThe number of goroutines.
go_sched_latencies_seconds_bucketThe distribution of scheduling latencies in seconds.
go_sched_latencies_seconds_countThe count of scheduling latencies in seconds.
go_sched_latencies_seconds_sumThe sum of scheduling latencies in seconds.
go_sched_pauses_stopping_gc_seconds_bucketThe distribution of stop-the-world GC pause durations in seconds.
go_sched_pauses_stopping_gc_seconds_countThe count of stop-the-world GC pause durations in seconds.
go_sched_pauses_stopping_gc_seconds_sumThe sum of stop-the-world GC pause durations in seconds.
go_sched_pauses_stopping_other_seconds_bucketThe distribution of other GC pause durations for other specific stops in seconds.
go_sched_pauses_stopping_other_seconds_countThe count of other GC pause durations for other specific stops in seconds.
go_sched_pauses_stopping_other_seconds_sumThe sum of other GC pause durations for other specific stops in seconds.
go_sched_pauses_total_gc_seconds_bucketThe distribution of all GC pause durations in seconds.
go_sched_pauses_total_gc_seconds_countThe count of all GC pause durations in seconds.
go_sched_pauses_total_gc_seconds_sumThe sum of all GC pause durations in seconds.
go_sched_pauses_total_other_seconds_bucketThe distribution of other GC pause durations for all other stops in seconds.
go_sched_pauses_total_other_seconds_countThe count of other GC pause durations for all other stops in seconds.
go_sched_pauses_total_other_seconds_sumThe cumulative sum of all goroutine pause durations caused by non-major activities in the scheduler in seconds.
go_sync_mutex_wait_total_seconds_totalThe total waiting duration for Mutex locks in seconds.
go_threadsThe number of Go threads.
grpc_client_handled_totalThe total number of requests handled by the gRPC client.
grpc_client_msg_received_totalThe total number of messages received by the gRPC client.
grpc_client_msg_sent_totalThe total number of messages sent by the gRPC client.
grpc_client_started_totalThe total number of gRPC client startups.
hidden_metric_totalThe total number of hidden metrics.
hidden_metrics_totalThe total number of hidden metrics.
http_request_duration_microsecondsThe HTTP request latency in microseconds.
http_request_size_bytesThe HTTP request size in bytes.
http_requests_totalThe total number of HTTP requests.
http_response_size_bytesThe HTTP response body size in bytes.
jobThe job name.
job_instance_modeThe job instance mode.
kube_apiserver_clusterip_allocator_allocated_ipsKubernetes APIServer: The number of allocated cluster IP addresses.
kube_apiserver_clusterip_allocator_allocation_errors_totalKubernetes APIServer: The total number of errors that occurred in cluster IP address allocations.
kube_apiserver_clusterip_allocator_allocation_totalKubernetes APIServer: The total number of cluster IP address allocations.
kube_apiserver_clusterip_allocator_available_ipsKubernetes APIServer: The number of available cluster IP addresses.
kube_apiserver_nodeport_allocator_allocated_portsKubernetes APIServer: The number of allocated node ports.
kube_apiserver_nodeport_allocator_allocation_errors_totalKubernetes APIServer: The total number of errors that occurred in node port allocations.
kube_apiserver_nodeport_allocator_allocation_totalKubernetes APIServer: The total number of node port allocations.
kube_apiserver_nodeport_allocator_available_portsKubernetes APIServer: The number of available node ports.
kube_apiserver_pod_logs_backend_tls_failure_totalKubernetes APIServer: The total number of pod/log requests that failed due to TLS verification errors.
kube_apiserver_pod_logs_insecure_backend_totalKubernetes APIServer: The total number of insecure pod/log requests.
kube_apiserver_pod_logs_pods_logs_backend_tls_failure_totalKubernetes APIServer: The total number of pod/log requests that failed due to TLS verification errors.
kube_apiserver_pod_logs_pods_logs_insecure_backend_totalKubernetes APIServer: The total number of insecure pod/log requests.
kubelet_container_log_filesystem_used_bytesKubelet: The space of the file system used by container logs in bytes.
kubelet_node_nameKubelet: The node name.
kubelet_pleg_relist_duration_seconds_bucketKubelet: The distribution of PLEG relisting durations in seconds.
kubelet_pod_worker_duration_seconds_bucketKubelet: The distribution of Pod worker relisting durations in seconds.
kubelet_volume_stats_available_bytesKubelet: The number of available bytes in the volume.
kubelet_volume_stats_capacity_bytesKubelet: The volume capacity in bytes.
kubelet_volume_stats_inodesKubelet: The number of available inodes in the volume.
kubelet_volume_stats_inodes_freeKubelet: The number of idle inodes in the volume.
kubelet_volume_stats_inodes_usedKubelet: The number of used inodes in the volume.
kubelet_volume_stats_used_bytesKubelet: The number of used bytes in the volume.
kubernetes_build_infoThe Kubernetes build information.
kubernetes_feature_enabledSpecifies that Kubernetes features are enabled.
last_list_all_response_size_in_bytesThe total size of all response bodies in the recent list in bytes.
memory_utilization_byteThe used memory in bytes.
node_authorizer_graph_actions_duration_seconds_bucketNode authorizer: The distribution of graph operation durations in seconds.
node_authorizer_graph_actions_duration_seconds_countNode authorizer: The count of graph operation durations in seconds.
node_authorizer_graph_actions_duration_seconds_sumNode authorizer: The sum of graph operation durations in seconds.
pod_security_evaluations_totalThe total number of pod security evaluations.
pod_security_exemptions_totalThe total number of pod security exemptions.
process_cpu_seconds_totalThe total process CPU seconds.
process_max_fdsThe maximum number of file descriptors for the process.
process_open_fdsThe number of file descriptors opened by the process.
process_resident_memory_bytesThe resident memory size of the process in bytes.
process_start_time_secondsThe process startup duration in seconds.
process_virtual_memory_bytesThe number of virtual memory bytes for the process.
process_virtual_memory_max_bytesThe maximum number of virtual memory bytes for the process.
registered_metric_totalThe total number of registered metrics.
registered_metrics_totalThe total number of registered metrics.
rest_client_exec_plugin_certificate_rotation_age_bucketREST client plug-in: The distribution of certificate rotation ages in seconds.
rest_client_exec_plugin_certificate_rotation_age_countREST client plug-in: The count of certificate rotation ages in seconds.
rest_client_exec_plugin_certificate_rotation_age_sumREST client plug-in: The sum of certificate rotation ages in seconds.
rest_client_exec_plugin_ttl_secondsREST client plug-in: The time to live (TTL) of the certificate in seconds.
rest_client_request_duration_seconds_bucketThe distribution of REST client request durations in seconds.
rest_client_request_duration_seconds_countThe count of REST client request durations in seconds.
rest_client_request_duration_seconds_sumThe sum of REST client request durations in seconds.
rest_client_request_latency_seconds_bucketThe total of REST client request latencies in seconds.
rest_client_request_size_bytes_bucketThe distribution of REST client request-body sizes in bytes.
rest_client_request_size_bytes_countThe count of REST client request-body sizes in bytes.
rest_client_request_size_bytes_sumThe sum of REST client request-body sizes in bytes.
rest_client_requests_totalThe number of REST client requests.
rest_client_response_size_bytes_bucketThe distribution of REST client response-body sizes in bytes.
rest_client_response_size_bytes_countThe count of REST client response-body sizes in bytes.
rest_client_response_size_bytes_sumThe sum of REST client response-body sizes in bytes.
rest_client_transport_cache_entriesThe number of transport entries of the REST client.
rest_client_transport_create_calls_totalThe total number of transport creation calls of the REST client.
scheduler_pending_podsScheduler: The number of pods to be scheduled.
scheduler_pod_scheduling_attempts_bucketScheduler: The distribution of pod scheduling attempts.
scheduler_scheduler_cache_sizeThe scheduler cache size.
scrape_duration_secondsThe scrape duration in seconds.
scrape_samples_post_metric_relabelingThe number of scraped samples after metric relabeling.
scrape_samples_scrapedThe number of scraped samples.
scrape_series_addedThe number of new series added during the scrape.
serviceaccount_invalid_legacy_auto_token_uses_totalThe total number of uses of invalid legacy automatic service account tokens.
serviceaccount_legacy_auto_token_uses_totalThe total number of uses of legacy automatic service account tokens.
serviceaccount_legacy_manual_token_uses_totalThe total number of uses of legacy manual service account tokens.
serviceaccount_legacy_tokens_totalThe total number of legacy service account tokens.
serviceaccount_stale_tokens_totalThe total number of stale service account tokens.
serviceaccount_valid_tokens_totalThe total number of valid service account tokens.
ssh_tunnel_open_countThe number of opened Secure Shell (SSH) tunnels.
ssh_tunnel_open_fail_countThe number of SSH tunnels that failed to be opened.
upThe connectivity of metric collection.
watch_cache_capacityThe capacity of the monitoring cache.
watch_cache_capacity_decrease_totalThe increasing capacity of the monitoring cache.
watch_cache_capacity_increase_totalThe decreasing capacity of the monitoring cache.
workqueue_adds_totalThe total number of additions to the work queue.
workqueue_depthThe work queue depth.
workqueue_longest_running_processor_secondsThe longest running processor time in the work queue in seconds.
workqueue_queue_duration_seconds_bucketThe distribution of queueing durations in the work queue in seconds.
workqueue_queue_duration_seconds_countThe count of queueing durations in the work queue in seconds.
workqueue_queue_duration_seconds_sumThe sum of queueing durations in the work queue in seconds.
workqueue_retries_totalThe total number of retries in the work queue.
workqueue_unfinished_work_secondsThe duration of unfinished work in the work queue in seconds.
workqueue_work_duration_seconds_bucketThe distribution of work durations in the work queue in seconds.
workqueue_work_duration_seconds_countThe count of work durations in the work queue in seconds.
workqueue_work_duration_seconds_sumThe sum of work durations in the work queue in seconds.

Node Exporter

Job name: node-exporter

Hardware and OS-level metrics from cluster nodes, including CPU, memory, disk, network, and filesystem statistics.

MetricDescription
ALERTSThe alerts.
ALERTS_FOR_STATEThe number of alerts based on status.
aliyun_prometheus_agent_append_duration_secondsThe duration of the Prometheus agent append operations in seconds.
aliyun_prometheus_agent_job_discovery_statusThe discovery status of the Prometheus agent collection jobs.
aliyun_prometheus_agent_scrapes_by_target_totalThe total number of scrapes by the Prometheus agent per target.
aliyun_prometheus_agent_target_infoThe target information of the Prometheus agent.
countThe Go-specific count details.
go_gc_duration_secondsThe Go GC pause duration in seconds.
go_gc_duration_seconds_countThe Go GC pause duration in seconds.
go_gc_duration_seconds_sumThe total Go GC pause duration in seconds.
go_goroutinesThe number of goroutines.
go_infoThe Go-specific information.
go_memstats_alloc_bytesThe amount of memory allocated in bytes.
go_memstats_alloc_bytes_totalThe cumulative amount of memory allocated in bytes.
go_memstats_buck_hash_sys_bytesThe amount of memory used by hash tables in the operating system in bytes.
go_memstats_frees_totalThe total number of releases.
go_memstats_gc_cpu_fractionThe GC CPU utilization (%).
go_memstats_gc_sys_bytesThe amount of memory used by GC in the operating system in bytes.
go_memstats_heap_alloc_bytesThe amount of heap memory allocated in bytes.
go_memstats_heap_idle_bytesThe amount of idle heap memory in bytes.
go_memstats_heap_inuse_bytesThe amount of heap memory in use in bytes.
go_memstats_heap_objectsThe number of objects allocated on the heap.
go_memstats_heap_released_bytesThe amount of heap memory released in bytes.
go_memstats_heap_sys_bytesThe amount of memory allocated to the heap by the operating system in bytes.
go_memstats_last_gc_time_secondsThe last GC duration in seconds.
go_memstats_lookups_totalThe total number of lookups.
go_memstats_mallocs_totalThe total number of allocations.
go_memstats_mcache_inuse_bytesThe amount of memory in use in mcache in bytes.
go_memstats_mcache_sys_bytesThe amount of memory allocated to mcache by the operating system in bytes.
go_memstats_mspan_inuse_bytesThe amount of memory in use in mspan in bytes.
go_memstats_mspan_sys_bytesThe amount of memory allocated to mspan by the operating system in bytes.
go_memstats_next_gc_bytesThe number of bytes to be released at the next GC in bytes.
go_memstats_other_sys_bytesThe amount of memory allocated for other purposes by the operating system in bytes.
go_memstats_stack_inuse_bytesThe amount of stack memory in use in bytes.
go_memstats_stack_sys_bytesThe amount of memory allocated to the stack by the operating system in bytes.
go_memstats_sys_bytesThe total memory allocated by the operating system in bytes.
go_threadsThe number of threads.
instanceThe instance.
instance_deviceThe instance device.
jobThe job name.
k8s_node_cpu_utilizationThe CPU utilization of Kubernetes nodes.
k8s_node_disk_utilizationThe disk usage of Kubernetes nodes.
k8s_node_memory_utilizationThe memory usage of Kubernetes nodes.
node_arp_entriesThe number of Address Resolution Protocol (ARP) entries on the node.
node_boot_time_secondsThe node startup duration in seconds.
node_context_switches_totalThe total number of context switches on the node.
node_cooling_device_cur_stateThe current state of the cooling device of the node.
node_cooling_device_max_stateThe maximum state of the cooling device of the node.
node_cpu_core_throttles_totalThe total number of CPU core throttling events on the node.
node_cpu_frequency_max_hertzThe maximum CPU frequency of the node in Hertz.
node_cpu_frequency_min_hertzThe minimum CPU frequency of the node in Hertz.
node_cpu_guest_seconds_totalThe total virtual machine time of the node CPU.
node_cpu_package_throttles_totalThe total number of CPU package throttling events on the node.
node_cpu_scaling_frequency_hertzThe dynamic CPU frequency of the node in Hz.
node_cpu_scaling_frequency_max_hertzThe maximum dynamic CPU frequency of the node in Hz.
node_cpu_scaling_frequency_min_hertzThe minimum dynamic CPU frequency of the node in Hz.
node_cpu_scaling_governorThe dynamic CPU governor of the node.
node_cpu_seconds_totalThe total CPU time consumed on the node.
node_disk_device_mapper_infoThe DeviceMapper information of the node.
node_disk_discard_time_seconds_totalThe total disk discard time of the node in seconds.
node_disk_discarded_sectors_totalThe total disk discard sectors of the node.
node_disk_discards_completed_totalThe total completed disk discards of the node.
node_disk_discards_merged_totalThe total merged disk discards of the node.
node_disk_filesystem_infoThe file system information of the node.
node_disk_flush_requests_time_seconds_totalThe total flush request duration of the node in seconds.
node_disk_flush_requests_totalThe total number of flush requests of the node.
node_disk_infoThe node disk information.
node_disk_io_nowThe current disk I/O of the node.
node_disk_io_time_seconds_totalThe total disk I/O duration of the node in seconds.
node_disk_io_time_weighted_seconds_totalThe total weighted disk I/O time of the node in seconds.
node_disk_read_bytes_totalThe total number of bytes read from the disk of the node.
node_disk_read_time_seconds_totalThe total disk read time of the node in seconds.
node_disk_reads_completed_totalThe total number of complete disk reads of the node.
node_disk_reads_merged_totalThe total number of merged disk reads of the node.
node_disk_write_time_seconds_totalThe total disk write time of the node in seconds.
node_disk_writes_completed_totalThe total number of complete disk writes of the node.
node_disk_writes_merged_totalThe total number of merged disk writes of the node.
node_disk_written_bytes_totalThe total number of bytes written to the disk of the node.
node_dmi_infoThe Desktop Management Interface (DMI) information of the node.
node_edac_correctable_errors_totalThe total number of correctable memory errors of the node.
node_edac_csrow_correctable_errors_totalThe total number of correctable memory errors in chip-select rows of the node.
node_edac_csrow_uncorrectable_errors_totalThe total number of uncorrectable memory errors in chip-select rows of the node.
node_edac_uncorrectable_errors_totalThe total number of uncorrectable memory errors of the node.
node_entropy_available_bitsThe number of bits of available entropy of the node.
node_entropy_pool_size_bitsThe number of bits of the entropy pool of the node.
node_exporter_build_infoThe build Information of the node exporter.
node_filefd_allocatedThe number of allocated file descriptors of the node.
node_filefd_maximumThe maximum number of file descriptors of the node.
node_filesystem_avail_bytesThe available bytes of the node file system.
node_filesystem_device_errorThe number of device errors in the file system of the node.
node_filesystem_filesThe number of files in the file system of the node.
node_filesystem_files_freeThe number of idle files in the file system of the node.
node_filesystem_free_bytesThe amount of idle space in the file system of the node in bytes.
node_filesystem_readonlyThe read-only state of the file system of the node.
node_filesystem_size_bytesThe total size of the file system of the node in bytes.
node_forks_totalThe total number of process forks of the node.
node_infiniband_excessive_buffer_overrun_errors_totalThe total number of InfiniBand excessive buffer overflow errors on the node.
node_infiniband_infoThe InfiniBand information of the node.
node_infiniband_link_downed_totalThe total number of InfiniBand link down events on the node.
node_infiniband_link_error_recovery_totalThe total number of InfiniBand link error recoveries on the node.
node_infiniband_local_link_integrity_errors_totalThe total number of InfiniBand local link integrity errors of the node.
node_infiniband_multicast_packets_received_totalThe total number of InfiniBand multicast packets received on the node.
node_infiniband_multicast_packets_transmitted_totalThe total number of InfiniBand multicast packets sent from the node.
node_infiniband_physical_state_idThe physical state ID of the InfiniBand port on the node.
node_infiniband_port_constraint_errors_received_totalThe total number of InfiniBand port constraint error received on the node.
node_infiniband_port_constraint_errors_transmitted_totalThe total number of InfiniBand port constraint error sent from the node.
node_infiniband_port_data_received_bytes_totalThe total bytes of data received by the InfiniBand port of the node.
node_infiniband_port_data_transmitted_bytes_totalThe total data bytes sent on the node InfiniBand port.
node_infiniband_port_discards_transmitted_totalThe total discarded sends on the node InfiniBand port.
node_infiniband_port_errors_received_totalThe total errors received on the node InfiniBand port.
node_infiniband_port_packets_received_totalThe total number of packets received by the InfiniBand port of the node.
node_infiniband_port_packets_transmitted_totalThe total number of packets sent by the InfiniBand port of the node.
node_infiniband_port_receive_remote_physical_errors_totalThe total remote physical errors received on the node InfiniBand port.
node_infiniband_port_receive_switch_relay_errors_totalThe total switch relay errors received on the node InfiniBand port.
node_infiniband_port_transmit_wait_totalThe total send waits on the node InfiniBand port.
node_infiniband_rate_bytes_per_secondThe InfiniBand port rate in bytes per second on the node.
node_infiniband_state_idThe state ID of the InfiniBand port of the node.
node_infiniband_symbol_error_totalThe total number of InfiniBand symbol errors of the node.
node_infiniband_unicast_packets_received_totalThe total number of unicast packets received on the InfiniBand port of the node.
node_infiniband_unicast_packets_transmitted_totalThe total number of unicast packets sent by the InfiniBand port of the node.
node_infiniband_vl15_dropped_totalThe total VL15 discards on the node InfiniBand port.
node_intr_totalThe total interrupts on the node.
node_load1The 1-minute load on the node.
node_load15The 15-minute load on the node.
node_load5The 5-minute load on the node.
node_memory_Active_anon_bytesThe size of anonymous active memory on the node in bytes.
node_memory_Active_bytesThe size of active memory on the node in bytes.
node_memory_Active_file_bytesThe size of active file memory on the node (in bytes).
node_memory_AnonHugePages_bytesThe size of anonymous huge pages on the node (in bytes).
node_memory_AnonPages_bytesThe size of anonymous pages on the node (in bytes).
node_memory_Bounce_bytesThe size of bounce pages on the node (in bytes).
node_memory_Buffers_bytesThe size of buffers memory on the node (in bytes).
node_memory_Cached_bytesThe size of cached memory on the node (in bytes).
node_memory_CmaFree_bytesThe size of Contiguous Memory Allocator (CMA) free memory on the node (in bytes).
node_memory_CmaTotal_bytesThe total size of CMA memory on the node (in bytes).
node_memory_CommitLimit_bytesThe commit limit of memory on the node (in bytes).
node_memory_Committed_AS_bytesThe committed address space of memory on the node (in bytes).
node_memory_DirectMap1G_bytesThe size of 1 GB direct map memory on the node (in bytes).
node_memory_DirectMap2M_bytesThe size of 2 MB direct map memory on the node (in bytes).
node_memory_DirectMap4k_bytesThe size of 4 KB direct map memory on the node (in bytes).
node_memory_Dirty_bytesThe size of dirty memory on the node (in bytes).
node_memory_DupText_bytesThe size of duplicate text memory on the node (in bytes).
node_memory_FileHugePages_bytesThe size of file huge pages memory on the node (in bytes).
node_memory_FilePmdMapped_bytesThe size of physically allocated memory via file mapping on the node (in bytes).
node_memory_HardwareCorrupted_bytesThe size of hardware corrupted memory on the node (in bytes).
node_memory_HugePages_FreeThe number of free huge pages on the node.
node_memory_HugePages_RsvdThe number of reserved huge pages on the node.
node_memory_HugePages_SurpThe number of surplus huge pages on the node.
node_memory_HugePages_TotalThe total number of huge pages on the node.
node_memory_Hugepagesize_bytesThe size of huge pages on the node (in bytes).
node_memory_Hugetlb_bytesThe size of Hugetlb memory on the node (in bytes).
node_memory_Inactive_anon_bytesThe size of inactive anonymous memory on the node (in bytes).
node_memory_Inactive_bytesThe size of inactive memory on the node (in bytes).
node_memory_Inactive_file_bytesThe size of inactive file memory on the node (in bytes).
node_memory_KernelStack_bytesThe size of KernelStack memory on the node (in bytes).
node_memory_KReclaimable_bytesThe size of KReclaimable memory on the node (in bytes).
node_memory_Mapped_bytesThe size of mapped memory on the node (in bytes).
node_memory_MemAvailable_bytesThe size of available memory on the node (in bytes).
node_memory_MemFree_bytesThe size of free memory on the node (in bytes).
node_memory_MemTotal_bytesThe total size of memory on the node (in bytes).
node_memory_MemZeroed_bytesThe size of zeroed memory on the node (in bytes).
node_memory_Mlocked_bytesThe size of locked memory on the node (in bytes).
node_memory_NFS_Unstable_bytesThe size of unstable NFS memory on the node (in bytes).
node_memory_PageTables_bytesThe size of page table memory on the node (in bytes).
node_memory_Percpu_bytesThe size of per-CPU memory on the node (in bytes).
node_memory_Shmem_bytesThe size of shared memory on the node (in bytes).
node_memory_ShmemHugePages_bytesThe size of shared huge pages memory on the node (in bytes).
node_memory_ShmemPmdMapped_bytesThe size of shared memory page middle directory (PMD) mapping on the node (in bytes).
node_memory_Slab_bytesThe size of Slab memory on the node (in bytes).
node_memory_SReclaimable_bytesThe size of SReclaimable memory on the node (in bytes).
node_memory_SUnreclaim_bytesThe size of SUnreclaim memory on the node (in bytes).
node_memory_SwapCached_bytesThe size of cached swap space on the node (in bytes).
node_memory_SwapFree_bytesThe size of free swap space on the node (in bytes).
node_memory_SwapTotal_bytesThe total size of swap space on the node (in bytes).
node_memory_Unevictable_bytesThe size of unevictable memory on the node (in bytes).
node_memory_VmallocChunk_bytesThe size of vmallocChunk memory on the node (in bytes).
node_memory_VmallocTotal_bytesThe total size of vmalloc memory on the node (in bytes).
node_memory_VmallocUsed_bytesThe size of used vmalloc memory on the node (in bytes).
node_memory_Writeback_bytesThe size of writeback memory on the node (in bytes).
node_memory_WritebackTmp_bytesThe size of temporary writeback memory on the node (in bytes).
node_netstat_Icmp_InErrorsThe number of Internet Control Message Protocol (ICMP) receive errors on the node.
node_netstat_Icmp_InMsgsThe number of received ICMP messages.
node_netstat_Icmp_OutMsgsThe number of sent ICMP messages.
node_netstat_Icmp6_InErrorsThe number of ICMPv6 receive errors.
node_netstat_Icmp6_InMsgsThe number of ICMPv6 messages received.
node_netstat_Icmp6_OutMsgsThe number of ICMPv6 messages sent.
node_netstat_Ip_ForwardingThe status of IP forwarding.
node_netstat_Ip6_InOctetsThe number of bytes received over IPv6.
node_netstat_Ip6_OutOctetsThe number of bytes sent over IPv6.
node_netstat_IpExt_InOctetsThe number of bytes received for IP extended statistics.
node_netstat_IpExt_OutOctetsThe number of bytes sent for IP extended statistics.
node_netstat_Tcp_ActiveOpensThe number of bytes received for IP extended statistics.
node_netstat_Tcp_CurrEstabThe current number of established TCP connections.
node_netstat_Tcp_InErrsThe number of TCP receive errors.
node_netstat_Tcp_InSegsThe number of TCP segments received.
node_netstat_Tcp_OutRstsThe number of TCP resets sent.
node_netstat_Tcp_OutSegsThe number of TCP segments sent.
node_netstat_Tcp_PassiveOpensThe number of passive TCP connections opened.
node_netstat_Tcp_RetransSegsThe number of TCP segments retransmitted.
node_netstat_TcpExt_ListenDropsThe number of TCP connections dropped from the listen queue.
node_netstat_TcpExt_ListenOverflowsThe number of times the listen queue overflowed.
node_netstat_TcpExt_SyncookiesFailedThe number of times SYN_COOKIE validation failed.
node_netstat_TcpExt_SyncookiesRecvThe number of SYN_COOKIES received.
node_netstat_TcpExt_SyncookiesSentThe number of SYN_COOKIES sent.
node_netstat_TcpExt_TCPOFOQueueThe number of OFOs in the TCP send queue.
node_netstat_TcpExt_TCPSynRetransThe number of TCP SYN retransmissions.
node_netstat_TcpExt_TCPTimeoutsThe number of TCP timeouts.
node_netstat_Udp_InDatagramsThe number of UDP datagrams received.
node_netstat_Udp_InErrorsThe number of UDP receive errors.
node_netstat_Udp_NoPortsThe number of UDP packets with unreachable destination ports.
node_netstat_Udp_OutDatagramsThe number of UDP datagrams sent.
node_netstat_Udp_RcvbufErrorsThe number of UDP receive buffer errors.
node_netstat_Udp_SndbufErrorsThe number of UDP send buffer errors.
node_netstat_Udp6_InDatagramsThe number of IPv6 UDP datagrams received.
node_netstat_Udp6_InErrorsThe number of IPv6 UDP packets with unreachable destination ports.
node_netstat_Udp6_NoPortsThe number of IPv6 UDP packets with unreachable destination ports.
node_netstat_Udp6_OutDatagramsThe number of IPv6 UDP datagrams sent.
node_netstat_Udp6_RcvbufErrorsThe number of IPv6 UDP receive buffer errors.
node_netstat_Udp6_SndbufErrorsThe number of IPv6 UDP send buffer errors.
node_netstat_UdpLite_InErrorsThe number of UDP Lite receive errors.
node_netstat_UdpLite6_InErrorsThe number of IPv6 UDP Lite receive errors.
node_network_address_assign_typeThe assignment type of the network address.
node_network_carrierThe information about the network carrier.
node_network_carrier_changes_totalThe information about the network carrier.
node_network_carrier_down_changes_totalThe total number of network carrier downgrade changes.
node_network_carrier_up_changes_totalThe total number of network carrier upgrade changes.
node_network_device_idThe dormant state of the network.
node_network_dormantThe status of network dormancy.
node_network_flagsThe network flags.
node_network_iface_idThe network interface ID.
node_network_iface_linkThe link state of the network interface.
node_network_iface_link_modeThe link mode of the network interface.
node_network_infoThe information about the network interface.
node_network_mtu_bytesThe maximum transmission unit size in bytes on the network.
node_network_name_assign_typeThe assignment type of the network name.
node_network_net_dev_groupThe network device group to which the network device belongs.
node_network_protocol_typeThe network protocol type.
node_network_receive_bytes_totalThe total number of bytes received cumulatively.
node_network_receive_compressed_totalThe total number of compressed packets received.
node_network_receive_drop_totalThe total number of packets dropped while receiving.
node_network_receive_errs_totalThe total number of receive errors.
node_network_receive_fifo_totalThe total number of receive first-in, first-out (FIFO) buffer errors while receiving.
node_network_receive_frame_totalThe total number of frame alignment errors while receiving.
node_network_receive_multicast_totalThe total number of multicast packets received.
node_network_receive_nohandler_totalThe total number of receptions without a handler.
node_network_receive_packets_totalThe total number of packets received.
node_network_speed_bytesThe network speed in bytes.
node_network_transmit_bytes_totalThe total number of bytes sent cumulatively.
node_network_transmit_carrier_totalThe total number of packets sent but lost due to ISP-related issues.
node_network_transmit_colls_totalThe total number of transmission collisions.
node_network_transmit_compressed_totalThe total number of compressed packets sent.
node_network_transmit_drop_totalThe total number of packets sent but dropped.
node_network_transmit_errs_totalThe total number of send errors.
node_network_transmit_fifo_totalThe total number of FIFO buffer errors while sending.
node_network_transmit_packets_totalThe total number of packets sent.
node_network_transmit_queue_lengthThe length of the send queue.
node_network_upIndicates whether the network interface is enabled.
node_nf_conntrack_entriesThe number of entries in the connection tracking table.
node_nf_conntrack_entries_limitThe limit of entries in the connection tracking table.
node_nf_conntrack_stat_dropThe limit of entries in the connection tracking table.
node_nf_conntrack_stat_early_dropThe early drop count for connection tracking.
node_nf_conntrack_stat_foundThe success find count for connection tracking.
node_nf_conntrack_stat_ignoreThe ignore count for connection tracking.
node_nf_conntrack_stat_insertThe insert count for connection tracking.
node_nf_conntrack_stat_insert_failedThe insert failure count for connection tracking.
node_nf_conntrack_stat_invalidThe invalid count for connection tracking.
node_nf_conntrack_stat_search_restartThe search restart count for connection tracking.
node_nfs_connections_totalThe total number of NFS connections.
node_nfs_packets_totalThe total number of NFS packets.
node_nfs_requests_totalThe total number of NFS requests.
node_nfs_rpc_authentication_refreshes_totalThe total number of NFS Remote Procedure Call (RPC) authentication refreshes.
node_nfs_rpc_retransmissions_totalThe total number of NFS RPC retransmissions.
node_nfs_rpcs_totalThe total number of NFS RPCs.
node_nfsd_connections_totalThe total number of connections to the NFS server.
node_nfsd_disk_bytes_read_totalThe total number of bytes read from the disk by the NFS server.
node_nfsd_disk_bytes_written_totalThe total number of bytes written to the disk by the NFS server.
node_nfsd_file_handles_stale_totalThe total number of stale file handles on the NFS server.
node_nfsd_packets_totalThe total number of packets processed by the NFS server.
node_nfsd_read_ahead_cache_not_found_totalThe total number of times the read-ahead cache of the NFS server was not found.
node_nfsd_read_ahead_cache_size_blocksThe size of blocks in the read-ahead cache of the NFS server.
node_nfsd_reply_cache_hits_totalThe total number of hits in the NFS server reply cache.
node_nfsd_reply_cache_misses_totalThe total number of misses in the NFS server reply cache.
node_nfsd_reply_cache_nocache_totalThe total number of no-cache situations in the NFS server reply cache.
node_nfsd_requests_totalThe total number of requests to the NFS server.
node_nfsd_rpc_errors_totalThe total number of RPC errors on the NFS server.
node_nfsd_server_rpcs_totalThe total number of RPCs processed by the NFS server.
node_nfsd_server_threadsThe number of threads on the NFS server.
node_nvme_infoThe information about Non-Volatile Memory Express (NVMe).
node_os_infoThe information about the operating system.
node_os_versionThe version of the operating system.
node_pressure_cpu_waiting_seconds_totalThe total seconds the CPU has spent waiting under pressure.
node_pressure_io_stalled_seconds_totalThe total seconds the I/O has been stalled under pressure.
node_pressure_io_waiting_seconds_totalThe total seconds the I/O has spent waiting under pressure.
node_pressure_memory_stalled_seconds_totalThe total seconds memory has been stalled under pressure.
node_pressure_memory_waiting_seconds_totalThe total seconds memory has spent waiting under pressure.
node_processes_max_processesThe maximum number of processes.
node_processes_max_threadsThe maximum number of threads.
node_processes_pidsThe number of process IDs.
node_processes_stateThe distribution of process states.
node_processes_threadsThe number of threads.
node_procs_blockedThe number of blocked processes.
node_procs_runningThe number of running processes.
node_schedstat_running_seconds_totalThe total seconds run in scheduling statistics.
node_schedstat_timeslices_totalThe total number of time slices in scheduling statistics.
node_schedstat_waiting_seconds_totalThe total seconds waited in scheduling statistics.
node_scrape_collector_duration_secondsThe duration of the scrape collector in seconds.
node_scrape_collector_successThe number of successful scrapes by the collector.
node_selinux_enabledIndicates whether Security-Enhanced Linux (SELinux) is enabled.
node_sockstat_FRAG_inuseThe number of FRAG sockets in use.
node_sockstat_FRAG_memoryThe amount of memory occupied by FRAG sockets.
node_sockstat_FRAG6_inuseThe number of FRAG6 sockets in use.
node_sockstat_FRAG6_memoryThe amount of memory occupied by FRAG6 sockets.
node_sockstat_RAW_inuseThe number of RAW sockets in use.
node_sockstat_RAW6_inuseThe number of RAW6 sockets in use.
node_sockstat_sockets_usedThe total number of sockets in use.
node_sockstat_TCP_allocThe number of TCP sockets allocated.
node_sockstat_TCP_inuseThe number of TCP sockets in use.
node_sockstat_TCP_memThe amount of memory used by TCP sockets.
node_sockstat_TCP_mem_bytesThe number of bytes of memory used by TCP sockets.
node_sockstat_TCP_orphanThe number of orphaned TCP sockets.
node_sockstat_TCP_twThe number of TCP sockets in the TIME_WAIT state.
node_sockstat_TCP6_inuseThe number of TCP6 sockets in use.
node_sockstat_UDP_inuseThe number of UDP sockets in use.
node_sockstat_UDP_memThe amount of memory used by UDP sockets.
node_sockstat_UDP_mem_bytesThe number of bytes of memory used by UDP sockets.
node_sockstat_UDP6_inuseThe number of IPv6 UDP sockets in use.
node_sockstat_UDPLITE_inuseThe number of UDP-Lite sockets in use.
node_sockstat_UDPLITE6_inuseThe number of UDP-Lite6 sockets in use.
node_softnet_backlog_lenThe length of the soft interrupt queue.
node_softnet_cpu_collision_totalThe total number of CPU collisions in soft interrupts.
node_softnet_dropped_totalThe total number of soft interrupts dropped.
node_softnet_flow_limit_count_totalThe total number of flow limit counts in soft interrupts.
node_softnet_processed_totalThe total number of soft interrupts processed.
node_softnet_received_rps_totalThe total receive rate per second of soft interrupts.
node_softnet_times_squeezed_totalThe total number of times soft interrupts were squeezed.
node_textfile_scrape_errorThe number of text file scrape errors.
node_thermal_zone_tempThe temperature of the thermal zone.
node_time_clocksource_available_infoThe available clock source information.
node_time_clocksource_current_infoThe information about the current clock source.
node_time_secondsThe number of seconds since the system started.
node_time_zone_offset_secondsThe time zone offset in seconds.
node_timex_estimated_error_secondsThe estimated time error in seconds.
node_timex_frequency_adjustment_ratioThe frequency adjustment ratio of the system clock.
node_timex_loop_time_constantThe time adjustment loop constant.
node_timex_maxerror_secondsThe maximum error in seconds.
node_timex_offset_secondsThe time offset in seconds.
node_timex_pps_calibration_totalThe total number of pulse per second (PPS) calibrations.
node_timex_pps_error_totalThe total number of PPS errors.
node_timex_pps_frequency_hertzThe PPS frequency in Hz.
node_timex_pps_jitter_secondsThe PPS jitter in seconds.
node_timex_pps_jitter_totalThe cumulative PPS jitter.
node_timex_pps_shift_secondsThe PPS offset in seconds.
node_timex_pps_stability_exceeded_totalThe number of times PPS stability exceeded limits.
node_timex_pps_stability_hertzThe PPS stability frequency in hertz.
node_timex_statusThe status of clock time adjustments.
node_timex_sync_statusThe synchronization status of the clock.
node_timex_tai_offset_secondsThe International Atomic Time (TAI) offset in seconds.
node_timex_tick_secondsThe tick interval of the clock in seconds.
node_udp_queuesThe statistics of UDP queues.
node_uname_infoThe system information (uname).
node_vmstat_oom_killThe number of out-of-memory (OOM) kills in VM statistics.
node_vmstat_pgfaultThe number of page faults in VM statistics.
node_vmstat_pgmajfaultThe number of major page faults in VM statistics.
node_vmstat_pgpginThe number of page ins in VM statistics.
node_vmstat_pgpgoutThe number of page outs in VM statistics.
node_vmstat_pswpinThe number of swap page ins in VM statistics.
node_vmstat_pswpoutThe number of swap page outs in VM statistics.
node_xfs_allocation_btree_compares_totalThe total number of B-tree comparisons for XFS allocation.
node_xfs_allocation_btree_lookups_totalThe total number of B-tree lookups for XFS allocation.
node_xfs_allocation_btree_records_deleted_totalThe total number of B-tree records deleted for XFS allocation.
node_xfs_allocation_btree_records_inserted_totalThe total number of B-tree records inserted for XFS allocation.
node_xfs_block_map_btree_compares_totalThe total number of B-tree comparisons for XFS block mapping.
node_xfs_block_map_btree_lookups_totalThe total number of B-tree lookups for XFS block mapping.
node_xfs_block_map_btree_records_deleted_totalThe total number of B-tree records deleted for XFS block mapping.
node_xfs_block_map_btree_records_inserted_totalThe total number of B-tree records inserted for XFS block mapping.
node_xfs_block_mapping_extent_list_compares_totalThe total number of extent list comparisons for XFS block mapping.
node_xfs_block_mapping_extent_list_deletions_totalThe total number of extent list deletions for XFS block mapping.
node_xfs_block_mapping_extent_list_insertions_totalThe number of extent list insertions for a file system.
node_xfs_block_mapping_extent_list_lookups_totalThe total number of extent list lookups for XFS block mapping.
node_xfs_block_mapping_reads_totalThe total number of reads for XFS block mapping.
node_xfs_block_mapping_unmaps_totalThe total number of unmappings for XFS block mapping.
node_xfs_block_mapping_writes_totalThe total number of writes for XFS block mapping.
node_xfs_directory_operation_create_totalThe total number of directory creation operations in XFS.
node_xfs_directory_operation_getdents_totalThe total number of directory entry retrieval operations in XFS.
node_xfs_directory_operation_lookup_totalThe total number of directory lookup operations in XFS.
node_xfs_directory_operation_remove_totalThe total number of directory removal operations in XFS.
node_xfs_extent_allocation_blocks_allocated_totalThe total number of blocks allocated in XFS.
node_xfs_extent_allocation_blocks_freed_totalThe total number of blocks freed in XFS.
node_xfs_extent_allocation_extents_allocated_totalThe total number of extents allocated in XFS.
node_xfs_extent_allocation_extents_freed_totalThe total number of extents freed in XFS.
node_xfs_inode_operation_attempts_totalThe total number of attempts at inode operations in XFS.
node_xfs_inode_operation_attribute_changes_totalThe total number of attribute change operations on inodes in XFS.
node_xfs_inode_operation_duplicates_totalThe total number of duplicate operations on inodes in XFS.
node_xfs_inode_operation_found_totalThe total number of hits in inode operations in XFS.
node_xfs_inode_operation_missed_totalThe total number of misses in inode operations in XFS.
node_xfs_inode_operation_reclaims_totalThe total number of reclaim operations on inodes in XFS.
node_xfs_inode_operation_recycled_totalThe total number of reuse operations on inodes in XFS.
node_xfs_read_calls_totalThe total number of read calls in XFS.
node_xfs_vnode_active_totalThe total number of active vnodes in XFS.
node_xfs_vnode_allocate_totalThe total number of vnode allocations in XFS.
node_xfs_vnode_get_totalThe total number of vnode retrievals in XFS.
node_xfs_vnode_hold_totalThe total number of vnodes held in XFS.
node_xfs_vnode_reclaim_totalThe total number of vnodes reclaimed in XFS.
node_xfs_vnode_release_totalThe total number of vnodes released in XFS.
node_xfs_vnode_remove_totalThe total number of vnodes removed in XFS.
node_xfs_write_calls_totalThe total number of write calls in XFS.
process_cpu_seconds_totalThe total process CPU seconds.
process_max_fdsThe maximum number of file descriptors for the process.
process_open_fdsThe number of file descriptors opened by the process.
process_resident_memory_bytesThe resident memory size of the process in bytes.
process_start_time_secondsThe process startup duration in seconds.
process_virtual_memory_bytesThe number of virtual memory bytes for the process.
process_virtual_memory_max_bytesThe maximum number of virtual memory bytes for the process.
promhttp_metric_handler_errors_totalThe total number of errors from the Prometheus HTTP metric handler.
promhttp_metric_handler_requests_in_flightThe current number of requests being handled by the Prometheus HTTP metric handler.
promhttp_metric_handler_requests_totalThe total number of requests handled by the Prometheus HTTP metric handler.
scrape_duration_secondsThe scrape duration in seconds.
scrape_samples_post_metric_relabelingThe number of scraped samples after metric relabeling.
scrape_samples_scrapedThe number of scraped samples.
scrape_series_addedThe number of new series added during the scrape.
upThe connectivity of metric collection.

kube-state-metrics

Job name: _kube-state-metrics

Kubernetes object state metrics generated from the Kubernetes API. Covers Deployments, DaemonSets, StatefulSets, Pods, nodes, jobs, HPAs, and other cluster resources.

MetricDescription
kube_configmap_infoThe information about the ConfigMap.
kube_cronjob_annotationsThe annotations of the Kubernetes CronJob.
kube_cronjob_createdThe creation time of the Kubernetes CronJob.
kube_cronjob_infoThe information about the Kubernetes CronJob.
kube_cronjob_labelsThe labels of the Kubernetes CronJob.
kube_cronjob_metadata_resource_versionThe metadata resource version of the Kubernetes CronJob.
kube_cronjob_next_schedule_timeThe next schedule time of the Kubernetes CronJob.
kube_cronjob_spec_failed_job_history_limitThe failed job history limit of the Kubernetes CronJob.
kube_cronjob_spec_starting_deadline_secondsThe starting deadline seconds of the Kubernetes CronJob.
kube_cronjob_spec_successful_job_history_limitThe successful job history limit of the Kubernetes CronJob.
kube_cronjob_spec_suspendThe suspend status of the Kubernetes CronJob.
kube_cronjob_status_activeThe number of active jobs of the Kubernetes CronJob.
kube_cronjob_status_last_schedule_timeThe last schedule time of the Kubernetes CronJob.
kube_cronjob_status_last_successful_timeThe last successful execution time of the Kubernetes CronJob.
kube_daemonset_createdThe creation time of the Kubernetes DaemonSet.
kube_daemonset_status_current_number_scheduledThe current number of scheduled nodes for the Kubernetes DaemonSet.
kube_daemonset_status_desired_number_scheduledThe desired number of scheduled nodes for the Kubernetes DaemonSet.
kube_daemonset_status_number_availableThe number of available nodes in the Kubernetes DaemonSet.
kube_daemonset_status_number_misscheduledThe number of missed scheduled nodes in the Kubernetes DaemonSet.
kube_daemonset_status_number_readyThe number of ready nodes in the Kubernetes DaemonSet.
kube_daemonset_status_number_unavailableThe number of unavailable nodes in the Kubernetes DaemonSet.
kube_daemonset_status_updated_number_scheduledThe number of updated scheduled nodes in the Kubernetes DaemonSet.
kube_daemonset_updated_number_scheduledThe number of updated scheduled nodes in the Kubernetes DaemonSet.
kube_deployment_createdThe creation time of the Kubernetes Deployment.
kube_deployment_labelsThe labels of the Kubernetes Deployment.
kube_deployment_metadata_generationThe metadata generation of the Kubernetes Deployment.
kube_deployment_spec_replicasThe number of replicas specified in the Kubernetes Deployment.
kube_deployment_spec_strategy_rollingupdate_max_unavailableThe maximum number of unavailable pods during rolling update of the Kubernetes Deployment.
kube_deployment_status_observed_generationThe observed generation of the Kubernetes Deployment.
kube_deployment_status_replicasThe total number of replicas in the Kubernetes Deployment.
kube_deployment_status_replicas_availableThe number of available replicas in the Kubernetes Deployment.
kube_deployment_status_replicas_readyThe number of ready replicas in the Kubernetes Deployment.
kube_deployment_status_replicas_unavailableThe number of unavailable replicas in the Kubernetes Deployment.
kube_deployment_status_replicas_updatedThe number of updated replicas in the Kubernetes Deployment.
kube_horizontalpodautoscaler_infoThe information about the Kubernetes HorizontalPodAutoscaler.
kube_horizontalpodautoscaler_labelsThe labels of the Kubernetes HorizontalPodAutoscaler.
kube_horizontalpodautoscaler_metadata_generationThe metadata generation of the Kubernetes HorizontalPodAutoscaler.
kube_horizontalpodautoscaler_spec_max_replicasThe maximum number of replicas specified in the Kubernetes HorizontalPodAutoscaler.
kube_horizontalpodautoscaler_spec_min_replicasThe minimum number of replicas specified in the Kubernetes HorizontalPodAutoscaler.
kube_horizontalpodautoscaler_spec_target_metricThe target metrics of the Kubernetes HorizontalPodAutoscaler.
kube_horizontalpodautoscaler_status_conditionThe status conditions of the Kubernetes HorizontalPodAutoscaler.
kube_horizontalpodautoscaler_status_current_replicasThe current number of replicas in the Kubernetes HorizontalPodAutoscaler.
kube_horizontalpodautoscaler_status_desired_replicasThe desired number of replicas in the Kubernetes HorizontalPodAutoscaler.
kube_hpa_labelsThe labels of the Kubernetes HorizontalPodAutoscaler.
kube_hpa_metadata_generationThe metadata generation of the Kubernetes HorizontalPodAutoscaler.
kube_hpa_spec_max_replicasThe maximum number of replicas specified in the Kubernetes HorizontalPodAutoscaler.
kube_hpa_spec_min_replicasThe minimum number of replicas specified in the Kubernetes HorizontalPodAutoscaler.
kube_hpa_spec_target_metricThe target metrics of the Kubernetes HorizontalPodAutoscaler.
kube_hpa_status_conditionThe status conditions of the Kubernetes HorizontalPodAutoscaler.
kube_hpa_status_current_replicasThe current number of replicas in the Kubernetes HorizontalPodAutoscaler.
kube_hpa_status_desired_replicasThe desired number of replicas in the Kubernetes HorizontalPodAutoscaler.
kube_ingress_infoThe information about the Ingress.
kube_job_createdThe information about the Ingress.
kube_job_failedThe total number of failures for the job.
kube_job_infoThe information about the Job.
kube_job_spec_completionsThe number of completed jobs.
kube_job_status_activeThe number of active jobs.
kube_job_status_failedThe number of failed jobs.
kube_job_status_succeededThe number of successful jobs.
kube_namespace_createdThe creation time of the namespace.
kube_namespace_labelsThe labels of the namespace.
kube_namespace_status_phaseThe phase of the namespace status.
kube_node_infoThe information about the node.
kube_node_labelsThe labels of the node.
kube_node_spec_taintThe taint configurations of the node.
kube_node_spec_unschedulableThe unschedulable flag of the node.
kube_node_status_allocatableThe allocatable resources of the node.
kube_node_status_allocatable_cpu_coresThe allocatable CPU cores of the node.
kube_node_status_allocatable_memory_bytesThe allocatable memory bytes of the node.
kube_node_status_allocatable_podsThe allocatable number of Pods on the node.
kube_node_status_capacityThe capacity of the node.
kube_node_status_capacity_cpu_coresThe capacity CPU cores of the node.
kube_node_status_capacity_memory_bytesThe capacity memory bytes of the node.
kube_node_status_capacity_podsThe capacity number of Pods on the node.
kube_node_status_conditionThe status conditions of the node.
kube_persistentvolume_status_phaseThe phase of the PersistentVolume (PV) status.
kube_persistentvolumeclaim_infoThe information about the PersistentVolumeClaim (PVC).
kube_persistentvolumeclaim_resource_requests_storage_bytesThe storage resource request of the PVC.
kube_persistentvolumeclaim_status_phaseThe phase of the PVC status.
kube_pod_completion_timeThe completion time of the Pod.
kube_pod_container_infoThe information about the Pod container.
kube_pod_container_resource_limitsThe resource limit of the Pod container.
kube_pod_container_resource_limits_cpu_coresThe CPU core limit of the Pod container.
kube_pod_container_resource_limits_memory_bytesThe memory byte limit of the Pod container.
kube_pod_container_resource_requestsThe resource requests of the Pod container.
kube_pod_container_resource_requests_cpu_coresThe CPU core requests of the Pod container.
kube_pod_container_resource_requests_memory_bytesThe memory byte requests of the Pod container.
kube_pod_container_status_last_terminated_reasonThe last termination reason of the Pod container.
kube_pod_container_status_readyThe ready status of the Pod container.
kube_pod_container_status_restarts_totalThe total number of restarts for the Pod container.
kube_pod_container_status_runningThe running status of the Pod container.
kube_pod_container_status_terminatedThe terminated status of the Pod container.
kube_pod_container_status_terminated_reasonThe termination reason of the Pod container.
kube_pod_container_status_waitingThe waiting status of the Pod container.
kube_pod_container_status_waiting_reasonThe waiting reason of the Pod container.
kube_pod_createdThe creation time of the Pod.
kube_pod_deletion_timestampThe deletion timestamp of the Pod.
kube_pod_infoThe information about the Pod.
kube_pod_labelsThe labels of the Pod.
kube_pod_ownerThe owner of the Pod.
kube_pod_start_timeThe start time of the Pod.
kube_pod_status_container_ready_timeThe container ready time of the Pod status.
kube_pod_status_initialized_timeThe initialization completion time of the Pod status.
kube_pod_status_phaseThe phase of the Pod status.
kube_pod_status_readyThe ready status of the Pod.
kube_pod_status_ready_timeThe ready time of the Pod.
kube_pod_status_reasonThe reason for the Pod status.
kube_pod_status_scheduled_timeThe scheduling time of the Pod.
kube_pod_status_unschedulableThe unschedulable flag of the Pod.
kube_replicaset_ownerThe owner of the ReplicaSet.
kube_replicaset_status_ready_replicasThe number of ready replicas in the ReplicaSet.
kube_resource_relationshipThe relationships between resources.
kube_resourcequotaThe resource quota.
kube_resourcequota_createdThe creation time of the resource quota.
kube_secret_infoThe information about the secret.
kube_service_infoThe information about the service.
kube_service_spec_typeThe type specification of the service.
kube_service_status_load_balancer_ingressThe load balancer ingress information of the service status.
kube_statefulset_createdThe creation time of the StatefulSet.
kube_statefulset_metadata_generationThe metadata generation of the StatefulSet.
kube_statefulset_replicasThe number of replicas in the StatefulSet.
kube_statefulset_status_replicasThe number of replicas in the state of the StatefulSet.
kube_statefulset_status_replicas_availableThe number of available replicas in the state of the StatefulSet.
kube_statefulset_status_replicas_readyThe number of ready replicas in the state of the StatefulSet.
kube_statefulset_status_replicas_updatedThe number of updated replicas in the state of the StatefulSet.
process_cpu_seconds_totalThe total number of CPU seconds used by the process.
process_resident_memory_bytesThe resident memory size of the process in bytes.
rest_client_requests_totalThe number of REST client requests.
upThe connectivity of metric collection.
workqueue_adds_totalThe total number of additions to the work queue.
workqueue_depthThe work queue depth.
workqueue_queue_duration_seconds_bucketThe distribution of queue duration in seconds for the work queue.

kube-events

Job name: _arms/kube-event

Metrics from the Kubernetes event collector, including event processing statistics and Prometheus agent scrape data.

MetricDescription
aliyun_prometheus_agent_append_duration_secondsThe duration of the Prometheus agent append operations in seconds.
aliyun_prometheus_agent_job_discovery_statusThe discovery status of the Prometheus agent collection jobs.
aliyun_prometheus_agent_scrape_custom_errorThe number of custom collection errors of the Prometheus agent.
aliyun_prometheus_agent_scrapes_by_target_totalThe total number of scrapes by the Prometheus agent per target.
aliyun_prometheus_agent_target_infoThe target information of the Prometheus agent.
eventer_events_error_totalThe total number of event processing errors.
eventer_events_normal_totalThe total number of normal events.
eventer_events_warning_totalThe total number of warning events.
eventer_exporter_duration_milliseconds_countThe count of samples for exporter duration in milliseconds.
eventer_exporter_duration_milliseconds_sumThe sum of exporter duration in milliseconds.
eventer_manager_last_time_secondsThe last operation time of the event manager in seconds.
eventer_scraper_duration_milliseconds_countThe count of scraper duration in milliseconds.
eventer_scraper_duration_milliseconds_sumThe sum of scraper duration in milliseconds.
eventer_scraper_events_total_numberThe total number of events scraped.
eventer_scraper_last_time_secondsThe last execution time of the scraper in seconds.
go_gc_duration_secondsThe Go GC pause duration in seconds.
go_gc_duration_seconds_countThe Go GC pause duration in seconds.
go_gc_duration_seconds_sumThe total Go GC pause duration in seconds.
go_goroutinesThe number of goroutines.
go_infoThe Go-specific information.
go_memstats_alloc_bytesThe amount of memory allocated in bytes.
go_memstats_alloc_bytes_totalThe cumulative amount of memory allocated in bytes.
go_memstats_buck_hash_sys_bytesThe amount of memory used by hash tables in the operating system in bytes.
go_memstats_frees_totalThe total number of releases.
go_memstats_gc_cpu_fractionThe GC CPU utilization (%).
go_memstats_gc_sys_bytesThe amount of memory used by GC in the operating system in bytes.
go_memstats_heap_alloc_bytesThe amount of heap memory allocated in bytes.
go_memstats_heap_idle_bytesThe amount of idle heap memory in bytes.
go_memstats_heap_inuse_bytesThe amount of heap memory in use in bytes.
go_memstats_heap_objectsThe number of objects allocated on the heap.
go_memstats_heap_released_bytesThe amount of heap memory released in bytes.
go_memstats_heap_sys_bytesThe amount of memory allocated to the heap by the operating system in bytes.
go_memstats_last_gc_time_secondsThe last GC duration in seconds.
go_memstats_lookups_totalThe total number of lookups.
go_memstats_mallocs_totalThe total number of allocations.
go_memstats_mcache_inuse_bytesThe amount of memory in use in mcache in bytes.
go_memstats_mcache_sys_bytesThe amount of memory allocated to mcache by the operating system in bytes.
go_memstats_mspan_inuse_bytesThe amount of memory in use in mspan in bytes.
go_memstats_mspan_sys_bytesThe amount of memory allocated to mspan by the operating system in bytes.
go_memstats_next_gc_bytesThe number of bytes to be released at the next GC in bytes.
go_memstats_other_sys_bytesThe amount of memory allocated for other purposes by the operating system in bytes.
go_memstats_stack_inuse_bytesThe amount of stack memory in use in bytes.
go_memstats_stack_sys_bytesThe amount of memory allocated to the stack by the operating system in bytes.
go_memstats_sys_bytesThe total memory allocated by the operating system in bytes.
go_threadsThe number of threads.
process_cpu_seconds_totalThe total process CPU seconds.
process_max_fdsThe maximum number of file descriptors for the process.
process_open_fdsThe number of file descriptors opened by the process.
process_resident_memory_bytesThe resident memory size of the process in bytes.
process_start_time_secondsThe process startup duration in seconds.
process_virtual_memory_bytesThe number of virtual memory bytes for the process.
process_virtual_memory_max_bytesThe maximum number of virtual memory bytes for the process.
promhttp_metric_handler_requests_in_flightThe current number of requests being handled by the Prometheus HTTP metric handler.
promhttp_metric_handler_requests_totalThe total number of requests handled by the Prometheus HTTP metric handler.
scrape_duration_secondsThe scrape duration in seconds.
scrape_samples_post_metric_relabelingThe number of scraped samples after metric relabeling.
scrape_samples_scrapedThe number of scraped samples.
scrape_series_addedThe number of new series added during the scrape.
upThe connectivity of metric collection.

CoreDNS

Job name: arms-ack-coredns

Metrics from CoreDNS, the cluster DNS service. Covers query counts, resolution latencies, cache performance, and DNS error tracking.

MetricDescription
aliyun_prometheus_agent_append_duration_secondsThe duration of the Prometheus agent append operations in seconds.
aliyun_prometheus_agent_job_discovery_statusThe discovery status of the Prometheus agent collection jobs.
aliyun_prometheus_agent_scrape_custom_errorThe number of custom collection errors of the Prometheus agent.
aliyun_prometheus_agent_scrapes_by_target_totalThe total number of scrapes by the Prometheus agent per target.
aliyun_prometheus_agent_target_infoThe target information of the Prometheus agent.
coredns_autopath_success_count_totalThe total number of successful automatic path resolutions in CoreDNS.
coredns_autopath_success_totalThe total number of successful automatic path resolutions in CoreDNS.
coredns_build_infoThe build information of CoreDNS.
coredns_cache_drops_totalThe total number of cache drops in CoreDNS.
coredns_cache_entriesThe number of cache entries in CoreDNS.
coredns_cache_evictions_totalThe total number of cache evictions in CoreDNS.
coredns_cache_hits_totalThe total number of cache hits in CoreDNS.
coredns_cache_misses_totalThe total number of cache misses in CoreDNS.
coredns_cache_requests_totalThe total number of cache requests in CoreDNS.
coredns_cache_sizeThe size of the cache in CoreDNS.
coredns_dns_do_requests_totalThe total number of DNS DO requests in CoreDNS.
coredns_dns_request_count_totalThe total count of DNS requests in CoreDNS.
coredns_dns_request_duration_seconds_bucketThe percentile of DNS request durations in seconds in CoreDNS.
coredns_dns_request_duration_seconds_countThe count of DNS request durations in seconds in CoreDNS.
coredns_dns_request_duration_seconds_sumThe sum of DNS request durations in seconds in CoreDNS.
coredns_dns_request_size_bytes_bucketThe percentile of DNS request sizes in bytes in CoreDNS.
coredns_dns_request_size_bytes_countThe count of DNS request sizes in bytes in CoreDNS.
coredns_dns_request_size_bytes_sumThe sum of DNS request sizes in bytes in CoreDNS.
coredns_dns_request_type_count_totalThe total count of DNS request types in CoreDNS.
coredns_dns_requests_totalThe total number of DNS requests in CoreDNS.
coredns_dns_response_rcode_count_totalThe total count of DNS response codes in CoreDNS.
coredns_dns_response_size_bytes_bucketThe percentile of DNS response sizes in bytes in CoreDNS.
coredns_dns_response_size_bytes_countThe count of DNS response sizes in bytes in CoreDNS.
coredns_dns_response_size_bytes_sumThe sum of DNS response sizes in bytes in CoreDNS.
coredns_dns_responses_totalThe total number of DNS responses in CoreDNS.
coredns_forward_conn_cache_hits_totalThe total number of cache hits for forwarded connections in CoreDNS.
coredns_forward_conn_cache_misses_totalThe total number of cache misses for forwarded connections in CoreDNS.
coredns_forward_healthcheck_broken_totalThe total number of health check failures for forwarded connections in CoreDNS.
coredns_forward_healthcheck_failure_count_totalThe total count of health check failures for forwarded connections in CoreDNS.
coredns_forward_healthcheck_failures_totalThe total number of health check failures for forwarded connections in CoreDNS.
coredns_forward_max_concurrent_rejects_totalThe total number of maximum concurrent rejections for forwarded connections in CoreDNS.
coredns_forward_request_count_totalThe total count of forwarded requests in CoreDNS.
coredns_forward_request_duration_seconds_bucketThe percentile of forwarded request durations in seconds in CoreDNS.
coredns_forward_request_duration_seconds_countThe count of forwarded request durations in seconds in CoreDNS.
coredns_forward_request_duration_seconds_sumThe sum of forwarded request durations in seconds in CoreDNS.
coredns_forward_requests_totalThe total number of forwarded requests in CoreDNS.
coredns_forward_response_rcode_count_totalThe total count of forwarded response codes in CoreDNS.
coredns_forward_responses_totalThe total number of forwarded responses in CoreDNS.
coredns_forward_sockets_openThe number of open sockets for forwarded connections in CoreDNS.
coredns_health_request_duration_seconds_bucketThe percentile of health check request durations in seconds in CoreDNS.
coredns_health_request_duration_seconds_countThe count of health check request durations in seconds in CoreDNS.
coredns_health_request_duration_seconds_sumThe sum of health check request durations in seconds in CoreDNS.
coredns_health_request_failures_totalThe total number of health check request failures in CoreDNS.
coredns_hosts_entriesThe number of host entries in CoreDNS.
coredns_hosts_reload_timestamp_secondsThe timestamp of the last host reload in CoreDNS in seconds.
coredns_kubernetes_dns_programming_duration_seconds_bucketThe percentile of Kubernetes DNS programming durations in seconds in CoreDNS.
coredns_kubernetes_dns_programming_duration_seconds_countThe count of Kubernetes DNS programming durations in seconds in CoreDNS.
coredns_kubernetes_dns_programming_duration_seconds_sumThe sum of Kubernetes DNS programming durations in seconds in CoreDNS.
coredns_local_localhost_requests_totalThe total number of localhost requests in CoreDNS.
coredns_panic_count_totalThe total number of panics in CoreDNS.
coredns_panics_totalThe total count of panics in CoreDNS.
coredns_plugin_enabledThe enabling status of CoreDNS plugins.
coredns_reload_failed_totalThe total number of reload failures in CoreDNS.
coredns_reload_version_infoThe version information of CoreDNS reloads.
coredns_template_matches_totalThe total number of template matches in CoreDNS.
go_gc_duration_secondsThe Go GC pause duration in seconds.
go_gc_duration_seconds_countThe Go GC pause duration in seconds.
go_gc_duration_seconds_sumThe total Go GC pause duration in seconds.
go_goroutinesThe number of goroutines.
go_infoThe Go-specific information.
go_memstats_alloc_bytesThe amount of memory allocated in bytes.
go_memstats_alloc_bytes_totalThe cumulative amount of memory allocated in bytes.
go_memstats_buck_hash_sys_bytesThe amount of memory used by hash tables in the operating system in bytes.
go_memstats_frees_totalThe total number of releases.
go_memstats_gc_cpu_fractionThe GC CPU utilization (%).
go_memstats_gc_sys_bytesThe amount of memory used by GC in the operating system in bytes.
go_memstats_heap_alloc_bytesThe amount of heap memory allocated in bytes.
go_memstats_heap_idle_bytesThe amount of idle heap memory in bytes.
go_memstats_heap_inuse_bytesThe amount of heap memory in use in bytes.
go_memstats_heap_objectsThe number of objects allocated on the heap.
go_memstats_heap_released_bytesThe amount of heap memory released in bytes.
go_memstats_heap_sys_bytesThe amount of memory allocated to the heap by the operating system in bytes.
go_memstats_last_gc_time_secondsThe last GC duration in seconds.
go_memstats_lookups_totalThe total number of lookups.
go_memstats_mallocs_totalThe total number of allocations.
go_memstats_mcache_inuse_bytesThe amount of memory in use in mcache in bytes.
go_memstats_mcache_sys_bytesThe amount of memory allocated to mcache by the operating system in bytes.
go_memstats_mspan_inuse_bytesThe amount of memory in use in mspan in bytes.
go_memstats_mspan_sys_bytesThe amount of memory allocated to mspan by the operating system in bytes.
go_memstats_next_gc_bytesThe number of bytes to be released at the next GC in bytes.
go_memstats_other_sys_bytesThe amount of memory allocated for other purposes by the operating system in bytes.
go_memstats_stack_inuse_bytesThe amount of stack memory in use in bytes.
go_memstats_stack_sys_bytesThe amount of memory allocated to the stack by the operating system in bytes.
go_memstats_sys_bytesThe total memory allocated by the operating system in bytes.
go_threadsThe number of threads.
process_cpu_seconds_totalThe total process CPU seconds.
process_max_fdsThe maximum number of file descriptors for the process.
process_open_fdsThe number of file descriptors opened by the process.
process_resident_memory_bytesThe resident memory size of the process in bytes.
process_start_time_secondsThe process startup duration in seconds.
process_virtual_memory_bytesThe number of virtual memory bytes for the process.
process_virtual_memory_max_bytesThe maximum number of virtual memory bytes for the process.
scrape_duration_secondsThe scrape duration in seconds.
scrape_samples_post_metric_relabelingThe number of scraped samples after metric relabeling.
scrape_samples_scrapedThe number of scraped samples.
scrape_series_addedThe number of new series added during the scrape.
upThe connectivity of metric collection.

CSI clusters

Job name: k8s-csi-cluster-pv

Cluster-level Container Storage Interface (CSI) metrics for persistent volume (PV) and PersistentVolumeClaim (PVC) monitoring.

MetricDescription
alibaba_cloud_storage_operator_build_infoThe build information about the storage operations system on Alibaba Cloud.
aliyun_prometheus_agent_append_duration_secondsThe duration of the Prometheus agent append operations in seconds.
aliyun_prometheus_agent_job_discovery_statusThe discovery status of the Prometheus agent collection jobs.
aliyun_prometheus_agent_scrape_custom_errorThe number of custom collection errors of the Prometheus agent.
aliyun_prometheus_agent_scrapes_by_target_totalThe total number of scrapes by the Prometheus agent per target.
aliyun_prometheus_agent_target_infoThe target information of the Prometheus agent.
cluster_pv_detail_num_totalThe total number of detailed PV information in the cluster.
cluster_pv_status_num_totalThe total number of PV states in the cluster.
cluster_pvc_detail_num_totalThe total number of detailed PVC information in the cluster.
cluster_pvc_status_num_totalThe total number of PVC states in the cluster.
cluster_scrape_collector_duration_secondsThe duration of the cluster scrape collector in seconds.
cluster_scrape_collector_successThe number of successful scrapes by the cluster collector.
scrape_duration_secondsThe scrape duration in seconds.
scrape_samples_post_metric_relabelingThe number of scraped samples after metric relabeling.
scrape_samples_scrapedThe number of scraped samples.
scrape_series_addedThe number of new series added during the scrape.
upThe connectivity of metric collection.

CSI nodes

Job name: k8s-csi-node-pv

Node-level CSI metrics covering disk I/O, volume health, NFS/NAS performance, and storage driver build information.

MetricDescription
alibaba_cloud_csi_driver_build_infoThe build information about the Container Storage Interface (CSI) driver.
aliyun_prometheus_agent_append_duration_secondsThe duration of the Prometheus agent append operations in seconds.
aliyun_prometheus_agent_job_discovery_statusThe discovery status of the Prometheus agent collection jobs.
aliyun_prometheus_agent_scrape_custom_errorThe number of custom collection errors of the Prometheus agent.
aliyun_prometheus_agent_scrapes_by_target_totalThe total number of scrapes by the Prometheus agent per target.
aliyun_prometheus_agent_target_infoThe target information of the Prometheus agent.
cluster_scrape_collector_duration_secondsThe duration of the cluster scrape collector in seconds.
cluster_scrape_collector_successThe number of successful scrapes by the cluster collector.
container_fs_available_bytesThe available bytes of the container file system.
container_fs_inodes_freeThe number of available inodes in the container file system.
container_fs_inodes_totalThe total number of inodes in the container file system.
container_fs_inodes_usedThe number of used inodes in the container file system.
container_fs_limit_bytesThe limit of bytes in the container file system.
container_fs_usage_bytesThe used bytes in the container file system.
ephemeral_storage_pod_available_bytesThe available bytes of ephemeral storage Pod.
ephemeral_storage_pod_inodes_freeThe available inodes of ephemeral storage Pod.
ephemeral_storage_pod_inodes_totalThe total number of inodes in the ephemeral storage Pod.
ephemeral_storage_pod_inodes_usedThe used inodes in the ephemeral storage Pod.
ephemeral_storage_pod_limit_bytesThe limit of bytes in the ephemeral storage Pod.
ephemeral_storage_pod_usage_bytesThe used bytes in the ephemeral storage Pod.
node_volume_backend_posix_access_total_counterThe total counter for Portable Operating System Interface (POSIX) access to the node volume backend.
node_volume_backend_posix_getattr_total_counterThe total counter for POSIX getattr calls to the node volume backend.
node_volume_backend_posix_getmode_total_counterThe total counter for POSIX getmode operations to the node volume backend.
node_volume_backend_posix_link_total_counterThe total counter for POSIX link operations to the node volume backend.
node_volume_backend_posix_lookup_total_counterThe total counter for POSIX lookup operations to the node volume backend.
node_volume_backend_posix_mknod_total_counterThe total counter for POSIX mknod operations to the node volume backend.
node_volume_backend_posix_readdir_total_counterThe total counter for POSIX readdir operations to the node volume backend.
node_volume_backend_posix_readlink_total_counterThe total counter for POSIX readlink operations to the node volume backend.
node_volume_backend_posix_remove_total_counterThe total counter for POSIX remove operations to the node volume backend.
node_volume_backend_posix_rename_total_counterThe total counter for POSIX rename operations to the node volume backend.
node_volume_backend_posix_setattr_total_counterThe total counter for POSIX setattr operations to the node volume backend.
node_volume_backend_posix_statfs_total_counterThe total counter for POSIX statfs operations to the node volume backend.
node_volume_backend_read_bytes_total_counterThe total counter for bytes read from the node volume backend.
node_volume_backend_read_completed_total_counterThe total number of completed read requests to the node volume backend.
node_volume_backend_read_time_milliseconds_total_counterThe total milliseconds spent on reads to the node volume backend.
node_volume_backend_write_bytes_total_counterThe total number of bytes written to the node volume backend.
node_volume_backend_write_completed_total_counterThe total number of completed write requests to the node volume backend.
node_volume_backend_write_time_milliseconds_total_counterThe total milliseconds spent on writes to the node volume backend.
node_volume_capacity_bytes_availableThe available capacity of the node volume in bytes.
node_volume_capacity_bytes_available_counterThe available capacity of the node volume in bytes.
node_volume_capacity_bytes_totalThe total capacity of the node volume in bytes.
node_volume_capacity_bytes_total_counterThe total capacity of the node volume in bytes (counter).
node_volume_capacity_bytes_usedThe used capacity of the node volume in bytes.
node_volume_capacity_bytes_used_counterThe used capacity of the node volume in bytes (counter).
node_volume_hot_spot_head_file_topThe top hot spot files in the node volume.
node_volume_hot_spot_read_file_topThe top files read in the node volume hot spots.
node_volume_hot_spot_write_file_topThe top files written in the node volume hot spots.
node_volume_inode_bytes_available_counterThe counter for available inode bytes in the node volume.
node_volume_inode_bytes_total_counterThe counter for total inode bytes in the node volume.
node_volume_inode_bytes_used_counterThe counter for used inode bytes in the node volume.
node_volume_inodes_availableThe number of available inodes in the node volume.
node_volume_inodes_totalThe total number of inodes in the node volume.
node_volume_inodes_usedThe number of used inodes in the node volume.
node_volume_io_nowThe current I/O count in the node volume.
node_volume_io_time_seconds_totalThe total seconds spent on I/O in the node volume.
node_volume_oss_delete_object_total_counterThe total counter for Object Storage Service (OSS) object deletions in the node volume.
node_volume_oss_get_object_total_counterThe total counter for OSS object gets in the node volume.
node_volume_oss_head_object_total_counterThe total counter for OSS object metadata in the node volume.
node_volume_oss_post_object_total_counterThe total counter for OSS object POSTs in the node volume.
node_volume_oss_put_object_total_counterThe total counter for OSS object PUTs in the node volume.
node_volume_posix_access_total_counterThe total counter for POSIX accesses in the node volume.
node_volume_posix_chmod_total_counterThe total counter for POSIX chmod operations in the node volume.
node_volume_posix_chown_total_counterThe total counter for POSIX chown operations in the node volume.
node_volume_posix_create_total_counterThe total counter for POSIX creations in the node volume.
node_volume_posix_flush_total_counterThe total counter for POSIX flushes in the node volume.
node_volume_posix_fsync_total_counterThe total counter for POSIX fsyncs in the node volume.
node_volume_posix_mkdir_total_counterThe total counter for POSIX mkdir operations in the node volume.
node_volume_posix_open_total_counterThe total counter for POSIX opens in the node volume.
node_volume_posix_opendir_total_counterThe total counter for POSIX opendir operations in the node volume.
node_volume_posix_read_total_counterThe total counter for POSIX reads in the node volume.
node_volume_posix_readdir_total_counterThe total counter for POSIX readdir operations in the node volume.
node_volume_posix_release_total_counterThe total counter for POSIX releases in the node volume.
node_volume_posix_rename_total_counterThe total counter for POSIX renames in the node volume.
node_volume_posix_rmdir_total_counterThe total counter for POSIX rmdir operations in the node volume.
node_volume_posix_truncate_total_counterThe total counter for POSIX truncate operations in the node volume.
node_volume_posix_write_total_counterThe total counter for POSIX writes in the node volume.
node_volume_read_bytes_totalThe total number of bytes read from the node volume.
node_volume_read_bytes_total_counterThe total number of bytes read from the node volume (counter).
node_volume_read_completed_totalThe total number of completed read requests to the node volume.
node_volume_read_completed_total_counterThe total number of completed read requests to the node volume (counter).
node_volume_read_merged_totalThe total number of merged read operations in the node volume.
node_volume_read_queue_time_milliseconds_totalThe total milliseconds spent on read queue in the node volume.
node_volume_read_rtt_time_milliseconds_totalThe total milliseconds spent on read round-trip time in the node volume.
node_volume_read_sent_bytes_totalThe total number of bytes sent during reads in the node volume.
node_volume_read_time_milliseconds_totalThe total milliseconds spent on reads in the node volume.
node_volume_read_time_milliseconds_total_counterThe total milliseconds spent on reads in the node volume (counter).
node_volume_read_timeouts_totalThe total number of read timeouts in the node volume.
node_volume_read_transmissions_totalThe total number of read transmissions in the node volume.
node_volume_vg_free_bytesThe free bytes in the volume group (VG) of the node volume.
node_volume_vg_size_bytesThe total bytes in the VG of the node volume.
node_volume_write_bytes_totalThe total number of bytes written to the node volume.
node_volume_write_bytes_total_counterThe total number of bytes written to the node volume (counter).
node_volume_write_completed_totalThe total number of completed write requests to the node volume.
node_volume_write_completed_total_counterThe total number of completed write requests to the node volume (counter).
node_volume_write_merged_totalThe total number of merged write operations in the node volume.
node_volume_write_queue_time_milliseconds_totalThe total milliseconds spent on write queue in the node volume.
node_volume_write_recv_bytes_totalThe total number of bytes received during writes in the node volume.
node_volume_write_rtt_time_milliseconds_totalThe total milliseconds spent on write round-trip time in the node volume.
node_volume_write_time_milliseconds_totalThe total milliseconds spent on writes in the node volume.
node_volume_write_time_milliseconds_total_counterThe total milliseconds spent on writes in the node volume (counter).
node_volume_write_timeouts_totalThe total number of write timeouts in the node volume.
node_volume_write_transmissions_totalThe total number of write transmissions in the node volume.
scrape_duration_secondsThe scrape duration in seconds.
scrape_samples_post_metric_relabelingThe number of scraped samples after metric relabeling.
scrape_samples_scrapedThe number of scraped samples.
scrape_series_addedThe number of new series added during the scrape.
upThe connectivity of metric collection.

GPU-Exporter

Job name: gpu-exporter

GPU metrics collected through NVIDIA Data Center GPU Manager (DCGM), covering utilization, memory, temperature, power, PCIe/NVLink throughput, and per-process GPU usage.

MetricDescription
DCGM_CUSTOM_ALLOCATE_MODEThe mode in which the node runs. A value of 0 indicates that no GPU Pods are running on the node. A value of 1 indicates that the GPU Pods on the current node run in an exclusive GPU mode. A value of 2 indicates that the GPU Pods on the current node run in a shared GPU mode.
DCGM_CUSTOM_CONTAINER_CP_ALLOCATEDThe ratio of the GPU computing power allocated to the container to the total computing power of the GPU. The value ranges from 0 to 1. In exclusive GPU mode or in shared GPU mode in which the container requests only GPU memory, the value of this metric is 0, which indicates that the allocation of GPU computing power is unlimited. For example, if a GPU provides a total of 100 compute units (CUs) of GPU computing power and allocates 30 CUs to a container, the ratio of the GPU computing power allocated to the container is calculated by using the following formula: 30/100 = 0.3.
DCGM_CUSTOM_CONTAINER_MEM_ALLOCATEDThe amount of GPU memory allocated to the container.
DCGM_CUSTOM_DEV_FB_ALLOCATEDThe ratio of the allocated GPU memory to the total memory of the GPU. The value ranges from 0 to 1.
DCGM_CUSTOM_DEV_FB_TOTALThe total memory of the GPU.
DCGM_CUSTOM_ILLEGAL_PROCESS_DECODE_UTILThe illegal process decode utilization.
DCGM_CUSTOM_ILLEGAL_PROCESS_ENCODE_UTILThe illegal process encode utilization.
DCGM_CUSTOM_ILLEGAL_PROCESS_MEM_COPY_UTILThe memory copy utilization of illegal processes.
DCGM_CUSTOM_ILLEGAL_PROCESS_MEM_USEDThe memory used by illegal processes.
DCGM_CUSTOM_ILLEGAL_PROCESS_SM_UTILThe SM utilization of illegal processes.
DCGM_CUSTOM_PROCESS_DECODE_UTILThe decoder utilization of GPU threads.
DCGM_CUSTOM_PROCESS_ENCODE_UTILThe encoder utilization of GPU threads.
DCGM_CUSTOM_PROCESS_MEM_COPY_UTILThe memory copy utilization of GPU threads.
DCGM_CUSTOM_PROCESS_MEM_USEDThe amount of GPU memory used by GPU threads.
DCGM_CUSTOM_PROCESS_SM_UTILThe SM utilization of GPU threads.
DCGM_FI_DEV_APP_MEM_CLOCKThe memory application clock speed.
DCGM_FI_DEV_APP_SM_CLOCKThe SM application clock speed.
DCGM_FI_DEV_BAR1_FREEThe remaining Base Address Register 1 (BAR1).
DCGM_FI_DEV_BAR1_TOTALThe total size of device BAR1.
DCGM_FI_DEV_BAR1_USEDThe used BAR1.
DCGM_FI_DEV_BOARD_LIMIT_VIOLATIONThe time of the violation due to board limitations.
DCGM_FI_DEV_CLOCK_THROTTLE_REASONSThe reasons for clock throttling.
DCGM_FI_DEV_COUNTThe number of devices.
DCGM_FI_DEV_DEC_UTILThe decoder utilization.
DCGM_FI_DEV_ENC_UTILThe encoder utilization.
DCGM_FI_DEV_FB_FREEThe amount of free frame buffer memory.
DCGM_FI_DEV_FB_USEDThe amount of used frame buffer memory. The value of this metric is the same as the value of Memory-Usage returned by the nvidia-smi command.
DCGM_FI_DEV_GPU_TEMPThe GPU temperature.
DCGM_FI_DEV_GPU_UTILThe GPU utilization within a cycle of 1 second or 1/6 second. The cycle varies based on the GPU model. A cycle is a period of time during which one or more kernel functions remain active. This metric only indicates that one or more kernel functions are occupying GPU resources. The metric does not display detailed GPU usage information.
DCGM_FI_DEV_LOW_UTIL_VIOLATIONThe time of the violation due to low utilization.
DCGM_FI_DEV_MEM_CLOCKThe memory clock speed.
DCGM_FI_DEV_MEM_COPY_UTILThe memory bandwidth utilization. For example, the maximum memory bandwidth of NVIDIA V100 is 900 GB/s. If the memory bandwidth used is 450 GB/s, the memory bandwidth utilization is 50%.
DCGM_FI_DEV_MEMORY_TEMPThe memory temperature.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTALThe total NVLink bandwidth.
DCGM_FI_DEV_PCIE_REPLAY_COUNTERThe PCIe replay counter.
DCGM_FI_DEV_POWER_USAGEThe power usage.
DCGM_FI_DEV_POWER_VIOLATIONThe time of the violation due to power limitations.
DCGM_FI_DEV_PSTATEThe status of the device power.
DCGM_FI_DEV_RELIABILITY_VIOLATIONThe time of the violation due to board reliability.
DCGM_FI_DEV_RETIRED_DBEThe number of pages retired due to double bit errors.
DCGM_FI_DEV_RETIRED_PENDINGThe number of pages to be retired. These pages are marked as unavailable due to errors in the GPU memory.
DCGM_FI_DEV_RETIRED_SBEThe number of pages retired due to single bit errors.
DCGM_FI_DEV_SM_CLOCKThe SM clock speed.
DCGM_FI_DEV_SYNC_BOOST_VIOLATIONThe time of the violation due to synchronous limit raising.
DCGM_FI_DEV_THERMAL_VIOLATIONThe time of the violation due to thermal limitations.
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTIONThe total energy consumed since the driver was last loaded.
DCGM_FI_DEV_VIDEO_CLOCKThe video clock speed.
DCGM_FI_DEV_XID_ERRORSThe last XID error that occurred within a period of time.
DCGM_FI_PROF_DRAM_ACTIVEThe cycle fraction for memory bandwidth utilization when sending data to device memory or receiving data from device memory. The value is an average value within a time interval rather than an instantaneous value. A larger value of this metric indicates higher device memory utilization. If the value is 1 (100%), a DRAM command is executed every cycle within the entire interval. The peak value of the metric can reach 0.8 (80%). If the value of this metric is 0.2 (20%), 20% of the cycles within the time interval are spent reading from or writing to device memory.
DCGM_FI_PROF_GR_ENGINE_ACTIVEThe percentage of time that the Graphics or Compute engines were active within a time interval. The value indicates the average across all Graphics and Compute engines. A Graphics or Compute engine is considered active when a Graphics or Compute context is bound to a thread and the Graphics or Compute context is in a busy state.
DCGM_FI_PROF_NVLINK_RX_BYTESThe TX rate of NVLink and the RX rate of NVLink. The bytes transmitted or received exclude the header. The value is an average value within a time interval rather than an instantaneous value. For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/s regardless of whether the transmission occurs at a consistent rate or in bursts. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per direction per link.
DCGM_FI_PROF_NVLINK_TX_BYTESThe total number of bytes sent through NVLink.
DCGM_FI_PROF_PCIE_RX_BYTESThe TX rate of PCIe and the RX rate of PCIe. The bytes transmitted or received include both the header and payload. The value is an average value within a time interval rather than an instantaneous value. For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/s regardless of whether the transmission occurs at a consistent rate or in bursts. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per lane.
DCGM_FI_PROF_PCIE_TX_BYTESThe TX rate of PCIe and the RX rate of PCIe. The bytes transmitted or received include both the header and payload. The value is an average value within a time interval rather than an instantaneous value. For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/s regardless of whether the transmission occurs at a consistent rate or in bursts. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per lane.
DCGM_FI_PROF_PIPE_FP16_ACTIVEThe fraction of cycles during which the FP16 (half-precision) pipeline was active. The value is an average value within a time interval rather than an instantaneous value. A higher value indicates higher utilization of the FP16 cores. A value of 1 (100%) means that an FP16 instruction was executed every two cycles throughout the entire time interval (for example, on Volta-type cards). If the value of this metric is 0.2 (20%), one of the following conditions may exist: The FP16 core utilization of 20% of the SMs within the time interval is 100%. The FP16 core utilization of all SMs within the time interval is 20%. The FP16 core utilization of all SMs within 20% of the time interval is 100%. Other conditions.
DCGM_FI_PROF_PIPE_FP32_ACTIVEThe fraction of cycles during which the FMA (Fused Multiply-Add) pipeline was active. The FMA operations include both FP32 (single-precision) and integer operations. The value is an average value within a time interval rather than an instantaneous value. A higher value indicates higher utilization of the FP32 cores. A value of 1 (100%) means that an FP32 instruction was executed every two cycles throughout the entire time interval (for example, on Volta-type cards). If the value of this metric is 0.2 (20%), one of the following conditions may exist: The FP32 core utilization of 20% of the SMs within the time interval is 100%. The FP32 core utilization of all SMs within the time interval is 20%. The FP32 core utilization of all SMs within 20% of the time interval is 100%. Other conditions.
DCGM_FI_PROF_PIPE_FP64_ACTIVEThe fraction of cycles during which the FP64 (double-precision) pipeline was active. The value is an average value within a time interval rather than an instantaneous value. A higher value indicates higher utilization of the FP64 cores. A value of 1 (100%) means that an FP64 instruction was executed every four cycles throughout the entire time interval (for example, on Volta-type cards). If the value of this metric is 0.2 (20%), one of the following conditions may exist: The FP64 core utilization of 20% of the SMs within the time interval is 100%. The FP64 core utilization of 20% of the SMs within the time interval is 100%. The FP64 core utilization of all SMs within 20% of the time interval is 100%. Other conditions.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVEThe cycle fraction for the Tensor (HMMA/IMMA) pipe being in the Active state. The value is an average value within a time interval rather than an instantaneous value. A larger value of this metric indicates higher tensor core utilization. If the value is 1 (100%), a Tensor instruction is issued every cycle within the entire interval. One instruction completes in two cycles. If the value of this metric is 0.2 (20%), one of the following conditions may exist: The tensor core utilization of 20% of the SMs within the time interval is 100%. The tensor core utilization of all SMs within the time interval is 20%. The tensor core utilization of all SMs within 20% of the time interval is 100%. Other conditions.
DCGM_FI_PROF_SM_ACTIVEThe ratio of cycles during which at least one warp on an SM remains active. The value is an average of all SMs. The value does not vary with the number of warps included in the thread block. When a warp is scheduled and resources are allocated to the warp, the warp is considered active. In this case, the status of the warp may be Computing or not Computing; for example, it may be waiting for memory requests or in another non-Computing state. If the value of this metric drops below 0.5, the GPU utilization is low. For high GPU utilization, the value should be greater than 0.8. Assume that a GPU has N SMs. If all SMs in N thread blocks run a kernel function within a time interval, the value of this metric is 1 (100%). If N/5 thread blocks run a kernel function within a time interval, the value of this metric is 0.2. If N thread blocks run a kernel function during 20% of the cycle within a time interval, the value of this metric is 0.2.
DCGM_FI_PROF_SM_OCCUPANCYThe ratio of warps resident on an SM to the maximum number of warps that can reside on that SM, averaged over all SMs within a time interval. A higher occupancy does not necessarily indicate higher GPU utilization. Only in workloads where GPU memory bandwidth is the limiting factor (DCGM_FI_PROF_DRAM_ACTIVE), does a higher occupancy indicate more effective GPU utilization.
nvidia_gpu_allocated_num_devicesThe number of allocated GPU devices. Warning: Will be deprecated in the future.
nvidia_gpu_memory_allocated_bytesThe full memory of GPU devices. Warning: Will be deprecated in the future, replaced by DCGM_CUSTOM_DEV_FB_allocated.
nvidia_gpu_sharing_memoryThe memory allocated for GPU sharing. Warning: Will be deprecated in the future, DCGM_CUSTOM_DEV_FB_allocated.
upThe connectivity of metric collection.

Cost-Exporter

Job name: alibaba-cloud-cost-exporter

Cost and billing metrics for cluster nodes, covering pricing across pay-as-you-go, subscription, and spot instance models.

MetricDescription
deducted_by_cash_couponsThe bill discount amount for the current instance.
deducted_by_prepaid_cardThe prepaid card discount amount for the current instance.
invoice_discountThe discount amount for the current instance.
list_priceThe unit price for the current instance.
node_current_priceThe actual price of the current node.
node_payAsYouGo_priceThe pay-as-you-go price of the current node.
node_payByPeriod_priceThe subscription price of the current node.
node_spot_priceThe spot price of the current node.
outstanding_amountThe outstanding amount for the current instance.
payent_amountThe cash payment amount for the current instance.
pretax_amountThe payable amount for the current instance.
pretax_gross_amountThe original amount for the current instance.
usageThe resource usage for the current instance.
upThe connectivity of metric collection.

Ingress

Job name: arms-ack-ingress

Metrics from the NGINX Ingress controller, covering request rates, latency distributions, response sizes, connection states, and SSL certificate expiration.

MetricDescription
aliyun_prometheus_agent_append_duration_secondsThe duration of the Prometheus agent append operations in seconds.
aliyun_prometheus_agent_job_discovery_statusThe discovery status of the Prometheus agent collection jobs.
aliyun_prometheus_agent_scrape_custom_errorThe number of custom collection errors of the Prometheus agent.
aliyun_prometheus_agent_scrapes_by_target_totalThe total number of scrapes by the Prometheus agent per target.
aliyun_prometheus_agent_target_infoThe target information of the Prometheus agent.
go_cgo_go_to_c_calls_calls_totalThe total number of C function calls made by cgo.
go_gc_cycles_automatic_gc_cycles_totalThe total number of automatic GC cycles.
go_gc_cycles_forced_gc_cycles_totalThe total number of forced GC cycles.
go_gc_cycles_total_gc_cycles_totalThe total number of GC cycles.
go_gc_duration_secondsThe Go GC pause duration in seconds.
go_gc_duration_seconds_countThe Go GC pause duration in seconds.
go_gc_duration_seconds_sumThe total Go GC pause duration in seconds.
go_gc_heap_allocs_by_size_bytes_total_bucketThe distribution of Go GC heap allocations classified by size in bytes.
go_gc_heap_allocs_by_size_bytes_total_countThe count of Go GC heap allocations classified by size in bytes.
go_gc_heap_allocs_by_size_bytes_total_sumThe sum of Go GC heap allocations classified by size in bytes.
go_gc_heap_allocs_bytes_totalThe total bytes allocated in the Go GC heap.
go_gc_heap_allocs_objects_totalThe total objects allocated in the Go GC heap.
go_gc_heap_frees_by_size_bytes_total_bucketThe distribution of Go GC heap releases classified by size in bytes.
go_gc_heap_frees_by_size_bytes_total_countThe count of Go GC heap releases classified by size in bytes.
go_gc_heap_frees_by_size_bytes_total_sumThe sum of Go GC heap releases classified by size in bytes.
go_gc_heap_frees_bytes_totalThe total bytes released in the Go GC heap.
go_gc_heap_frees_objects_totalThe total objects released in the Go GC heap.
go_gc_heap_goal_bytesThe target size of the Go GC heap in bytes.
go_gc_heap_objects_objectsThe number of objects in the Go GC heap.
go_gc_heap_tiny_allocs_objects_totalThe total number of small object allocations in the Go GC.
go_gc_limiter_last_enabled_gc_cycleThe last enabled GC cycle.
go_gc_pauses_seconds_total_bucketThe distribution of Go GC pause time in seconds.
go_gc_pauses_seconds_total_countThe count of Go GC pause time in seconds.
go_gc_pauses_seconds_total_sumThe sum of Go GC pause time in seconds.
go_gc_stack_starting_size_bytesThe starting size of the Go GC stack in bytes.
go_goroutinesThe number of goroutines.
go_infoThe Go-specific information.
go_memory_classes_heap_free_bytesThe amount of idle heap memory in bytes.
go_memory_classes_heap_objects_bytesThe amount of heap memory occupied by objects in bytes.
go_memory_classes_heap_released_bytesThe amount of heap memory released in bytes.
go_memory_classes_heap_stacks_bytesThe amount of memory reserved for the stack in bytes.
go_memory_classes_heap_unused_bytesThe amount of heap memory not used in bytes.
go_memory_classes_metadata_mcache_free_bytesThe amount of idle memory in mcache in bytes.
go_memory_classes_metadata_mcache_inuse_bytesThe amount of memory in use in mcache in bytes.
go_memory_classes_metadata_mspan_free_bytesThe amount of idle memory in mspan in bytes.
go_memory_classes_metadata_mspan_inuse_bytesThe amount of memory in use in mspan in bytes.
go_memory_classes_metadata_other_bytesThe amount of memory occupied by other metadata in bytes.
go_memory_classes_os_stacks_bytesThe amount of memory reserved for the operating system stack in bytes.
go_memory_classes_other_bytesThe amount of memory used for other purposes in bytes.
go_memory_classes_profiling_buckets_bytesThe bytes used by profiling buckets.
go_memory_classes_total_bytesThe total memory in bytes.
go_memstats_alloc_bytesThe amount of memory allocated in bytes.
go_memstats_alloc_bytes_totalThe cumulative amount of memory allocated in bytes.
go_memstats_buck_hash_sys_bytesThe amount of memory used by hash tables in the operating system in bytes.
go_memstats_frees_totalThe total number of releases.
go_memstats_gc_cpu_fractionThe GC CPU utilization (%).
go_memstats_gc_sys_bytesThe amount of memory used by GC in the operating system in bytes.
go_memstats_heap_alloc_bytesThe amount of heap memory allocated in bytes.
go_memstats_heap_idle_bytesThe amount of idle heap memory in bytes.
go_memstats_heap_inuse_bytesThe amount of heap memory in use in bytes.
go_memstats_heap_objectsThe number of objects allocated on the heap.
go_memstats_heap_released_bytesThe amount of heap memory released in bytes.
go_memstats_heap_sys_bytesThe amount of memory allocated to the heap by the operating system in bytes.
go_memstats_last_gc_time_secondsThe last GC duration in seconds.
go_memstats_lookups_totalThe total number of lookups.
go_memstats_mallocs_totalThe total number of allocations.
go_memstats_mcache_inuse_bytesThe amount of memory in use in mcache in bytes.
go_memstats_mcache_sys_bytesThe amount of memory allocated to mcache by the operating system in bytes.
go_memstats_mspan_inuse_bytesThe amount of memory in use in mspan in bytes.
go_memstats_mspan_sys_bytesThe amount of memory allocated to mspan by the operating system in bytes.
go_memstats_next_gc_bytesThe number of bytes to be released at the next GC in bytes.
go_memstats_other_sys_bytesThe amount of memory allocated for other purposes by the operating system in bytes.
go_memstats_stack_inuse_bytesThe amount of stack memory in use in bytes.
go_memstats_stack_sys_bytesThe amount of memory allocated to the stack by the operating system in bytes.
go_memstats_sys_bytesThe total memory allocated by the operating system in bytes.
go_sched_gomaxprocs_threadsThe maximum parallelism of the Go scheduler in threads.
go_sched_goroutines_goroutinesThe current number of goroutines in the Go scheduler.
go_sched_latencies_seconds_bucketThe distribution of Go scheduling latencies in seconds.
go_sched_latencies_seconds_countThe count of Go scheduling latencies in seconds.
go_sched_latencies_seconds_sumThe sum of Go scheduling latencies in seconds.
go_threadsThe number of Go threads.
nginx_ingress_controller_admission_config_sizeThe size of the NGINX Ingress controller Admission Config.
nginx_ingress_controller_admission_render_durationThe rendering duration of the NGINX Ingress controller Admission Config.
nginx_ingress_controller_admission_render_ingressesThe number of Ingresses rendered by the NGINX Ingress controller.
nginx_ingress_controller_admission_roundtrip_durationThe round-trip processing duration of the NGINX Ingress controller.
nginx_ingress_controller_admission_tested_durationThe testing duration of the NGINX Ingress controller.
nginx_ingress_controller_admission_tested_ingressesThe number of Ingresses tested by the NGINX Ingress controller.
nginx_ingress_controller_build_infoThe build information of the NGINX Ingress controller.
nginx_ingress_controller_bytes_sent_bucketThe distribution of total bytes sent by the NGINX Ingress controller.
nginx_ingress_controller_bytes_sent_countThe count of total bytes sent by the NGINX Ingress controller.
nginx_ingress_controller_bytes_sent_sumThe sum of total bytes sent by the NGINX Ingress controller.
nginx_ingress_controller_check_errorsThe number of check errors in the NGINX Ingress controller.
nginx_ingress_controller_check_successThe number of successful checks in the NGINX Ingress controller.
nginx_ingress_controller_config_hashThe configuration hash of the NGINX Ingress controller.
nginx_ingress_controller_config_last_reload_successfulThe success status of the last configuration reload in the NGINX Ingress controller.
nginx_ingress_controller_config_last_reload_successful_timestamp_secondsThe timestamp of the last successful configuration reload in the NGINX Ingress controller in seconds.
nginx_ingress_controller_connect_duration_seconds_bucketThe distribution of connection durations in the NGINX Ingress controller in seconds.
nginx_ingress_controller_connect_duration_seconds_countThe count of connection durations in the NGINX Ingress controller in seconds.
nginx_ingress_controller_connect_duration_seconds_sumThe sum of connection durations in the NGINX Ingress controller in seconds.
nginx_ingress_controller_errorsThe number of errors in the NGINX Ingress controller.
nginx_ingress_controller_header_duration_seconds_bucketThe distribution of header processing durations in the NGINX Ingress controller in seconds.
nginx_ingress_controller_header_duration_seconds_countThe count of header processing durations in the NGINX Ingress controller in seconds.
nginx_ingress_controller_header_duration_seconds_sumThe sum of header processing durations in the NGINX Ingress controller in seconds.
nginx_ingress_controller_ingress_upstream_latency_secondsThe upstream latency in the NGINX Ingress controller in seconds.
nginx_ingress_controller_ingress_upstream_latency_seconds_countThe count of upstream latencies in the NGINX Ingress controller.
nginx_ingress_controller_ingress_upstream_latency_seconds_sumThe sum of upstream latencies in the NGINX Ingress controller.
nginx_ingress_controller_leader_election_statusThe leader election status of the NGINX Ingress controller.
nginx_ingress_controller_nginx_process_connectionsThe number of connections in the nginx process of the NGINX Ingress controller.
nginx_ingress_controller_nginx_process_connections_totalThe total number of connections in the nginx process of the NGINX Ingress controller.
nginx_ingress_controller_nginx_process_cpu_seconds_totalThe total CPU utilization in seconds of the nginx process in the NGINX Ingress controller.
nginx_ingress_controller_nginx_process_num_procsThe number of nginx processes in the NGINX Ingress controller.
nginx_ingress_controller_nginx_process_oldest_start_time_secondsThe oldest start time in seconds of the nginx process in the NGINX Ingress controller.
nginx_ingress_controller_nginx_process_read_bytes_totalThe total number of bytes read by the nginx process in the NGINX Ingress controller.
nginx_ingress_controller_nginx_process_requests_totalThe total number of requests processed by the nginx process in the NGINX Ingress controller.
nginx_ingress_controller_nginx_process_resident_memory_bytesThe resident memory size in bytes of the nginx process in the NGINX Ingress controller.
nginx_ingress_controller_nginx_process_virtual_memory_bytesThe amount of virtual memory that is used by an NGINX process in bytes.
nginx_ingress_controller_nginx_process_write_bytes_totalThe virtual memory size in bytes of the nginx process in the NGINX Ingress controller.
nginx_ingress_controller_orphan_ingressThe number of orphaned Ingresses in the NGINX Ingress controller.
nginx_ingress_controller_request_duration_seconds_bucketThe distribution of request durations in the NGINX Ingress controller in seconds.
nginx_ingress_controller_request_duration_seconds_countThe count of request durations in the NGINX Ingress controller in seconds.
nginx_ingress_controller_request_duration_seconds_sumThe sum of request durations in the NGINX Ingress controller in seconds.
nginx_ingress_controller_request_size_bucketThe distribution of request sizes in the NGINX Ingress controller.
nginx_ingress_controller_request_size_countThe count of request sizes in the NGINX Ingress controller.
nginx_ingress_controller_request_size_sumThe sum of request sizes in the NGINX Ingress controller.
nginx_ingress_controller_requestsThe total number of requests in the NGINX Ingress controller.
nginx_ingress_controller_response_duration_seconds_bucketThe distribution of response durations in the NGINX Ingress controller in seconds.
nginx_ingress_controller_response_duration_seconds_countThe count of response durations in the NGINX Ingress controller in seconds.
nginx_ingress_controller_response_duration_seconds_sumThe sum of response durations in the NGINX Ingress controller in seconds.
nginx_ingress_controller_response_size_bucketThe distribution of response sizes in the NGINX Ingress controller.
nginx_ingress_controller_response_size_countThe count of response sizes in the NGINX Ingress controller.
nginx_ingress_controller_response_size_sumThe sum of response sizes in the NGINX Ingress controller.
nginx_ingress_controller_ssl_certificate_infoThe SSL certificate information in the NGINX Ingress controller.
nginx_ingress_controller_ssl_expire_time_secondsThe expiration time of the SSL certificate in the NGINX Ingress controller in seconds.
nginx_ingress_controller_successThe number of successes in the NGINX Ingress controller.
scrape_duration_secondsThe scrape duration in seconds.
scrape_samples_post_metric_relabelingThe number of scraped samples after metric relabeling.
scrape_samples_scrapedThe number of scraped samples.
scrape_series_addedThe number of new series added during the scrape.
upThe connectivity of metric collection.

Koordinator

Job name: kube-system, koordlet-metrics-podmonitor, or koord-manager-metrics-service

Metrics from the Koordinator scheduling framework for fine-grained resource orchestration and workload management.

MetricDescription
aliyun_prometheus_agent_append_duration_secondsThe duration of the Prometheus agent append operations in seconds.
aliyun_prometheus_agent_scrapes_by_target_totalThe total number of scrapes by the Prometheus agent per target.
aliyun_prometheus_agent_target_infoThe target information of the Prometheus agent.
koord_manager_recommender_recommendation_workload_targetThe recommended specification metric for workload in the resource profiling feature.
koordlet_container_resource_limitsThe limit metric for container resources.
koordlet_container_resource_requestsThe request metric for container resources.
koordlet_node_priority_resource_reclaimableThe priority metric for node resources.
koordlet_node_resource_allocatableThe allocatable resource metric for the node.
scrape_duration_secondsThe scrape duration in seconds.
scrape_samples_post_metric_relabelingThe number of scraped samples after metric relabeling.
scrape_samples_scrapedThe number of scraped samples.
scrape_series_addedThe number of new series added during the scrape.
slo_manager_recommender_recommendation_workload_targetThe resource specifications that are recommended based on the workload by the resource profiling feature. This metric is discontinued.
upThe connectivity of metric collection.

ETCD

Job name: etcd

Metrics from the etcd key-value store backing the Kubernetes control plane, including request latency, database size, leader elections, and disk I/O.

MetricDescription
aliyun_prometheus_agent_append_duration_secondsThe duration of the Prometheus agent append operations in seconds.
aliyun_prometheus_agent_job_discovery_statusThe discovery status of the Prometheus agent collection jobs.
aliyun_prometheus_agent_scrape_custom_errorThe number of custom collection errors of the Prometheus agent.
aliyun_prometheus_agent_scrapes_by_target_totalThe total number of scrapes by the Prometheus agent per target.
aliyun_prometheus_agent_target_infoThe target information of the Prometheus agent.
cpu_utilization_coreThe CPU core utilization.
etcd_cluster_versionThe version of the cluster.
etcd_debugging_auth_revisionThe authentication revision number for ETCD debugging.
etcd_debugging_disk_backend_commit_rebalance_duration_seconds_bucketThe distribution of ETCD debugging disk backend commit rebalance duration in seconds.
etcd_debugging_disk_backend_commit_rebalance_duration_seconds_countThe count of ETCD debugging disk backend commit rebalance duration in seconds.
etcd_debugging_disk_backend_commit_rebalance_duration_seconds_sumThe sum of ETCD debugging disk backend commit rebalance duration in seconds.
etcd_debugging_disk_backend_commit_spill_duration_seconds_bucketThe distribution of ETCD debugging disk backend commit spill duration.
etcd_debugging_disk_backend_commit_spill_duration_seconds_countThe count of ETCD debugging disk backend commit spill duration.
etcd_debugging_disk_backend_commit_spill_duration_seconds_sumThe sum of ETCD debugging disk backend commit spill duration.
etcd_debugging_disk_backend_commit_write_duration_seconds_bucketThe distribution of ETCD debugging disk backend commit write duration in seconds.
etcd_debugging_disk_backend_commit_write_duration_seconds_countThe count of ETCD debugging disk backend commit write duration in seconds.
etcd_debugging_disk_backend_commit_write_duration_seconds_sumThe sum of ETCD debugging disk backend commit write duration in seconds.
etcd_debugging_lease_granted_totalThe total number of lease grants in ETCD debugging.
etcd_debugging_lease_renewed_totalThe total number of lease renewals in ETCD debugging.
etcd_debugging_lease_revoked_totalThe total number of lease revocations in ETCD debugging.
etcd_debugging_lease_ttl_total_bucketThe distribution of lease TTLs in ETCD debugging.
etcd_debugging_lease_ttl_total_countThe count of lease TTLs in ETCD debugging.
etcd_debugging_lease_ttl_total_sumThe sum of lease TTLs in ETCD debugging.
etcd_debugging_mvcc_compact_revisionThe compaction revision number for ETCD debugging MVCC.
etcd_debugging_mvcc_current_revisionThe current revision version for ETCD debugging MVCC.
etcd_debugging_mvcc_db_compaction_keys_totalThe total number of keys compressed in the ETCD debugging MVCC database.
etcd_debugging_mvcc_db_compaction_lastThe last compaction time for the ETCD debugging MVCC database.
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucketThe distribution of MVCC database compaction pause durations in milliseconds for ETCD debugging.
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_countThe count of MVCC database compaction pause durations in milliseconds for ETCD debugging.
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_sumThe sum of MVCC database compaction pause durations in milliseconds for ETCD debugging.
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucketThe distribution of MVCC database compaction total durations in milliseconds for ETCD debugging.
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_countThe count of MVCC database compaction total durations in milliseconds for ETCD debugging.
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_sumThe sum of MVCC database compaction total durations in milliseconds for ETCD debugging.
etcd_debugging_mvcc_db_total_size_in_bytesThe total size of the MVCC database in bytes for ETCD debugging.
etcd_debugging_mvcc_delete_totalThe total number of delete operations in ETCD debugging MVCC.
etcd_debugging_mvcc_events_totalThe total number of events in ETCD debugging.
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucketThe distribution of MVCC index compaction pause durations in milliseconds for ETCD debugging.
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_countThe count of MVCC index compaction pause durations in milliseconds for ETCD debugging.
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_sumThe sum of MVCC index compaction pause durations in milliseconds for ETCD debugging.
etcd_debugging_mvcc_keys_totalThe total number of keys in ETCD debugging MVCC.
etcd_debugging_mvcc_pending_events_totalThe total number of pending events in ETCD debugging MVCC.
etcd_debugging_mvcc_put_totalThe total number of put operations in ETCD debugging MVCC.
etcd_debugging_mvcc_range_totalThe total number of range queries in ETCD MVCC.
etcd_debugging_mvcc_slow_watcher_totalThe total number of slow watchers in ETCD debugging.
etcd_debugging_mvcc_total_put_size_in_bytesThe total size of MVCC puts in bytes for ETCD debugging.
etcd_debugging_mvcc_txn_totalThe total number of MVCC transactions in ETCD debugging.
etcd_debugging_mvcc_watch_stream_totalThe total number of snapshot streams in ETCD debugging.
etcd_debugging_mvcc_watcher_totalThe total number of watchers in ETCD debugging.
etcd_debugging_server_lease_expired_totalThe total number of expired leases in ETCD debugging.
etcd_debugging_snap_save_marshalling_duration_seconds_bucketThe distribution of snapshot save marshalling durations in seconds for ETCD debugging.
etcd_debugging_snap_save_marshalling_duration_seconds_countThe count of snapshot save marshalling durations in seconds for ETCD debugging.
etcd_debugging_snap_save_marshalling_duration_seconds_sumThe sum of snapshot save marshalling durations in seconds for ETCD debugging.
etcd_debugging_snap_save_total_duration_seconds_bucketThe distribution of snapshot save durations in seconds for ETCD debugging.
etcd_debugging_snap_save_total_duration_seconds_countThe count of snapshot save durations in seconds for ETCD debugging.
etcd_debugging_snap_save_total_duration_seconds_sumThe sum of snapshot save durations in seconds for ETCD debugging.
etcd_debugging_store_expires_totalThe total number of expired items in ETCD debugging storage.
etcd_debugging_store_reads_totalThe total number of reads in ETCD debugging storage.
etcd_debugging_store_watch_requests_totalThe total number of watch requests in ETCD debugging storage.
etcd_debugging_store_watchersThe total number of watchers in ETCD debugging storage.
etcd_debugging_store_writes_totalThe total number of writes in ETCD debugging storage.
etcd_disk_backend_commit_duration_seconds_bucketThe distribution of disk backend commit durations in seconds for ETCD.
etcd_disk_backend_commit_duration_seconds_countThe count of disk backend commit durations in seconds for ETCD.
etcd_disk_backend_commit_duration_seconds_sumThe sum of disk backend commit durations in seconds for ETCD.
etcd_disk_backend_defrag_duration_seconds_bucketThe distribution of disk backend defragmentation durations in seconds for ETCD.
etcd_disk_backend_defrag_duration_seconds_countThe count of disk backend defragmentation durations in seconds for ETCD.
etcd_disk_backend_defrag_duration_seconds_sumThe sum of disk backend defragmentation durations in seconds for ETCD.
etcd_disk_backend_snapshot_duration_seconds_bucketThe distribution of disk backend snapshot durations in seconds for ETCD.
etcd_disk_backend_snapshot_duration_seconds_countThe count of disk backend snapshot durations in seconds for ETCD.
etcd_disk_backend_snapshot_duration_seconds_sumThe sum of disk backend snapshot durations in seconds for ETCD.
etcd_disk_defrag_inflightThe number of ongoing disk defragmentations in ETCD.
etcd_disk_wal_fsync_duration_seconds_bucketThe distribution of WAL sync durations in seconds for ETCD disk.
etcd_disk_wal_fsync_duration_seconds_countThe count of WAL sync durations in seconds for ETCD disk.
etcd_disk_wal_fsync_duration_seconds_sumThe sum of WAL sync durations in seconds for ETCD disk.
etcd_disk_wal_write_bytes_totalThe total number of bytes written to the WAL in ETCD disk.
etcd_grpc_proxy_cache_hits_totalThe total number of cache hits in the ETCD gRPC proxy.
etcd_grpc_proxy_cache_keys_totalThe total number of cache keys in the ETCD gRPC proxy.
etcd_grpc_proxy_cache_misses_totalThe total number of cache misses in the ETCD gRPC proxy.
etcd_grpc_proxy_events_coalescing_totalThe total number of event coalescings in the ETCD gRPC proxy.
etcd_grpc_proxy_watchers_coalescing_totalThe total number of watcher coalescings in the ETCD gRPC proxy.
etcd_mvcc_db_open_read_transactionsThe number of open read transactions in the ETCD MVCC database.
etcd_mvcc_db_total_size_in_bytesThe total size of the MVCC database in bytes for ETCD.
etcd_mvcc_db_total_size_in_use_in_bytesThe total size in use of the MVCC database in bytes for ETCD.
etcd_mvcc_delete_totalThe total number of deletes in ETCD MVCC.
etcd_mvcc_hash_duration_seconds_bucketThe distribution of MVCC hash durations in seconds for ETCD.
etcd_mvcc_hash_duration_seconds_countThe count of MVCC hash durations in seconds for ETCD.
etcd_mvcc_hash_duration_seconds_sumThe sum of MVCC hash durations in seconds for ETCD.
etcd_mvcc_hash_rev_duration_seconds_bucketThe distribution of MVCC hash revision durations in seconds for ETCD.
etcd_mvcc_hash_rev_duration_seconds_countThe count of MVCC hash revision durations in seconds for ETCD.
etcd_mvcc_hash_rev_duration_seconds_sumThe sum of MVCC hash revision durations in seconds for ETCD.
etcd_mvcc_put_totalThe total number of put operations in ETCD MVCC.
etcd_mvcc_range_totalThe total number of range queries in ETCD MVCC.
etcd_mvcc_txn_totalThe total number of MVCC transactions in ETCD.
etcd_network_active_peersThe number of active peers in the ETCD network.
etcd_network_client_grpc_received_bytes_totalThe total number of bytes received by the ETCD network client via gRPC.
etcd_network_client_grpc_sent_bytes_totalThe total number of bytes sent by the ETCD network client via gRPC.
etcd_network_disconnected_peers_totalThe total number of disconnected peers in the ETCD network.
etcd_network_peer_received_bytes_totalThe total number of bytes received by the ETCD network peer.
etcd_network_peer_received_failures_totalThe total number of receive failures in the ETCD network peer.
etcd_network_peer_round_trip_time_seconds_bucketThe distribution of round trip times for the ETCD network peer in seconds.
etcd_network_peer_round_trip_time_seconds_countThe count of round trip times for the ETCD network peer in seconds.
etcd_network_peer_round_trip_time_seconds_sumThe sum of round trip times for the ETCD network peer in seconds.
etcd_network_peer_sent_bytes_totalThe total number of bytes sent by the ETCD network peer.
etcd_network_peer_sent_failures_totalThe total number of send failures by the ETCD network peer.
etcd_network_server_stream_failures_totalThe total number of stream failures in the ETCD network server.
etcd_network_snapshot_receive_inflights_totalThe number of concurrent snapshot receive requests in the ETCD network.
etcd_network_snapshot_receive_successThe number of successful snapshot receives in the ETCD network.
etcd_network_snapshot_receive_total_duration_seconds_bucketThe distribution of snapshot receive durations in seconds for the ETCD network.
etcd_network_snapshot_receive_total_duration_seconds_countThe count of snapshot receive durations in seconds for the ETCD network.
etcd_network_snapshot_receive_total_duration_seconds_sumThe sum of snapshot receive durations in seconds for the ETCD network.
etcd_network_snapshot_send_inflights_totalThe number of concurrent snapshot send requests in the ETCD network.
etcd_network_snapshot_send_successThe number of successful snapshot sends in the ETCD network.
etcd_network_snapshot_send_total_duration_seconds_bucketThe distribution of snapshot send durations in seconds for the ETCD network.
etcd_network_snapshot_send_total_duration_seconds_countThe count of snapshot send durations in seconds for the ETCD network.
etcd_network_snapshot_send_total_duration_seconds_sumThe sum of snapshot send durations in seconds for the ETCD network.
etcd_server_apply_duration_seconds_bucketThe distribution of application durations in seconds for the ETCD server.
etcd_server_apply_duration_seconds_countThe count of application durations in seconds for the ETCD server.
etcd_server_apply_duration_seconds_sumThe sum of application durations in seconds for the ETCD server.
etcd_server_client_requests_totalThe total number of client requests to the ETCD server.
etcd_server_go_versionThe Go version of the ETCD server.
etcd_server_has_leaderIndicates whether a leader exists in the ETCD server.
etcd_server_health_failuresThe number of health check failures in the ETCD server.
etcd_server_health_successThe number of successful health checks in the ETCD server.
etcd_server_heartbeat_send_failures_totalThe total number of heartbeat send failures in the ETCD server.
etcd_server_idThe ID of the ETCD server.
etcd_server_is_leaderIndicates whether the ETCD server is a leader.
etcd_server_is_learnerIndicates whether the ETCD server is a learner.
etcd_server_leader_changes_seen_totalThe total number of leader changes witnessed by the ETCD server.
etcd_server_learner_promote_successesThe number of successful learner promotions in the ETCD server.
etcd_server_proposals_applied_totalThe total number of applied proposals in the ETCD server.
etcd_server_proposals_committed_totalThe total number of committed proposals in the ETCD server.
etcd_server_proposals_failed_totalThe total number of failed proposals in the ETCD server.
etcd_server_proposals_pendingThe total number of pending proposals in the ETCD server.
etcd_server_quota_backend_bytesThe backend storage quota in bytes for the ETCD server.
etcd_server_read_indexes_failed_totalThe total number of read index failures in the ETCD server.
etcd_server_slow_apply_totalThe total number of slow applications in the ETCD server.
etcd_server_slow_read_indexes_totalThe total number of slow read indexes in the ETCD server.
etcd_server_snapshot_apply_in_progress_totalThe total number of snapshots being applied in the ETCD server.
etcd_server_versionThe version of the ETCD server.
etcd_snap_db_fsync_duration_seconds_bucketThe distribution of ETCD snapshot database fsync durations in seconds.
etcd_snap_db_fsync_duration_seconds_countThe count of ETCD snapshot database fsync durations in seconds.
etcd_snap_db_fsync_duration_seconds_sumThe sum of ETCD snapshot database fsync durations in seconds.
etcd_snap_db_save_total_duration_seconds_bucketThe distribution of ETCD snapshot database save durations in seconds.
etcd_snap_db_save_total_duration_seconds_countThe count of ETCD snapshot database save durations in seconds.
etcd_snap_db_save_total_duration_seconds_sumThe sum of ETCD snapshot database save durations in seconds.
etcd_snap_fsync_duration_seconds_bucketThe distribution of ETCD snapshot fsync durations in seconds.
etcd_snap_fsync_duration_seconds_countThe count of ETCD snapshot fsync durations in seconds.
etcd_snap_fsync_duration_seconds_sumThe sum of ETCD snapshot fsync durations in seconds.
go_gc_duration_secondsThe Go GC pause duration in seconds.
go_gc_duration_seconds_countThe Go GC pause duration in seconds.
go_gc_duration_seconds_sumThe total Go GC pause duration in seconds.
go_goroutinesThe number of goroutines.
go_infoThe Go-specific information.
go_memstats_alloc_bytesThe amount of memory allocated in bytes.
go_memstats_alloc_bytes_totalThe cumulative amount of memory allocated in bytes.
go_memstats_buck_hash_sys_bytesThe amount of memory used by hash tables in the operating system in bytes.
go_memstats_frees_totalThe total number of releases.
go_memstats_gc_cpu_fractionThe GC CPU utilization (%).
go_memstats_gc_sys_bytesThe amount of memory used by GC in the operating system in bytes.
go_memstats_heap_alloc_bytesThe amount of heap memory allocated in bytes.
go_memstats_heap_idle_bytesThe amount of idle heap memory in bytes.
go_memstats_heap_inuse_bytesThe amount of heap memory in use in bytes.
go_memstats_heap_objectsThe number of objects allocated on the heap.
go_memstats_heap_released_bytesThe amount of heap memory released in bytes.
go_memstats_heap_sys_bytesThe amount of memory allocated to the heap by the operating system in bytes.
go_memstats_last_gc_time_secondsThe last GC duration in seconds.
go_memstats_lookups_totalThe total number of lookups.
go_memstats_mallocs_totalThe total number of allocations.
go_memstats_mcache_inuse_bytesThe amount of memory in use in mcache in bytes.
go_memstats_mcache_sys_bytesThe amount of memory allocated to mcache by the operating system in bytes.
go_memstats_mspan_inuse_bytesThe amount of memory in use in mspan in bytes.
go_memstats_mspan_sys_bytesThe amount of memory allocated to mspan by the operating system in bytes.
go_memstats_next_gc_bytesThe number of bytes to be released at the next GC in bytes.
go_memstats_other_sys_bytesThe amount of memory allocated for other purposes by the operating system in bytes.
go_memstats_stack_inuse_bytesThe amount of stack memory in use in bytes.
go_memstats_stack_sys_bytesThe amount of memory allocated to the stack by the operating system in bytes.
go_memstats_sys_bytesThe total memory allocated by the operating system in bytes.
go_threadsThe number of threads.
grpc_server_handled_totalThe total number of requests handled by the gRPC server.
grpc_server_msg_received_totalThe total number of requests received by the gRPC server.
grpc_server_msg_sent_totalThe total number of requests sent by the gRPC server.
grpc_server_started_totalThe total number of times the gRPC server has started.
memory_utilization_byteThe memory usage in bytes.
os_fd_limitThe file descriptor limit of the operating system.
os_fd_usedThe number of file descriptors used by the operating system.
process_cpu_seconds_totalThe total number of CPU seconds used by the process.
process_max_fdsThe maximum number of file descriptors for the process.
process_open_fdsThe number of file descriptors opened by the process.
process_resident_memory_bytesThe resident memory size of the process in bytes.
process_start_time_secondsThe process startup duration in seconds.
process_virtual_memory_bytesThe number of virtual memory bytes for the process.
process_virtual_memory_max_bytesThe maximum number of virtual memory bytes for the process.
promhttp_metric_handler_requests_in_flightThe current number of requests being handled by the Prometheus HTTP metric handler.
promhttp_metric_handler_requests_totalThe total number of requests handled by the Prometheus HTTP metric handler.
scrape_duration_secondsThe scrape duration in seconds.
scrape_samples_post_metric_relabelingThe number of scraped samples after metric relabeling.
scrape_samples_scrapedThe number of scraped samples.
scrape_series_addedThe number of new series added during the scrape.
upThe connectivity of metric collection.

Scheduler

Job name: ack-scheduler

Metrics from the ACK Kubernetes scheduler, covering scheduling latency, queue depth, plugin execution times, and preemption statistics.

MetricDescription
aggregator_discovery_aggregation_count_totalThe count of discovery aggregations performed by the aggregator.
aliyun_prometheus_agent_append_duration_secondsThe duration of the Prometheus agent append operations in seconds.
aliyun_prometheus_agent_job_discovery_statusThe discovery status of the Prometheus agent collection jobs.
aliyun_prometheus_agent_scrape_custom_errorThe number of custom collection errors of the Prometheus agent.
aliyun_prometheus_agent_scrapes_by_target_totalThe total number of scrapes by the Prometheus agent per target.
aliyun_prometheus_agent_target_infoThe target information of the Prometheus agent.
apiserver_audit_event_totalThe total number of APIServer audit events.
apiserver_audit_requests_rejected_totalThe total number of APIServer audit request rejections.
apiserver_client_certificate_expiration_seconds_bucketThe distribution of remaining seconds until APIServer client certificate expiration.
apiserver_client_certificate_expiration_seconds_countThe count of remaining seconds until APIServer client certificate expiration.
apiserver_client_certificate_expiration_seconds_sumThe sum of remaining seconds until APIServer client certificate expiration.
apiserver_delegated_authn_request_duration_seconds_bucketThe distribution of delegated authentication request durations in seconds for the APIServer.
apiserver_delegated_authn_request_duration_seconds_countThe count of delegated authentication request durations in seconds for the APIServer.
apiserver_delegated_authn_request_duration_seconds_sumThe sum of delegated authentication request durations in seconds for the APIServer.
apiserver_delegated_authn_request_totalThe total number of delegated authentication requests for the APIServer.
apiserver_delegated_authz_request_duration_seconds_bucketThe distribution of delegated authorization request durations in seconds for the APIServer.
apiserver_delegated_authz_request_duration_seconds_countThe count of delegated authorization request durations in seconds for the APIServer.
apiserver_delegated_authz_request_duration_seconds_sumThe sum of delegated authorization request durations in seconds for the APIServer.
apiserver_delegated_authz_request_totalThe total number of delegated authorization requests to the API server.
apiserver_encryption_config_controller_automatic_reload_failures_totalThe total number of automatic reload failures for the APIServer encryption configuration controller.
apiserver_encryption_config_controller_automatic_reload_success_totalThe total number of successful automatic reloads for the APIServer encryption configuration controller.
apiserver_envelope_encryption_dek_cache_fill_percentThe percentage of envelope encryption data encryption keys (DEKs) cache fill for the APIServer.
apiserver_storage_data_key_generation_duration_seconds_bucketThe distribution of data key generation durations for the APIServer storage.
apiserver_storage_data_key_generation_duration_seconds_countThe count of data key generation durations for the APIServer storage.
apiserver_storage_data_key_generation_duration_seconds_sumThe sum of data key generation durations for the APIServer storage.
apiserver_storage_data_key_generation_failures_totalThe total number of data key generation failures for the APIServer storage.
apiserver_storage_envelope_transformation_cache_misses_totalThe total number of envelope transformation cache misses for the APIServer storage.
apiserver_webhooks_x509_insecure_sha1_totalThe total count of insecure SHA1 usage in X509 certificates for APIServer Webhooks.
apiserver_webhooks_x509_missing_san_totalThe total count of missing SANs in X509 certificates for APIServer Webhooks.
authenticated_user_requestsThe number of authenticated user requests.
authentication_attemptsThe number of authentication attempts.
authentication_duration_seconds_bucketThe distribution of authentication durations in seconds.
authentication_duration_seconds_countThe count of authentication durations in seconds.
authentication_duration_seconds_sumThe sum of authentication durations in seconds.
authentication_token_cache_active_fetch_countThe count of active fetches for the authentication token cache.
authentication_token_cache_fetch_totalThe total number of fetches for the authentication token cache.
authentication_token_cache_request_duration_seconds_bucketThe distribution of request durations in seconds for the authentication token cache.
authentication_token_cache_request_duration_seconds_countThe count of request durations in seconds for the authentication token cache.
authentication_token_cache_request_duration_seconds_sumThe sum of request durations in seconds for the authentication token cache.
authentication_token_cache_request_totalThe total number of requests for the authentication token cache.
authorization_attempts_totalThe total number of authorization attempts.
authorization_duration_seconds_bucketThe distribution of authorization durations in seconds.
authorization_duration_seconds_countThe count of authorization durations in seconds.
authorization_duration_seconds_sumThe sum of authorization durations in seconds.
cardinality_enforcement_unexpected_categorizations_totalThe total number of unexpected categorizations during cardinality enforcement.
cpu_utilization_coreThe CPU core utilization.
disabled_metric_totalThe total number of disabled metrics.
disabled_metrics_totalThe total number of disabled metrics.
go_cgo_go_to_c_calls_calls_totalThe total number of Go to C calls via cgo.
go_cpu_classes_gc_mark_assist_cpu_seconds_totalThe total number of CPU seconds for GC mark assist.
go_cpu_classes_gc_mark_dedicated_cpu_seconds_totalThe total number of dedicated CPU seconds for GC marking in Go.
go_cpu_classes_gc_mark_idle_cpu_seconds_totalThe idle CPU seconds for GC marking in Go.
go_cpu_classes_gc_pause_cpu_seconds_totalThe total number of CPU seconds for GC pauses in Go.
go_cpu_classes_gc_total_cpu_seconds_totalThe total number of CPU seconds for all GC activities in Go.
go_cpu_classes_idle_cpu_seconds_totalThe total number of idle CPU seconds in Go.
go_cpu_classes_scavenge_assist_cpu_seconds_totalThe total number of CPU seconds for GC scavenging assist.
go_cpu_classes_scavenge_background_cpu_seconds_totalThe total number of CPU seconds for background GC scavenging.
go_cpu_classes_scavenge_total_cpu_seconds_totalThe total CPU seconds for scavenge in Go CPU classes.
go_cpu_classes_total_cpu_seconds_totalThe total CPU seconds summed across all Go CPU classes.
go_cpu_classes_user_cpu_seconds_totalThe total user CPU seconds summed across Go CPU classes.
go_gc_cycles_automatic_gc_cycles_totalThe total number of automatic GC cycles in Go.
go_gc_cycles_forced_gc_cycles_totalThe total number of forced GC cycles in Go.
go_gc_cycles_total_gc_cycles_totalThe total number of GC cycles in Go.
go_gc_duration_secondsThe duration of Go GC in seconds.
go_gc_duration_seconds_countThe count of Go GC durations in seconds.
go_gc_duration_seconds_sumThe sum of Go GC pause durations in seconds.
go_gc_gogc_percentThe GO GC target percentage.
go_gc_gomemlimit_bytesThe heap memory limit in bytes for Go GC.
go_gc_heap_allocs_by_size_bytes_bucketThe distribution of heap allocations by size in bytes for Go GC.
go_gc_heap_allocs_by_size_bytes_countThe count of heap allocations by size in bytes for Go GC.
go_gc_heap_allocs_by_size_bytes_sumThe sum of heap allocations by size in bytes for Go GC.
go_gc_heap_allocs_by_size_bytes_total_bucketThe distribution of heap allocations by size in bytes for Go GC.
go_gc_heap_allocs_by_size_bytes_total_countThe count of heap allocations by size in bytes for Go GC.
go_gc_heap_allocs_by_size_bytes_total_sumThe sum of heap allocations by size in bytes for Go GC.
go_gc_heap_allocs_bytes_totalThe total bytes allocated in the Go GC heap.
go_gc_heap_allocs_objects_totalThe total number of objects allocated on the heap for Go GC.
go_gc_heap_frees_by_size_bytes_bucketThe distribution of heap releases by size in bytes for Go GC.
go_gc_heap_frees_by_size_bytes_countThe count of heap releases by size in bytes for Go GC.
go_gc_heap_frees_by_size_bytes_sumThe sum of heap releases by size in bytes for Go GC.
go_gc_heap_frees_by_size_bytes_total_bucketThe distribution of total heap releases by size in bytes for Go GC.
go_gc_heap_frees_by_size_bytes_total_countThe count of total heap releases by size in bytes for Go GC.
go_gc_heap_frees_by_size_bytes_total_sumThe sum of total heap releases by size in bytes for Go GC.
go_gc_heap_frees_bytes_totalThe total bytes released in the Go GC heap.
go_gc_heap_frees_objects_totalThe total number of objects freed from the heap for Go GC.
go_gc_heap_goal_bytesThe target heap size in bytes for Go GC.
go_gc_heap_live_bytesThe live heap size in bytes for Go GC.
go_gc_heap_objects_objectsThe number of objects in the heap for Go GC.
go_gc_heap_tiny_allocs_objects_totalThe total number of tiny object allocations in the heap for Go GC.
go_gc_limiter_last_enabled_gc_cycleThe last enabled GC cycle for the Go GC limiter.
go_gc_pauses_seconds_bucketThe distribution of GC pause durations in seconds.
go_gc_pauses_seconds_countThe count of GC pause durations in seconds.
go_gc_pauses_seconds_sumThe sum of GC pause durations in seconds.
go_gc_pauses_seconds_total_bucketThe distribution of total GC pause durations in seconds.
go_gc_pauses_seconds_total_countThe count of total GC pause durations in seconds.
go_gc_pauses_seconds_total_sumThe sum of total GC pause durations in seconds.
go_gc_scan_globals_bytesThe number of global bytes scanned during Go GC.
go_gc_scan_heap_bytesThe number of heap bytes scanned during Go GC.
go_gc_scan_stack_bytesThe number of stack bytes scanned during Go GC.
go_gc_scan_total_bytesThe total number of bytes scanned during Go GC.
go_gc_stack_starting_size_bytesThe starting size of the Go GC stack in bytes.
go_godebug_non_default_behavior_execerrdot_events_totalThe total number of execution error point events for non-default Go behavior.
go_godebug_non_default_behavior_gocachehash_events_totalThe total number of Go cache hash events for non-default Go behavior.
go_godebug_non_default_behavior_gocachetest_events_totalThe total number of gocachetest events for non-default Go debug behavior.
go_godebug_non_default_behavior_gocacheverify_events_totalThe total number of gocacheverify events for non-default Go behavior.
go_godebug_non_default_behavior_gotypesalias_events_totalThe total number of gotypealias events for non-default Go debug behavior.
go_godebug_non_default_behavior_http2client_events_totalThe total number of http2client events for non-default Go debug behavior.
go_godebug_non_default_behavior_http2server_events_totalThe total number of http2server events for non-default Go behavior.
go_godebug_non_default_behavior_httplaxcontentlength_events_totalThe total number of HTTP lax content length events for non-default Go behavior.
go_godebug_non_default_behavior_httpmuxgo121_events_totalThe total number of httpmuxgo121 events for non-default Go behavior.
go_godebug_non_default_behavior_installgoroot_events_totalThe total number of goroot installation events for non-default Go debugging.
go_godebug_non_default_behavior_jstmpllitinterp_events_totalThe total number of jstmpllitinterp events for non-default Go debug behavior.
go_godebug_non_default_behavior_multipartmaxheaders_events_totalThe total number of multipart max headers events for non-default Go behavior.
go_godebug_non_default_behavior_multipartmaxparts_events_totalThe total number of multipartmaxparts events for non-default Go debug behavior.
go_godebug_non_default_behavior_multipathtcp_events_totalThe total number of multipathtcp events for non-default Go debug behavior.
go_godebug_non_default_behavior_panicnil_events_totalThe total number of nil pointer panic events for non-default Go behavior.
go_godebug_non_default_behavior_randautoseed_events_totalThe total number of random auto-seed events for non-default Go behavior.
go_godebug_non_default_behavior_tarinsecurepath_events_totalThe total number of tarinsecurepath events for non-default Go debug behavior.
go_godebug_non_default_behavior_tls10server_events_totalThe total number of TLS1.0 events for non-default Go debug behavior.
go_godebug_non_default_behavior_tlsmaxrsasize_events_totalThe total number of tlsmaxrsasize events for non-default Go debug behavior.
go_godebug_non_default_behavior_tlsrsakex_events_totalThe total number of TLS RSA key exchange events for non-default Go debug behavior.
go_godebug_non_default_behavior_tlsunsafeekm_events_totalThe total number of TLS insecure EKM events for non-default Go debug behavior.
go_godebug_non_default_behavior_x509sha1_events_totalThe total number of x509sha1 events for non-default Go debug behavior.
go_godebug_non_default_behavior_x509usefallbackroots_events_totalThe total number of X509 use fallback roots events for non-default Go behavior.
go_godebug_non_default_behavior_x509usepolicies_events_totalThe total number of x509usepolicies events for non-default Go debug behavior.
go_godebug_non_default_behavior_zipinsecurepath_events_totalThe total number of zipinsecurepath events for non-default Go debug behavior.
go_goroutinesGo goroutines.
go_infoThe Go-specific information.
go_memory_classes_heap_free_bytesThe free bytes in the heap.
go_memory_classes_heap_objects_bytesThe bytes used by heap objects.
go_memory_classes_heap_released_bytesThe released bytes in the heap for memory classes.
go_memory_classes_heap_stacks_bytesThe bytes used by stacks.
go_memory_classes_heap_unused_bytesThe unused bytes in the heap.
go_memory_classes_metadata_mcache_free_bytesThe free bytes in metadata mcache.
go_memory_classes_metadata_mcache_inuse_bytesThe in-use bytes in metadata mcache.
go_memory_classes_metadata_mspan_free_bytesThe free bytes in metadata mspan.
go_memory_classes_metadata_mspan_inuse_bytesThe in-use bytes in metadata mspan.
go_memory_classes_metadata_other_bytesThe other bytes in metadata.
go_memory_classes_os_stacks_bytesThe bytes used by OS stacks in memory classes.
go_memory_classes_other_bytesThe other bytes.
go_memory_classes_profiling_buckets_bytesThe bytes used by profiling buckets.
go_memory_classes_total_bytesThe total bytes.
go_memstats_alloc_bytesThe allocated bytes.
go_memstats_alloc_bytes_totalThe total allocated bytes.
go_memstats_buck_hash_sys_bytesThe buck hash system bytes.
go_memstats_frees_totalThe total number of releases.
go_memstats_gc_cpu_fractionThe fraction of CPU time spent in GC.
go_memstats_gc_sys_bytesThe GC system bytes.
go_memstats_heap_alloc_bytesThe allocated bytes on the heap.
go_memstats_heap_idle_bytesThe idle bytes on the heap.
go_memstats_heap_inuse_bytesThe in-use bytes on the heap.
go_memstats_heap_objectsThe number of objects on the heap.
go_memstats_heap_released_bytesThe released bytes on the heap.
go_memstats_heap_sys_bytesThe system bytes on the heap.
go_memstats_last_gc_time_secondsThe last GC duration in seconds.
go_memstats_lookups_totalThe total number of lookups.
go_memstats_mallocs_totalThe total number of allocations.
go_memstats_mcache_inuse_bytesThe amount of memory in use in mcache in bytes.
go_memstats_mcache_sys_bytesThe amount of memory allocated to mcache by the operating system in bytes.
go_memstats_mspan_inuse_bytesThe amount of memory in use in mspan in bytes.
go_memstats_mspan_sys_bytesThe amount of memory allocated to mspan by the operating system in bytes.
go_memstats_next_gc_bytesThe number of bytes to be released at the next GC in bytes.
go_memstats_other_sys_bytesThe total memory allocated by the operating system in bytes.
go_memstats_stack_inuse_bytesThe amount of stack memory in use in bytes.
go_memstats_stack_sys_bytesThe amount of stack memory allocated by the operating system in bytes.
go_memstats_sys_bytesThe total memory allocated by the operating system in bytes.
go_sched_gomaxprocs_threadsThe number of threads determined by GOMAXPROCS.
go_sched_goroutines_goroutinesThe number of goroutines.
go_sched_latencies_seconds_bucketThe distribution of Go scheduling latencies in seconds.
go_sched_latencies_seconds_countThe count of Go scheduling latencies in seconds.
go_sched_latencies_seconds_sumThe sum of Go scheduling latencies in seconds.
go_sched_pauses_stopping_gc_seconds_bucketThe distribution of stopping GC pause seconds.
go_sched_pauses_stopping_gc_seconds_countThe count of stopping GC pause seconds.
go_sched_pauses_stopping_gc_seconds_sumThe sum of stopping GC pause seconds.
go_sched_pauses_stopping_other_seconds_bucketThe distribution of other stopping seconds for Go scheduler pauses.
go_sched_pauses_stopping_other_seconds_countThe count of other stopping seconds for Go scheduler pauses.
go_sched_pauses_stopping_other_seconds_sumThe sum of other stopping seconds for Go scheduler pauses.
go_sched_pauses_total_gc_seconds_bucketThe distribution of total GC seconds for Go scheduler pauses.
go_sched_pauses_total_gc_seconds_countThe count of total GC seconds for Go scheduler pauses.
go_sched_pauses_total_gc_seconds_sumThe sum of total GC seconds for Go scheduler pauses.
go_sched_pauses_total_other_seconds_bucketThe distribution of other pause seconds.
go_sched_pauses_total_other_seconds_countThe count of other pause seconds.
go_sched_pauses_total_other_seconds_sumThe sum of other pause seconds.
go_sync_mutex_wait_total_seconds_totalThe total seconds of Go sync mutex wait.
go_threadsThe number of Go threads.
hidden_metric_totalThe total number of hidden metrics.
hidden_metrics_totalThe total number of hidden metrics.
kubernetes_build_infoThe Kubernetes build information.
kubernetes_feature_enabledThe Kubernetes enabled features.
leader_election_master_statusThe master status of leader election.
memory_utilization_byteThe used memory in bytes.
process_cpu_seconds_totalThe total CPU seconds of the process.
process_max_fdsThe maximum number of file descriptors for the process.
process_open_fdsThe number of file descriptors opened by the process.
process_resident_memory_bytesThe resident memory size of the process in bytes.
process_start_time_secondsThe process startup duration in seconds.
process_virtual_memory_bytesThe number of virtual memory bytes for the process.
process_virtual_memory_max_bytesThe maximum number of virtual memory bytes for the process.
registered_metric_totalThe total number of registered metrics.
registered_metrics_totalThe total number of registered metrics.
rest_client_exec_plugin_certificate_rotation_age_bucketThe distribution of certificate rotation age for REST client exec plugin.
rest_client_exec_plugin_certificate_rotation_age_countThe count of certificate rotation age for REST client exec plugin.
rest_client_exec_plugin_certificate_rotation_age_sumThe sum of certificate rotation age for REST client exec plugin.
rest_client_rate_limiter_duration_seconds_bucketThe distribution of rate limiter durations for REST client.
rest_client_rate_limiter_duration_seconds_countThe count of rate limiter durations for REST client.
rest_client_rate_limiter_duration_seconds_sumThe sum of rate limiter durations for REST client.
rest_client_request_duration_seconds_bucketThe distribution of request durations in seconds for REST client.
rest_client_request_duration_seconds_countThe count of request durations in seconds for REST client.
rest_client_request_duration_seconds_sumThe sum of request durations in seconds for REST client.
rest_client_request_retries_totalThe total number of request retries for REST client.
rest_client_request_size_bytes_bucketThe distribution of request sizes in bytes for REST client.
rest_client_request_size_bytes_countThe count of request sizes in bytes for REST client.
rest_client_request_size_bytes_sumThe sum of request sizes in bytes for REST client.
rest_client_requests_totalThe total number of requests for REST client.
rest_client_response_size_bytes_bucketThe distribution of response sizes in bytes for REST client.
rest_client_response_size_bytes_countThe count of response sizes in bytes for REST client.
rest_client_response_size_bytes_sumThe sum of response sizes in bytes for REST client.
rest_client_transport_cache_entriesThe number of transport cache entries for REST client.
rest_client_transport_create_calls_totalThe total number of transport create calls for REST client.
scheduler_binding_duration_seconds_bucketThe distribution of binding durations in seconds for the scheduler.
scheduler_binding_duration_seconds_countThe count of binding durations in seconds for the scheduler.
scheduler_binding_duration_seconds_sumThe sum of binding durations in seconds for the scheduler.
scheduler_e2e_scheduling_duration_seconds_bucketThe distribution of end-to-end scheduling durations for the scheduler.
scheduler_e2e_scheduling_duration_seconds_countThe count of end-to-end scheduling durations for the scheduler.
scheduler_e2e_scheduling_duration_seconds_sumThe sum of end-to-end scheduling durations for the scheduler.
scheduler_framework_extension_point_duration_seconds_bucketThe distribution of extension point durations for the scheduler framework.
scheduler_framework_extension_point_duration_seconds_countThe count of extension point durations for the scheduler framework.
scheduler_framework_extension_point_duration_seconds_sumThe sum of extension point durations for the scheduler framework.
scheduler_goroutinesThe number of goroutines for the scheduler.
scheduler_pending_podsThe number of pending pods for the scheduler.
scheduler_plugin_evaluation_totalThe total number of plugin evaluations for the scheduler.
scheduler_plugin_execution_duration_seconds_bucketThe distribution of execution durations in seconds for the scheduler plugins.
scheduler_plugin_execution_duration_seconds_countThe count of execution durations in seconds for the scheduler plugins.
scheduler_plugin_execution_duration_seconds_sumThe sum of execution durations in seconds for the scheduler plugins.
scheduler_pod_preemption_victims_bucketThe distribution of preemption victims for the scheduler.
scheduler_pod_preemption_victims_countThe count of preemption victims for the scheduler.
scheduler_pod_preemption_victims_sumThe sum of preemption victims for the scheduler.
scheduler_pod_scheduling_attempts_bucketThe distribution of pod scheduling attempts for the scheduler.
scheduler_pod_scheduling_attempts_countThe count of pod scheduling attempts for the scheduler.
scheduler_pod_scheduling_attempts_sumThe sum of pod scheduling attempts for the scheduler.
scheduler_pod_scheduling_duration_seconds_bucketThe distribution of pod scheduling durations in seconds for the scheduler.
scheduler_pod_scheduling_duration_seconds_countThe count of pod scheduling durations in seconds for the scheduler.
scheduler_pod_scheduling_duration_seconds_sumThe sum of pod scheduling durations in seconds for the scheduler.
scheduler_pod_scheduling_sli_duration_seconds_bucketThe distribution of SLI durations for pod scheduling.
scheduler_pod_scheduling_sli_duration_seconds_countThe count of SLI durations for pod scheduling.
scheduler_pod_scheduling_sli_duration_seconds_sumThe sum of SLI durations for pod scheduling.
scheduler_preemption_attempts_totalThe total number of preemption attempts for the scheduler.
scheduler_preemption_victims_bucketThe distribution of preemption victims for the scheduler.
scheduler_preemption_victims_countThe count of preemption victims for the scheduler.
scheduler_preemption_victims_sumThe sum of preemption victims for the scheduler.
scheduler_queue_incoming_pods_totalThe total number of incoming pods for the scheduler.
scheduler_schedule_attempts_totalThe total number of scheduling attempts for the scheduler.
scheduler_scheduler_cache_sizeThe scheduler cache size.
scheduler_scheduler_goroutinesThe number of goroutines for the scheduler.
scheduler_scheduling_algorithm_duration_seconds_bucketThe distribution of scheduling algorithm durations in seconds.
scheduler_scheduling_algorithm_duration_seconds_countThe count of scheduling algorithm durations in seconds.
scheduler_scheduling_algorithm_duration_seconds_sumThe sum of scheduling algorithm durations in seconds.
scheduler_scheduling_algorithm_predicate_evaluation_seconds_bucketThe distribution of predicate evaluation seconds for the scheduling algorithm.
scheduler_scheduling_algorithm_predicate_evaluation_seconds_countThe count of predicate evaluation seconds for the scheduling algorithm.
scheduler_scheduling_algorithm_predicate_evaluation_seconds_sumThe sum of predicate evaluation seconds for the scheduling algorithm.
scheduler_scheduling_algorithm_preemption_evaluation_seconds_bucketThe distribution of preemption evaluation seconds for the scheduling algorithm.
scheduler_scheduling_algorithm_preemption_evaluation_seconds_countThe count of preemption evaluation seconds for the scheduling algorithm.
scheduler_scheduling_algorithm_preemption_evaluation_seconds_sumThe sum of preemption evaluation seconds for the scheduling algorithm.
scheduler_scheduling_algorithm_priority_evaluation_seconds_bucketThe distribution of priority evaluation durations in seconds for the scheduling algorithm.
scheduler_scheduling_algorithm_priority_evaluation_seconds_countThe count of priority evaluation durations in seconds for the scheduling algorithm.
scheduler_scheduling_algorithm_priority_evaluation_seconds_sumThe sum of priority evaluation durations in seconds for the scheduling algorithm.
scheduler_scheduling_attempt_duration_seconds_bucketThe distribution of scheduling attempt durations.
scheduler_scheduling_attempt_duration_seconds_countThe count of scheduling attempt durations.
scheduler_scheduling_attempt_duration_seconds_sumThe sum of scheduling attempt durations.
scheduler_scheduling_duration_secondsThe distribution of scheduling durations in seconds.
scheduler_scheduling_duration_seconds_countThe count of scheduling durations in seconds.
scheduler_scheduling_duration_seconds_sumThe sum of scheduling durations in seconds.
scheduler_total_preemption_attemptsThe total number of preemption attempts by the scheduler.
scheduler_unschedulable_podsThe number of unscheduled pods by the scheduler.
scheduler_volume_scheduling_duration_seconds_bucketThe distribution of volume scheduling durations in seconds.
scheduler_volume_scheduling_duration_seconds_countThe count of volume scheduling durations in seconds.
scheduler_volume_scheduling_duration_seconds_sumThe sum of volume scheduling durations in seconds.
scheduler_volume_scheduling_stage_error_totalThe number of errors that are returned during volume scheduling.
scrape_duration_secondsThe scrape duration in seconds.
scrape_samples_post_metric_relabelingThe number of scraped samples after metric relabeling.
scrape_samples_scrapedThe number of scraped samples.
scrape_series_addedThe number of new series added during the scrape.
upThe connectivity of metric collection.
workqueue_adds_totalThe total number of additions to the work queue.
workqueue_depthThe work queue depth.
workqueue_longest_running_processor_secondsThe longest running processor duration in seconds for the work queue.
workqueue_queue_duration_seconds_bucketThe distribution of queue durations in seconds for the work queue.
workqueue_queue_duration_seconds_countThe count of queue durations in seconds for the work queue.
workqueue_queue_duration_seconds_sumThe sum of queue durations in seconds for the work queue.
workqueue_retries_totalThe total number of retries in the work queue.
workqueue_unfinished_work_secondsThe unfinished work duration in seconds for the work queue.
workqueue_work_duration_seconds_bucketThe distribution of work durations for the work queue.
workqueue_work_duration_seconds_countThe count of work durations for the work queue.
workqueue_work_duration_seconds_sumThe sum of work durations for the work queue.

cAdvisor (job name: _arms/kubelet/cadvisor)

Metric

Description

container_cpu_usage_seconds_total

The total CPU time consumed by the container in seconds.

container_fs_usage_bytes

The number of bytes used by the container file system.

container_memory_cache

The memory cache size of the container in bytes.

container_memory_usage_bytes

The amount of memory used by the container in bytes.

container_memory_working_set_bytes

The memory working set size (WSS) of the container in bytes.

container_network_receive_bytes_total

The total network traffic received by the container in bytes.

container_network_transmit_bytes_total

The total network traffic transmitted by the container in bytes.

container_scrape_error

The number of container metric scraping errors.

DCGM_CUSTOM_CONTAINER_CP_ALLOCATED

The ratio of the GPU computing power allocated to the container to the total computing power of the GPU. The value ranges from 0 to 1. In exclusive GPU mode or in shared GPU mode in which the container requests only GPU memory, the value of this metric is 0, which indicates that the allocation of GPU computing power is unlimited. For example, if a GPU provides a total of 100 compute units (CUs) of GPU computing power and allocates 30 CUs to a container, the ratio of the GPU computing power allocated to the container is calculated by using the following formula: 30/100 = 0.3.

DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED

The amount of GPU memory allocated to the container.

DCGM_CUSTOM_DEV_FB_ALLOCATED

The ratio of the allocated GPU memory to the total memory of the GPU. The value ranges from 0 to 1.

DCGM_CUSTOM_DEV_FB_TOTAL

The total memory of the GPU.

DCGM_CUSTOM_DEV_HEALTH

The health status of the GPU.

DCGM_CUSTOM_PROCESS_DECODE_UTIL

The decoder utilization of GPU threads.

DCGM_CUSTOM_PROCESS_ENCODE_UTIL

The encoder utilization of GPU threads.

DCGM_CUSTOM_PROCESS_MEM_COPY_UTIL

The memory copy utilization of GPU threads.

DCGM_CUSTOM_PROCESS_MEM_USED

The amount of GPU memory used by GPU threads.

DCGM_CUSTOM_PROCESS_SM_UTIL

The streaming multiprocessor (SM) utilization of GPU threads.

DCGM_CUSTOM_PROF_MEM_BANDWIDTH_USED

The GPU memory bandwidth used.

DCGM_CUSTOM_PROF_TENS_TFPS_USED

The tensor core utilization.

DCGM_FI_DEV_DEC_UTIL

The decoder utilization.

DCGM_FI_DEV_ENC_UTIL

The encoder utilization.

DCGM_FI_DEV_FB_FREE

The amount of free frame buffer memory.

DCGM_FI_DEV_FB_USED

The amount of used frame buffer memory. The value of this metric is the same as the value of Memory-Usage returned by the nvidia-smi command.

DCGM_FI_DEV_GPU_TEMP

The GPU temperature.

DCGM_FI_DEV_GPU_UTIL

The GPU utilization within a cycle of 1 second or 1/6 second. The cycle varies based on the GPU model. A cycle is a period of time during which one or more kernel functions remain active. This metric only indicates that one or more kernel functions are occupying GPU resources. The metric does not display detailed GPU usage information.

DCGM_FI_DEV_MEM_CLOCK

The memory clock speed.

DCGM_FI_DEV_MEM_COPY_UTIL

The memory bandwidth utilization. For example, the maximum memory bandwidth of NVIDIA V100 is 900 GB/s. If the memory bandwidth used is 450 GB/s, the memory bandwidth utilization is 50%.

DCGM_FI_DEV_POWER_USAGE

The power usage.

DCGM_FI_DEV_SM_CLOCK

The SM clock speed.

DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION

The total energy consumed since the driver was last loaded.

DCGM_FI_DEV_XID_ERRORS

The last XID error that occurred within a period of time.

DCGM_FI_PROF_DRAM_ACTIVE

The cycle fraction for memory bandwidth utilization when sending data to device memory or receiving data from device memory.

The value is an average value within a time interval rather than an instantaneous value.

A larger value of this metric indicates higher device memory utilization.

If the value is 1 (100%), a DRAM command is executed every cycle within the entire interval. The peak value of the metric can reach 0.8 (80%).

If the value of this metric is 0.2 (20%), 20% of the cycles within the time interval are spent reading from or writing to device memory.

DCGM_FI_PROF_NVLINK_RX_BYTES

The TX rate of NVLink and the RX rate of NVLink. The bytes transmitted or received exclude the header.

The value is an average value within a time interval rather than an instantaneous value.

For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/s regardless of whether the transmission occurs at a consistent rate or in bursts. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per direction per link.

DCGM_FI_PROF_NVLINK_TX_BYTES

The total number of bytes sent through NVLink.

DCGM_FI_PROF_PCIE_RX_BYTES

The TX rate of PCle and the RX rate of PCIe. The bytes transmitted or received include both the header and payload.

The value is an average value within a time interval rather than an instantaneous value.

For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/s regardless of whether the transmission occurs at a consistent rate or in bursts. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per lane.

DCGM_FI_PROF_PCIE_TX_BYTES

The TX rate of PCle and the RX rate of PCIe. The bytes transmitted or received include both the header and payload.

The value is an average value within a time interval rather than an instantaneous value.

For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/s regardless of whether the transmission occurs at a consistent rate or in bursts. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per lane.

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

The cycle fraction for the Tensor (HMMA/IMMA) pipe being in the Active state.

The value is an average value within a time interval rather than an instantaneous value.

A larger value of this metric indicates higher tensor core utilization.

If the value is 1 (100%), a Tensor instruction is issued every cycle within the entire interval. One instruction completes in two cycles.

If the value of this metric is 0.2 (20%), one of the following conditions may exist:

The tensor core utilization of 20% of the SMs within the time interval is 100%.

The tensor core utilization of all SMs within the time interval is 20%.

The tensor core utilization of all SMs within 20% of the time interval is 100%.

Other conditions.

DCGM_FI_PROF_SM_ACTIVE

The ratio of cycles during which at least one warp on an SM remains active. The value is an average of all SMs. The value does not vary with the number of warps included in the thread block. When a warp is scheduled and resources are allocated to the warp, the warp is considered active. In this case, the status of the warp may be Computing or not Computing; for example, it may be waiting for memory requests or in another non-Computing state. If the value of this metric drops below 0.5, the GPU utilization is low. To ensure high GPU utilization, make sure that the value is greater than 0.8. Assume that a GPU has N SMs. If all SMs in N thread blocks run a kernel function within a time interval, the value of this metric is 1 (100%). If N/5 thread blocks run a kernel function within a time interval, the value of this metric is 0.2. If N thread blocks run a kernel function during 20% of the cycle within a time interval, the value of this metric is 0.2.

machine_cpu_cores

The number of CPU cores on the machine.

node_exporter_build_info

The build information about the node exporter.

nvidia_gpu_duty_cycle

The percentage of time over the past sample period during which the NVIDIA GPU was occupied.

nvidia_gpu_memory_total_bytes

The total memory of the NVIDIA GPU in bytes.

nvidia_gpu_memory_used_bytes

The memory used by the NVIDIA GPU in bytes.

nvidia_gpu_num_devices

The number of NVIDIA GPUs.

nvidia_gpu_power_usage_milliwatts

The power consumption of the NVIDIA GPU in milliwatts.

nvidia_gpu_temperature_celsius

The temperature of the NVIDIA GPU in °C.

rdma_service_monitor_local_ack_timeout_err

The number of timeout errors that occurred in the remote direct memory access (RDMA) network.

rdma_service_monitor_out_of_seq

The number of out-of-order packets in the RDMA network.

rdma_service_monitor_packet_seq_err

The number of out-of-order packet errors in the RDMA network.

rdma_service_monitor_rx_bytes

The throughput received over the RDMA network in bytes.

rdma_service_monitor_rx_packets

The number of packets received over the RDMA network.

rdma_service_monitor_tx_bytes

The throughput sent over the RDMA network in bytes.

rdma_service_monitor_tx_packets

The number of packets sent over the RDMA network.

up

The connectivity of metric collection.

ACK ControlPlane APIServer (Control plane components for ACK Pro clusters: APIServer, ETCD, Scheduler, Kube Controller Manager, and Cloud Controller Manager as well as Control plane component for ACK dedicated clusters: APIServer) (job name: apiserver)

Metric

Description

aggregator_discovery_aggregation_count_total

The count of discovery aggregations performed by the aggregator.

aggregator_openapi_v2_regeneration_count

The number of regenerations based on OpenAPI 2.0.

aggregator_openapi_v2_regeneration_duration

The amount of time consumed for regenerations based on OpenAPI 2.0.

aggregator_unavailable_apiservice

The APIServices that are unavailable to the aggregator.

aggregator_unavailable_apiservice_count

The count of APIServices that are unavailable to the aggregator.

aggregator_unavailable_apiservice_total

The total number of APIServices that are unavailable to the aggregator.

aliyun_prometheus_agent_append_duration_seconds

The additional time spent by the Prometheus agent in seconds.

aliyun_prometheus_agent_job_discovery_status

The job status that is discovered by the Prometheus agent.

aliyun_prometheus_agent_scrapes_by_target_total

The total number of target scrapes performed by the Prometheus agent.

aliyun_prometheus_agent_target_info

The information about targets scraped by the Prometheus agent.

apiextensions_apiserver_validation_ratcheting_seconds_bucket

The distribution of incremental time intervals for validation in seconds in the APIServer.

apiextensions_apiserver_validation_ratcheting_seconds_count

The count of incremental time intervals for validation in seconds in the APIServer.

apiextensions_apiserver_validation_ratcheting_seconds_sum

The sum of incremental time intervals for validation in seconds in the APIServer.

apiextensions_openapi_v2_regeneration_count

The number of API extension regenerations based on OpenAPI 2.0.

apiextensions_openapi_v3_regeneration_count

The number of API extension regenerations based on OpenAPI 3.0.

apiserver_accepted_listall_requests_total

The total number of ListAll requests accepted by the APIServer.

apiserver_admission_controller_admission_duration_seconds_bucket

The distribution of APIServer admission controller durations in seconds.

apiserver_admission_controller_admission_duration_seconds_count

The count of APIServer admission controller durations in seconds.

apiserver_admission_controller_admission_duration_seconds_sum

The sum of APIServer admission controller durations in seconds.

apiserver_admission_step_admission_duration_seconds_bucket

The distribution of APIServer admission step durations in seconds.

apiserver_admission_step_admission_duration_seconds_count

The count of APIServer admission step durations per second.

apiserver_admission_step_admission_duration_seconds_sum

The sum of APIServer admission step durations in seconds.

apiserver_admission_step_admission_duration_seconds_summary

The summary of APIServer admission step durations in seconds.

apiserver_admission_step_admission_duration_seconds_summary_count

The summary count of APIServer admission step durations in seconds.

apiserver_admission_step_admission_duration_seconds_summary_sum

The summary total of APIServer admission step durations in seconds.

apiserver_admission_webhook_admission_duration_seconds_bucket

The distribution of APIServer admission webhook durations in seconds.

apiserver_admission_webhook_admission_duration_seconds_count

The count of APIServer admission webhook durations in seconds.

apiserver_admission_webhook_admission_duration_seconds_sum

The sum of APIServer admission webhook durations in seconds.

apiserver_admission_webhook_fail_open_count

The count of times that the APIServer admission webhook is configured as fail open.

apiserver_admission_webhook_rejection_count

The count of requests rejected by the APIServer admission webhook.

apiserver_admission_webhook_request_total

The total number of requests to the APIServer admission webhook.

apiserver_audit_error_total

The total number of APIServer audit errors.

apiserver_audit_event_total

The total number of APIServer audit events.

apiserver_audit_level_total

The total number of APIServer audit levels.

apiserver_audit_requests_rejected_total

The total number of rejected APIServer requests.

apiserver_authorization_decisions_total

The total number of authorization decisions made by the APIServer.

apiserver_cache_list_fetched_objects_total

The total number of objects obtained by the APIServer cache list.

apiserver_cache_list_returned_objects_total

The total number of objects returned by the APIServer cache list.

apiserver_cache_list_total

The total number of operations performed by the APIServer cache list.

apiserver_cacher_received_events

The number of events received by the APIServer cache.

apiserver_cacher_sended_events_latency_milliseconds_bucket

The distribution of APIServer event sending latencies in milliseconds.

apiserver_cacher_sended_events_latency_milliseconds_count

The count of APIServer event sending latencies in milliseconds.

apiserver_cacher_sended_events_latency_milliseconds_sum

The total of APIServer event sending latencies in milliseconds.

apiserver_cacher_watcher_channel_length

The watcher channel length of the APIServer cache.

apiserver_cel_compilation_duration_seconds_bucket

The distribution of APIServer Common Expression Language (CEL) compilation latencies in seconds.

apiserver_cel_compilation_duration_seconds_count

The count of APIServer CEL compilations.

apiserver_cel_compilation_duration_seconds_sum

The total time consumed for APIServer CEL compilations in seconds.

apiserver_cel_evaluation_duration_seconds_bucket

The distribution of APIServer CEL evaluation latencies in seconds.

apiserver_cel_evaluation_duration_seconds_count

The count of APIServer CEL evaluations.

apiserver_cel_evaluation_duration_seconds_sum

The total of APIServer CEL evaluation latencies in seconds.

apiserver_client_certificate_expiration_seconds_bucket

The distribution of remaining seconds until APIServer client certificate expiration.

apiserver_client_certificate_expiration_seconds_count

The count of remaining seconds until APIServer client certificate expiration.

apiserver_client_certificate_expiration_seconds_sum

The total remaining seconds until APIServer client certificate expiration.

apiserver_clusterip_repair_ip_errors_total

The total number of ClusterIP errors fixed by the APIServer.

apiserver_clusterip_repair_reconcile_errors_total

The total number of ClusterIP reconcile errors fixed by the APIServer.

apiserver_conversion_webhook_duration_seconds_bucket

The distribution of APIServer conversion webhook latencies in seconds.

apiserver_conversion_webhook_duration_seconds_count

The count of APIServer conversion webhook calls.

apiserver_conversion_webhook_duration_seconds_sum

The total of APIServer conversion webhook latencies in seconds.

apiserver_conversion_webhook_request_total

The total number of APIServer conversion webhook requests.

apiserver_crd_conversion_webhook_duration_seconds_bucket

The distribution of APIServer Custom Resource Definition (CRD) conversion webhook latencies in seconds.

apiserver_crd_conversion_webhook_duration_seconds_count

The count of APIServer CRD conversion webhook calls.

apiserver_crd_conversion_webhook_duration_seconds_sum

The total of APIServer CRD conversion webhook latencies in seconds.

apiserver_crd_webhook_conversion_duration_seconds_bucket

The distribution of APIServer CRD webhook conversion latencies in seconds.

apiserver_crd_webhook_conversion_duration_seconds_count

The count of APIServer CRD webhook conversions.

apiserver_crd_webhook_conversion_duration_seconds_sum

The total of APIServer CRD webhook conversion latencies in seconds.

apiserver_created_watchers

The number of watchers created by the APIServer.

apiserver_current_inflight_requests

The number of requests that are being processed by the APIServer.

apiserver_current_inqueue_requests

The maximum number of queued requests in the APIServer.

apiserver_dropped_requests_total

The total number of requests dropped by the APIServer.

apiserver_encryption_config_controller_automatic_reload_failures_total

The number of times that the encryption configuration controller of the APIServer failed to be automatically reloaded.

apiserver_encryption_config_controller_automatic_reload_success_total

The number of times that the encryption configuration controller of the APIServer was automatically reloaded.

apiserver_envelope_encryption_dek_cache_fill_percent

The percentage of APIServer envelope encryption Data Encryption Key (DEK) cache filled.

apiserver_error_watchers

The number of watchers in the Error state in the APIServer.

apiserver_flowcontrol_current_executing_requests

The number of requests being processed by APIServer rate limiting.

apiserver_flowcontrol_current_executing_seats

The number of seats occupied by APIServer rate limiting.

apiserver_flowcontrol_current_inqueue_requests

The number of requests pending in queues in the APF system.

apiserver_flowcontrol_current_inqueue_seats

The number of seats pending in APIServer rate limiting queues.

apiserver_flowcontrol_current_limit_seats

The number of seats limited by APIServer rate limiting.

apiserver_flowcontrol_current_r

The current R value of APIServer rate limiting.

apiserver_flowcontrol_demand_seats_average

The average number of seats requested by APIServer rate limiting.

apiserver_flowcontrol_demand_seats_bucket

The distribution of seats requested by APIServer rate limiting.

apiserver_flowcontrol_demand_seats_count

The count of seats requested by APIServer rate limiting.

apiserver_flowcontrol_demand_seats_high_watermark

The high watermark of seats requested by APIServer rate limiting.

apiserver_flowcontrol_demand_seats_smoothed

The smoothed value of seats requested by APIServer rate limiting.

apiserver_flowcontrol_demand_seats_stdev

The standard deviation of seats requested by APIServer rate limiting.

apiserver_flowcontrol_demand_seats_sum

The sum of seats requested by APIServer rate limiting.

apiserver_flowcontrol_dispatch_r

The scheduling R value of APIServer rate limiting.

apiserver_flowcontrol_dispatched_requests_total

The total number of requests scheduled by APIServer rate limiting.

apiserver_flowcontrol_latest_s

The recent S value bounds of APIServer rate limiting.

apiserver_flowcontrol_lower_limit_seats

The lower bound of seats in APIServer rate limiting.

apiserver_flowcontrol_next_discounted_s_bounds

The next discounted S value bounds of APIServer rate limiting.

apiserver_flowcontrol_next_s_bounds

The next S value bounds of APIServer rate limiting.

apiserver_flowcontrol_nominal_limit_seats

The nominal upper bound of seats in APIServer rate limiting.

apiserver_flowcontrol_priority_level_request_count_samples_bucket

The distribution of priority level request samples in APIServer rate limiting.

apiserver_flowcontrol_priority_level_request_count_samples_count

The count of priority level request samples in APIServer rate limiting.

apiserver_flowcontrol_priority_level_request_count_samples_sum

The sum of priority level request samples in APIServer rate limiting.

apiserver_flowcontrol_priority_level_request_count_watermarks_bucket

The distribution of watermark levels for priority level request samples in APIServer rate limiting.

apiserver_flowcontrol_priority_level_request_count_watermarks_count

The count of watermark levels for priority level request samples in APIServer rate limiting.

apiserver_flowcontrol_priority_level_request_count_watermarks_sum

The sum of watermark levels for priority level request samples in APIServer rate limiting.

apiserver_flowcontrol_priority_level_request_utilization_bucket

The distribution of request utilization samples by priority level in APIServer rate limiting.

apiserver_flowcontrol_priority_level_request_utilization_count

The count of request utilization samples by priority level in APIServer rate limiting.

apiserver_flowcontrol_priority_level_request_utilization_sum

The sum of request utilization by priority level in APIServer rate limiting.

apiserver_flowcontrol_priority_level_seat_count_samples_bucket

The distribution of seat samples for priority level in APIServer rate limiting.

apiserver_flowcontrol_priority_level_seat_count_samples_count

The count of seat samples for priority level in APIServer rate limiting.

apiserver_flowcontrol_priority_level_seat_count_samples_sum

The sum of seat samples for priority level in APIServer rate limiting.

apiserver_flowcontrol_priority_level_seat_count_watermarks_bucket

The distribution of watermark levels for seat samples in APIServer rate limiting by priority level.

apiserver_flowcontrol_priority_level_seat_count_watermarks_count

The count of watermark levels for seat samples in APIServer rate limiting by priority level.

apiserver_flowcontrol_priority_level_seat_count_watermarks_sum

The sum of watermark levels for seat samples in APIServer rate limiting by priority level.

apiserver_flowcontrol_priority_level_seat_utilization_bucket

The distribution of seat utilization samples by priority level in APIServer rate limiting.

apiserver_flowcontrol_priority_level_seat_utilization_count

The count of seat utilization samples by priority level in APIServer rate limiting.

apiserver_flowcontrol_priority_level_seat_utilization_sum

The sum of seat utilization by priority level in APIServer rate limiting.

apiserver_flowcontrol_read_vs_write_current_requests_bucket

The distribution of current read/write requests in APIServer rate limiting.

apiserver_flowcontrol_read_vs_write_current_requests_count

The count of current read/write requests in APIServer rate limiting.

apiserver_flowcontrol_read_vs_write_current_requests_sum

The sum of current read/write requests in APIServer rate limiting.

apiserver_flowcontrol_read_vs_write_request_count_samples_bucket

The distribution of read/write request count samples in APIServer rate limiting.

apiserver_flowcontrol_read_vs_write_request_count_samples_count

The count of read/write request count samples in APIServer rate limiting.

apiserver_flowcontrol_read_vs_write_request_count_samples_sum

The sum of read/write request count samples in APIServer rate limiting.

apiserver_flowcontrol_read_vs_write_request_count_watermarks_bucket

The distribution of read/write request count watermarks in APIServer rate limiting.

apiserver_flowcontrol_read_vs_write_request_count_watermarks_count

The count of read/write request count watermarks in APIServer rate limiting.

apiserver_flowcontrol_read_vs_write_request_count_watermarks_sum

The sum of read/write request count watermarks in APIServer rate limiting.

apiserver_flowcontrol_rejected_requests_total

The total number of requests rejected by APIServer rate limiting.

apiserver_flowcontrol_request_concurrency_in_use

The count of concurrent requests in APIServer rate limiting.

apiserver_flowcontrol_request_concurrency_limit

The concurrent request limit in APIServer rate limiting.

apiserver_flowcontrol_request_dispatch_no_accommodation_total

The total number of requests that could not be accommodated by the scheduling of APIServer rate limiting.

apiserver_flowcontrol_request_execution_seconds_bucket

The distribution of request latencies in seconds in APIServer rate limiting.

apiserver_flowcontrol_request_execution_seconds_count

The count of request latencies in seconds in APIServer rate limiting.

apiserver_flowcontrol_request_execution_seconds_sum

The sum of request latencies in seconds in APIServer rate limiting.

apiserver_flowcontrol_request_queue_length_after_enqueue_bucket

The distribution of request queue lengths after enqueuing in APIServer rate limiting.

apiserver_flowcontrol_request_queue_length_after_enqueue_count

The count of request queue lengths after enqueuing in APIServer rate limiting.

apiserver_flowcontrol_request_queue_length_after_enqueue_sum

The sum of request queue lengths after enqueuing in APIServer rate limiting.

apiserver_flowcontrol_request_wait_duration_seconds_bucket

The distribution of request waiting durations in seconds in APIServer rate limiting.

apiserver_flowcontrol_request_wait_duration_seconds_count

The count of request waiting durations in seconds in APIServer rate limiting.

apiserver_flowcontrol_request_wait_duration_seconds_sum

The sum of request waiting durations in seconds in APIServer rate limiting.

apiserver_flowcontrol_seat_fair_frac

The fair share ratios determined by the APIServer during the last borrowing adjustment period.

apiserver_flowcontrol_target_seats

The target number of seats in APIServer rate limiting.

apiserver_flowcontrol_upper_limit_seats

The upper bound of seats in APIServer rate limiting.

apiserver_flowcontrol_watch_count_samples_bucket

The distribution of observed samples in APIServer rate limiting.

apiserver_flowcontrol_watch_count_samples_count

The count of observed samples in APIServer rate limiting.

apiserver_flowcontrol_watch_count_samples_sum

The sum of observed samples in APIServer rate limiting.

apiserver_flowcontrol_work_estimated_seats_bucket

The distribution of estimated seats in APIServer rate limiting.

apiserver_flowcontrol_work_estimated_seats_count

The count of estimated seats in APIServer rate limiting.

apiserver_flowcontrol_work_estimated_seats_sum

The sum of estimated seats in APIServer rate limiting.

apiserver_init_events_total

The total number of initialization events in the APIServer.

apiserver_kube_aggregator_x509_insecure_sha1_total

The number of requests using insecure Secure Hash Algorithm 1 (SHA1) signatures.

apiserver_kube_aggregator_x509_missing_san_total

The total number of x509 certificates missing Subject Alternative Names (SANs) in APIServer kube-aggregator.

apiserver_longrunning_gauge

The long-running meter in the APIServer.

apiserver_longrunning_requests

The long-running requests in the APIServer.

apiserver_nodeport_repair_reconcile_errors_total

The total number of node port fix reconcile errors in the APIServer.

apiserver_realtime_watchers

The number of real-time observers in the APIServer.

apiserver_registered_watchers

The number of registered watchers in the APIServer.

apiserver_request_aborts_total

The total number of suspended APIServer requests.

apiserver_request_body_size_bytes_bucket

The distribution of APIServer request body sizes in bytes.

apiserver_request_body_size_bytes_count

The count of APIServer request body sizes in bytes.

apiserver_request_body_size_bytes_sum

The sum of APIServer request body sizes in bytes.

apiserver_request_count

The number of APIServer requests.

apiserver_request_duration_seconds_bucket

The distribution of APIServer request latencies in seconds

apiserver_request_duration_seconds_count

The count of APIServer request latencies in seconds

apiserver_request_duration_seconds_sum

The sum of APIServer request latencies in seconds

apiserver_request_filter_duration_seconds_bucket

The distribution of request filter latencies in seconds.

apiserver_request_filter_duration_seconds_count

The count of request filter latencies in seconds.

apiserver_request_filter_duration_seconds_sum

The sum of request filter latencies in seconds.

apiserver_request_latencies_summary

The summary of APIServer request latencies.

apiserver_request_no_resourceversion_list_total

The total number of unversioned LIST requests.

apiserver_request_post_timeout_total

The total number of timed out POST requests.

apiserver_request_sli_duration_seconds_bucket

The distribution of Service Level Indicator (SLI) request latencies in seconds.

apiserver_request_sli_duration_seconds_count

The count of SLI request latencies in seconds.

apiserver_request_sli_duration_seconds_sum

The sum of SLI request latencies in seconds.

apiserver_request_slo_duration_seconds_bucket

The distribution of Service Level Objective (SLO) request latencies in seconds.

apiserver_request_slo_duration_seconds_count

The count of SLO request latencies in seconds.

apiserver_request_slo_duration_seconds_sum

The sum of SLO request latencies in seconds.

apiserver_request_terminations_total

The total number of terminated API requests.

apiserver_request_timestamp_comparison_time_bucket

The distribution of time spent in timestamp comparison of API requests.

apiserver_request_timestamp_comparison_time_count

The count of API request samples for timestamp comparison.

apiserver_request_timestamp_comparison_time_sum

The sum of time spent in timestamp comparison of API requests.

apiserver_request_total

The total number of API requests.

apiserver_requested_deprecated_apis

The count of APIServer requests for deprecated APIs.

apiserver_response_sizes_bucket

The distribution of response body sizes of API requests.

apiserver_response_sizes_count

The count of response body sizes of API requests.

apiserver_response_sizes_sum

The sum of response body sizes of API requests.

apiserver_selfrequest_total

The total number of APIServer self-requests.

apiserver_storage_data_key_generation_duration_seconds_bucket

The distribution of time consumed by the APIServer to generate data keys in seconds.

apiserver_storage_data_key_generation_duration_seconds_count

The count of time consumed by the APIServer to generate data keys in seconds.

apiserver_storage_data_key_generation_duration_seconds_sum

The sum of time consumed by the APIServer to generate data keys in seconds.

apiserver_storage_data_key_generation_failures_total

The total number of data key generation failures.

apiserver_storage_db_total_size_in_bytes

The total size of APIServer databases in bytes.

apiserver_storage_decode_errors_total

The total number of decoding errors in the APIServer.

apiserver_storage_envelope_transformation_cache_misses_total

The total number of envelope conversion cache misses in the APIServer.

apiserver_storage_events_received_total

The total number of events received by the APIServer.

apiserver_storage_list_evaluated_objects_total

The total number of evaluated objects in the APIServer storage list.

apiserver_storage_list_fetched_objects_total

The total number of objects obtained by the APIServer storage list.

apiserver_storage_list_returned_objects_total

The total number of objects returned by the APIServer storage list.

apiserver_storage_list_total

The total number of operations performed by the APIServer storage list.

apiserver_storage_objects

The number of objects stored in the APIServer.

apiserver_storage_size_bytes

The total size of objects stored in the APIServer.

apiserver_terminated_watchers_total

The total number of watchers terminated by the APIServer.

apiserver_tls_handshake_errors_total

The total number of requests with Transport Layer Security (TLS) handshake errors in the APIServer.

apiserver_too_large_resourceversion_errors

The total number of requests whose resource version is too late in the APIServer.

apiserver_watch_cache_events_dispatched_total

The total number of cache distribution events observed by the APIServer.

apiserver_watch_cache_events_received_total

The total number of cache reception events observed by the APIServer.

apiserver_watch_cache_initializations_total

The total number of cache initializations observed by the APIServer.

apiserver_watch_cache_read_wait_seconds_bucket

The distribution of cache read waiting durations in seconds observed by the APIServer.

apiserver_watch_cache_read_wait_seconds_count

The count of cache read waiting durations in seconds observed by the APIServer.

apiserver_watch_cache_read_wait_seconds_sum

The sum of cache read waiting durations in seconds observed by the APIServer.

apiserver_watch_cache_watch_cache_initializations_total

The total number of cache initializations observed by the APIServer.

apiserver_watch_events_sizes_bucket

The distribution of sizes of events observed by the APIServer.

apiserver_watch_events_sizes_count

The count of sizes of events observed by the APIServer.

apiserver_watch_events_sizes_sum

The sum of sizes of events observed by the APIServer.

apiserver_watch_events_total

The total number of events observed by the APIServer.

apiserver_webhooks_x509_insecure_sha1_total

The number of requests using insecure SHA1 signatures.

apiserver_webhooks_x509_missing_san_total

The total number of missing SANs in APIServer webhooks.

authenticated_user_requests

The total number of authenticated user requests.

authentication_attempts

The number of authentication attempts.

authentication_duration_seconds_bucket

The distribution of authentication durations in seconds.

authentication_duration_seconds_count

The count of authentication durations in seconds.

authentication_duration_seconds_sum

The sum of authentication durations in seconds.

authentication_token_cache_active_fetch_count

The count of active fetches for the authentication token cache.

authentication_token_cache_fetch_total

The total number of times the authentication token was retrieved from the cache.

authentication_token_cache_request_duration_seconds_bucket

The distribution of request durations in seconds for authentication token cache.

authentication_token_cache_request_duration_seconds_count

The count of request durations in seconds for authentication token cache.

authentication_token_cache_request_duration_seconds_sum

The sum of request durations in seconds for authentication token cache.

authentication_token_cache_request_total

The total number of requests for authentication token cache.

authorization_attempts_total

The total number of authorization attempts.

authorization_duration_seconds_bucket

The distribution of authorization durations in seconds.

authorization_duration_seconds_count

The count of authorization durations in seconds.

authorization_duration_seconds_sum

The sum of authorization durations in seconds.

cardinality_enforcement_unexpected_categorizations_total

The total number of unexpected classifications in classification execution.

count

The count details.

cpu_utilization_core

The CPU utilization of the core.

disabled_metric_total

The total number of disabled metrics.

disabled_metrics_total

The total number of disabled metrics.

etcd_bookmark_counts

The number of ETCD bookmarks.

etcd_db_total_size_in_bytes

The total size of ETCD databases in bytes.

etcd_lease_object_counts_bucket

The distribution of objects attached to a single ETCD lease.

etcd_lease_object_counts_count

The count of objects attached to a single ETCD lease.

etcd_lease_object_counts_sum

The sum of objects attached to a single ETCD lease.

etcd_object_counts

The number of ETCD objects.

etcd_request_duration_seconds_bucket

The distribution of ETCD request latencies in seconds.

etcd_request_duration_seconds_count

The count of ETCD request latencies in seconds.

etcd_request_duration_seconds_sum

The sum of ETCD request latencies in seconds.

etcd_request_errors_total

The total number of failed ETCD requests.

etcd_requests_total

The total number of ETCD requests.

etcd_watcher_channel_length

The channel length of the ETCD watcher.

etcd_watcher_received_events

The number of events received by the ETCD watcher.

etcd_watcher_sended_events_latency_milliseconds_bucket

The distribution of event sending latencies of the ETCD watcher in milliseconds.

etcd_watcher_sended_events_latency_milliseconds_count

The count of event sending latencies of the ETCD watcher in milliseconds.

etcd_watcher_sended_events_latency_milliseconds_sum

The sum of event sending latencies of the ETCD watcher in milliseconds.

field_validation_request_duration_seconds_bucket

The distribution of field validation request latencies in seconds.

field_validation_request_duration_seconds_count

The count of field validation request latencies in seconds.

field_validation_request_duration_seconds_sum

The sum of field validation request latencies in seconds.

get_token_count

The number of obtained tokens.

get_token_fail_count

The number of token obtaining failures.

grpc_client_handled_total

The total number of requests handled by the gRPC client.

grpc_client_msg_received_total

The total number of messages received by the gRPC client.

grpc_client_msg_sent_total

The total number of messages sent by the gRPC client.

grpc_client_started_total

The total number of gRPC client startups.

http_request_duration_microseconds

The HTTP request latency in microseconds.

http_request_size_bytes

The HTTP request size in bytes.

http_requests_total

The total number of HTTP requests.

http_response_size_bytes

The HTTP response body size in bytes.

job

The job name.

job_instance_mode

The job instance mode.

kube_apiserver_clusterip_allocator_allocated_ips

Kubernetes APIServer: The number of allocated cluster IP addresses.

kube_apiserver_clusterip_allocator_allocation_errors_total

Kubernetes APIServer: The total number of errors that occurred in cluster IP address allocations.

kube_apiserver_clusterip_allocator_allocation_total

Kubernetes APIServer: The total number of cluster IP address allocations.

kube_apiserver_clusterip_allocator_available_ips

Kubernetes APIServer: The number of available cluster IP addresses.

kube_apiserver_nodeport_allocator_allocated_ports

Kubernetes APIServer: The number of allocated node ports.

kube_apiserver_nodeport_allocator_allocation_errors_total

Kubernetes APIServer: The total number of errors that occurred in node port allocations.

kube_apiserver_nodeport_allocator_allocation_total

Kubernetes APIServer: The total number of node port allocations.

kube_apiserver_nodeport_allocator_available_ports

Kubernetes APIServer: The number of available node ports.

kube_apiserver_pod_logs_backend_tls_failure_total

Kubernetes APIServer: The total number of pod/log requests that failed due to TLS verification errors.

kube_apiserver_pod_logs_insecure_backend_total

Kubernetes APIServer: The total number of insecure pod/log requests.

kube_apiserver_pod_logs_pods_logs_backend_tls_failure_total

Kubernetes APIServer: The total number of pod/log requests that failed due to TLS verification errors.

kube_apiserver_pod_logs_pods_logs_insecure_backend_total

Kubernetes APIServer: The total number of insecure pod/log requests.

kubelet_container_log_filesystem_used_bytes

Kubelet: The space of the file system used by container logs in bytes.

kubelet_node_name

Kubelet: The node name.

kubelet_pleg_relist_duration_seconds_bucket

Kubelet: The distribution of PLEG relisting durations in seconds.

kubelet_pod_worker_duration_seconds_bucket

Kubelet: The distribution of Pod worker relisting durations in seconds.

kubelet_volume_stats_available_bytes

Kubelet: The number of available bytes in the volume.

kubelet_volume_stats_capacity_bytes

Kubelet: The volume capacity in bytes.

kubelet_volume_stats_inodes

Kubelet: The number of available inodes in the volume.

kubelet_volume_stats_inodes_free

Kubelet: The number of idle inodes in the volume.

kubelet_volume_stats_inodes_used

Kubelet: The number of used inodes in the volume.

kubelet_volume_stats_used_bytes

Kubelet: The number of used bytes in the volume.

kubernetes_build_info

The Kubernetes build information.

kubernetes_feature_enabled

Specifies that Kubernetes features are enabled.

last_list_all_response_size_in_bytes

The total size of all response bodies in the recent list in bytes.

memory_utilization_byte

The used memory in bytes.

node_authorizer_graph_actions_duration_seconds_bucket

Node authorizer: The distribution of graph operation durations in seconds.

node_authorizer_graph_actions_duration_seconds_count

Node authorizer: The count of graph operation durations in seconds.

node_authorizer_graph_actions_duration_seconds_sum

Node authorizer: The sum of graph operation durations in seconds.

pod_security_evaluations_total

The total number of pod security evaluations.

pod_security_exemptions_total

The total number of pod security exemptions.

registered_metric_total

The total number of registered metrics.

registered_metrics_total

The total number of registered metrics.

rest_client_exec_plugin_certificate_rotation_age_bucket

REST client plug-in: The distribution of certificate rotation ages in seconds.

rest_client_exec_plugin_certificate_rotation_age_count

REST client plug-in: The count of certificate rotation ages in seconds.

rest_client_exec_plugin_certificate_rotation_age_sum

REST client plug-in: The sum of certificate rotation ages in seconds.

rest_client_exec_plugin_ttl_seconds

REST client plug-in: The time to live (TTL) of the certificate in seconds.

rest_client_request_duration_seconds_bucket

The distribution of REST client request durations in seconds.

rest_client_request_duration_seconds_count

The count of REST client request durations in seconds.

rest_client_request_duration_seconds_sum

The sum of REST client request durations in seconds.

rest_client_request_latency_seconds_bucket

The total of REST client request latencies in seconds.

rest_client_request_size_bytes_bucket

The distribution of REST client request-body sizes in bytes.

rest_client_request_size_bytes_count

The count of REST client request-body sizes in bytes.

rest_client_request_size_bytes_sum

The sum of REST client request-body sizes in bytes.

rest_client_requests_total

The number of REST client requests.

rest_client_response_size_bytes_bucket

The distribution of REST client response-body sizes in bytes.

rest_client_response_size_bytes_count

The count of REST client response-body sizes in bytes.

rest_client_response_size_bytes_sum

The sum of REST client response-body sizes in bytes.

rest_client_transport_cache_entries

The number of transport entries of the REST client.

rest_client_transport_create_calls_total

The total number of transport creation calls of the REST client.

scheduler_pending_pods

Scheduler: The number of pods to be scheduled.

scheduler_pod_scheduling_attempts_bucket

Scheduler: The distribution of pod scheduling attempts.

scheduler_scheduler_cache_size

The scheduler cache size.

serviceaccount_invalid_legacy_auto_token_uses_total

The total number of uses of invalid legacy automatic service account tokens.

serviceaccount_legacy_auto_token_uses_total

The total number of uses of legacy automatic service account tokens.

serviceaccount_legacy_manual_token_uses_total

The total number of uses of legacy manual service account tokens.

serviceaccount_legacy_tokens_total

The total number of legacy service account tokens.

serviceaccount_stale_tokens_total

The total number of stale service account tokens.

serviceaccount_valid_tokens_total

The total number of valid service account tokens.

ssh_tunnel_open_count

The number of opened Secure Shell (SSH) tunnels.

ssh_tunnel_open_fail_count

The number of SSH tunnels that failed to be opened.

up

The connectivity of metric collection.

watch_cache_capacity

The capacity of the monitoring cache.

watch_cache_capacity_decrease_total

The increasing capacity of the monitoring cache.

watch_cache_capacity_increase_total

The decreasing capacity of the monitoring cache.

workqueue_adds_total

The total number of additions to the work queue.

workqueue_depth

The work queue depth.

workqueue_longest_running_processor_seconds

The longest running processor time in the work queue in seconds.

workqueue_queue_duration_seconds_bucket

The distribution of queueing durations in the work queue in seconds.

workqueue_queue_duration_seconds_count

The count of queueing durations in the work queue in seconds.

workqueue_queue_duration_seconds_sum

The sum of queueing durations in the work queue in seconds.

workqueue_retries_total

The total number of retries in the work queue.

workqueue_unfinished_work_seconds

The duration of unfinished work in the work queue in seconds.

workqueue_work_duration_seconds_bucket

The distribution of work durations in the work queue in seconds.

workqueue_work_duration_seconds_count

The count of work durations in the work queue in seconds.

workqueue_work_duration_seconds_sum

The sum of work durations in the work queue in seconds.

Node Exporter (job name: node-exporter)

Metric

Description

aliyun_prometheus_agent_append_duration_seconds

The duration of the Prometheus agent append operations in seconds.

aliyun_prometheus_agent_job_discovery_status

The discovery status of the Prometheus agent collection jobs.

aliyun_prometheus_agent_scrapes_by_target_total

The total number of scrapes by the Prometheus agent per target.

aliyun_prometheus_agent_target_info

The target information of the Prometheus agent.

job

The job name.

node_boot_time_seconds

The node startup duration in seconds.

node_context_switches_total

The total number of context switches on the node.

node_cpu_seconds_total

The total CPU time consumed on the node.

node_disk_io_now

The current disk I/O of the node.

node_disk_io_time_seconds_total

The total disk I/O duration of the node in seconds.

node_disk_io_time_weighted_seconds_total

The total weighted disk I/O time of the node in seconds.

node_disk_read_bytes_total

The total number of bytes read from the disk of the node.

node_disk_read_time_seconds_total

The total disk read time of the node in seconds.

node_disk_reads_completed_total

The total number of complete disk reads of the node.

node_disk_reads_merged_total

The total number of merged disk reads of the node.

node_disk_write_time_seconds_total

The total disk write time of the node in seconds.

node_disk_writes_completed_total

The total number of complete disk writes of the node.

node_disk_writes_merged_total

The total number of merged disk writes of the node.

node_disk_written_bytes_total

The total number of bytes written to the disk of the node.

node_exporter_build_info

The build Information of the node exporter.

node_filefd_allocated

The number of allocated file descriptors of the node.

node_filefd_maximum

The maximum number of file descriptors of the node.

node_filesystem_avail_bytes

The available bytes of the node file system.

node_filesystem_free_bytes

The amount of idle space in the file system of the node in bytes.

node_filesystem_size_bytes

The total size of the file system of the node in bytes.

node_intr_total

The total interrupts on the node.

node_load1

The 1-minute load on the node.

node_load15

The 15-minute load on the node.

node_load5

The 5-minute load on the node.

node_memory_MemAvailable_bytes

The size of available memory on the node (in bytes).

node_memory_MemFree_bytes

The size of free memory on the node (in bytes).

node_memory_MemTotal_bytes

The total size of memory on the node (in bytes).

node_memory_Slab_bytes

The size of Slab memory on the node (in bytes).

node_memory_SReclaimable_bytes

The size of SReclaimable memory on the node (in bytes).

node_netstat_Tcp_InErrs

The number of TCP receive errors.

node_netstat_Tcp_InSegs

The number of TCP segments received.

node_netstat_Tcp_OutSegs

The number of TCP segments sent.

node_netstat_Tcp_PassiveOpens

The number of passive TCP connections opened.

node_netstat_Tcp_RetransSegs

The number of TCP segments retransmitted.

node_network_receive_bytes_total

The total number of bytes received cumulatively.

node_network_receive_drop_total

The total number of packets dropped while receiving.

node_network_receive_errs_total

The total number of receive errors.

node_network_receive_packets_total

The total number of packets received.

node_network_transmit_bytes_total

The total number of bytes sent cumulatively.

node_network_transmit_drop_total

The total number of packets sent but dropped.

node_network_transmit_errs_total

The total number of send errors.

node_network_transmit_packets_total

The total number of packets sent.

node_network_up

Indicates whether the network interface is enabled.

node_processes_max_processes

The maximum number of processes.

node_processes_max_threads

The maximum number of threads.

node_processes_pids

The number of process IDs.

node_processes_state

The distribution of process states.

node_processes_threads

The number of threads.

node_schedstat_running_seconds_total

The total seconds run in scheduling statistics.

node_sockstat_TCP_alloc

The number of TCP sockets allocated.

node_sockstat_TCP_inuse

The number of TCP sockets in use.

node_sockstat_TCP_mem

The amount of memory used by TCP sockets.

node_sockstat_TCP_mem_bytes

The number of bytes of memory used by TCP sockets.

node_sockstat_TCP_tw

The number of TCP sockets in the TIME_WAIT state.

node_time_zone_offset_seconds

The time zone offset in seconds.

node_timex_offset_seconds

The time offset in seconds.

node_timex_sync_status

The synchronization status of the clock.

node_uname_info

The system information (uname).

node_vmstat_pgfault

The number of page faults in VM statistics.

node_vmstat_pgmajfault

The number of major page faults in VM statistics.

node_vmstat_pgpgin

The number of page ins in VM statistics.

node_vmstat_pgpgout

The number of page outs in VM statistics.

up

The connectivity of metric collection.

kube-state-metrics (job name: _kube-state-metrics)

Metric

Description

kube_configmap_info

The information about the ConfigMap.

kube_cronjob_annotations

The annotations of the Kubernetes CronJob.

kube_cronjob_created

The creation time of the Kubernetes CronJob.

kube_cronjob_info

The information about the Kubernetes CronJob.

kube_cronjob_labels

The labels of the Kubernetes CronJob.

kube_cronjob_metadata_resource_version

The metadata resource version of the Kubernetes CronJob.

kube_cronjob_next_schedule_time

The next schedule time of the Kubernetes CronJob.

kube_cronjob_spec_failed_job_history_limit

The failed job history limit of the Kubernetes CronJob.

kube_cronjob_spec_starting_deadline_seconds

The starting deadline seconds of the Kubernetes CronJob.

kube_cronjob_spec_successful_job_history_limit

The successful job history limit of the Kubernetes CronJob.

kube_cronjob_spec_suspend

The suspend status of the Kubernetes CronJob.

kube_cronjob_status_active

The number of active jobs of the Kubernetes CronJob.

kube_cronjob_status_last_schedule_time

The last schedule time of the Kubernetes CronJob.

kube_cronjob_status_last_successful_time

The last successful execution time of the Kubernetes CronJob.

kube_daemonset_created

The creation time of the Kubernetes DaemonSet.

kube_daemonset_status_current_number_scheduled

The current number of scheduled nodes for the Kubernetes DaemonSet.

kube_daemonset_status_desired_number_scheduled

The desired number of scheduled nodes for the Kubernetes DaemonSet

kube_daemonset_status_number_available

The number of available nodes in the Kubernetes DaemonSet.

kube_daemonset_status_number_misscheduled

The number of missed scheduled nodes in the Kubernetes DaemonSet.

kube_daemonset_status_number_ready

The number of ready nodes in the Kubernetes DaemonSet.

kube_daemonset_status_number_unavailable

The number of unavailable nodes in the Kubernetes DaemonSet.

kube_daemonset_status_updated_number_scheduled

The number of updated scheduled nodes in the Kubernetes DaemonSet

kube_daemonset_updated_number_scheduled

The number of updated scheduled nodes in the Kubernetes DaemonSet

kube_deployment_created

The creation time of the Kubernetes Deployment.

kube_deployment_labels

The labels of the Kubernetes Deployment.

kube_deployment_metadata_generation

The metadata generation of the Kubernetes Deployment.

kube_deployment_spec_replicas

The number of replicas specified in the Kubernetes Deployment.

kube_deployment_spec_strategy_rollingupdate_max_unavailable

The maximum number of unavailable pods during rolling update of the Kubernetes Deployment.

kube_deployment_status_observed_generation

The observed generation of the Kubernetes Deployment.

kube_deployment_status_replicas

The total number of replicas in the Kubernetes Deployment.

kube_deployment_status_replicas_available

The number of available replicas in the Kubernetes Deployment.

kube_deployment_status_replicas_ready

The number of ready replicas in the Kubernetes Deployment.

kube_deployment_status_replicas_unavailable

The number of unavailable replicas in the Kubernetes Deployment.

kube_deployment_status_replicas_updated

The number of updated replicas in the Kubernetes Deployment.

kube_horizontalpodautoscaler_info

The information about the Kubernetes HorizontalPodAutoscaler.

kube_horizontalpodautoscaler_labels

The labels of the Kubernetes HorizontalPodAutoscaler.

kube_horizontalpodautoscaler_metadata_generation

The metadata generation of the Kubernetes HorizontalPodAutoscaler.

kube_horizontalpodautoscaler_spec_max_replicas

The maximum number of replicas specified in the Kubernetes HorizontalPodAutoscaler.

kube_horizontalpodautoscaler_spec_min_replicas

The minimum number of replicas specified in the Kubernetes HorizontalPodAutoscaler.

kube_horizontalpodautoscaler_spec_target_metric

The target metrics of the Kubernetes HorizontalPodAutoscaler.

kube_horizontalpodautoscaler_status_condition

The status conditions of the Kubernetes HorizontalPodAutoscaler.

kube_horizontalpodautoscaler_status_current_replicas

The current number of replicas in the Kubernetes HorizontalPodAutoscaler.

kube_horizontalpodautoscaler_status_desired_replicas

The desired number of replicas in the Kubernetes HorizontalPodAutoscaler.

kube_hpa_labels

The labels of the Kubernetes HorizontalPodAutoscaler.

kube_hpa_metadata_generation

The metadata generation of the Kubernetes HorizontalPodAutoscaler.

kube_hpa_spec_max_replicas

The maximum number of replicas specified in the Kubernetes HorizontalPodAutoscaler.

kube_hpa_spec_min_replicas

The minimum number of replicas specified in the Kubernetes HorizontalPodAutoscaler.

kube_hpa_spec_target_metric

The target metrics of the Kubernetes HorizontalPodAutoscaler.

kube_hpa_status_condition

The status conditions of the Kubernetes HorizontalPodAutoscaler.

kube_hpa_status_current_replicas

The current number of replicas in the Kubernetes HorizontalPodAutoscaler.

kube_hpa_status_desired_replicas

The desired number of replicas in the Kubernetes HorizontalPodAutoscaler.

kube_ingress_info

The information about the Ingress.

kube_job_created

The information about the Ingress

kube_job_failed

The total number of failures for the job.

kube_job_info

The information about the Job.

kube_job_spec_completions

The number of completed jobs.

kube_job_status_active

The number of active jobs.

kube_job_status_failed

The number of failed jobs.

kube_job_status_succeeded

The number of successful jobs.

kube_namespace_created

The creation time of the namespace.

kube_namespace_labels

The labels of the namespace.

kube_namespace_status_phase

The phase of the namespace status.

kube_node_info

The information about the node.

kube_node_labels

The labels of the node.

kube_node_spec_taint

The taint configurations of the node.

kube_node_spec_unschedulable

The unschedulable flag of the node.

kube_node_status_allocatable

The allocatable resources of the node.

kube_node_status_allocatable_cpu_cores

The allocatable CPU cores of the node.

kube_node_status_allocatable_memory_bytes

The allocatable memory bytes of the node.

kube_node_status_allocatable_pods

The allocatable number of Pods on the node.

kube_node_status_capacity

The capacity of the node.

kube_node_status_capacity_cpu_cores

The capacity CPU cores of the node.

kube_node_status_capacity_memory_bytes

The capacity memory bytes of the node.

kube_node_status_capacity_pods

The capacity number of Pods on the node.

kube_node_status_condition

The status conditions of the node.

kube_persistentvolume_status_phase

The phase of the PersistentVolume (PV) status.

kube_persistentvolumeclaim_info

The information about the PersistentVolumeClaim (PVC).

kube_persistentvolumeclaim_resource_requests_storage_bytes

The storage resource request of the PVC.

kube_persistentvolumeclaim_status_phase

The phase of the PVC status.

kube_pod_completion_time

The completion time of the Pod.

kube_pod_container_info

The information about the Pod container.

kube_pod_container_resource_limits

The resource limit of the Pod container.

kube_pod_container_resource_limits_cpu_cores

The CPU core limit of the Pod container.

kube_pod_container_resource_limits_memory_bytes

The memory byte limit of the Pod container.

kube_pod_container_resource_requests

The resource requests of the Pod container.

kube_pod_container_resource_requests_cpu_cores

The CPU core requests of the Pod container

kube_pod_container_resource_requests_memory_bytes

The memory byte requests of the Pod container

kube_pod_container_status_last_terminated_reason

The last termination reason of the Pod container.

kube_pod_container_status_ready

The ready status of the Pod container.

kube_pod_container_status_restarts_total

The total number of restarts for the Pod container.

kube_pod_container_status_running

The running status of the Pod container.

kube_pod_container_status_terminated

The terminated status of the Pod container.

kube_pod_container_status_terminated_reason

The termination reason of the Pod container.

kube_pod_container_status_waiting

The waiting status of the Pod container.

kube_pod_container_status_waiting_reason

The waiting reason of the Pod container.

kube_pod_created

The creation time of the Pod.

kube_pod_deletion_timestamp

The deletion timestamp of the Pod.

kube_pod_info

The information about the Pod.

kube_pod_labels

The labels of the Pod.

kube_pod_owner

The owner of the Pod.

kube_pod_start_time

The start time of the Pod.

kube_pod_status_container_ready_time

The container ready time of the Pod status.

kube_pod_status_initialized_time

The initialization completion time of the Pod status.

kube_pod_status_phase

The phase of the Pod status.

kube_pod_status_ready

The ready status of the Pod.

kube_pod_status_ready_time

The ready time of the Pod.

kube_pod_status_reason

The reason for the Pod status.

kube_pod_status_scheduled_time

The scheduling time of the Pod.

kube_pod_status_unschedulable

The unschedulable flag of the Pod.

kube_replicaset_owner

The owner of the ReplicaSet.

kube_replicaset_status_ready_replicas

The number of ready replicas in the ReplicaSet.

kube_resource_relationship

The relationships between resources.

kube_resourcequota

The resource quota.

kube_resourcequota_created

The creation time of the resource quota.

kube_secret_info

The information about the secret.

kube_service_info

The information about the service.

kube_service_spec_type

The type specification of the service.

kube_service_status_load_balancer_ingress

The load balancer ingress information of the service status.

kube_statefulset_created

The creation time of the StatefulSet.

kube_statefulset_metadata_generation

The metadata generation of the StatefulSet.

kube_statefulset_replicas

The number of replicas in the StatefulSet.

kube_statefulset_status_replicas

The number of replicas in the state of the StatefulSet.

kube_statefulset_status_replicas_available

The number of available replicas in the state of the StatefulSet.

kube_statefulset_status_replicas_ready

The number of ready replicas in the state of the StatefulSet.

kube_statefulset_status_replicas_updated

The number of updated replicas in the state of the StatefulSet.

rest_client_requests_total

The number of REST client requests.

up

The connectivity of metric collection.

workqueue_adds_total

The total number of additions to the work queue.

workqueue_depth

The work queue depth.

workqueue_queue_duration_seconds_bucket

The distribution of queue duration in seconds for the work queue.

kube-events (job name: _arms/kube-event)

Metric

Description

aliyun_prometheus_agent_append_duration_seconds

The duration of the Prometheus agent append operations in seconds.

aliyun_prometheus_agent_job_discovery_status

The discovery status of the Prometheus agent collection jobs.

aliyun_prometheus_agent_scrape_custom_error

The number of custom collection errors of the Prometheus agent.

aliyun_prometheus_agent_scrapes_by_target_total

The total number of scrapes by the Prometheus agent per target.

aliyun_prometheus_agent_target_info

The target information of the Prometheus agent.

eventer_events_error_total

The total number of event processing errors.

eventer_events_normal_total

The total number of normal events.

eventer_events_warning_total

The total number of warning events.

eventer_exporter_duration_milliseconds_count

The count of samples for exporter duration in milliseconds.

eventer_exporter_duration_milliseconds_sum

The sum of exporter duration in milliseconds.

eventer_manager_last_time_seconds

The last operation time of the event manager in seconds.

eventer_scraper_duration_milliseconds_count

The count of scraper duration in milliseconds.

eventer_scraper_duration_milliseconds_sum

The sum of scraper duration in milliseconds.

eventer_scraper_events_total_number

The total number of events scraped.

eventer_scraper_last_time_seconds

The last execution time of the scraper in seconds.

up

The connectivity of metric collection.

CoreDNS (job name: arms-ack-coredns)

Metric

Description

aliyun_prometheus_agent_append_duration_seconds

The duration of the Prometheus agent append operations in seconds.

aliyun_prometheus_agent_job_discovery_status

The discovery status of the Prometheus agent collection jobs.

aliyun_prometheus_agent_scrape_custom_error

The number of custom collection errors of the Prometheus agent.

aliyun_prometheus_agent_scrapes_by_target_total

The total number of scrapes by the Prometheus agent per target.

aliyun_prometheus_agent_target_info

The target information of the Prometheus agent.

coredns_autopath_success_count_total

The total number of successful automatic path resolutions in CoreDNS.

coredns_autopath_success_total

The total number of successful automatic path resolutions in CoreDNS.

coredns_build_info

The build information of CoreDNS.

coredns_cache_drops_total

The total number of cache drops in CoreDNS.

coredns_cache_entries

The number of cache entries in CoreDNS.

coredns_cache_evictions_total

The total number of cache evictions in CoreDNS.

coredns_cache_hits_total

The total number of cache hits in CoreDNS.

coredns_cache_misses_total

The total number of cache misses in CoreDNS.

coredns_cache_requests_total

The total number of cache requests in CoreDNS.

coredns_cache_size

The size of the cache in CoreDNS.

coredns_dns_do_requests_total

The total number of DNS DO requests in CoreDNS.

coredns_dns_request_count_total

The total count of DNS requests in CoreDNS.

coredns_dns_request_duration_seconds_bucket

The percentile of DNS request durations in seconds in CoreDNS.

coredns_dns_request_duration_seconds_count

The count of DNS request durations in seconds in CoreDNS.

coredns_dns_request_duration_seconds_sum

The sum of DNS request durations in seconds in CoreDNS.

coredns_dns_request_size_bytes_bucket

The percentile of DNS request sizes in bytes in CoreDNS.

coredns_dns_request_size_bytes_count

The count of DNS request sizes in bytes in CoreDNS.

coredns_dns_request_size_bytes_sum

The sum of DNS request sizes in bytes in CoreDNS.

coredns_dns_request_type_count_total

The total count of DNS request types in CoreDNS.

coredns_dns_requests_total

The total number of DNS requests in CoreDNS.

coredns_dns_response_rcode_count_total

The total count of DNS response codes in CoreDNS.

coredns_dns_response_size_bytes_bucket

The percentile of DNS response sizes in bytes in CoreDNS.

coredns_dns_response_size_bytes_count

The count of DNS response sizes in bytes in CoreDNS.

coredns_dns_response_size_bytes_sum

The sum of DNS response sizes in bytes in CoreDNS.

coredns_dns_responses_total

The total number of DNS responses in CoreDNS.

coredns_forward_conn_cache_hits_total

The total number of cache hits for forwarded connections in CoreDNS.

coredns_forward_conn_cache_misses_total

The total number of cache misses for forwarded connections in CoreDNS.

coredns_forward_healthcheck_broken_total

The total number of health check failures for forwarded connections in CoreDNS.

coredns_forward_healthcheck_failure_count_total

The total count of health check failures for forwarded connections in CoreDNS.

coredns_forward_healthcheck_failures_total

The total number of health check failures for forwarded connections in CoreDNS.

coredns_forward_max_concurrent_rejects_total

The total number of maximum concurrent rejections for forwarded connections in CoreDNS.

coredns_forward_request_count_total

The total count of forwarded requests in CoreDNS.

coredns_forward_request_duration_seconds_bucket

The percentile of forwarded request durations in seconds in CoreDNS.

coredns_forward_request_duration_seconds_count

The count of forwarded request durations in seconds in CoreDNS.

coredns_forward_request_duration_seconds_sum

The sum of forwarded request durations in seconds in CoreDNS.

coredns_forward_requests_total

The total number of forwarded requests in CoreDNS.

coredns_forward_response_rcode_count_total

The total count of forwarded response codes in CoreDNS.

coredns_forward_responses_total

The total number of forwarded responses in CoreDNS.

coredns_forward_sockets_open

The number of open sockets for forwarded connections in CoreDNS.

coredns_health_request_duration_seconds_bucket

The percentile of health check request durations in seconds in CoreDNS.

coredns_health_request_duration_seconds_count

The count of health check request durations in seconds in CoreDNS.

coredns_health_request_duration_seconds_sum

The sum of health check request durations in seconds in CoreDNS.

coredns_health_request_failures_total

The total number of health check request failures in CoreDNS.

coredns_hosts_entries

The number of host entries in CoreDNS.

coredns_hosts_reload_timestamp_seconds

The timestamp of the last host reload in CoreDNS in seconds.

coredns_kubernetes_dns_programming_duration_seconds_bucket

The percentile of Kubernetes DNS programming durations in seconds in CoreDNS.

coredns_kubernetes_dns_programming_duration_seconds_count

The count of Kubernetes DNS programming durations in seconds in CoreDNS.

coredns_kubernetes_dns_programming_duration_seconds_sum

The sum of Kubernetes DNS programming durations in seconds in CoreDNS.

coredns_local_localhost_requests_total

The total number of localhost requests in CoreDNS.

coredns_panic_count_total

The total number of panics in CoreDNS.

coredns_panics_total

The total count of panics in CoreDNS.

coredns_plugin_enabled

The enabling status of CoreDNS plugins.

coredns_reload_failed_total

The total number of reload failures in CoreDNS.

coredns_reload_version_info

The version information of CoreDNS reloads.

coredns_template_matches_total

The total number of template matches in CoreDNS.

up

The connectivity of metric collection.

CSI clusters (job name: k8s-csi-cluster-pv)

Metric

Description

alibaba_cloud_storage_operator_build_info

The build information about the storage operations system on Alibaba Cloud.

aliyun_prometheus_agent_append_duration_seconds

The duration of the Prometheus agent append operations in seconds.

aliyun_prometheus_agent_job_discovery_status

The discovery status of the Prometheus agent collection jobs.

aliyun_prometheus_agent_scrape_custom_error

The number of custom collection errors of the Prometheus agent.

aliyun_prometheus_agent_scrapes_by_target_total

The total number of scrapes by the Prometheus agent per target.

aliyun_prometheus_agent_target_info

The target information of the Prometheus agent.

cluster_pv_detail_num_total

The total number of detailed PV information in the cluster.

cluster_pv_status_num_total

The total number of PV states in the cluster.

cluster_pvc_detail_num_total

The total number of detailed PVC information in the cluster.

cluster_pvc_status_num_total

The total number of PVC states in the cluster.

cluster_scrape_collector_duration_seconds

The duration of the cluster scrape collector in seconds.

cluster_scrape_collector_success

The number of successful scrapes by the cluster collector.

up

The connectivity of metric collection.

CSI nodes (job name: k8s-csi-node-pv)

Metric

Description

alibaba_cloud_csi_driver_build_info

The build information about the Container Storage Interface (CSI) driver.

aliyun_prometheus_agent_append_duration_seconds

The duration of the Prometheus agent append operations in seconds.

aliyun_prometheus_agent_job_discovery_status

The discovery status of the Prometheus agent collection jobs.

aliyun_prometheus_agent_scrape_custom_error

The number of custom collection errors of the Prometheus agent.

aliyun_prometheus_agent_scrapes_by_target_total

The total number of scrapes by the Prometheus agent per target.

aliyun_prometheus_agent_target_info

The target information of the Prometheus agent.

cluster_scrape_collector_duration_seconds

The duration of the cluster scrape collector in seconds.

cluster_scrape_collector_success

The number of successful scrapes by the cluster collector.

container_fs_available_bytes

The available bytes of the container file system.

container_fs_inodes_free

The number of available inodes in the container file system.

container_fs_inodes_total

The total number of inodes in the container file system.

container_fs_inodes_used

The number of used inodes in the container file system.

container_fs_limit_bytes

The limit of bytes in the container file system.

container_fs_usage_bytes

The used bytes in the container file system.

ephemeral_storage_pod_available_bytes

The available bytes of ephemeral storage Pod.

ephemeral_storage_pod_inodes_free

The available inodes of ephemeral storage Pod.

ephemeral_storage_pod_inodes_total

The total number of inodes in the ephemeral storage Pod.

ephemeral_storage_pod_inodes_used

The used inodes in the ephemeral storage Pod.

ephemeral_storage_pod_limit_bytes

The limit of bytes in the ephemeral storage Pod.

ephemeral_storage_pod_usage_bytes

The used bytes in the ephemeral storage Pod.

node_volume_backend_posix_access_total_counter

The total counter for Portable Operating System Interface (POSIX) access to the node volume backend.

node_volume_backend_posix_getattr_total_counter

The total counter for POSIX getattr calls to the node volume backend.

node_volume_backend_posix_getmode_total_counter

The total counter for POSIX getmode operations to the node volume backend.

node_volume_backend_posix_link_total_counter

The total counter for POSIX link operations to the node volume backend.

node_volume_backend_posix_lookup_total_counter

The total counter for POSIX lookup operations to the node volume backend.

node_volume_backend_posix_mknod_total_counter

The total counter for POSIX mknod operations to the node volume backend.

node_volume_backend_posix_readdir_total_counter

The total counter for POSIX readdir operations to the node volume backend.

node_volume_backend_posix_readlink_total_counter

The total counter for POSIX readlink operations to the node volume backend.

node_volume_backend_posix_remove_total_counter

The total counter for POSIX remove operations to the node volume backend.

node_volume_backend_posix_rename_total_counter

The total counter for POSIX rename operations to the node volume backend.

node_volume_backend_posix_setattr_total_counter

The total counter for POSIX setattr operations to the node volume backend.

node_volume_backend_posix_statfs_total_counter

The total counter for POSIX statfs operations to the node volume backend.

node_volume_backend_read_bytes_total_counter

The total counter for bytes read from the node volume backend.

node_volume_backend_read_completed_total_counter

The total number of completed read requests to the node volume backend.

node_volume_backend_read_time_milliseconds_total_counter

The total milliseconds spent on reads to the node volume backend.

node_volume_backend_write_bytes_total_counter

The total number of bytes written to the node volume backend.

node_volume_backend_write_completed_total_counter

The total number of completed write requests to the node volume backend.

node_volume_backend_write_time_milliseconds_total_counter

The total milliseconds spent on writes to the node volume backend.

node_volume_capacity_bytes_available

The available capacity of the node volume in bytes.

node_volume_capacity_bytes_available_counter

The available capacity of the node volume in bytes.

node_volume_capacity_bytes_total

The total capacity of the node volume in bytes.

node_volume_capacity_bytes_total_counter

The total capacity of the node volume in bytes (counter).

node_volume_capacity_bytes_used

The used capacity of the node volume in bytes.

node_volume_capacity_bytes_used_counter

The used capacity of the node volume in bytes (counter).

node_volume_hot_spot_head_file_top

The top hot spot files in the node volume.

node_volume_hot_spot_read_file_top

The top files read in the node volume hot spots.

node_volume_hot_spot_write_file_top

The top files written in the node volume hot spots.

node_volume_inode_bytes_available_counter

The counter for available inode bytes in the node volume.

node_volume_inode_bytes_total_counter

The counter for total inode bytes in the node volume.

node_volume_inode_bytes_used_counter

The counter for used inode bytes in the node volume.

node_volume_inodes_available

The number of available inodes in the node volume.

node_volume_inodes_total

The total number of inodes in the node volume.

node_volume_inodes_used

The number of used inodes in the node volume.

node_volume_io_now

The current I/O count in the node volume.

node_volume_io_time_seconds_total

The total seconds spent on I/O in the node volume.

node_volume_oss_delete_object_total_counter

The total counter for Object Storage Service (OSS) object deletions in the node volume.

node_volume_oss_get_object_total_counter

The total counter for OSS object gets in the node volume.

node_volume_oss_head_object_total_counter

The total counter for OSS object metadata in the node volume.

node_volume_oss_post_object_total_counter

The total counter for OSS object POSTs in the node volume.

node_volume_oss_put_object_total_counter

The total counter for OSS object PUTs in the node volume.

node_volume_posix_access_total_counter

The total counter for POSIX accesses in the node volume.

node_volume_posix_chmod_total_counter

The total counter for POSIX chmod operations in the node volume.

node_volume_posix_chown_total_counter

The total counter for POSIX chown operations in the node volume.

node_volume_posix_create_total_counter

The total counter for POSIX creations in the node volume.

node_volume_posix_flush_total_counter

The total counter for POSIX flushes in the node volume.

node_volume_posix_fsync_total_counter

The total counter for POSIX fsyncs in the node volume.

node_volume_posix_mkdir_total_counter

The total counter for POSIX mkdir operations in the node volume.

node_volume_posix_open_total_counter

The total counter for POSIX opens in the node volume.

node_volume_posix_opendir_total_counter

The total counter for POSIX opendir operations in the node volume.

node_volume_posix_read_total_counter

The total counter for POSIX reads in the node volume.

node_volume_posix_readdir_total_counter

The total counter for POSIX readdir operations in the node volume.

node_volume_posix_release_total_counter

The total counter for POSIX releases in the node volume.

node_volume_posix_rename_total_counter

The total counter for POSIX renames in the node volume.

node_volume_posix_rmdir_total_counter

The total counter for POSIX rmdir operations in the node volume.

node_volume_posix_truncate_total_counter

The total counter for POSIX truncate operations in the node volume.

node_volume_posix_write_total_counter

The total counter for POSIX writes in the node volume.

node_volume_read_bytes_total

The total number of bytes read from the node volume.

node_volume_read_bytes_total_counter

The total number of bytes read from the node volume (counter).

node_volume_read_completed_total

The total number of completed read requests to the node volume.

node_volume_read_completed_total_counter

The total number of completed read requests to the node volume (counter).

node_volume_read_merged_total

The total number of merged read operations in the node volume.

node_volume_read_queue_time_milliseconds_total

The total milliseconds spent on read queue in the node volume.

node_volume_read_rtt_time_milliseconds_total

The total milliseconds spent on read round-trip time in the node volume.

node_volume_read_sent_bytes_total

The total number of bytes sent during reads in the node volume.

node_volume_read_time_milliseconds_total

The total milliseconds spent on reads in the node volume.

node_volume_read_time_milliseconds_total_counter

The total milliseconds spent on reads in the node volume (counter).

node_volume_read_timeouts_total

The total number of read timeouts in the node volume.

node_volume_read_transmissions_total

The total number of read transmissions in the node volume.

node_volume_vg_free_bytes

The free bytes in the volume group (VG) of the node volume.

node_volume_vg_size_bytes

The total bytes in the VG of the node volume.

node_volume_write_bytes_total

The total number of bytes written to the node volume.

node_volume_write_bytes_total_counter

The total number of bytes written to the node volume (counter).

node_volume_write_completed_total

The total number of completed write requests to the node volume.

node_volume_write_completed_total_counter

The total number of completed write requests to the node volume (counter).

node_volume_write_merged_total

The total number of merged write operations in the node volume.

node_volume_write_queue_time_milliseconds_total

The total milliseconds spent on write queue in the node volume.

node_volume_write_recv_bytes_total

The total number of bytes received during writes in the node volume.

node_volume_write_rtt_time_milliseconds_total

The total milliseconds spent on write round-trip time in the node volume.

node_volume_write_time_milliseconds_total

The total milliseconds spent on writes in the node volume.

node_volume_write_time_milliseconds_total_counter

The total milliseconds spent on writes in the node volume (counter).

node_volume_write_timeouts_total

The total number of write timeouts in the node volume.

node_volume_write_transmissions_total

The total number of write transmissions in the node volume.

up

The connectivity of metric collection.

GPU-Exporter (job name: gpu-exporter)

Metric

Description

DCGM_CUSTOM_ALLOCATE_MODE

The mode in which the node runs. A value of 0 indicates that no GPU Pods are running on the node. A value of 1 indicates that the GPU Pods on the current node run in an exclusive GPU mode. A value of 2 indicates that the GPU Pods on the current node run in a shared GPU mode.

DCGM_CUSTOM_CONTAINER_CP_ALLOCATED

The ratio of the GPU computing power allocated to the container to the total computing power of the GPU. The value ranges from 0 to 1. In exclusive GPU mode or in shared GPU mode in which the container requests only GPU memory, the value of this metric is 0, which indicates that the allocation of GPU computing power is unlimited. For example, if a GPU provides a total of 100 compute units (CUs) of GPU computing power and allocates 30 CUs to a container, the ratio of the GPU computing power allocated to the container is calculated by using the following formula: 30/100 = 0.3.

DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED

The amount of GPU memory allocated to the container.

DCGM_CUSTOM_DEV_FB_ALLOCATED

The ratio of the allocated GPU memory to the total memory of the GPU. The value ranges from 0 to 1.

DCGM_CUSTOM_DEV_FB_TOTAL

The total memory of the GPU.

DCGM_CUSTOM_ILLEGAL_PROCESS_DECODE_UTIL

The illegal process decode utilization.

DCGM_CUSTOM_ILLEGAL_PROCESS_ENCODE_UTIL

The illegal process encode utilization.

DCGM_CUSTOM_ILLEGAL_PROCESS_MEM_COPY_UTIL

The memory copy utilization of illegal processes.

DCGM_CUSTOM_ILLEGAL_PROCESS_MEM_USED

The memory used by illegal processes.

DCGM_CUSTOM_ILLEGAL_PROCESS_SM_UTIL

The SM utilization of illegal processes.

DCGM_CUSTOM_PROCESS_DECODE_UTIL

The decoder utilization of GPU threads.

DCGM_CUSTOM_PROCESS_ENCODE_UTIL

The encoder utilization of GPU threads.

DCGM_CUSTOM_PROCESS_MEM_COPY_UTIL

The memory copy utilization of GPU threads.

DCGM_CUSTOM_PROCESS_MEM_USED

The amount of GPU memory used by GPU threads.

DCGM_CUSTOM_PROCESS_SM_UTIL

The SM utilization of GPU threads.

DCGM_FI_DEV_APP_MEM_CLOCK

The memory application clock speed.

DCGM_FI_DEV_APP_SM_CLOCK

The SM application clock speed.

DCGM_FI_DEV_BAR1_FREE

The remaining Base Address Register 1 (BAR1).

DCGM_FI_DEV_BAR1_TOTAL

The total size of device BAR1.

DCGM_FI_DEV_BAR1_USED

The used BAR1.

DCGM_FI_DEV_BOARD_LIMIT_VIOLATION

The time of the violation due to board limitations.

DCGM_FI_DEV_CLOCK_THROTTLE_REASONS

The reasons for clock throttling.

DCGM_FI_DEV_COUNT

The number of devices.

DCGM_FI_DEV_DEC_UTIL

The decoder utilization.

DCGM_FI_DEV_ENC_UTIL

The encoder utilization.

DCGM_FI_DEV_FB_FREE

The amount of free frame buffer memory.

DCGM_FI_DEV_FB_USED

The amount of used frame buffer memory. The value of this metric is the same as the value of Memory-Usage returned by the nvidia-smi command.

DCGM_FI_DEV_GPU_TEMP

The GPU temperature.

DCGM_FI_DEV_GPU_UTIL

The GPU utilization within a cycle of 1 second or 1/6 second. The cycle varies based on the GPU model. A cycle is a period of time during which one or more kernel functions remain active. This metric only indicates that one or more kernel functions are occupying GPU resources. The metric does not display detailed GPU usage information.

DCGM_FI_DEV_LOW_UTIL_VIOLATION

The time of the violation due to low utilization.

DCGM_FI_DEV_MEM_CLOCK

The memory clock speed.

DCGM_FI_DEV_MEM_COPY_UTIL

The memory bandwidth utilization. For example, the maximum memory bandwidth of NVIDIA V100 is 900 GB/s. If the memory bandwidth used is 450 GB/s, the memory bandwidth utilization is 50%.

DCGM_FI_DEV_MEMORY_TEMP

The memory temperature.

DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL

The total NVLink bandwidth.

DCGM_FI_DEV_PCIE_REPLAY_COUNTER

The PCIe replay counter.

DCGM_FI_DEV_POWER_USAGE

The power usage.

DCGM_FI_DEV_POWER_VIOLATION

The time of the violation due to power limitations.

DCGM_FI_DEV_PSTATE

The status of the device power.

DCGM_FI_DEV_RELIABILITY_VIOLATION

The time of the violation due to board reliability.

DCGM_FI_DEV_RETIRED_DBE

The number of pages retired due to double bit errors.

DCGM_FI_DEV_RETIRED_PENDING

The number of pages to be retired. These pages are marked as unavailable due to errors in the GPU memory.

DCGM_FI_DEV_RETIRED_SBE

The number of pages retired due to single bit errors.

DCGM_FI_DEV_SM_CLOCK

The SM clock speed.

DCGM_FI_DEV_SYNC_BOOST_VIOLATION

The time of the violation due to synchronous limit raising.

DCGM_FI_DEV_THERMAL_VIOLATION

The time of the violation due to thermal limitations.

DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION

The total energy consumed since the driver was last loaded.

DCGM_FI_DEV_VIDEO_CLOCK

The video clock speed.

DCGM_FI_DEV_XID_ERRORS

The last XID error that occurred within a period of time.

DCGM_FI_PROF_DRAM_ACTIVE

The cycle fraction for memory bandwidth utilization when sending data to device memory or receiving data from device memory.

The value is an average value within a time interval rather than an instantaneous value.

A larger value of this metric indicates higher device memory utilization.

If the value is 1 (100%), a DRAM command is executed every cycle within the entire interval. The peak value of the metric can reach 0.8 (80%).

If the value of this metric is 0.2 (20%), 20% of the cycles within the time interval are spent reading from or writing to device memory.

DCGM_FI_PROF_GR_ENGINE_ACTIVE

The percentage of time that the Graphics or Compute engines were active within a time interval. The value indicates the average across all Graphics and Compute engines. A Graphics or Compute engine is considered active when a Graphics or Compute context is bound to a thread and the Graphics or Compute context is in a busy state.

DCGM_FI_PROF_NVLINK_RX_BYTES

The TX rate of NVLink and the RX rate of NVLink. The bytes transmitted or received exclude the header.

The value is an average value within a time interval rather than an instantaneous value.

For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/s regardless of whether the transmission occurs at a consistent rate or in bursts. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per direction per link.

DCGM_FI_PROF_NVLINK_TX_BYTES

The total number of bytes sent through NVLink.

DCGM_FI_PROF_PCIE_RX_BYTES

The TX rate of PCle and the RX rate of PCIe. The bytes transmitted or received include both the header and payload.

The value is an average value within a time interval rather than an instantaneous value.

For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/s regardless of whether the transmission occurs at a consistent rate or in bursts. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per lane.

DCGM_FI_PROF_PCIE_TX_BYTES

The TX rate of PCle and the RX rate of PCIe. The bytes transmitted or received include both the header and payload.

The value is an average value within a time interval rather than an instantaneous value.

For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/s regardless of whether the transmission occurs at a consistent rate or in bursts. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per lane.

DCGM_FI_PROF_PIPE_FP16_ACTIVE

The fraction of cycles during which the FP16 (half-precision) pipeline was active.

The value is an average value within a time interval rather than an instantaneous value.

A higher value indicates higher utilization of the FP16 cores.

A value of 1 (100%) means that an FP16 instruction was executed every two cycles throughout the entire time interval (for example, on Volta-type cards).

If the value of this metric is 0.2 (20%), one of the following conditions may exist:

The FP16 core utilization of 20% of the SMs within the time interval is 100%.

The FP16 core utilization of all SMs within the time interval is 20%.

The FP16 core utilization of all SMs within 20% of the time interval is 100%.

Other conditions.

DCGM_FI_PROF_PIPE_FP32_ACTIVE

The fraction of cycles during which the FMA (Fused Multiply-Add) pipeline was active. The FMA operations include both FP32 (single-precision) and integer operations.

The value is an average value within a time interval rather than an instantaneous value.

A higher value indicates higher utilization of the FP32 cores.

A value of 1 (100%) means that an FP32 instruction was executed every two cycles throughout the entire time interval (for example, on Volta-type cards).

If the value of this metric is 0.2 (20%), one of the following conditions may exist:

The FP32 core utilization of 20% of the SMs within the time interval is 100%.

The FP32 core utilization of all SMs within the time interval is 20%.

The FP32 core utilization of all SMs within 20% of the time interval is 100%.

Other conditions.

DCGM_FI_PROF_PIPE_FP64_ACTIVE

The fraction of cycles during which the FP64 (double-precision) pipeline was active.

The value is an average value within a time interval rather than an instantaneous value.

A higher value indicates higher utilization of the FP64 cores.

A value of 1 (100%) means that an FP64 instruction was executed every four cycles throughout the entire time interval (for example, on Volta-type cards).

If the value of this metric is 0.2 (20%), one of the following conditions may exist:

The FP64 core utilization of 20% of the SMs within the time interval is 100%.

The FP64 core utilization of 20% of the SMs within the time interval is 100%.

The FP64 core utilization of all SMs within 20% of the time interval is 100%.

Other conditions.

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

The cycle fraction for the Tensor (HMMA/IMMA) pipe being in the Active state.

The value is an average value within a time interval rather than an instantaneous value.

A larger value of this metric indicates higher tensor core utilization.

If the value is 1 (100%), a Tensor instruction is issued every cycle within the entire interval. One instruction completes in two cycles.

If the value of this metric is 0.2 (20%), one of the following conditions may exist:

The tensor core utilization of 20% of the SMs within the time interval is 100%.

The tensor core utilization of all SMs within the time interval is 20%.

The tensor core utilization of all SMs within 20% of the time interval is 100%.

Other conditions.

DCGM_FI_PROF_SM_ACTIVE

The ratio of cycles during which at least one warp on an SM remains active. The value is an average of all SMs. The value does not vary with the number of warps included in the thread block. When a warp is scheduled and resources are allocated to the warp, the warp is considered active. In this case, the status of the warp may be Computing or not Computing; for example, it may be waiting for memory requests or in another non-Computing state. If the value of this metric drops below 0.5, the GPU utilization is low. To ensure high GPU utilization, make sure that the value is greater than 0.8. Assume that a GPU has N SMs. If all SMs in N thread blocks run a kernel function within a time interval, the value of this metric is 1 (100%). If N/5 thread blocks run a kernel function within a time interval, the value of this metric is 0.2. If N thread blocks run a kernel function during 20% of the cycle within a time interval, the value of this metric is 0.2.

DCGM_FI_PROF_SM_OCCUPANCY

The ratio of warps resident on an SM to the maximum number of warps that can reside on that SM, averaged over all SMs within a time interval. A higher occupancy does not necessarily indicate higher GPU utilization. Only in workloads where GPU memory bandwidth is the limiting factor (DCGM_FI_PROF_DRAM_ACTIVE), does a higher occupancy indicate more effective GPU utilization.

nvidia_gpu_allocated_num_devices

The number of allocated GPU devices. Warning: Will be deprecated in the future.

nvidia_gpu_memory_allocated_bytes

The full memory of GPU devices. Warning: Will be deprecated in the future, replaced by DCGM_CUSTOM_DEV_FB_allocated.

nvidia_gpu_sharing_memory

The memory allocated for GPU sharing. Warning: Will be deprecated in the future, DCGM_CUSTOM_DEV_FB_allocated.

up

The connectivity of metric collection.

Cost-Exporter (job name: alibaba-cloud-cost-exporter)

Metric

Description

deducted_by_cash_coupons

The bill discount amount for the current instance.

deducted_by_prepaid_card

The prepaid card discount amount for the current instance.

invoice_discount

The discount amount for the current instance.

list_price

The unit price for the current instance.

node_current_price

The actual price of the current node.

node_payAsYouGo_price

The pay-as-you-go price of the current node.

node_payByPeriod_price

The subscription price of the current node.

node_spot_price

The spot price of the current node.

outstanding_amount

The outstanding amount for the current instance.

payent_amount

The cash payment amount for the current instance.

pretax_amount

The payable amount for the current instance.

pretax_gross_amount

The original amount for the current instance.

usage

The resource usage for the current instance.

up

The connectivity of metric collection.

Ingress (job name: arms-ack-ingress, ingress-ask-default)

Metric

Description

aliyun_prometheus_agent_append_duration_seconds

The duration of the Prometheus agent append operations in seconds.

aliyun_prometheus_agent_job_discovery_status

The discovery status of the Prometheus agent collection jobs.

aliyun_prometheus_agent_scrape_custom_error

The number of custom collection errors of the Prometheus agent.

aliyun_prometheus_agent_scrapes_by_target_total

The total number of scrapes by the Prometheus agent per target.

aliyun_prometheus_agent_target_info

The target information of the Prometheus agent.

nginx_ingress_controller_admission_config_size

The size of the NGINX Ingress controller Admission Config.

nginx_ingress_controller_admission_render_duration

The rendering duration of the NGINX Ingress controller Admission Config.

nginx_ingress_controller_admission_render_ingresses

The number of Ingresses rendered by the NGINX Ingress controller.

nginx_ingress_controller_admission_roundtrip_duration

The round-trip processing duration of the NGINX Ingress controller.

nginx_ingress_controller_admission_tested_duration

The testing duration of the NGINX Ingress controller.

nginx_ingress_controller_admission_tested_ingresses

The number of Ingresses tested by the NGINX Ingress controller.

nginx_ingress_controller_build_info

The build information of the NGINX Ingress controller.

nginx_ingress_controller_bytes_sent_bucket

The distribution of total bytes sent by the NGINX Ingress controller.

nginx_ingress_controller_bytes_sent_count

The count of total bytes sent by the NGINX Ingress controller.

nginx_ingress_controller_bytes_sent_sum

The sum of total bytes sent by the NGINX Ingress controller.

nginx_ingress_controller_check_errors

The number of check errors in the NGINX Ingress controller.

nginx_ingress_controller_check_success

The number of successful checks in the NGINX Ingress controller.

nginx_ingress_controller_config_hash

The configuration hash of the NGINX Ingress controller.

nginx_ingress_controller_config_last_reload_successful

The success status of the last configuration reload in the NGINX Ingress controller.

nginx_ingress_controller_config_last_reload_successful_timestamp_seconds

The timestamp of the last successful configuration reload in the NGINX Ingress controller in seconds.

nginx_ingress_controller_connect_duration_seconds_bucket

The distribution of connection durations in the NGINX Ingress controller in seconds.

nginx_ingress_controller_connect_duration_seconds_count

The count of connection durations in the NGINX Ingress controller in seconds.

nginx_ingress_controller_connect_duration_seconds_sum

The sum of connection durations in the NGINX Ingress controller in seconds.

nginx_ingress_controller_errors

The number of errors in the NGINX Ingress controller.

nginx_ingress_controller_header_duration_seconds_bucket

The distribution of header processing durations in the NGINX Ingress controller in seconds.

nginx_ingress_controller_header_duration_seconds_count

The count of header processing durations in the NGINX Ingress controller in seconds.

nginx_ingress_controller_header_duration_seconds_sum

The sum of header processing durations in the NGINX Ingress controller in seconds.

nginx_ingress_controller_ingress_upstream_latency_seconds

The upstream latency in the NGINX Ingress controller in seconds.

nginx_ingress_controller_ingress_upstream_latency_seconds_count

The count of upstream latencies in the NGINX Ingress controller.

nginx_ingress_controller_ingress_upstream_latency_seconds_sum

The sum of upstream latencies in the NGINX Ingress controller.

nginx_ingress_controller_leader_election_status

The leader election status of the NGINX Ingress controller.

nginx_ingress_controller_nginx_process_connections

The number of connections in the nginx process of the NGINX Ingress controller.

nginx_ingress_controller_nginx_process_connections_total

The total number of connections in the nginx process of the NGINX Ingress controller.

nginx_ingress_controller_nginx_process_cpu_seconds_total

The total CPU utilization in seconds of the nginx process in the NGINX Ingress controller.

nginx_ingress_controller_nginx_process_num_procs

The number of nginx processes in the NGINX Ingress controller.

nginx_ingress_controller_nginx_process_oldest_start_time_seconds

The oldest start time in seconds of the nginx process in the NGINX Ingress controller.

nginx_ingress_controller_nginx_process_read_bytes_total

The total number of bytes read by the nginx process in the NGINX Ingress controller.

nginx_ingress_controller_nginx_process_requests_total

The total number of requests processed by the nginx process in the NGINX Ingress controller.

nginx_ingress_controller_nginx_process_resident_memory_bytes

The resident memory size in bytes of the nginx process in the NGINX Ingress controller.

nginx_ingress_controller_nginx_process_virtual_memory_bytes

The amount of virtual memory that is used by an NGINX process in bytes.

nginx_ingress_controller_nginx_process_write_bytes_total

The virtual memory size in bytes of the nginx process in the NGINX Ingress controller.

nginx_ingress_controller_orphan_ingress

The number of orphaned Ingresses in the NGINX Ingress controller.

nginx_ingress_controller_request_duration_seconds_bucket

The distribution of request durations in the NGINX Ingress controller in seconds.

nginx_ingress_controller_request_duration_seconds_count

The count of request durations in the NGINX Ingress controller in seconds.

nginx_ingress_controller_request_duration_seconds_sum

The sum of request durations in the NGINX Ingress controller in seconds.

nginx_ingress_controller_request_size_bucket

The distribution of request sizes in the NGINX Ingress controller.

nginx_ingress_controller_request_size_count

The count of request sizes in the NGINX Ingress controller.

nginx_ingress_controller_request_size_sum

The sum of request sizes in the NGINX Ingress controller.

nginx_ingress_controller_requests

The total number of requests in the NGINX Ingress controller.

nginx_ingress_controller_response_duration_seconds_bucket

The distribution of response durations in the NGINX Ingress controller in seconds.

nginx_ingress_controller_response_duration_seconds_count

The count of response durations in the NGINX Ingress controller in seconds.

nginx_ingress_controller_response_duration_seconds_sum

The sum of response durations in the NGINX Ingress controller in seconds.

nginx_ingress_controller_response_size_bucket

The distribution of response sizes in the NGINX Ingress controller.

nginx_ingress_controller_response_size_count

The count of response sizes in the NGINX Ingress controller.

nginx_ingress_controller_response_size_sum

The sum of response sizes in the NGINX Ingress controller.

nginx_ingress_controller_ssl_certificate_info

The SSL certificate information in the NGINX Ingress controller.

nginx_ingress_controller_ssl_expire_time_seconds

The expiration time of the SSL certificate in the NGINX Ingress controller in seconds.

nginx_ingress_controller_success

The number of successes in the NGINX Ingress controller.

up

The connectivity of metric collection.

Koordinator (job name: kube-system, koordlet-metrics-podmonitor, or koord-manager-metrics-service)

Metric

Description

aliyun_prometheus_agent_append_duration_seconds

The duration of the Prometheus agent append operations in seconds.

aliyun_prometheus_agent_scrapes_by_target_total

The total number of scrapes by the Prometheus agent per target.

aliyun_prometheus_agent_target_info

The target information of the Prometheus agent.

koord_manager_recommender_recommendation_workload_target

The recommended specification metric for workload in the resource profiling feature.

koordlet_container_resource_limits

The limit metric for container resources.

koordlet_container_resource_requests

The request metric for container resources.

koordlet_node_priority_resource_reclaimable

The priority metric for node resources.

koordlet_node_resource_allocatable

The allocatable resource metric for the node.

slo_manager_recommender_recommendation_workload_target

The resource specifications that are recommended based on the workload by the resource profiling feature. This metric is discontinued.

up

The connectivity of metric collection.

ETCD (job name: etcd)

Metric

Description

aliyun_prometheus_agent_append_duration_seconds

The duration of the Prometheus agent append operations in seconds.

aliyun_prometheus_agent_job_discovery_status

The discovery status of the Prometheus agent collection jobs.

aliyun_prometheus_agent_scrape_custom_error

The number of custom collection errors of the Prometheus agent.

aliyun_prometheus_agent_scrapes_by_target_total

The total number of scrapes by the Prometheus agent per target.

aliyun_prometheus_agent_target_info

The target information of the Prometheus agent.

etcd_cluster_version

The version of the cluster.

etcd_debugging_auth_revision

The authentication revision number for ETCD debugging.

etcd_debugging_disk_backend_commit_rebalance_duration_seconds_bucket

The distribution of ETCD debugging disk backend commit rebalance duration in seconds.

etcd_debugging_disk_backend_commit_rebalance_duration_seconds_count

The count of ETCD debugging disk backend commit rebalance duration in seconds.

etcd_debugging_disk_backend_commit_rebalance_duration_seconds_sum

The sum of ETCD debugging disk backend commit rebalance duration in seconds.

etcd_debugging_disk_backend_commit_spill_duration_seconds_bucket

The distribution of ETCD debugging disk backend commit spill duration.

etcd_debugging_disk_backend_commit_spill_duration_seconds_count

The count of ETCD debugging disk backend commit spill duration.

etcd_debugging_disk_backend_commit_spill_duration_seconds_sum

The sum of ETCD debugging disk backend commit spill duration.

etcd_debugging_disk_backend_commit_write_duration_seconds_bucket

The distribution of ETCD debugging disk backend commit write duration in seconds.

etcd_debugging_disk_backend_commit_write_duration_seconds_count

The count of ETCD debugging disk backend commit write duration in seconds.

etcd_debugging_disk_backend_commit_write_duration_seconds_sum

The sum of ETCD debugging disk backend commit write duration in seconds.

etcd_debugging_lease_granted_total

The total number of lease grants in ETCD debugging.

etcd_debugging_lease_renewed_total

The total number of lease renewals in ETCD debugging.

etcd_debugging_lease_revoked_total

The total number of lease revocations in ETCD debugging.

etcd_debugging_lease_ttl_total_bucket

The distribution of lease TTLs in ETCD debugging.

etcd_debugging_lease_ttl_total_count

The count of lease TTLs in ETCD debugging.

etcd_debugging_lease_ttl_total_sum

The sum of lease TTLs in ETCD debugging.

etcd_debugging_mvcc_compact_revision

The compaction revision number for ETCD debugging MVCC.

etcd_debugging_mvcc_current_revision

The current revision version for ETCD debugging MVCC.

etcd_debugging_mvcc_db_compaction_keys_total

The total number of keys compressed in the ETCD debugging MVCC database.

etcd_debugging_mvcc_db_compaction_last

The last compaction time for the ETCD debugging MVCC database.

etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket

The distribution of MVCC database compaction pause durations in milliseconds for ETCD debugging.

etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_count

The count of MVCC database compaction pause durations in milliseconds for ETCD debugging.

etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_sum

The sum of MVCC database compaction pause durations in milliseconds for ETCD debugging.

etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket

The distribution of MVCC database compaction total durations in milliseconds for ETCD debugging.

etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_count

The count of MVCC database compaction total durations in milliseconds for ETCD debugging.

etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_sum

The sum of MVCC database compaction total durations in milliseconds for ETCD debugging.

etcd_debugging_mvcc_db_total_size_in_bytes

The total size of the MVCC database in bytes for ETCD debugging.

etcd_debugging_mvcc_delete_total

The total number of delete operations in ETCD debugging MVCC.

etcd_debugging_mvcc_events_total

The total number of events in ETCD debugging.

etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucket

The distribution of MVCC index compaction pause durations in milliseconds for ETCD debugging.

etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_count

The count of MVCC index compaction pause durations in milliseconds for ETCD debugging.

etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_sum

The sum of MVCC index compaction pause durations in milliseconds for ETCD debugging.

etcd_debugging_mvcc_keys_total

The total number of keys in ETCD debugging MVCC.

etcd_debugging_mvcc_pending_events_total

The total number of pending events in ETCD debugging MVCC.

etcd_debugging_mvcc_put_total

The total number of put operations in ETCD debugging MVCC.

etcd_debugging_mvcc_range_total

The total number of range queries in ETCD MVCC.

etcd_debugging_mvcc_slow_watcher_total

The total number of slow watchers in ETCD debugging.

etcd_debugging_mvcc_total_put_size_in_bytes

The total size of MVCC puts in bytes for ETCD debugging.

etcd_debugging_mvcc_txn_total

The total number of MVCC transactions in ETCD debugging.

etcd_debugging_mvcc_watch_stream_total

The total number of snapshot streams in ETCD debugging.

etcd_debugging_mvcc_watcher_total

The total number of watchers in ETCD debugging.

etcd_debugging_server_lease_expired_total

The total number of expired leases in ETCD debugging.

etcd_debugging_snap_save_marshalling_duration_seconds_bucket

The distribution of snapshot save marshalling durations in seconds for ETCD debugging.

etcd_debugging_snap_save_marshalling_duration_seconds_count

The count of snapshot save marshalling durations in seconds for ETCD debugging.

etcd_debugging_snap_save_marshalling_duration_seconds_sum

The sum of snapshot save marshalling durations in seconds for ETCD debugging.

etcd_debugging_snap_save_total_duration_seconds_bucket

The distribution of snapshot save durations in seconds for ETCD debugging.

etcd_debugging_snap_save_total_duration_seconds_count

The count of snapshot save durations in seconds for ETCD debugging.

etcd_debugging_snap_save_total_duration_seconds_sum

The sum of snapshot save durations in seconds for ETCD debugging.

etcd_debugging_store_expires_total

The total number of expired items in ETCD debugging storage.

etcd_debugging_store_reads_total

The total number of reads in ETCD debugging storage.

etcd_debugging_store_watch_requests_total

The total number of watch requests in ETCD debugging storage.

etcd_debugging_store_watchers

The total number of watchers in ETCD debugging storage.

etcd_debugging_store_writes_total

The total number of writes in ETCD debugging storage.

etcd_disk_backend_commit_duration_seconds_bucket

The distribution of disk backend commit durations in seconds for ETCD.

etcd_disk_backend_commit_duration_seconds_count

The count of disk backend commit durations in seconds for ETCD.

etcd_disk_backend_commit_duration_seconds_sum

The sum of disk backend commit durations in seconds for ETCD.

etcd_disk_backend_defrag_duration_seconds_bucket

The distribution of disk backend defragmentation durations in seconds for ETCD.

etcd_disk_backend_defrag_duration_seconds_count

The count of disk backend defragmentation durations in seconds for ETCD.

etcd_disk_backend_defrag_duration_seconds_sum

The sum of disk backend defragmentation durations in seconds for ETCD.

etcd_disk_backend_snapshot_duration_seconds_bucket

The distribution of disk backend snapshot durations in seconds for ETCD.

etcd_disk_backend_snapshot_duration_seconds_count

The count of disk backend snapshot durations in seconds for ETCD.

etcd_disk_backend_snapshot_duration_seconds_sum

The sum of disk backend snapshot durations in seconds for ETCD.

etcd_disk_defrag_inflight

The number of ongoing disk defragmentations in ETCD.

etcd_disk_wal_fsync_duration_seconds_bucket

The distribution of WAL sync durations in seconds for ETCD disk.

etcd_disk_wal_fsync_duration_seconds_count

The count of WAL sync durations in seconds for ETCD disk.

etcd_disk_wal_fsync_duration_seconds_sum

The sum of WAL sync durations in seconds for ETCD disk.

etcd_disk_wal_write_bytes_total

The total number of bytes written to the WAL in ETCD disk.

etcd_grpc_proxy_cache_hits_total

The total number of cache hits in the ETCD gRPC proxy.

etcd_grpc_proxy_cache_keys_total

The total number of cache keys in the ETCD gRPC proxy.

etcd_grpc_proxy_cache_misses_total

The total number of cache misses in the ETCD gRPC proxy.

etcd_grpc_proxy_events_coalescing_total

The total number of event coalescings in the ETCD gRPC proxy.

etcd_grpc_proxy_watchers_coalescing_total

The total number of watcher coalescings in the ETCD gRPC proxy.

etcd_mvcc_db_open_read_transactions

The number of open read transactions in the ETCD MVCC database.

etcd_mvcc_db_total_size_in_bytes

The total size of the MVCC database in bytes for ETCD.

etcd_mvcc_db_total_size_in_use_in_bytes

The total size in use of the MVCC database in bytes for ETCD.

etcd_mvcc_delete_total

The total number of deletes in ETCD MVCC.

etcd_mvcc_hash_duration_seconds_bucket

The distribution of MVCC hash durations in seconds for ETCD.

etcd_mvcc_hash_duration_seconds_count

The count of MVCC hash durations in seconds for ETCD.

etcd_mvcc_hash_duration_seconds_sum

The sum of MVCC hash durations in seconds for ETCD.

etcd_mvcc_hash_rev_duration_seconds_bucket

The distribution of MVCC hash revision durations in seconds for ETCD.

etcd_mvcc_hash_rev_duration_seconds_count

The count of MVCC hash revision durations in seconds for ETCD.

etcd_mvcc_hash_rev_duration_seconds_sum

The sum of MVCC hash revision durations in seconds for ETCD.

etcd_mvcc_put_total

The total number of put operations in ETCD MVCC.

etcd_mvcc_range_total

The total number of range queries in ETCD MVCC.

etcd_mvcc_txn_total

The total number of MVCC transactions in ETCD.

etcd_network_active_peers

The number of active peers in the ETCD network.

etcd_network_client_grpc_received_bytes_total

The total number of bytes received by the ETCD network client via gRPC.

etcd_network_client_grpc_sent_bytes_total

The total number of bytes sent by the ETCD network client via gRPC.

etcd_network_disconnected_peers_total

The total number of disconnected peers in the ETCD network.

etcd_network_peer_received_bytes_total

The total number of bytes received by the ETCD network peer.

etcd_network_peer_received_failures_total

The total number of receive failures in the ETCD network peer.

etcd_network_peer_round_trip_time_seconds_bucket

The distribution of round trip times for the ETCD network peer in seconds.

etcd_network_peer_round_trip_time_seconds_count

The count of round trip times for the ETCD network peer in seconds.

etcd_network_peer_round_trip_time_seconds_sum

The sum of round trip times for the ETCD network peer in seconds.

etcd_network_peer_sent_bytes_total

The total number of bytes sent by the ETCD network peer.

etcd_network_peer_sent_failures_total

The total number of send failures by the ETCD network peer.

etcd_network_server_stream_failures_total

The total number of stream failures in the ETCD network server.

etcd_network_snapshot_receive_inflights_total

The number of concurrent snapshot receive requests in the ETCD network.

etcd_network_snapshot_receive_success

The number of successful snapshot receives in the ETCD network.

etcd_network_snapshot_receive_total_duration_seconds_bucket

The distribution of snapshot receive durations in seconds for the ETCD network.

etcd_network_snapshot_receive_total_duration_seconds_count

The count of snapshot receive durations in seconds for the ETCD network.

etcd_network_snapshot_receive_total_duration_seconds_sum

The sum of snapshot receive durations in seconds for the ETCD network.

etcd_network_snapshot_send_inflights_total

The number of concurrent snapshot send requests in the ETCD network.

etcd_network_snapshot_send_success

The number of successful snapshot sends in the ETCD network.

etcd_network_snapshot_send_total_duration_seconds_bucket

The distribution of snapshot send durations in seconds for the ETCD network.

etcd_network_snapshot_send_total_duration_seconds_count

The count of snapshot send durations in seconds for the ETCD network.

etcd_network_snapshot_send_total_duration_seconds_sum

The sum of snapshot send durations in seconds for the ETCD network.

etcd_server_apply_duration_seconds_bucket

The distribution of application durations in seconds for the ETCD server.

etcd_server_apply_duration_seconds_count

The count of application durations in seconds for the ETCD server.

etcd_server_apply_duration_seconds_sum

The sum of application durations in seconds for the ETCD server.

etcd_server_client_requests_total

The total number of client requests to the ETCD server.

etcd_server_go_version

The Go version of the ETCD server.

etcd_server_has_leader

Indicates whether a leader exists in the ETCD server.

etcd_server_health_failures

The number of health check failures in the ETCD server.

etcd_server_health_success

The number of successful health checks in the ETCD server.

etcd_server_heartbeat_send_failures_total

The total number of heartbeat send failures in the ETCD server.

etcd_server_id

The ID of the ETCD server.

etcd_server_is_leader

Indicates whether the ETCD server is a leader.

etcd_server_is_learner

Indicates whether the ETCD server is a learner.

etcd_server_leader_changes_seen_total

The total number of leader changes witnessed by the ETCD server.

etcd_server_learner_promote_successes

The number of successful learner promotions in the ETCD server.

etcd_server_proposals_applied_total

The total number of applied proposals in the ETCD server.

etcd_server_proposals_committed_total

The total number of committed proposals in the ETCD server.

etcd_server_proposals_failed_total

The total number of failed proposals in the ETCD server.

etcd_server_proposals_pending

The total number of pending proposals in the ETCD server.

etcd_server_quota_backend_bytes

The backend storage quota in bytes for the ETCD server.

etcd_server_read_indexes_failed_total

The total number of read index failures in the ETCD server.

etcd_server_slow_apply_total

The total number of slow applications in the ETCD server.

etcd_server_slow_read_indexes_total

The total number of slow read indexes in the ETCD server.

etcd_server_snapshot_apply_in_progress_total

The total number of snapshots being applied in the ETCD server.

etcd_server_version

The version of the ETCD server.

etcd_snap_db_fsync_duration_seconds_bucket

The distribution of ETCD snapshot database fsync durations in seconds.

etcd_snap_db_fsync_duration_seconds_count

The count of ETCD snapshot database fsync durations in seconds.

etcd_snap_db_fsync_duration_seconds_sum

The sum of ETCD snapshot database fsync durations in seconds.

etcd_snap_db_save_total_duration_seconds_bucket

The distribution of ETCD snapshot database save durations in seconds.

etcd_snap_db_save_total_duration_seconds_count

The count of ETCD snapshot database save durations in seconds.

etcd_snap_db_save_total_duration_seconds_sum

The sum of ETCD snapshot database save durations in seconds.

etcd_snap_fsync_duration_seconds_bucket

The distribution of ETCD snapshot fsync durations in seconds.

etcd_snap_fsync_duration_seconds_count

The count of ETCD snapshot fsync durations in seconds.

etcd_snap_fsync_duration_seconds_sum

The sum of ETCD snapshot fsync durations in seconds.

grpc_server_handled_total

The total number of requests handled by the gRPC server.

grpc_server_msg_received_total

The total number of requests received by the gRPC server.

grpc_server_msg_sent_total

The total number of requests sent by the gRPC server.

grpc_server_started_total

The total number of times the gRPC server has started.

os_fd_limit

The file descriptor limit of the operating system.

os_fd_used

The number of file descriptors used by the operating system.

up

The connectivity of metric collection.

Scheduler (job name: ack-scheduler)

Metric

Description

aggregator_discovery_aggregation_count_total

The count of discovery aggregations performed by the aggregator.

aliyun_prometheus_agent_append_duration_seconds

The duration of the Prometheus agent append operations in seconds.

aliyun_prometheus_agent_job_discovery_status

The discovery status of the Prometheus agent collection jobs.

aliyun_prometheus_agent_scrape_custom_error

The number of custom collection errors of the Prometheus agent.

aliyun_prometheus_agent_scrapes_by_target_total

The total number of scrapes by the Prometheus agent per target.

aliyun_prometheus_agent_target_info

The target information of the Prometheus agent.

apiserver_audit_event_total

The total number of APIServer audit events.

apiserver_audit_requests_rejected_total

The total number of APIServer audit request rejections.

apiserver_client_certificate_expiration_seconds_bucket

The distribution of remaining seconds until APIServer client certificate expiration.

apiserver_client_certificate_expiration_seconds_count

The count of remaining seconds until APIServer client certificate expiration.

apiserver_client_certificate_expiration_seconds_sum

The sum of remaining seconds until APIServer client certificate expiration.

apiserver_delegated_authn_request_duration_seconds_bucket

The distribution of delegated authentication request durations in seconds for the APIServer.

apiserver_delegated_authn_request_duration_seconds_count

The count of delegated authentication request durations in seconds for the APIServer.

apiserver_delegated_authn_request_duration_seconds_sum

The sum of delegated authentication request durations in seconds for the APIServer.

apiserver_delegated_authn_request_total

The total number of delegated authentication requests for the APIServer.

apiserver_delegated_authz_request_duration_seconds_bucket

The distribution of delegated authorization request durations in seconds for the APIServer.

apiserver_delegated_authz_request_duration_seconds_count

The count of delegated authorization request durations in seconds for the APIServer.

apiserver_delegated_authz_request_duration_seconds_sum

The sum of delegated authorization request durations in seconds for the APIServer.

apiserver_delegated_authz_request_total

The total number of delegated authorization requests to the API server.

apiserver_encryption_config_controller_automatic_reload_failures_total

The total number of automatic reload failures for the APIServer encryption configuration controller.

apiserver_encryption_config_controller_automatic_reload_success_total

The total number of successful automatic reloads for the APIServer encryption configuration controller.

apiserver_envelope_encryption_dek_cache_fill_percent

The percentage of envelope encryption data encryption keys (DEKs) cache fill for the APIServer.

apiserver_storage_data_key_generation_duration_seconds_bucket

The distribution of data key generation durations for the APIServer storage.

apiserver_storage_data_key_generation_duration_seconds_count

The count of data key generation durations for the APIServer storage.

apiserver_storage_data_key_generation_duration_seconds_sum

The sum of data key generation durations for the APIServer storage.

apiserver_storage_data_key_generation_failures_total

The total number of data key generation failures for the APIServer storage.

apiserver_storage_envelope_transformation_cache_misses_total

The total number of envelope transformation cache misses for the APIServer storage.

apiserver_webhooks_x509_insecure_sha1_total

The total count of insecure SHA1 usage in X509 certificates for APIServer Webhooks.

apiserver_webhooks_x509_missing_san_total

The total count of missing SANs in X509 certificates for APIServer Webhooks.

authenticated_user_requests

The number of authenticated user requests.

authentication_attempts

The number of authentication attempts.

authentication_duration_seconds_bucket

The distribution of authentication durations in seconds.

authentication_duration_seconds_count

The count of authentication durations in seconds.

authentication_duration_seconds_sum

The sum of authentication durations in seconds.

authentication_token_cache_active_fetch_count

The count of active fetches for the authentication token cache.

authentication_token_cache_fetch_total

The total number of fetches for the authentication token cache.

authentication_token_cache_request_duration_seconds_bucket

The distribution of request durations in seconds for the authentication token cache.

authentication_token_cache_request_duration_seconds_count

The count of request durations in seconds for the authentication token cache.

authentication_token_cache_request_duration_seconds_sum

The sum of request durations in seconds for the authentication token cache.

authentication_token_cache_request_total

The total number of requests for the authentication token cache.

authorization_attempts_total

The total number of authorization attempts.

authorization_duration_seconds_bucket

The distribution of authorization durations in seconds.

authorization_duration_seconds_count

The count of authorization durations in seconds.

authorization_duration_seconds_sum

The sum of authorization durations in seconds.

cardinality_enforcement_unexpected_categorizations_total

The total number of unexpected categorizations during cardinality enforcement.

kubernetes_build_info

The Kubernetes build information.

kubernetes_feature_enabled

The Kubernetes enabled features.

leader_election_master_status

The master status of leader election.

registered_metric_total

The total number of registered metrics.

registered_metrics_total

The total number of registered metrics.

rest_client_exec_plugin_certificate_rotation_age_bucket

The distribution of certificate rotation age for REST client exec plugin.

rest_client_exec_plugin_certificate_rotation_age_count

The count of certificate rotation age for REST client exec plugin.

rest_client_exec_plugin_certificate_rotation_age_sum

The sum of certificate rotation age for REST client exec plugin.

rest_client_rate_limiter_duration_seconds_bucket

The distribution of rate limiter durations for REST client.

rest_client_rate_limiter_duration_seconds_count

The count of rate limiter durations for REST client.

rest_client_rate_limiter_duration_seconds_sum

The sum of rate limiter durations for REST client.

rest_client_request_duration_seconds_bucket

The distribution of request durations in seconds for REST client.

rest_client_request_duration_seconds_count

The count of request durations in seconds for REST client.

rest_client_request_duration_seconds_sum

The sum of request durations in seconds for REST client.

rest_client_request_retries_total

The total number of request retries for REST client.

rest_client_request_size_bytes_bucket

The distribution of request sizes in bytes for REST client.

rest_client_request_size_bytes_count

The count of request sizes in bytes for REST client.

rest_client_request_size_bytes_sum

The sum of request sizes in bytes for REST client.

rest_client_requests_total

The total number of requests for REST client.

rest_client_response_size_bytes_bucket

The distribution of response sizes in bytes for REST client.

rest_client_response_size_bytes_count

The count of response sizes in bytes for REST client.

rest_client_response_size_bytes_sum

The sum of response sizes in bytes for REST client.

rest_client_transport_cache_entries

The number of transport cache entries for REST client.

rest_client_transport_create_calls_total

The total number of transport create calls for REST client.

scheduler_binding_duration_seconds_bucket

The distribution of binding durations in seconds for the scheduler.

scheduler_binding_duration_seconds_count

The count of binding durations in seconds for the scheduler.

scheduler_binding_duration_seconds_sum

The sum of binding durations in seconds for the scheduler.

scheduler_e2e_scheduling_duration_seconds_bucket

The distribution of end-to-end scheduling durations for the scheduler.

scheduler_e2e_scheduling_duration_seconds_count

The count of end-to-end scheduling durations for the scheduler.

scheduler_e2e_scheduling_duration_seconds_sum

The sum of end-to-end scheduling durations for the scheduler.

scheduler_framework_extension_point_duration_seconds_bucket

The distribution of extension point durations for the scheduler framework.

scheduler_framework_extension_point_duration_seconds_count

The count of extension point durations for the scheduler framework.

scheduler_framework_extension_point_duration_seconds_sum

The sum of extension point durations for the scheduler framework.

scheduler_goroutines

The number of goroutines for the scheduler.

scheduler_pending_pods

The number of pending pods for the scheduler.

scheduler_plugin_evaluation_total

The total number of plugin evaluations for the scheduler.

scheduler_plugin_execution_duration_seconds_bucket

The distribution of execution durations in seconds for the scheduler plugins.

scheduler_plugin_execution_duration_seconds_count

The count of execution durations in seconds for the scheduler plugins.

scheduler_plugin_execution_duration_seconds_sum

The sum of execution durations in seconds for the scheduler plugins.

scheduler_pod_preemption_victims_bucket

The distribution of preemption victims for the scheduler.

scheduler_pod_preemption_victims_count

The count of preemption victims for the scheduler.

scheduler_pod_preemption_victims_sum

The sum of preemption victims for the scheduler.

scheduler_pod_scheduling_attempts_bucket

The distribution of pod scheduling attempts for the scheduler.

scheduler_pod_scheduling_attempts_count

The count of pod scheduling attempts for the scheduler.

scheduler_pod_scheduling_attempts_sum

The sum of pod scheduling attempts for the scheduler.

scheduler_pod_scheduling_duration_seconds_bucket

The distribution of pod scheduling durations in seconds for the scheduler.

scheduler_pod_scheduling_duration_seconds_count

The count of pod scheduling durations in seconds for the scheduler.

scheduler_pod_scheduling_duration_seconds_sum

The sum of pod scheduling durations in seconds for the scheduler.

scheduler_pod_scheduling_sli_duration_seconds_bucket

The distribution of SLI durations for pod scheduling.

scheduler_pod_scheduling_sli_duration_seconds_count

The count of SLI durations for pod scheduling.

scheduler_pod_scheduling_sli_duration_seconds_sum

The sum of SLI durations for pod scheduling.

scheduler_preemption_attempts_total

The total number of preemption attempts for the scheduler.

scheduler_preemption_victims_bucket

The distribution of preemption victims for the scheduler.

scheduler_preemption_victims_count

The count of preemption victims for the scheduler.

scheduler_preemption_victims_sum

The sum of preemption victims for the scheduler.

scheduler_queue_incoming_pods_total

The total number of incoming pods for the scheduler.

scheduler_schedule_attempts_total

The total number of scheduling attempts for the scheduler.

scheduler_scheduler_cache_size

The scheduler cache size.

scheduler_scheduler_goroutines

The number of goroutines for the scheduler.

scheduler_scheduling_algorithm_duration_seconds_bucket

The distribution of scheduling algorithm durations in seconds.

scheduler_scheduling_algorithm_duration_seconds_count

The count of scheduling algorithm durations in seconds.

scheduler_scheduling_algorithm_duration_seconds_sum

The sum of scheduling algorithm durations in seconds.

scheduler_scheduling_algorithm_predicate_evaluation_seconds_bucket

The distribution of predicate evaluation seconds for the scheduling algorithm.

scheduler_scheduling_algorithm_predicate_evaluation_seconds_count

The count of predicate evaluation seconds for the scheduling algorithm.

scheduler_scheduling_algorithm_predicate_evaluation_seconds_sum

The sum of predicate evaluation seconds for the scheduling algorithm.

scheduler_scheduling_algorithm_preemption_evaluation_seconds_bucket

The distribution of preemption evaluation seconds for the scheduling algorithm.

scheduler_scheduling_algorithm_preemption_evaluation_seconds_count

The count of preemption evaluation seconds for the scheduling algorithm.

scheduler_scheduling_algorithm_preemption_evaluation_seconds_sum

The sum of preemption evaluation seconds for the scheduling algorithm.

scheduler_scheduling_algorithm_priority_evaluation_seconds_bucket

The distribution of priority evaluation durations in seconds for the scheduling algorithm.

scheduler_scheduling_algorithm_priority_evaluation_seconds_count

The count of priority evaluation durations in seconds for the scheduling algorithm.

scheduler_scheduling_algorithm_priority_evaluation_seconds_sum

The sum of priority evaluation durations in seconds for the scheduling algorithm.

scheduler_scheduling_attempt_duration_seconds_bucket

The distribution of scheduling attempt durations.

scheduler_scheduling_attempt_duration_seconds_count

The count of scheduling attempt durations.

scheduler_scheduling_attempt_duration_seconds_sum

The sum of scheduling attempt durations.

scheduler_scheduling_duration_seconds

The distribution of scheduling durations in seconds.

scheduler_scheduling_duration_seconds_count

The count of scheduling durations in seconds.

scheduler_scheduling_duration_seconds_sum

The sum of scheduling durations in seconds.

scheduler_total_preemption_attempts

The total number of preemption attempts by the scheduler.

scheduler_unschedulable_pods

The number of unscheduled pods by the scheduler.

scheduler_volume_scheduling_duration_seconds_bucket

The distribution of volume scheduling durations in seconds.

scheduler_volume_scheduling_duration_seconds_count

The count of volume scheduling durations in seconds.

scheduler_volume_scheduling_duration_seconds_sum

The sum of volume scheduling durations in seconds.

scheduler_volume_scheduling_stage_error_total

The number of errors that are returned during volume scheduling.

scrape_duration_seconds

The scrape duration in seconds.

scrape_samples_post_metric_relabeling

The number of scraped samples after metric relabeling.

scrape_samples_scraped

The number of scraped samples.

scrape_series_added

The number of new series added during the scrape.

up

The connectivity of metric collection.

workqueue_adds_total

The total number of additions to the work queue.

workqueue_depth

The work queue depth.

workqueue_longest_running_processor_seconds

The longest running processor duration in seconds for the work queue.

workqueue_queue_duration_seconds_bucket

The distribution of queue durations in seconds for the work queue.

workqueue_queue_duration_seconds_count

The count of queue durations in seconds for the work queue.

workqueue_queue_duration_seconds_sum

The sum of queue durations in seconds for the work queue.

workqueue_retries_total

The total number of retries in the work queue.

workqueue_unfinished_work_seconds

The unfinished work duration in seconds for the work queue.

workqueue_work_duration_seconds_bucket

The distribution of work durations for the work queue.

workqueue_work_duration_seconds_count

The count of work durations for the work queue.

workqueue_work_duration_seconds_sum

The sum of work durations for the work queue.

References