What are the basic metrics for container clusters supported by Managed Service for Prometheus - Application Real-Time Monitoring Service

This topic describes the basic metrics for container clusters that are supported by Managed Service for Prometheus.

Important

Billing for Managed Service for Prometheus is based on the data write volume or the number of reported data points. Metrics are divided into two types:
- Basic metrics: Managed Service for Prometheus provides free data reporting and writing for basic metrics collected from Alibaba Cloud container services, such as Container Service for Kubernetes (ACK), ACS, ASK, ACK One, and ACK Edge. This benefit does not apply to other types of container clusters.
- Custom metrics: Any metric that is not a basic metric is a custom metric. Billing for custom metrics started on January 6, 2020.
Starting from 00:00:00 (UTC+8) on November 12, 2024, Managed Service for Prometheus will adjust the scope of basic metrics collected from Alibaba Cloud container service clusters. The adjusted metric scope is described below.

Note that the scope of basic metrics collected by default for container clusters is limited to the metrics described in this topic.

Container cluster metrics outside this scope are custom metrics and are subject to charges. For more information about billing, see Billing of Prometheus instances.

cAdvisor (Job name: _arms/kubelet/cadvisor)

Metric	Description
container_cpu_usage_seconds_total	Total container CPU usage time.
container_fs_usage_bytes	Container file system usage in bytes.
container_memory_cache	Container memory cache.
container_memory_usage_bytes	Container memory usage in bytes.
container_memory_working_set_bytes	Container memory working set in bytes.
container_network_receive_bytes_total	Total bytes received by the container network.
container_network_transmit_bytes_total	Total bytes transmitted by the container network.
container_scrape_error	Container metric scrape error.
DCGM_CUSTOM_CONTAINER_CP_ALLOCATED	The proportion of computing power allocated to a container on a GPU card relative to the total computing power of that GPU. The value ranges from 0 to 1. For exclusive GPUs or shared GPUs that only request GPU memory, this metric is 0, which indicates no limit on computing power. For example, if a GPU card has 100 units of computing power and 30 units are allocated to a container, the allocated computing power ratio for that container is 30/100 = 0.3.
DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED	The GPU memory allocated to the container.
DCGM_CUSTOM_DEV_FB_ALLOCATED	The proportion of allocated GPU memory to the total GPU memory. The value ranges from 0 to 1.
DCGM_CUSTOM_DEV_FB_TOTAL	The total GPU memory of the GPU card.
DCGM_CUSTOM_DEV_HEALTH	GPU health status.
DCGM_CUSTOM_PROCESS_DECODE_UTIL	The decoder utilization of the GPU thread.
DCGM_CUSTOM_PROCESS_ENCODE_UTIL	The encoder utilization of the GPU thread.
DCGM_CUSTOM_PROCESS_MEM_COPY_UTIL	The memory copy utilization of the GPU thread.
DCGM_CUSTOM_PROCESS_MEM_USED	The GPU memory currently used by the GPU thread.
DCGM_CUSTOM_PROCESS_SM_UTIL	The SM utilization of the GPU thread.
DCGM_CUSTOM_PROF_MEM_BANDWIDTH_USED	GPU memory bandwidth usage.
DCGM_CUSTOM_PROF_TENS_TFPS_USED	The usage of the GPU tensor core.
DCGM_FI_DEV_DEC_UTIL	Decoder utilization.
DCGM_FI_DEV_ENC_UTIL	Encoder utilization.
DCGM_FI_DEV_FB_FREE	The amount of available framebuffer memory.
DCGM_FI_DEV_FB_USED	The amount of used framebuffer memory. This value corresponds to the used value of Memory-Usage in the nvidia-smi command.
DCGM_FI_DEV_GPU_TEMP	GPU temperature.
DCGM_FI_DEV_GPU_UTIL	GPU utilization. This is the percentage of time one or more kernel functions are active on the GPU over a period, such as 1s or 1/6s, depending on the GPU product. This metric only shows that a GPU resource is in use by a kernel function, but does not show the specific usage.
DCGM_FI_DEV_MEM_CLOCK	Memory clock frequency.
DCGM_FI_DEV_MEM_COPY_UTIL	Memory bandwidth utilization. For example, for an NVIDIA V100 GPU, the maximum memory bandwidth is 900 GB/sec. If the current memory bandwidth is 450 GB/sec, the memory bandwidth utilization is 50%.
DCGM_FI_DEV_POWER_USAGE	Power usage.
DCGM_FI_DEV_SM_CLOCK	SM clock frequency.
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION	The energy consumed since the driver was loaded.
DCGM_FI_DEV_XID_ERRORS	The last XID error number that occurred within a period of time.
DCGM_FI_PROF_DRAM_ACTIVE	Memory bandwidth utilization. The fraction of cycles where data is sent to or received from the device memory. This value is an average over the time interval, not an instantaneous value. A higher value indicates higher utilization of the device memory. A value of 1 (100%) means that a DRAM instruction is executed in every cycle within the time interval. In practice, a peak of about 0.8 (80%) is the maximum achievable value. A value of 0.2 (20%) means that 20% of the cycles are used to read from or write to the device memory within the time interval.
DCGM_FI_PROF_NVLINK_RX_BYTES	The data rate of data transmitted or received over NVLink, excluding protocol headers. This value is an average over a time interval, not an instantaneous value. The rate is averaged over the time interval. For example, if 1 GB of data is transmitted in 1 second, the rate is 1 GB/s, regardless of whether the data is transmitted at a constant rate or in bursts. The theoretical maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction.
DCGM_FI_PROF_NVLINK_TX_BYTES	Total bytes transmitted over NVLink (send direction).
DCGM_FI_PROF_PCIE_RX_BYTES	The data rate of data transmitted or received over the PCIe bus, including protocol headers and data payloads. This value is an average over a time interval, not an instantaneous value. The rate is averaged over the time interval. For example, if 1 GB of data is transmitted in 1 second, the rate is 1 GB/s, regardless of whether the data is transmitted at a constant rate or in bursts. The theoretical maximum PCIe Gen3 bandwidth is 985 MB/s per channel.
DCGM_FI_PROF_PCIE_TX_BYTES	The data rate of data transmitted or received over the PCIe bus, including protocol headers and data payloads. This value is an average over a time interval, not an instantaneous value. The rate is averaged over the time interval. For example, if 1 GB of data is transmitted in 1 second, the rate is 1 GB/s, regardless of whether the data is transmitted at a constant rate or in bursts. The theoretical maximum PCIe Gen3 bandwidth is 985 MB/s per channel.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE	The fraction of cycles where the Tensor (HMMA/IMMA) Pipe is active. This value is an average over a time interval, not an instantaneous value. A higher value indicates higher utilization of Tensor Cores. A value of 1 (100%) means that a Tensor instruction is issued every other instruction cycle. One instruction is completed in two cycles. A value of 0.2 (20%) could mean: 20% of the SMs' Tensor Cores are running at 100% utilization throughout the interval. 100% of the SMs' Tensor Cores are running at 20% utilization throughout the interval. For 1/5 of the interval, 100% of the Tensor Cores on the SMs are running at 100% utilization. Other combinations.
DCGM_FI_PROF_SM_ACTIVE	The percentage of time that at least one warp is active on a Streaming Multiprocessor (SM) within a time interval. This value is the average for all SMs and is not sensitive to the number of threads per block. A warp is active when it is scheduled and allocated resources. It can be in a computing or non-computing state, such as waiting for a memory request. A value less than 0.5 indicates inefficient GPU utilization, and a value greater than 0.8 is necessary. Assume a GPU has N SMs: If a kernel function runs on all SMs using N thread blocks throughout the interval, the value is 1 (100%). If a kernel function runs N/5 thread blocks within the interval, the value is 0.2. If a kernel function uses N thread blocks but runs for only 1/5 of the cycle time within the interval, the value is 0.2.
machine_cpu_cores	Number of machine CPU cores.
node_exporter_build_info	Node exporter build information.
nvidia_gpu_duty_cycle	NVIDIA GPU duty cycle percentage.
nvidia_gpu_memory_total_bytes	Total NVIDIA GPU memory in bytes.
nvidia_gpu_memory_used_bytes	Amount of used NVIDIA GPU memory.
nvidia_gpu_num_devices	Number of NVIDIA GPU devices.
nvidia_gpu_power_usage_milliwatts	NVIDIA GPU power consumption in milliwatts.
nvidia_gpu_temperature_celsius	NVIDIA GPU temperature in Celsius.
rdma_service_monitor_local_ack_timeout_err	Number of RDMA network timeout errors.
rdma_service_monitor_out_of_seq	Number of out-of-sequence RDMA network datagrams.
rdma_service_monitor_packet_seq_err	Number of out-of-sequence RDMA network packet sending errors.
rdma_service_monitor_rx_bytes	RDMA network receive throughput.
rdma_service_monitor_rx_packets	Number of received RDMA network packets.
rdma_service_monitor_tx_bytes	RDMA network send throughput.
rdma_service_monitor_tx_packets	Number of sent RDMA network packets.
up	Connectivity of metric scraping.

ACK ControlPlane APIServer (Includes ACK Pro control plane components such as APIServer, etcd, scheduler, KCM, and CCM. ACK Dedicated clusters include only APIServer) (Job name: apiserver)

Metric	Description
aggregator_discovery_aggregation_count_total	Total count of aggregations from aggregator discovery
aggregator_openapi_v2_regeneration_count	Aggregator OpenAPI V2 regeneration count
aggregator_openapi_v2_regeneration_duration	Aggregator OpenAPI V2 regeneration duration
aggregator_unavailable_apiservice	Unavailable aggregator APIService
aggregator_unavailable_apiservice_count	The number of unavailable APIServices in the aggregator.
aggregator_unavailable_apiservice_total	Total number of unavailable API services in the aggregator
aliyun_prometheus_agent_append_duration_seconds	Alibaba Cloud Prometheus Agent append duration (seconds)
aliyun_prometheus_agent_job_discovery_status	Alibaba Cloud Prometheus Agent job discovery status
aliyun_prometheus_agent_scrapes_by_target_total	Total scrapes by target for the Alibaba Cloud Prometheus Agent
aliyun_prometheus_agent_target_info	Alibaba Cloud Prometheus Agent target information
apiextensions_apiserver_validation_ratcheting_seconds_bucket	APIServer validation ratcheting seconds bucket
apiextensions_apiserver_validation_ratcheting_seconds_count	Count of APIServer validation ratcheting seconds
apiextensions_apiserver_validation_ratcheting_seconds_sum	Sum of APIServer validation increment in seconds
apiextensions_openapi_v2_regeneration_count	Apiextensions OpenAPI V2 regeneration count
apiextensions_openapi_v3_regeneration_count	Apiextensions OpenAPI V3 regeneration count
apiserver_accepted_listall_requests_total	The total number of listall requests accepted by the APIServer.
apiserver_admission_controller_admission_duration_seconds_bucket	The bucket for the APIServer admission controller admission duration, in seconds.
apiserver_admission_controller_admission_duration_seconds_count	The number of admission requests processed by the APIServer admission controller.
apiserver_admission_controller_admission_duration_seconds_sum	Total admission duration for the APIServer admission controller, in seconds
apiserver_admission_step_admission_duration_seconds_bucket	The histogram bucket for the duration of an APIServer admission step in seconds.
apiserver_admission_step_admission_duration_seconds_count	Count of API server admission step durations in seconds.
apiserver_admission_step_admission_duration_seconds_sum	Total duration of API server admission steps in seconds
apiserver_admission_step_admission_duration_seconds_summary	Summary of the APIServer admission step duration in seconds.
apiserver_admission_step_admission_duration_seconds_summary_count	Summary count of the admission duration of an APIServer admission step in seconds.
apiserver_admission_step_admission_duration_seconds_summary_sum	The sum of the summary of the API server admission step duration, in seconds.
apiserver_admission_webhook_admission_duration_seconds_bucket	APIServer admission webhook admission duration seconds bucket
apiserver_admission_webhook_admission_duration_seconds_count	The count of APIServer admission webhook durations in seconds.
apiserver_admission_webhook_admission_duration_seconds_sum	Sum of the admission duration of API server admission webhooks, in seconds.
apiserver_admission_webhook_fail_open_count	API server admission webhook fail open count
apiserver_admission_webhook_rejection_count	The number of rejections from the API server admission webhook.
apiserver_admission_webhook_request_total	Total number of API server admission webhook requests
apiserver_audit_error_total	Total number of API Server audit errors
apiserver_audit_event_total	Total APIServer audit events
apiserver_audit_level_total	Total number of API server audit events
apiserver_audit_requests_rejected_total	Total number of rejected APIServer audit requests.
apiserver_authorization_decisions_total	Total number of API server authorization decisions
apiserver_cache_list_fetched_objects_total	The total number of objects fetched from the APIServer cache list.
apiserver_cache_list_returned_objects_total	Total number of objects returned by the APIServer cache list
apiserver_cache_list_total	Total number of APIServer cache list operations
apiserver_cacher_received_events	Events received by the APIServer cache
apiserver_cacher_sended_events_latency_milliseconds_bucket	The distribution of latency in milliseconds for events sent by the APIServer cacher.
apiserver_cacher_sended_events_latency_milliseconds_count	The count of latency measurements in milliseconds for events sent by the APIServer cacher.
apiserver_cacher_sended_events_latency_milliseconds_sum	The total latency in milliseconds for events sent by the APIServer cacher.
apiserver_cacher_watcher_channel_length	APIServer cacher watcher channel length
apiserver_cel_compilation_duration_seconds_bucket	Distribution of APIServer CEL compilation durations in seconds
apiserver_cel_compilation_duration_seconds_count	Counter of API server CEL compilations
apiserver_cel_compilation_duration_seconds_sum	Total APIServer CEL compilation duration (seconds)
apiserver_cel_evaluation_duration_seconds_bucket	Distribution of APIServer CEL evaluation durations in seconds.
apiserver_cel_evaluation_duration_seconds_count	The number of API server CEL evaluations.
apiserver_cel_evaluation_duration_seconds_sum	Total duration of APIServer CEL evaluation in seconds
apiserver_client_certificate_expiration_seconds_bucket	Distribution of seconds remaining before the API server client certificate expires.
apiserver_client_certificate_expiration_seconds_count	The number of seconds before the API server client certificate expires.
apiserver_client_certificate_expiration_seconds_sum	The total number of seconds remaining before the APIServer client certificate expires.
apiserver_clusterip_repair_ip_errors_total	Total ClusterIP errors repaired by the API server
apiserver_clusterip_repair_reconcile_errors_total	The total number of reconciliation errors for ClusterIP repairs by the APIServer.
apiserver_conversion_webhook_duration_seconds_bucket	The distribution of API server conversion webhook durations in seconds.
apiserver_conversion_webhook_duration_seconds_count	The number of APIServer conversion webhook calls
apiserver_conversion_webhook_duration_seconds_sum	Total duration of API server conversion webhooks in seconds
apiserver_conversion_webhook_request_total	Total number of API server conversion webhook requests
apiserver_crd_conversion_webhook_duration_seconds_bucket	The distribution of API Server CRD conversion webhook durations in seconds.
apiserver_crd_conversion_webhook_duration_seconds_count	Count of calls to the APIServer CRD conversion webhook
apiserver_crd_conversion_webhook_duration_seconds_sum	Total duration of APIServer CRD conversion webhooks in seconds.
apiserver_crd_webhook_conversion_duration_seconds_bucket	Distribution of APIServer CRD webhook conversion duration in seconds.
apiserver_crd_webhook_conversion_duration_seconds_count	The total number of APIServer CRD webhook conversions.
apiserver_crd_webhook_conversion_duration_seconds_sum	Total duration of APIServer CRD webhook conversions in seconds.
apiserver_created_watchers	Number of watchers created by the API server
apiserver_current_inflight_requests	The number of requests the APIServer is currently processing.
apiserver_current_inqueue_requests	The current number of requests in the API server queue.
apiserver_dropped_requests_total	The total number of requests dropped by the APIServer.
apiserver_encryption_config_controller_automatic_reload_failures_total	Number of failed automatic reloads for the APIServer encryption configuration controller
apiserver_encryption_config_controller_automatic_reload_success_total	Number of successful automatic reloads for the APIServer encryption configuration controller
apiserver_envelope_encryption_dek_cache_fill_percent	APIServer envelope encryption DEK cache fill percentage
apiserver_error_watchers	Number of APIServer fault observers
apiserver_flowcontrol_current_executing_requests	Number of requests currently being executed by the APIServer throttle
apiserver_flowcontrol_current_executing_seats	Number of seats currently used by the APIServer throttle
apiserver_flowcontrol_current_inqueue_requests	Number of requests in the APIServer throttle queue
apiserver_flowcontrol_current_inqueue_seats	Number of seats in the APIServer throttle queue
apiserver_flowcontrol_current_limit_seats	Current seat limit for the API server throttle
apiserver_flowcontrol_current_r	Current R value of the APIServer throttle
apiserver_flowcontrol_demand_seats_average	Average value of requested seats for APIServer throttling
apiserver_flowcontrol_demand_seats_bucket	Seat distribution for throttled API server requests
apiserver_flowcontrol_demand_seats_count	APIServer throttle request seat count
apiserver_flowcontrol_demand_seats_high_watermark	APIServer throttling request seats high-water mark
apiserver_flowcontrol_demand_seats_smoothed	Smoothing value for APIServer throttle request seats
apiserver_flowcontrol_demand_seats_stdev	Standard deviation of request seats for APIServer throttling
apiserver_flowcontrol_demand_seats_sum	Total requested seats for APIServer throttling
apiserver_flowcontrol_dispatch_r	APIServer throttle scheduling R value
apiserver_flowcontrol_dispatched_requests_total	Total number of requests scheduled by APIServer throttling
apiserver_flowcontrol_latest_s	Recent S value limit for APIServer throttling
apiserver_flowcontrol_lower_limit_seats	Minimum seats for APIServer throttling
apiserver_flowcontrol_next_discounted_s_bounds	Next discounted S-value threshold for the APIServer throttle
apiserver_flowcontrol_next_s_bounds	Next S value threshold for APIServer throttling
apiserver_flowcontrol_nominal_limit_seats	Nominal seat limit for APIServer throttling
apiserver_flowcontrol_priority_level_request_count_samples_bucket	Sample distribution of APIServer requests by throttling priority level
apiserver_flowcontrol_priority_level_request_count_samples_count	Sample count of APIServer requests per throttling priority level
apiserver_flowcontrol_priority_level_request_count_samples_sum	Sum of sampled request counts for the APIServer throttling priority level
apiserver_flowcontrol_priority_level_request_count_watermarks_bucket	Distribution of request count watermarks across APIServer flow control priority levels
apiserver_flowcontrol_priority_level_request_count_watermarks_count	API server throttling priority level: request count watermark mark count
apiserver_flowcontrol_priority_level_request_count_watermarks_sum	Sum of request watermarks for APIServer throttling priority levels
apiserver_flowcontrol_priority_level_request_utilization_bucket	Distribution of APIServer request utilization by flow control priority level
apiserver_flowcontrol_priority_level_request_utilization_count	APIServer throttle priority level request utilization count
apiserver_flowcontrol_priority_level_request_utilization_sum	Total request utilization across APIServer throttling priority levels
apiserver_flowcontrol_priority_level_seat_count_samples_bucket	Sample distribution of seats across APIServer throttling priority levels
apiserver_flowcontrol_priority_level_seat_count_samples_count	APIServer throttling priority level seats sample count
apiserver_flowcontrol_priority_level_seat_count_samples_sum	Sum of seat count samples for the APIServer throttle priority level
apiserver_flowcontrol_priority_level_seat_count_watermarks_bucket	Distribution of seat watermarks for API server priority levels
apiserver_flowcontrol_priority_level_seat_count_watermarks_count	APIServer throttle priority level seats watermark mark count
apiserver_flowcontrol_priority_level_seat_count_watermarks_sum	Total seats at the watermark for the APIServer throttling priority level
apiserver_flowcontrol_priority_level_seat_utilization_bucket	API server: Seat utilization distribution by throttle priority level
apiserver_flowcontrol_priority_level_seat_utilization_count	APIServer flow control priority level seat utilization count
apiserver_flowcontrol_priority_level_seat_utilization_sum	Total seat utilization across API server throttling priority levels
apiserver_flowcontrol_read_vs_write_current_requests_bucket	Current request count in the APIServer read/write throttle bucket
apiserver_flowcontrol_read_vs_write_current_requests_count	Current read/write request count for APIServer throttling
apiserver_flowcontrol_read_vs_write_current_requests_sum	Sum of current read and write requests throttled by the APIServer
apiserver_flowcontrol_read_vs_write_request_count_samples_bucket	Sample bucket for the read/write request count of the APIServer throttle.
apiserver_flowcontrol_read_vs_write_request_count_samples_count	Number of samples for the APIServer throttled read/write request counter
apiserver_flowcontrol_read_vs_write_request_count_samples_sum	Total count of throttled APIServer read/write requests
apiserver_flowcontrol_read_vs_write_request_count_watermarks_bucket	APIServer throttling read/write request count watermark bucket
apiserver_flowcontrol_read_vs_write_request_count_watermarks_count	APIServer throttled read/write request count watermark
apiserver_flowcontrol_read_vs_write_request_count_watermarks_sum	Total count watermark for APIServer throttled read/write requests
apiserver_flowcontrol_rejected_requests_total	Total requests rejected by APIServer throttling
apiserver_flowcontrol_request_concurrency_in_use	APIServer throttled concurrent requests
apiserver_flowcontrol_request_concurrency_limit	Concurrency limit for APIServer request throttling
apiserver_flowcontrol_request_dispatch_no_accommodation_total	The API server request throttling scheduler cannot accommodate the total number of requests.
apiserver_flowcontrol_request_execution_seconds_bucket	APIServer throttled request execution time in seconds (buckets)
apiserver_flowcontrol_request_execution_seconds_count	Total execution time in seconds for throttled APIServer requests
apiserver_flowcontrol_request_execution_seconds_sum	Sum of execution seconds for throttled APIServer requests
apiserver_flowcontrol_request_queue_length_after_enqueue_bucket	Post-enqueue length buckets of the APIServer request throttling queue
apiserver_flowcontrol_request_queue_length_after_enqueue_count	Count of requests in the APIServer throttling queue
apiserver_flowcontrol_request_queue_length_after_enqueue_sum	Total enqueued requests in APIServer throttling queues
apiserver_flowcontrol_request_wait_duration_seconds_bucket	APIServer request throttling wait time bucket (seconds)
apiserver_flowcontrol_request_wait_duration_seconds_count	Total wait time in seconds for throttled APIServer requests
apiserver_flowcontrol_request_wait_duration_seconds_sum	Total wait time in seconds for throttled APIServer requests
apiserver_flowcontrol_seat_fair_frac	The APIServer contains the fair allocation ratio from the previous borrowing adjustment period.
apiserver_flowcontrol_target_seats	Target seat count for API server throttling
apiserver_flowcontrol_upper_limit_seats	Maximum number of seats for APIServer throttling
apiserver_flowcontrol_watch_count_samples_bucket	APIServer throttle observation count sample bucket
apiserver_flowcontrol_watch_count_samples_count	APIServer throttle observation sample count
apiserver_flowcontrol_watch_count_samples_sum	Sum of APIServer throttle observation counts
apiserver_flowcontrol_work_estimated_seats_bucket	APIServer flow control's bucket for estimated work seats
apiserver_flowcontrol_work_estimated_seats_count	APIServer flow control estimated seat count
apiserver_flowcontrol_work_estimated_seats_sum	Total estimated seats for APIServer throttling work
apiserver_init_events_total	Total APIServer initialization events
apiserver_kube_aggregator_x509_insecure_sha1_total	Number of requests using insecure SHA1 signatures
apiserver_kube_aggregator_x509_missing_san_total	APIServer kube-aggregator: Total missing x509 SANs
apiserver_longrunning_gauge	APIServer long-running gauge
apiserver_longrunning_requests	Long-running APIServer requests
apiserver_nodeport_repair_reconcile_errors_total	Total reconciliation faults for APIServer node port repairs
apiserver_realtime_watchers	Number of real-time APIServer observers
apiserver_registered_watchers	Number of registered observers in APIServer
apiserver_request_aborts_total	Total aborted APIServer requests
apiserver_request_body_size_bytes_bucket	APIServer request body size in bytes bucket
apiserver_request_body_size_bytes_count	APIServer request body size in bytes
apiserver_request_body_size_bytes_sum	Total APIServer request body size in bytes
apiserver_request_count	Number of API server requests
apiserver_request_duration_seconds_bucket	Buckets for APIServer request processing time (in seconds)
apiserver_request_duration_seconds_count	Count of APIServer request duration in seconds
apiserver_request_duration_seconds_sum	Total APIServer request duration in seconds
apiserver_request_filter_duration_seconds_bucket	APIServer request filter duration bucket (seconds)
apiserver_request_filter_duration_seconds_count	Count of APIServer request filter durations in seconds.
apiserver_request_filter_duration_seconds_sum	Total duration of APIServer request filters in seconds
apiserver_request_latencies_summary	APIServer request latency distribution summary
apiserver_request_no_resourceversion_list_total	Total LIST requests for versions without resources
apiserver_request_post_timeout_total	Total POST API Request Timeouts
apiserver_request_sli_duration_seconds_bucket	API request Service Level Indicator (SLI) duration seconds bucket
apiserver_request_sli_duration_seconds_count	Total API request SLI duration in seconds
apiserver_request_sli_duration_seconds_sum	Total API request SLI duration in seconds
apiserver_request_slo_duration_seconds_bucket	API request SLO duration bucket (seconds)
apiserver_request_slo_duration_seconds_count	API request SLO duration seconds count
apiserver_request_slo_duration_seconds_sum	Total API request SLO duration in seconds
apiserver_request_terminations_total	Total stopped API requests
apiserver_request_timestamp_comparison_time_bucket	Distribution buckets for API request timestamp differences
apiserver_request_timestamp_comparison_time_count	API request timestamp comparison sample count
apiserver_request_timestamp_comparison_time_sum	Total time for API request timestamp comparison
apiserver_request_total	Total API requests
apiserver_requested_deprecated_apis	Number of requests to the API server for deprecated APIs
apiserver_response_sizes_bucket	API response size distribution buckets
apiserver_response_sizes_count	API response size count
apiserver_response_sizes_sum	Total API response size
apiserver_selfrequest_total	Total API server self-requests
apiserver_storage_data_key_generation_duration_seconds_bucket	APIServer storage data key generation duration: seconds buckets
apiserver_storage_data_key_generation_duration_seconds_count	Count of data key generations by APIServer storage
apiserver_storage_data_key_generation_duration_seconds_sum	Total data key generation time for APIServer storage, in seconds
apiserver_storage_data_key_generation_failures_total	Total number of data key generation failures for the APIServer store
apiserver_storage_db_total_size_in_bytes	Total size of the APIServer database (bytes)
apiserver_storage_decode_errors_total	Total APIServer storage decoding errors
apiserver_storage_envelope_transformation_cache_misses_total	Total cache misses for the envelope transform in APIServer storage
apiserver_storage_events_received_total	Total number of events accepted and stored by the APIServer
apiserver_storage_list_evaluated_objects_total	Total objects evaluated from APIServer storage for list operations
apiserver_storage_list_fetched_objects_total	Total objects retrieved from the APIServer storage list
apiserver_storage_list_returned_objects_total	Total number of objects in a list response from the APIServer
apiserver_storage_list_total	Total APIServer storage list operations
apiserver_storage_objects	Number of APIServer objects
apiserver_storage_size_bytes	APIServer storage size (bytes)
apiserver_terminated_watchers_total	Total number of observers for APIServer stop
apiserver_tls_handshake_errors_total	Total failed TLS handshake requests for the API server
apiserver_too_large_resourceversion_errors	Number of error requests to APIServer due to oversized resource versions
apiserver_watch_cache_events_dispatched_total	Total number of events distributed by the APIServer observation cache
apiserver_watch_cache_events_received_total	Total events accepted by the APIServer observation cache
apiserver_watch_cache_initializations_total	Total APIServer watch cache initializations
apiserver_watch_cache_read_wait_seconds_bucket	APIServer watch cache read wait time bucket (seconds)
apiserver_watch_cache_read_wait_seconds_count	APIServer observation cache read wait seconds count
apiserver_watch_cache_read_wait_seconds_sum	Sum of wait time in seconds for APIServer observation cache reads
apiserver_watch_cache_watch_cache_initializations_total	Total APIServer observation cache initializations
apiserver_watch_events_sizes_bucket	API server observation event size distribution buckets
apiserver_watch_events_sizes_count	APIServer observation event size count
apiserver_watch_events_sizes_sum	Total size of APIServer observation events
apiserver_watch_events_total	Total APIServer observation events
apiserver_webhooks_x509_insecure_sha1_total	Number of requests that use insecure SHA1 signatures
apiserver_webhooks_x509_missing_san_total	Total missing SANs in APIServerWebhooks
authenticated_user_requests	Total number of authenticated user requests
authentication_attempts	Authentication attempts
authentication_duration_seconds_bucket	Authentication procedure duration buckets (seconds)
authentication_duration_seconds_count	Authentication procedure duration (seconds)
authentication_duration_seconds_sum	Total authentication duration in seconds
authentication_token_cache_active_fetch_count	Authentication token cache proactive fetch count
authentication_token_cache_fetch_total	Total authentication token cache retrievals
authentication_token_cache_request_duration_seconds_bucket	Authentication token cache request latency distribution buckets (seconds)
authentication_token_cache_request_duration_seconds_count	Authentication token cache request latency counter (seconds)
authentication_token_cache_request_duration_seconds_sum	Total duration of authentication token cache requests in seconds
authentication_token_cache_request_total	Total authentication token cache requests
authorization_attempts_total	Total authorization attempts
authorization_duration_seconds_bucket	Distribution buckets for authorization procedure duration (seconds)
authorization_duration_seconds_count	Authorization procedure duration in seconds
authorization_duration_seconds_sum	Total authorization procedure duration in seconds
cardinality_enforcement_unexpected_categorizations_total	Total by execution and exception category
count	Count
cpu_utilization_core	CPU utilization (core)
disabled_metric_total	Total disabled metrics
disabled_metrics_total	Total disabled metrics
etcd_bookmark_counts	Etcd bookmark count
etcd_db_total_size_in_bytes	Total etcd database size (bytes)
etcd_lease_object_counts_bucket	Histogram buckets for etcd lease object count
etcd_lease_object_counts_count	Total ETCD lease object count
etcd_lease_object_counts_sum	Total etcd lease object count
etcd_object_counts	ETCD object count
etcd_request_duration_seconds_bucket	Bucket counter for ETCD request processing time (in seconds)
etcd_request_duration_seconds_count	ETCD request duration count (seconds)
etcd_request_duration_seconds_sum	Sum of etcd request durations in seconds
etcd_request_errors_total	Total ETCD request faults
etcd_requests_total	Total etcd requests
etcd_watcher_channel_length	etcd observer channel length
etcd_watcher_received_events	Events received by the ETCD observer
etcd_watcher_sended_events_latency_milliseconds_bucket	Distribution bucket for etcd observer event send latency (ms)
etcd_watcher_sent_events_latency_milliseconds_count	ETCD observer event send latency in milliseconds
etcd_watcher_sent_events_latency_milliseconds_sum	Sum of etcd observer send event latency in milliseconds
field_validation_request_duration_seconds_bucket	Field validation request duration distribution bucket (seconds)
field_validation_request_duration_seconds_count	Field validation request duration count (seconds)
field_validation_request_duration_seconds_sum	Total field authentication request duration in seconds
get_token_count	Get token count
get_token_fail_count	Failed token acquisition count
grpc_client_handled_total	gRPC client: Total processed
grpc_client_msg_received_total	gRPC client: Total messages received
grpc_client_msg_sent_total	gRPC client: Total messages sent
grpc_client_started_total	gRPC Client: Total Starts
hidden_metric_total	Hidden metric: Total
hidden_metrics_total	Hidden metric: Total
http_request_duration_microseconds	HTTP request: Duration (microseconds)
http_request_size_bytes	HTTP request: size (bytes)
http_requests_total	HTTP requests: Total
http_response_size_bytes	HTTP response size (bytes)
Job	Job name
job_instance_mode	Job instance pattern
kube_apiserver_clusterip_allocator_allocated_ips	Kubernetes APIServer: number of IPs allocated by the ClusterIP allocator
kube_apiserver_clusterip_allocator_allocation_errors_total	Kubernetes API server: Total ClusterIP allocator allocation errors
kube_apiserver_clusterip_allocator_allocation_total	Kubernetes APIServer: Total allocations by the ClusterIP allocator
kube_apiserver_clusterip_allocator_available_ips	Kubernetes API server: Available IP address count for the ClusterIP allocator
kube_apiserver_nodeport_allocator_allocated_ports	Kubernetes APIServer: Number of ports allocated by the NodePort allocator
kube_apiserver_nodeport_allocator_allocation_errors_total	Kubernetes APIServer: Total NodePort allocator allocation faults
kube_apiserver_nodeport_allocator_allocation_total	Kubernetes APIServer: Total allocations by the NodePort allocator
kube_apiserver_nodeport_allocator_available_ports	Kubernetes APIServer: Number of available ports for the NodePort allocator
kube_apiserver_pod_logs_backend_tls_failure_total	Kubernetes APIServer: Total number of pods/logs requests due to TLS authentication failure
kube_apiserver_pod_logs_insecure_backend_total	Kubernetes APIServer: Total insecure pods/logs requests
kube_apiserver_pod_logs_pods_logs_backend_tls_failure_total	Kubernetes API server: Total pods/logs requests that failed TLS authentication
kube_apiserver_pod_logs_pods_logs_insecure_backend_total	Kubernetes API server: Number of insecure pods/logs requests
kubelet_container_log_filesystem_used_bytes	Kubelet: File system usage for container logs in bytes
kubelet_node_name	Kubelet: Node name
kubelet_pleg_relist_duration_seconds_bucket	Kubelet: PLEG relist duration buckets (seconds)
kubelet_pod_worker_duration_seconds_bucket	Kubelet: bucketing of pod worker duration in seconds
kubelet_volume_stats_available_bytes	Kubelet: Available bytes in volume stats
kubelet_volume_stats_capacity_bytes	Kubelet: Capacity in bytes from volume statistics
kubelet_volume_stats_inodes	Kubelet: Volume statistics for available inodes
kubelet_volume_stats_inodes_free	Kubelet: Free inode count on the volume
kubelet_volume_stats_inodes_used	Kubelet: Used inode count for the volume
kubelet_volume_stats_used_bytes	Kubelet: Volume used bytes
kubernetes_build_info	Kubernetes build information
kubernetes_feature_enabled	Kubernetes feature status: Enabled
last_list_all_response_size_in_bytes	Total size of the last list response (bytes)
memory_utilization_byte	Memory utilization: Bytes
node_authorizer_graph_actions_duration_seconds_bucket	Node authorizer: Graph operation duration bucketing in seconds
node_authorizer_graph_actions_duration_seconds_count	Node authorizer: Graph operation duration in seconds
node_authorizer_graph_actions_duration_seconds_sum	Node authorizer: Total duration of graph operations in seconds
pod_security_evaluations_total	Total pod security assessments
pod_security_exemptions_total	Total pod security exemptions
process_cpu_seconds_total	Total process CPU time in seconds
process_max_fds	Maximum number of file descriptors per process
process_open_fds	Number of open file descriptors for the process
process_resident_memory_bytes	Process resident memory in bytes
process_start_time_seconds	Process startup time (seconds)
process_virtual_memory_bytes	Process virtual memory in bytes
process_virtual_memory_max_bytes	Maximum virtual memory of a process in bytes
registered_metric_total	Registration metric: Total count
registered_metrics_total	Registration metrics: Total
rest_client_exec_plugin_certificate_rotation_age_bucket	REST client plugin: Certificate rotation age bucketing (seconds)
rest_client_exec_plugin_certificate_rotation_age_count	REST client plugin: Certificate rotation age in seconds
rest_client_exec_plugin_certificate_rotation_age_sum	REST client plugin: Sum of certificate rotation age in seconds
rest_client_exec_plugin_ttl_seconds	REST client plugin: Certificate TTL in seconds
rest_client_request_duration_seconds_bucket	REST client: Request duration bucketing in seconds
rest_client_request_duration_seconds_count	REST client: Request duration count in seconds
rest_client_request_duration_seconds_sum	REST client: Total request duration in seconds
rest_client_request_latency_seconds_bucket	REST client: Request latency bucketing in seconds
rest_client_request_size_bytes_bucket	REST client: Request size bucketing (bytes)
rest_client_request_size_bytes_count	REST client: Request byte count
rest_client_request_size_bytes_sum	REST client: Total request size (bytes)
rest_client_requests_total	REST client: Total requests
rest_client_response_size_bytes_bucket	REST client: Response size (bytes) bucketing
rest_client_response_size_bytes_count	REST client: Response byte count
rest_client_response_size_bytes_sum	REST client: Total response size (bytes)
rest_client_transport_cache_entries	REST client: number of transport cache entries
rest_client_transport_create_calls_total	REST client: Total transport creation calls
scheduler_pending_pods	Scheduler: Number of pending pods
scheduler_pod_scheduling_attempts_bucket	Scheduler: pod scheduling attempt count bucketing
scheduler_scheduler_cache_size	Scheduler: Scheduler cache size
scrape_duration_seconds	Scrape duration (seconds)
scrape_samples_post_metric_relabeling	Number of scraped samples (after metric relabeling)
scrape_samples_scraped	Number of scraped samples
scrape_series_added	Number of new series scraped
serviceaccount_invalid_legacy_auto_token_uses_total	Total uses of invalid legacy automated service account tokens
serviceaccount_legacy_auto_token_uses_total	Total usage count of legacy automated service account tokens
serviceaccount_legacy_manual_token_uses_total	Total uses of legacy manual service account tokens
serviceaccount_legacy_tokens_total	Total number of legacy service account tokens
serviceaccount_stale_tokens_total	Total number of legacy service account tokens
serviceaccount_valid_tokens_total	Total valid service account tokens
ssh_tunnel_open_count	Open SSH tunnel count
ssh_tunnel_open_fail_count	Number of failed SSH tunnel openings
up	Metric collection connectivity
watch_cache_capacity	Monitor cache capacity
watch_cache_capacity_decrease_total	Total reduction in cache capacity
watch_cache_capacity_increase_total	Total increase in monitoring cache capacity
workqueue_adds_total	Total additions to the work queue
workqueue_depth	Work queue depth
workqueue_longest_running_processor_seconds	Longest processor run time in the work queue (seconds)
workqueue_queue_duration_seconds_bucket	Work queue queuing duration (seconds) quantile bucket
workqueue_queue_duration_seconds_count	Total work queue wait time (seconds)
workqueue_queue_duration_seconds_sum	Sum of work queue wait time (seconds)
workqueue_retries_total	Total work queue retries
workqueue_unfinished_work_seconds	Duration of pending work in the work queue (seconds)
workqueue_work_duration_seconds_bucket	Work queue duration (seconds) quantile bucket
workqueue_work_duration_seconds_count	Work queue processing time (seconds)
workqueue_work_duration_seconds_sum	Total work queue duration (seconds)

Node Exporter (Job name: node-exporter)

Metric	Description
aliyun_prometheus_agent_append_duration_seconds	Duration of append operations for the Alibaba Cloud Prometheus agent in seconds.
aliyun_prometheus_agent_job_discovery_status	Discovery status of scrape jobs for the Alibaba Cloud Prometheus agent.
aliyun_prometheus_agent_scrapes_by_target_total	Total number of scrapes by target for the Alibaba Cloud Prometheus agent.
aliyun_prometheus_agent_target_info	Information about the targets of the Alibaba Cloud Prometheus agent.
job	The name of the job.
node_boot_time_seconds	Node boot time in seconds.
node_context_switches_total	Total number of context switches on the node.
node_cpu_seconds_total	Total CPU time spent by the node.
node_disk_io_now	Current disk I/O on the node.
node_disk_io_time_seconds_total	Total time spent on disk I/O on the node, in seconds.
node_disk_io_time_weighted_seconds_total	Total weighted time spent on disk I/O on the node, in seconds.
node_disk_read_bytes_total	Total bytes read from disk on the node.
node_disk_read_time_seconds_total	Total time spent reading from disk on the node, in seconds.
node_disk_reads_completed_total	Total number of completed disk reads on the node.
node_disk_reads_merged_total	Total number of merged disk reads on the node.
node_disk_write_time_seconds_total	Total time spent writing to disk on the node, in seconds.
node_disk_writes_completed_total	Total number of completed disk writes on the node.
node_disk_writes_merged_total	Total number of merged disk writes on the node.
node_disk_written_bytes_total	Total bytes written to disk on the node.
node_exporter_build_info	Build information for Node Exporter.
node_filefd_allocated	Number of allocated file descriptors on the node.
node_filefd_maximum	Maximum number of file descriptors on the node.
node_filesystem_avail_bytes	Number of available bytes in the file system on the node.
node_filesystem_free_bytes	Number of free bytes in the file system on the node.
node_filesystem_size_bytes	Total size of the file system on the node, in bytes.
node_intr_total	Total number of interrupts on the node.
node_load1	1-minute load average on the node.
node_load15	15-minute load average on the node.
node_load5	5-minute load average on the node.
node_memory_MemAvailable_bytes	Available memory on the node, in bytes.
node_memory_MemFree_bytes	Free memory on the node, in bytes.
node_memory_MemTotal_bytes	Total memory on the node, in bytes.
node_memory_Slab_bytes	Slab memory on the node, in bytes.
node_memory_SReclaimable_bytes	Reclaimable slab memory on the node, in bytes.
node_netstat_Tcp_InErrs	Number of TCP receive errors.
node_netstat_Tcp_InSegs	Number of received TCP segments.
node_netstat_Tcp_OutSegs	Number of sent TCP segments.
node_netstat_Tcp_PassiveOpens	Number of passive TCP connection openings.
node_netstat_Tcp_RetransSegs	Number of retransmitted TCP segments.
node_network_receive_bytes_total	Total number of bytes received over the network.
node_network_receive_drop_total	Total number of received packets dropped.
node_network_receive_errs_total	Total number of receive errors.
node_network_receive_packets_total	Total number of packets received.
node_network_transmit_bytes_total	Total number of bytes transmitted over the network.
node_network_transmit_drop_total	Total number of transmitted packets dropped.
node_network_transmit_errs_total	Total number of transmit errors.
node_network_transmit_packets_total	Total number of packets transmitted.
node_network_up	Indicates whether the network interface is enabled.
node_processes_max_processes	Maximum number of processes.
node_processes_max_threads	Maximum number of threads.
node_processes_pids	Number of process IDs.
node_processes_state	Distribution of process states.
node_processes_threads	Number of threads.
node_schedstat_running_seconds_total	Total seconds spent in the running state according to scheduling statistics.
node_sockstat_TCP_alloc	Number of allocated TCP sockets.
node_sockstat_TCP_inuse	Number of TCP sockets in use.
node_sockstat_TCP_mem	Memory usage of TCP sockets.
node_sockstat_TCP_mem_bytes	Memory usage of TCP sockets, in bytes.
node_sockstat_TCP_tw	Number of TCP sockets in the TIME_WAIT state.
node_time_zone_offset_seconds	Time zone offset in seconds.
node_timex_offset_seconds	Time offset in seconds.
node_timex_sync_status	Clock synchronization status.
node_uname_info	System information from uname.
node_vmstat_pgfault	Number of page faults from VM statistics.
node_vmstat_pgmajfault	Number of major page faults from VM statistics.
node_vmstat_pgpgin	Number of page-ins from VM statistics.
node_vmstat_pgpgout	Number of page-outs from VM statistics.
up	Connectivity for metric scraping.

kube-state-metrics (Job name: _kube-state-metrics)

Metric	Description
kube_configmap_info	Information about Kubernetes ConfigMaps
kube_cronjob_annotations	Kubernetes CronJob annotations
kube_cronjob_created	The creation time of the Kubernetes CronJob.
kube_cronjob_info	Kubernetes CronJob information
kube_cronjob_labels	Kubernetes CronJob labels
kube_cronjob_metadata_resource_version	Shows the resource version of the Kubernetes CronJob metadata.
kube_cronjob_next_schedule_time	The next scheduled time of a Kubernetes CronJob.
kube_cronjob_spec_failed_job_history_limit	Kubernetes CronJob failed job history limit
kube_cronjob_spec_starting_deadline_seconds	The starting deadline for the Kubernetes CronJob in seconds.
kube_cronjob_spec_successful_job_history_limit	The retention limit for the history of successful jobs in a Kubernetes CronJob.
kube_cronjob_spec_suspend	The suspend status of a Kubernetes CronJob.
kube_cronjob_status_active	Number of active Kubernetes CronJobs
kube_cronjob_status_last_schedule_time	The last schedule time of the Kubernetes CronJob
kube_cronjob_status_last_successful_time	The last successful running time of the Kubernetes CronJob
kube_daemonset_created	The creation time of the Kubernetes DaemonSet.
kube_daemonset_status_current_number_scheduled	The current number of nodes scheduled for the Kubernetes DaemonSet.
kube_daemonset_status_desired_number_scheduled	The desired number of scheduled nodes for a Kubernetes DaemonSet.
kube_daemonset_status_number_available	Number of available nodes in the Kubernetes DaemonSet
kube_daemonset_status_number_misscheduled	Number of nodes incorrectly running a Kubernetes DaemonSet pod
kube_daemonset_status_number_ready	The number of ready nodes in a Kubernetes DaemonSet.
kube_daemonset_status_number_unavailable	Number of unavailable nodes in the Kubernetes DaemonSet
kube_daemonset_status_updated_number_scheduled	The number of nodes scheduled with the updated Kubernetes DaemonSet.
kube_daemonset_updated_number_scheduled	Number of nodes scheduled with the updated Kubernetes DaemonSet.
kube_deployment_created	The creation time of the Kubernetes deployment.
kube_deployment_labels	Kubernetes deployment labels
kube_deployment_metadata_generation	The generation of the Kubernetes deployment metadata.
kube_deployment_spec_replicas	Number of replicas in the Kubernetes deployment specification
kube_deployment_spec_strategy_rollingupdate_max_unavailable	The maximum number of unavailable pods during a rolling update for a Kubernetes deployment
kube_deployment_status_observed_generation	The observed generation of the Kubernetes deployment.
kube_deployment_status_replicas	Total number of replicas in a Kubernetes deployment
kube_deployment_status_replicas_available	Number of available Kubernetes deployment replicas
kube_deployment_status_replicas_ready	Number of ready replicas in a Kubernetes deployment
kube_deployment_status_replicas_unavailable	Number of unavailable replicas in a Kubernetes deployment
kube_deployment_status_replicas_updated	The number of updated replicas in a Kubernetes deployment.
kube_horizontalpodautoscaler_info	Information about the Kubernetes HorizontalPodAutoscaler.
kube_horizontalpodautoscaler_labels	Kubernetes HorizontalPodAutoscaler labels
kube_horizontalpodautoscaler_metadata_generation	The metadata generation of the Kubernetes HorizontalPodAutoscaler.
kube_horizontalpodautoscaler_spec_max_replicas	The maximum number of replicas in the specification for a Kubernetes HorizontalPodAutoscaler.
kube_horizontalpodautoscaler_spec_min_replicas	The minimum number of replicas for a Kubernetes HorizontalPodAutoscaler.
kube_horizontalpodautoscaler_spec_target_metric	The target metric of a Kubernetes HorizontalPodAutoscaler.
kube_horizontalpodautoscaler_status_condition	The status condition of a Kubernetes HorizontalPodAutoscaler.
kube_horizontalpodautoscaler_status_current_replicas	The current number of replicas of the Kubernetes HorizontalPodAutoscaler.
kube_horizontalpodautoscaler_status_desired_replicas	Desired number of replicas for the Kubernetes HorizontalPodAutoscaler
kube_hpa_labels	kube_hpa labels
kube_hpa_metadata_generation	The metadata generation of the Kubernetes HorizontalPodAutoscaler.
kube_hpa_spec_max_replicas	The maximum number of replicas for a Kubernetes HorizontalPodAutoscaler.
kube_hpa_spec_min_replicas	The minimum number of replicas in the Kubernetes HorizontalPodAutoscaler specification.
kube_hpa_spec_target_metric	The target metric for a Kubernetes HorizontalPodAutoscaler.
kube_hpa_status_condition	Kubernetes HorizontalPodAutoscaler status condition
kube_hpa_status_current_replicas	The current number of replicas for the Kubernetes HorizontalPodAutoscaler.
kube_hpa_status_desired_replicas	The desired number of replicas for a Kubernetes HorizontalPodAutoscaler.
kube_ingress_info	Ingress information
kube_job_created	The time when the job was created.
kube_job_failed	Total number of failed jobs
kube_job_info	Job information
kube_job_spec_completions	The number of completions specified for the job
kube_job_status_active	Number of active jobs
kube_job_status_failed	The number of failed jobs.
kube_job_status_succeeded	The number of jobs that have succeeded.
kube_namespace_created	The creation time of the namespace.
kube_namespace_labels	Namespace labels
kube_namespace_status_phase	Namespace status phase
kube_node_info	Node information
kube_node_labels	Node labels
kube_node_spec_taint	Node taint configuration
kube_node_spec_unschedulable	Flag indicating whether the node can be scheduled.
kube_node_status_allocatable	The amount of allocatable resources on a node.
kube_node_status_allocatable_cpu_cores	Number of allocatable CPU cores on the node.
kube_node_status_allocatable_memory_bytes	Allocatable memory on the node in bytes
kube_node_status_allocatable_pods	Number of allocatable pods on the node
kube_node_status_capacity	Node capacity
kube_node_status_capacity_cpu_cores	The CPU capacity of a node in cores.
kube_node_status_capacity_memory_bytes	Node memory capacity in bytes
kube_node_status_capacity_pods	Node pod capacity
kube_node_status_condition	Node status condition
kube_persistentvolume_status_phase	The status phase of the persistent volume.
kube_persistentvolumeclaim_info	Persistent Volume Claim information
kube_persistentvolumeclaim_resource_requests_storage_bytes	The amount of storage requested by a persistent volume claim
kube_persistentvolumeclaim_status_phase	The status phase of the persistent volume claim.
kube_pod_completion_time	Pod completion time
kube_pod_container_info	Pod container information
kube_pod_container_resource_limits	Pod container resource limits
kube_pod_container_resource_limits_cpu_cores	Pod container CPU core limit
kube_pod_container_resource_limits_memory_bytes	Pod container memory limit in bytes
kube_pod_container_resource_requests	Pod container resource request
kube_pod_container_resource_requests_cpu_cores	Pod container CPU core request
kube_pod_container_resource_requests_memory_bytes	pod container memory resource request in bytes
kube_pod_container_status_last_terminated_reason	Last termination reason of the pod container
kube_pod_container_status_ready	Pod container readiness status
kube_pod_container_status_restarts_total	Pod container restart count
kube_pod_container_status_running	Pod container runtime status
kube_pod_container_status_terminated	Pod container termination status
kube_pod_container_status_terminated_reason	Pod container stop reason
kube_pod_container_status_waiting	Pod container waiting status
kube_pod_container_status_waiting_reason	Pod container wait reason
kube_pod_created	Pod creation time
kube_pod_deletion_timestamp	Pod deletion timestamp
kube_pod_info	Pod information
kube_pod_labels	Pod label
kube_pod_owner	Owner object
kube_pod_start_time	Pod start time
kube_pod_status_container_ready_time	Pod container readiness time
kube_pod_status_initialized_time	Pod status initialization completion time
kube_pod_status_phase	Pod phase
kube_pod_status_ready	Pod readiness status
kube_pod_status_ready_time	Pod readiness time
kube_pod_status_reason	Pod status reason
kube_pod_status_scheduled_time	Pod scheduling time
kube_pod_status_unschedulable	Unscheduled pod flag
kube_replicaset_owner	ReplicaSet owner object
kube_replicaset_status_ready_replicas	Number of ready replicas in the ReplicaSet
kube_resource_relationship	Resource relationships
kube_resourcequota	Resource quota
kube_resourcequota_created	Resource quota creation time
kube_secret_info	Secret information
kube_service_info	Service information
kube_service_spec_type	Service type specifications
kube_service_status_load_balancer_ingress	Service status and Server Load Balancer endpoint information
kube_statefulset_created	Stateful ReplicaSet creation time
kube_statefulset_metadata_generation	Stateful ReplicaSet metadata generation
kube_statefulset_replicas	Number of replicas for the stateful ReplicaSet
kube_statefulset_status_replicas	Number of replicas in the Stateful ReplicaSet status
kube_statefulset_status_replicas_available	Number of active replicas
kube_statefulset_status_replicas_ready	Stateful ReplicaSet ready replica count
kube_statefulset_status_replicas_updated	stateful ReplicaSet status: Updated number of replicas
rest_client_requests_total	Total REST client requests
up	Connectivity for metric collection
workqueue_adds_total	Total work queue additions
workqueue_depth	Work queue depth
workqueue_queue_duration_seconds_bucket	Work queue queuing duration distribution (seconds)

kube-events (Job name: _arms/kube-event)

Metric	Description
aliyun_prometheus_agent_append_duration_seconds	The duration of an append operation for the Alibaba Cloud Prometheus agent, in seconds.
aliyun_prometheus_agent_job_discovery_status	The discovery status of a scrape job for the Alibaba Cloud Prometheus agent.
aliyun_prometheus_agent_scrape_custom_error	The number of custom scrape errors for the Alibaba Cloud Prometheus agent.
aliyun_prometheus_agent_scrapes_by_target_total	The total number of scrapes by target for the Alibaba Cloud Prometheus agent.
aliyun_prometheus_agent_target_info	The target information for the Alibaba Cloud Prometheus agent.
eventer_events_error_total	The total number of event processing errors.
eventer_events_normal_total	The total number of normal events.
eventer_events_warning_total	The total number of event warnings.
eventer_exporter_duration_milliseconds_count	The number of samples for the event export duration, in milliseconds.
eventer_exporter_duration_milliseconds_sum	The total event export duration, in milliseconds.
eventer_manager_last_time_seconds	The last operation time of the event manager, in seconds.
eventer_scraper_duration_milliseconds_count	The count of the event scrape duration, in milliseconds.
eventer_scraper_duration_milliseconds_sum	The total event scrape duration, in milliseconds.
eventer_scraper_events_total_number	The total number of events scraped.
eventer_scraper_last_time_seconds	The last running time of the event scrape, in seconds.
up	The connectivity for metric collection.

CoreDNS (Job name: arms-ack-coredns)

Metric	Description
aliyun_prometheus_agent_append_duration_seconds	The duration of append operations for the Alibaba Cloud Prometheus agent, in seconds.
aliyun_prometheus_agent_job_discovery_status	The status of scrape job discovery for the Alibaba Cloud Prometheus agent.
aliyun_prometheus_agent_scrape_custom_error	Number of custom scrape errors from the Alibaba Cloud Prometheus agent
aliyun_prometheus_agent_scrapes_by_target_total	The total number of scrapes by the Alibaba Cloud Prometheus agent per target.
aliyun_prometheus_agent_target_info	Target information for the Alibaba Cloud Prometheus agent
coredns_autopath_success_count_total	Total success count for CoreDNS autopath.
coredns_autopath_success_total	Total number of successful CoreDNS autopaths.
coredns_build_info	CoreDNS build information
coredns_cache_drops_total	Total CoreDNS cache drop count
coredns_cache_entries	Number of CoreDNS cache entries
coredns_cache_evictions_total	Total number of CoreDNS cache evictions
coredns_cache_hits_total	Total CoreDNS cache hits
coredns_cache_misses_total	Total number of CoreDNS cache misses
coredns_cache_requests_total	Total CoreDNS cache requests
coredns_cache_size	The size of the CoreDNS cache.
coredns_dns_do_requests_total	Total CoreDNS DNS DO requests
coredns_dns_request_count_total	Total DNS request count for CoreDNS
coredns_dns_request_duration_seconds_bucket	CoreDNS DNS request duration quantile (seconds)
coredns_dns_request_duration_seconds_count	The count of CoreDNS DNS requests
coredns_dns_request_duration_seconds_sum	Total CoreDNS DNS request duration in seconds
coredns_dns_request_size_bytes_bucket	CoreDNS DNS request size quantile (bytes)
coredns_dns_request_size_bytes_count	CoreDNS DNS request size count (bytes)
coredns_dns_request_size_bytes_sum	Sum of CoreDNS DNS request size (bytes)
coredns_dns_request_type_count_total	The total number of DNS requests in CoreDNS, categorized by request type.
coredns_dns_requests_total	Total DNS requests handled by CoreDNS
coredns_dns_response_rcode_count_total	Total number of CoreDNS DNS responses by response code
coredns_dns_response_size_bytes_bucket	CoreDNS DNS response size quantile (bytes)
coredns_dns_response_size_bytes_count	CoreDNS DNS response size (bytes) count
coredns_dns_response_size_bytes_sum	The sum of CoreDNS DNS response sizes in bytes
coredns_dns_responses_total	Total number of CoreDNS DNS responses
coredns_forward_conn_cache_hits_total	Total CoreDNS forward connection cache hits.
coredns_forward_conn_cache_misses_total	Total misses in the CoreDNS forward connection cache.
coredns_forward_healthcheck_broken_total	Total number of failed CoreDNS forward health checks
coredns_forward_healthcheck_failure_count_total	Total count of CoreDNS forwarding health check failures
coredns_forward_healthcheck_failures_total	Total CoreDNS forward health check failures
coredns_forward_max_concurrent_rejects_total	Total number of rejections for CoreDNS forwarding due to maximum concurrency
coredns_forward_request_count_total	Total count of requests forwarded by CoreDNS
coredns_forward_request_duration_seconds_bucket	Quantiles for CoreDNS forwarded request duration in seconds.
coredns_forward_request_duration_seconds_count	Count of CoreDNS forward request duration (seconds)
coredns_forward_request_duration_seconds_sum	Total duration of CoreDNS forward requests in seconds.
coredns_forward_requests_total	Total number of requests forwarded by CoreDNS
coredns_forward_response_rcode_count_total	Total count of CoreDNS forwarded response codes
coredns_forward_responses_total	Total number of responses forwarded by CoreDNS
coredns_forward_sockets_open	Number of open sockets for CoreDNS forwarding
coredns_health_request_duration_seconds_bucket	Quantile of CoreDNS health check request duration in seconds
coredns_health_request_duration_seconds_count	Number of CoreDNS health check requests.
coredns_health_request_duration_seconds_sum	Total duration of CoreDNS health check requests in seconds.
coredns_health_request_failures_total	Total number of failed CoreDNS health check requests
coredns_hosts_entries	Number of CoreDNS host entries
coredns_hosts_reload_timestamp_seconds	CoreDNS host reload timestamp (seconds)
coredns_kubernetes_dns_programming_duration_seconds_bucket	CoreDNS Kubernetes DNS programming duration quantile (seconds)
coredns_kubernetes_dns_programming_duration_seconds_count	CoreDNS Kubernetes DNS request duration (seconds) count
coredns_kubernetes_dns_programming_duration_seconds_sum	CoreDNS: Sum of Kubernetes DNS programming time
coredns_local_localhost_requests_total	Total CoreDNS requests to localhost
coredns_panic_count_total	Total CoreDNS panics
coredns_panics_total	Total CoreDNS panic count
coredns_plugin_enabled	CoreDNS plugin status
coredns_reload_failed_total	Total CoreDNS reload failures
coredns_reload_version_info	CoreDNS reload version
coredns_template_matches_total	Total CoreDNS template matches
up	Metric collection connectivity

CSI (cluster dimension) (Job name: k8s-csi-cluster-pv)

Metric	Description
alibaba_cloud_storage_operator_build_info	The build information for Alibaba Cloud storage O&M.
aliyun_prometheus_agent_append_duration_seconds	The duration of the append operation for the Alibaba Cloud Prometheus agent, in seconds.
aliyun_prometheus_agent_job_discovery_status	The discovery status of the scrape job for the Alibaba Cloud Prometheus agent.
aliyun_prometheus_agent_scrape_custom_error	The number of custom scrape errors for the Alibaba Cloud Prometheus agent.
aliyun_prometheus_agent_scrapes_by_target_total	The total number of scrapes by target for the Alibaba Cloud Prometheus agent.
aliyun_prometheus_agent_target_info	The target information of the Alibaba Cloud Prometheus agent.
cluster_pv_detail_num_total	The total count of detailed information for cluster PVs.
cluster_pv_status_num_total	The total number of cluster PV statuses.
cluster_pvc_detail_num_total	The total count of detailed information for cluster PVCs.
cluster_pvc_status_num_total	The total number of cluster PVC statuses.
cluster_scrape_collector_duration_seconds	The duration of the cluster scrape collector, in seconds.
cluster_scrape_collector_success	The number of successful attempts by the cluster scrape collector.
up	The connectivity for metric scraping.

CSI (node dimension) (Job name: k8s-csi-node-pv)

Metric	Description
alibaba_cloud_csi_driver_build_info	Alibaba Cloud CSI driver build information
aliyun_prometheus_agent_append_duration_seconds	Alibaba Cloud Prometheus agent append operation duration in seconds
aliyun_prometheus_agent_job_discovery_status	Discovery status of scrape jobs for the Alibaba Cloud Prometheus agent
aliyun_prometheus_agent_scrape_custom_error	Number of custom scrape errors from the Alibaba Cloud Prometheus agent
aliyun_prometheus_agent_scrapes_by_target_total	Total number of scrapes by target from the Alibaba Cloud Prometheus agent
aliyun_prometheus_agent_target_info	Target information for the Alibaba Cloud Prometheus agent
cluster_scrape_collector_duration_seconds	Duration of the cluster scrape collector in seconds
cluster_scrape_collector_success	Number of successful cluster scrape collections
container_fs_available_bytes	Available bytes in the container file system
container_fs_inodes_free	Available inodes in the container file system
container_fs_inodes_total	Total inodes in the container file system
container_fs_inodes_used	Used inodes in the container file system
container_fs_limit_bytes	Byte limit for the container file system
container_fs_usage_bytes	Used bytes in the container file system
ephemeral_storage_pod_available_bytes	Available bytes for the ephemeral storage pod
ephemeral_storage_pod_inodes_free	Available inodes for the ephemeral storage pod
ephemeral_storage_pod_inodes_total	Total inodes for the ephemeral storage pod
ephemeral_storage_pod_inodes_used	Used inodes for the ephemeral storage pod
ephemeral_storage_pod_limit_bytes	Byte limit for the ephemeral storage pod
ephemeral_storage_pod_usage_bytes	Used bytes for the ephemeral storage pod
node_volume_backend_posix_access_total_counter	Total POSIX access operations on the node volume backend.
node_volume_backend_posix_getattr_total_counter	Total POSIX getattr calls on the node volume backend.
node_volume_backend_posix_getmode_total_counter	Total POSIX get mode operations on the node volume backend.
node_volume_backend_posix_link_total_counter	Total POSIX link operations on the node volume backend.
node_volume_backend_posix_lookup_total_counter	Total POSIX lookup operations on the node volume backend.
node_volume_backend_posix_mknod_total_counter	Total POSIX mknod operations on the node volume backend.
node_volume_backend_posix_readdir_total_counter	Total POSIX readdir operations on the node volume backend.
node_volume_backend_posix_readlink_total_counter	Total POSIX readlink operations on the node volume backend.
node_volume_backend_posix_remove_total_counter	Total POSIX remove operations on the node volume backend.
node_volume_backend_posix_rename_total_counter	Total POSIX rename operations on the node volume backend.
node_volume_backend_posix_setattr_total_counter	Total POSIX setattr operations on the node volume backend.
node_volume_backend_posix_statfs_total_counter	Total POSIX statfs operations on the node volume backend.
node_volume_backend_read_bytes_total_counter	Total bytes read from the node volume backend.
node_volume_backend_read_completed_total_counter	Total completed read requests on the node volume backend.
node_volume_backend_read_time_milliseconds_total_counter	Total read time in milliseconds on the node volume backend.
node_volume_backend_write_bytes_total_counter	Total bytes written to the node volume backend.
node_volume_backend_write_completed_total_counter	Total completed write requests on the node volume backend.
node_volume_backend_write_time_milliseconds_total_counter	Total write time in milliseconds on the node volume backend.
node_volume_capacity_bytes_available	Available capacity of the node volume in bytes.
node_volume_capacity_bytes_available_counter	Counter for the available capacity of the node volume in bytes.
node_volume_capacity_bytes_total	Total capacity of the node volume in bytes.
node_volume_capacity_bytes_total_counter	Counter for the total capacity of the node volume in bytes.
node_volume_capacity_bytes_used	Used capacity of the node volume in bytes.
node_volume_capacity_bytes_used_counter	Counter for the used capacity of the node volume in bytes.
node_volume_hot_spot_head_file_top	Ranking of hot spot head files on the node volume.
node_volume_hot_spot_read_file_top	Ranking of hot spot read files on the node volume.
node_volume_hot_spot_write_file_top	Ranking of hot spot write files on the node volume.
node_volume_inode_bytes_available_counter	Counter for available bytes for inodes on the node volume.
node_volume_inode_bytes_total_counter	Counter for total bytes for inodes on the node volume.
node_volume_inode_bytes_used_counter	Counter for used bytes for inodes on the node volume.
node_volume_inodes_available	Available inodes on the node volume.
node_volume_inodes_total	Total inodes on the node volume.
node_volume_inodes_used	Used inodes on the node volume.
node_volume_io_now	Current I/O operations on the node volume.
node_volume_io_time_seconds_total	Total I/O time on the node volume in seconds.
node_volume_oss_delete_object_total_counter	Total objects deleted from OSS for the node volume.
node_volume_oss_get_object_total_counter	Total objects retrieved from OSS for the node volume.
node_volume_oss_head_object_total_counter	Total head object operations on OSS for the node volume.
node_volume_oss_post_object_total_counter	Total objects posted to OSS for the node volume.
node_volume_oss_put_object_total_counter	Total objects put to OSS for the node volume.
node_volume_posix_access_total_counter	Total POSIX access operations on the node volume.
node_volume_posix_chmod_total_counter	Total POSIX chmod operations on the node volume.
node_volume_posix_chown_total_counter	Total POSIX chown operations on the node volume.
node_volume_posix_create_total_counter	Total POSIX create operations on the node volume.
node_volume_posix_flush_total_counter	Total POSIX flush operations on the node volume.
node_volume_posix_fsync_total_counter	Total POSIX fsync operations on the node volume.
node_volume_posix_mkdir_total_counter	Total POSIX mkdir operations on the node volume.
node_volume_posix_open_total_counter	Total POSIX open operations on the node volume.
node_volume_posix_opendir_total_counter	Total POSIX opendir operations on the node volume.
node_volume_posix_read_total_counter	Total POSIX read operations on the node volume.
node_volume_posix_readdir_total_counter	Total POSIX readdir operations on the node volume.
node_volume_posix_release_total_counter	Total POSIX release operations on the node volume.
node_volume_posix_rename_total_counter	Total POSIX rename operations on the node volume.
node_volume_posix_rmdir_total_counter	Total POSIX rmdir operations on the node volume.
node_volume_posix_truncate_total_counter	Total POSIX truncate operations on the node volume.
node_volume_posix_write_total_counter	Total POSIX write operations on the node volume.
node_volume_read_bytes_total	Total bytes read from the node volume.
node_volume_read_bytes_total_counter	Counter for the total bytes read from the node volume.
node_volume_read_completed_total	Total completed read operations on the node volume.
node_volume_read_completed_total_counter	Counter for total completed read operations on the node volume.
node_volume_read_merged_total	Total merged read operations on the node volume.
node_volume_read_queue_time_milliseconds_total	Total time spent in the read queue on the node volume, in milliseconds.
node_volume_read_rtt_time_milliseconds_total	Total round trip time for read operations on the node volume, in milliseconds.
node_volume_read_sent_bytes_total	Total bytes sent for read operations on the node volume.
node_volume_read_time_milliseconds_total	Total time for read operations on the node volume, in milliseconds.
node_volume_read_time_milliseconds_total_counter	Counter for the total time for read operations on the node volume, in milliseconds.
node_volume_read_timeouts_total	Total read timeouts on the node volume.
node_volume_read_transmissions_total	Total read transmissions on the node volume.
node_volume_vg_free_bytes	Free bytes in the node volume group (VG).
node_volume_vg_size_bytes	Total size of the node volume group (VG) in bytes.
node_volume_write_bytes_total	Total bytes written to the node volume.
node_volume_write_bytes_total_counter	Counter for the total bytes written to the node volume.
node_volume_write_completed_total	Total completed write operations on the node volume.
node_volume_write_completed_total_counter	Counter for total completed write operations on the node volume.
node_volume_write_merged_total	Total merged write operations on the node volume.
node_volume_write_queue_time_milliseconds_total	Total time spent in the write queue on the node volume, in milliseconds.
node_volume_write_recv_bytes_total	Total bytes received for write operations on the node volume.
node_volume_write_rtt_time_milliseconds_total	Total round trip time for write operations on the node volume, in milliseconds.
node_volume_write_time_milliseconds_total	Total time for write operations on the node volume, in milliseconds.
node_volume_write_time_milliseconds_total_counter	Counter for the total time for write operations on the node volume, in milliseconds.
node_volume_write_timeouts_total	Total write timeouts on the node volume.
node_volume_write_transmissions_total	Total write transmissions on the node volume.
up	Connectivity for metric scraping.

GPU-Exporter (job name: gpu-exporter)

Metric	Description
DCGM_CUSTOM_ALLOCATE_MODE	The operating pattern of the node. The possible values are: 0 (None) indicates that no GPU pods are running on the node. 1 (Exclusive) indicates that GPU pods on the node run in exclusive mode. 2 (Share) indicates that GPU pods on the node run in shared mode.
DCGM_CUSTOM_CONTAINER_CP_ALLOCATED	Indicates the ratio of the computing power allocated to a container to the total computing power of the GPU card. The value ranges from 0 to 1. The value is 0 if only GPU memory is requested for an exclusive or shared GPU. A value of 0 means computing power is not limited. For example, if a GPU card has 100 units of computing power and 30 units are allocated to a container, the allocated computing power ratio is 30/100 = 0.3.
DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED	The GPU memory allocated to the container.
DCGM_CUSTOM_DEV_FB_ALLOCATED	The percentage of total GPU memory that is allocated. The value ranges from 0 to 1.
DCGM_CUSTOM_DEV_FB_TOTAL	Indicates the total GPU memory of the GPU.
DCGM_CUSTOM_ILLEGAL_PROCESS_DECODE_UTIL	Illegal process decode utilization
DCGM_CUSTOM_ILLEGAL_PROCESS_ENCODE_UTIL	Illegal process encoding utilization
DCGM_CUSTOM_ILLEGAL_PROCESS_MEM_COPY_UTIL	Illegal process memory copy utilization
DCGM_CUSTOM_ILLEGAL_PROCESS_MEM_USED	Memory used by illegal process
DCGM_CUSTOM_ILLEGAL_PROCESS_SM_UTIL	Illegal process Streaming Multiprocessor (SM) utilization
DCGM_CUSTOM_PROCESS_DECODE_UTIL	Indicates the decoder utilization of the GPU thread.
DCGM_CUSTOM_PROCESS_ENCODE_UTIL	The encoder utilization of the GPU thread.
DCGM_CUSTOM_PROCESS_MEM_COPY_UTIL	Indicates the memory copy utilization of GPU threads.
DCGM_CUSTOM_PROCESS_MEM_USED	The GPU memory currently used by the GPU thread.
DCGM_CUSTOM_PROCESS_SM_UTIL	The SM utilization of GPU threads.
DCGM_FI_DEV_APP_MEM_CLOCK	The application memory clock speed.
DCGM_FI_DEV_APP_SM_CLOCK	The SM application clock frequency.
DCGM_FI_DEV_BAR1_FREE	Indicates the free BAR1 memory.
DCGM_FI_DEV_BAR1_TOTAL	Total size of Base Address Register 1 (BAR1), which maps GPU memory to the system address space.
DCGM_FI_DEV_BAR1_USED	The amount of used BAR1.
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION	Indicates a violation due to the board limit. The value is the duration of the violation.
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS	The reasons for clock throttling.
DCGM_FI_DEV_COUNT	Number of devices
DCGM_FI_DEV_DEC_UTIL	Indicates the decoder utilization.
DCGM_FI_DEV_ENC_UTIL	Indicates the encoder utilization.
DCGM_FI_DEV_FB_FREE	The amount of free framebuffer memory.
DCGM_FI_DEV_FB_USED	The amount of used framebuffer memory. This value corresponds to the used value for Memory-Usage from the nvidia-smi command.
DCGM_FI_DEV_GPU_TEMP	Indicates the GPU temperature.
DCGM_FI_DEV_GPU_UTIL	Indicates GPU utilization. This is the time that one or more kernel functions are active in a set period. The period is 1 s or 1/6 s. It depends on the GPU product. This metric shows that a kernel function is using the GPU. It does not show how the GPU is used.
DCGM_FI_DEV_LOW_UTIL_VIOLATION	A violation triggered by the low utilization limit. The value is the duration of the violation.
DCGM_FI_DEV_MEM_CLOCK	The memory clock frequency.
DCGM_FI_DEV_MEM_COPY_UTIL	Indicates the memory bandwidth utilization. For example, an NVIDIA V100 GPU has a maximum memory bandwidth of 900 GB/sec. If the current memory bandwidth is 450 GB/sec, the memory bandwidth utilization is 50%.
DCGM_FI_DEV_MEMORY_TEMP	Indicates the memory temperature.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL	Total NVLINK bandwidth
DCGM_FI_DEV_PCIE_REPLAY_COUNTER	PCIe replay counter (records the number of retries due to data transmission errors)
DCGM_FI_DEV_POWER_USAGE	Indicates power.
DCGM_FI_DEV_POWER_VIOLATION	Indicates a violation caused by the power limit. The value is the duration of the violation.
DCGM_FI_DEV_PSTATE	Device power state
DCGM_FI_DEV_RELIABILITY_VIOLATION	Indicates a violation caused by the board's reliability limit. The value is the duration of the violation.
DCGM_FI_DEV_RETIRED_DBE	Indicates pages retired due to a double-bit fault.
DCGM_FI_DEV_RETIRED_PENDING	Number of pages pending retirement (pages in GPU memory marked as unusable due to faults)
DCGM_FI_DEV_RETIRED_SBE	Indicates pages retired due to a single-bit error.
DCGM_FI_DEV_SM_CLOCK	Indicates the SM clock frequency.
DCGM_FI_DEV_SYNC_BOOST_VIOLATION	Indicates the duration of a violation caused by a sync boost limit.
DCGM_FI_DEV_THERMAL_VIOLATION	Indicates a thermal violation. The value is the duration of the violation.
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION	The total energy consumed since the driver was loaded.
DCGM_FI_DEV_VIDEO_CLOCK	Video clock frequency
DCGM_FI_DEV_XID_ERRORS	The error number of the most recent XID error that occurred over a period of time.
DCGM_FI_PROF_DRAM_ACTIVE	The fraction of cycles that the device memory is active sending or receiving data. This metric measures Memory Bandwidth Utilization. This value is an average over a time interval, not an instantaneous value. A higher value indicates higher device memory utilization. A value of 1 (100%) means that one DRAM instruction is executed in every cycle during the time interval. In practice, the maximum achievable peak value is approximately 0.8 (80%). For example, a value of 0.2 (20%) means that the device memory is read from or written to during 20% of the cycles in the time interval.
DCGM_FI_PROF_GR_ENGINE_ACTIVE	Indicates the percentage of time that a graphics or compute engine is active over a time interval. This value is the average for all graphics and compute engines. An engine is considered active if a graphics or compute Context is attached to a thread and the Context is busy.
DCGM_FI_PROF_NVLINK_RX_BYTES	The rate of data received over NVLink, excluding protocol headers. This value is an average over a time interval, not an instantaneous value. The rate is averaged over the time interval. For example, if 1 GB of data is transferred in 1 second, the rate is 1 GB/s. This is true whether the data is transferred at a constant rate or in a burst. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction.
DCGM_FI_PROF_NVLINK_TX_BYTES	Total bytes sent over NVLink
DCGM_FI_PROF_PCIE_RX_BYTES	The rate of data received over the PCIe bus, including protocol headers and data payloads. This value represents an average over a time interval, not an instantaneous value. The rate is averaged over the time interval. For example, if 1 GB of data is transferred in 1 second, the rate is 1 GB/s, regardless of whether the transfer is constant or in a burst. The theoretical maximum bandwidth for PCIe Gen3 is 985 MB/s per channel.
DCGM_FI_PROF_PCIE_TX_BYTES	Indicates the rate of data sent or received over the PCIe bus. This includes protocol headers and data payloads. This value is an average over a time interval, not an instantaneous value. The rate is averaged over the time interval. For example, if 1 GB of data is sent in 1 second, the rate is 1 GB/s. This is true whether the data is sent at a constant rate or in a burst. The theoretical maximum bandwidth for PCIe Gen3 is 985 MB/s per channel.
DCGM_FI_PROF_PIPE_FP16_ACTIVE	The fraction of epochs that the FP16 (half-precision) pipeline is active. This value is an average over a time interval, not an instantaneous value. A higher value indicates higher utilization of the FP16 Cores. A value of 1 (100%) means that an FP16 instruction is executed every two epochs for the entire time interval. For example, on a Volta-based GPU. If the value is 0.2 (20%), the following scenarios are possible: 20% of the Streaming Multiprocessors (SMs) run their FP16 Cores at 100% utilization for the entire time interval. All SMs run their FP16 Cores at 20% utilization for the entire time interval. All SMs run their FP16 Cores at 100% utilization for one-fifth of the time interval. Other combinations.
DCGM_FI_PROF_PIPE_FP32_ACTIVE	Indicates the fraction of cycles where the Fused Multiply-Add (FMA) pipeline is active. FMA operations include both single-precision (FP32) and integer types. This value is an average over a time interval, not an instantaneous value. A higher value indicates higher utilization of the FP32 Cores. A value of 1 (100%) indicates that an FP32 instruction is executed every two cycles over the entire time interval, for example, on a Volta-architecture card. For example, a value of 0.2 (20%) indicates one of the following scenarios: 20% of the FP32 Cores on the Streaming Multiprocessors (SMs) operate at 100% utilization throughout the interval. All FP32 Cores on the SMs operate at 20% utilization throughout the interval. All FP32 Cores on the SMs operate at 100% utilization for 20% of the interval. Other combinations.
DCGM_FI_PROF_PIPE_FP64_ACTIVE	The fraction of cycles that the FP64 (double-precision) pipe is active. This value is an average over a time interval, not an instantaneous value. A higher value means higher utilization of the FP64 Cores. A value of 1 (100%) means an FP64 instruction is executed every four cycles over the entire time interval. For example, on a Volta-based GPU. A value of 0.2 (20%) could mean any of the following: 20% of the Streaming Multiprocessors (SMs) run their FP64 Cores at 100% utilization for the entire interval. All SMs run their FP64 Cores at 20% utilization for the entire interval. All SMs run their FP64 Cores at 100% utilization for one-fifth of the interval. Other combinations.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE	The fraction of epochs that the Tensor (HMMA/IMMA) pipe is active. This value is an average over a time interval and not an instantaneous value. A higher value indicates higher Tensor Core utilization. A value of 1 (100%) means a Tensor instruction is issued every other instruction cycle for the entire interval. This is because one instruction takes two cycles to complete. For example, a value of 0.2 (20%) could mean: The Tensor Cores on 20% of the Streaming Multiprocessors (SMs) run at 100% utilization for the entire interval. The Tensor Cores on 100% of the SMs run at 20% utilization for the entire interval. The Tensor Cores on 100% of the SMs run at 100% utilization for one-fifth of the interval. Other combinations.
DCGM_FI_PROF_SM_ACTIVE	The percentage of time within an interval that at least one warp is active on a Streaming Multiprocessor (SM). This value is the average across all SMs and is not sensitive to the number of threads per block. A warp is active when it has been scheduled and allocated resources. An active warp can be in a computing or a non-computing state, such as waiting for a memory request. A value below 0.5 indicates that the GPU is underutilized, while a value above 0.8 is necessary for high efficiency. Assume a GPU has N SMs. If a kernel function uses N thread blocks and runs on all N SMs for the entire interval, the value is 1 (100%). If a kernel function runs with N/5 thread blocks during the interval, the value is 0.2. If a kernel function uses N thread blocks but runs for only 1/5 of the interval, the value is 0.2.
DCGM_FI_PROF_SM_OCCUPANCY	The ratio of active warps to the maximum number of resident warps on a Streaming Multiprocessor (SM). This value is the average across all SMs over a time interval. A higher occupancy does not necessarily mean higher GPU utilization. Higher occupancy indicates more effective GPU utilization only for workloads that are limited by GPU memory bandwidth (DCGM_FI_PROF_DRAM_ACTIVE).
nvidia_gpu_allocated_num_devices	The number of allocated GPU devices. Warning: This metric will be deprecated.
nvidia_gpu_memory_allocated_bytes	The allocated memory on the GPU device. Warning: This metric will be deprecated and replaced by DCGM_CUSTOM_DEV_FB_allocated.
nvidia_gpu_sharing_memory	The memory allocated for GPU sharing. Warning: This metric will be deprecated and replaced by DCGM_CUSTOM_DEV_FB_allocated.
Up	Connectivity for metric collection

Cost-Exporter (Job name: alibaba-cloud-cost-exporter)

Metric	Description
deducted_by_cash_coupons	The amount deducted by coupons from a bill for the current instance.
deducted_by_prepaid_card	The amount deducted by a prepaid card from a bill for the current instance.
invoice_discount	The discount amount for a bill of the current instance.
list_price	The unit price for a bill of the current instance.
node_current_price	The actual price of the current node.
node_payAsYouGo_price	The pay-as-you-go price of the current node.
node_payByPeriod_price	The subscription price of the current node.
node_spot_price	The price of the current node, based on the pricing of a Spot Instance with the same specifications.
outstanding_amount	The outstanding amount for a bill of the current instance.
payent_amount	The cash payment amount for a bill of the current instance.
pretax_amount	The amount payable for a bill of the current instance.
pretax_gross_amount	The original amount for a bill of the current instance.
usage	The resource usage for a bill of the current instance.
up	The connectivity for metric collection.

Ingress (Job name: arms-ack-ingress or ingress-ask-default)

Metric	Description
aliyun_prometheus_agent_append_duration_seconds	The duration of an append operation by the Alibaba Cloud Prometheus agent (in seconds).
aliyun_prometheus_agent_job_discovery_status	Status of scrape job discovery for the Alibaba Cloud Prometheus agent
aliyun_prometheus_agent_scrape_custom_error	The number of custom scrape errors for the Prometheus agent.
aliyun_prometheus_agent_scrapes_by_target_total	Total number of scrapes by the Alibaba Cloud Prometheus agent per Target
aliyun_prometheus_agent_target_info	Target information for the Alibaba Cloud Prometheus agent
nginx_ingress_controller_admission_config_size	Nginx Ingress controller - Admission configuration size
nginx_ingress_controller_admission_render_duration	Nginx Ingress controller - Rendering duration
nginx_ingress_controller_admission_render_ingresses	Nginx Ingress controller - Rendered Ingress count
nginx_ingress_controller_admission_roundtrip_duration	Nginx Ingress controller - Roundtrip processing duration
nginx_ingress_controller_admission_tested_duration	Nginx Ingress controller - Test duration
nginx_ingress_controller_admission_tested_ingresses	Nginx Ingress controller - Number of Ingresses tested
nginx_ingress_controller_build_info	Nginx Ingress controller - Build information
nginx_ingress_controller_bytes_sent_bucket	Nginx Ingress controller - Total bytes sent (bucket)
nginx_ingress_controller_bytes_sent_count	Nginx Ingress controller - Total bytes sent (count)
nginx_ingress_controller_bytes_sent_sum	Nginx Ingress controller - Sent bytes total (Sum)
nginx_ingress_controller_check_errors	Nginx Ingress controller - Check errors
nginx_ingress_controller_check_success	Nginx Ingress controller - Successful check count
nginx_ingress_controller_config_hash	Nginx Ingress controller - Configuration hash
nginx_ingress_controller_config_last_reload_successful	Nginx Ingress controller - Last configuration load successful
nginx_ingress_controller_config_last_reload_successful_timestamp_seconds	Nginx Ingress controller - Last successful configuration load time (seconds)
nginx_ingress_controller_connect_duration_seconds_bucket	Nginx Ingress controller - Connection duration (seconds) - Bucket
nginx_ingress_controller_connect_duration_seconds_count	Nginx Ingress controller - connection duration (seconds) - count
nginx_ingress_controller_connect_duration_seconds_sum	Nginx Ingress controller - Connection duration (seconds) - Sum
nginx_ingress_controller_errors	Nginx Ingress controller - Error count
nginx_ingress_controller_header_duration_seconds_bucket	Nginx Ingress controller - Header processing time (s) - Bucket
nginx_ingress_controller_header_duration_seconds_count	Nginx Ingress controller - Header processing time (seconds) - Count
nginx_ingress_controller_header_duration_seconds_sum	Total header processing time for the Nginx Ingress controller (seconds)
nginx_ingress_controller_ingress_upstream_latency_seconds	Nginx Ingress controller upstream latency (seconds)
nginx_ingress_controller_ingress_upstream_latency_seconds_count	Nginx Ingress controller upstream latency count
nginx_ingress_controller_ingress_upstream_latency_seconds_sum	Nginx Ingress controller upstream latency sum (seconds)
nginx_ingress_controller_leader_election_status	Nginx Ingress controller leader election status
nginx_ingress_controller_nginx_process_connections	Nginx Ingress controller nginx process connections
nginx_ingress_controller_nginx_process_connections_total	Total connections for the nginx process in the Nginx Ingress controller
nginx_ingress_controller_nginx_process_cpu_seconds_total	Total CPU seconds for the Nginx Ingress controller's nginx process
nginx_ingress_controller_nginx_process_num_procs	Number of Nginx processes for the Nginx Ingress controller
nginx_ingress_controller_nginx_process_oldest_start_time_seconds	Start time of the oldest nginx process in the Nginx Ingress controller (seconds)
nginx_ingress_controller_nginx_process_read_bytes_total	Total bytes read by the nginx process of the Nginx Ingress controller
nginx_ingress_controller_nginx_process_requests_total	Total requests for the Nginx Ingress controller's nginx process
nginx_ingress_controller_nginx_process_resident_memory_bytes	Resident memory size (bytes) of the nginx process for the Nginx Ingress controller
nginx_ingress_controller_nginx_process_virtual_memory_bytes	Virtual memory of the nginx process for the Nginx Ingress controller in bytes
nginx_ingress_controller_nginx_process_write_bytes_total	Total bytes written by the nginx process of the Nginx Ingress controller
nginx_ingress_controller_orphan_ingress	Number of isolated Ingresses for the Nginx Ingress controller
nginx_ingress_controller_request_duration_seconds_bucket	Nginx Ingress controller request latency distribution (seconds)
nginx_ingress_controller_request_duration_seconds_count	Nginx Ingress controller request duration (seconds)
nginx_ingress_controller_request_duration_seconds_sum	Sum of Nginx Ingress controller request time (seconds)
nginx_ingress_controller_request_size_bucket	Nginx Ingress controller request size distribution
nginx_ingress_controller_request_size_count	Nginx Ingress controller request size count
nginx_ingress_controller_request_size_sum	Nginx Ingress controller total request size
nginx_ingress_controller_requests	Total Nginx Ingress controller requests
nginx_ingress_controller_response_duration_seconds_bucket	Nginx Ingress controller response time distribution (seconds)
nginx_ingress_controller_response_duration_seconds_count	Nginx Ingress controller response time (seconds)
nginx_ingress_controller_response_duration_seconds_sum	Total Nginx Ingress controller response time (seconds)
nginx_ingress_controller_response_size_bucket	Nginx Ingress controller response size distribution
nginx_ingress_controller_response_size_count	Nginx Ingress controller response size count
nginx_ingress_controller_response_size_sum	Total Nginx Ingress controller response size
nginx_ingress_controller_ssl_certificate_info	Nginx Ingress controller SSL certificate information
nginx_ingress_controller_ssl_expire_time_seconds	Nginx Ingress controller SSL certificate expiration time (seconds)
nginx_ingress_controller_success	Nginx Ingress controller success count
Up	Metric collection connectivity

Koordinator (Job names: kube-system/koordlet-metrics-podmonitor, koord-manager-metrics-service)

Metric	Description
aliyun_prometheus_agent_append_duration_seconds	The duration of append operations for the Alibaba Cloud Prometheus agent, in seconds.
aliyun_prometheus_agent_scrapes_by_target_total	The total number of scrapes by the Alibaba Cloud Prometheus agent, per target.
aliyun_prometheus_agent_target_info	The target information for the Alibaba Cloud Prometheus agent.
koord_manager_recommender_recommendation_workload_target	The metric for recommended workload specifications from the resource profiling feature.
koordlet_container_resource_limits	The metric for container resource limits.
koordlet_container_resource_requests	The metric for container resource requests.
koordlet_node_priority_resource_reclaimable	The metric for node resource priority.
koordlet_node_resource_allocatable	The metric for allocatable resources on a node.
slo_manager_recommender_recommendation_workload_target	The metric for recommended workload specifications from the resource profiling feature. (Deprecated)
up	The connectivity for metric scraping.

ACK dedicated etcd component (Job name: etcd)

Metric	Description
aliyun_prometheus_agent_append_duration_seconds	Duration of the append operation for the Alibaba Cloud Prometheus agent (seconds)
aliyun_prometheus_agent_job_discovery_status	Status of scrape job discovery for the Alibaba Cloud Prometheus agent
aliyun_prometheus_agent_scrape_custom_error	The number of errors from custom scrapes by the Alibaba Cloud Prometheus agent.
aliyun_prometheus_agent_scrapes_by_target_total	The total number of scrapes by target for the Alibaba Cloud Prometheus agent.
aliyun_prometheus_agent_target_info	Target information for an Alibaba Cloud Prometheus agent
cpu_utilization_core	CPU core utilization
etcd_cluster_version	The version of the etcd cluster.
etcd_debugging_auth_revision	etcd debug authentication revision
etcd_debugging_disk_backend_commit_rebalance_duration_seconds_bucket	Etcd debugging disk backend commit rebalance duration distribution (seconds)
etcd_debugging_disk_backend_commit_rebalance_duration_seconds_count	The count of commit rebalance durations in seconds for the etcd Multi-Version Concurrency Control (MVCC) database, used for debugging.
etcd_debugging_disk_backend_commit_rebalance_duration_seconds_sum	Total commit rebalance duration for the etcd debug disk backend (seconds)
etcd_debugging_disk_backend_commit_spill_duration_seconds_bucket	The distribution of commit spill duration for the etcd debugging disk backend
etcd_debugging_disk_backend_commit_spill_duration_seconds_count	The total number of commit spills for the etcd debug disk backend.
etcd_debugging_disk_backend_commit_spill_duration_seconds_sum	Sum of the commit spill duration for the etcd debugging disk backend (seconds)
etcd_debugging_disk_backend_commit_write_duration_seconds_bucket	Etcd debug disk backend commit write duration distribution (seconds)
etcd_debugging_disk_backend_commit_write_duration_seconds_count	The total number of write commits to the etcd debug disk backend.
etcd_debugging_disk_backend_commit_write_duration_seconds_sum	The total duration of commit writes to the etcd debug disk backend, in seconds.
etcd_debugging_lease_granted_total	Total number of leases granted for etcd debugging
etcd_debugging_lease_renewed_total	The total number of etcd debugging lease renewals
etcd_debugging_lease_revoked_total	Total number of etcd debugging leases revoked.
etcd_debugging_lease_ttl_total_bucket	Etcd debug lease TTL total bucket
etcd_debugging_lease_ttl_total_count	Total count of etcd debug lease TTLs
etcd_debugging_lease_ttl_total_sum	etcd lease TTL sum (seconds)
etcd_debugging_mvcc_compact_revision	etcd MVCC compaction revision for debugging
etcd_debugging_mvcc_current_revision	Current MVCC revision for etcd debugging
etcd_debugging_mvcc_db_compaction_keys_total	Total keys compacted in the etcd MVCC database for debugging
etcd_debugging_mvcc_db_compaction_last	Last compaction time of the etcd MVCC database for debugging.
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket	The bucket for the pause duration in milliseconds during etcd MVCC database compaction for debugging.
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_count	The count of pause durations (in milliseconds) during MVCC database compaction for etcd debugging.
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_sum	Sum of pause durations for etcd MVCC database compaction during debugging (milliseconds).
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket	Distribution of the total duration of MVCC database compaction for etcd debugging (in milliseconds)
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_count	The total count of etcd debug MVCC database compactions, measured in milliseconds.
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_sum	Sum of the total duration of etcd MVCC database compaction for debugging (milliseconds)
etcd_debugging_mvcc_db_total_size_in_bytes	Total size of the etcd debug MVCC database in bytes
etcd_debugging_mvcc_delete_total	Total MVCC delete operations for etcd debugging
etcd_debugging_mvcc_events_total	Total number of etcd debug events
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucket	The bucket for the etcd debugging MVCC index compaction pause duration in milliseconds.
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_count	Count of etcd debug MVCC index compaction pauses.
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_sum	The sum of pause durations in milliseconds for etcd MVCC index compaction during debugging.
etcd_debugging_mvcc_keys_total	The total number of MVCC keys for etcd debugging.
etcd_debugging_mvcc_pending_events_total	Total number of pending MVCC events for etcd debugging
etcd_debugging_mvcc_put_total	Total number of MVCC put operations for debugging etcd
etcd_debugging_mvcc_range_total	Total etcd MVCC range queries
etcd_debugging_mvcc_slow_watcher_total	Total number of slow watchers for etcd debugging
etcd_debugging_mvcc_total_put_size_in_bytes	Total MVCC put size for etcd debugging (bytes)
etcd_debugging_mvcc_txn_total	Total Multi-Version Concurrency Control (MVCC) transactions for etcd debugging
etcd_debugging_mvcc_watch_stream_total	Total etcd debug snapshot streams
etcd_debugging_mvcc_watcher_total	Total number of etcd debug watchers
etcd_debugging_server_lease_expired_total	Total expired leases for the etcd debugging server.
etcd_debugging_snap_save_marshalling_duration_seconds_bucket	Distribution of marshalling durations when saving etcd debug snapshots
etcd_debugging_snap_save_marshalling_duration_seconds_count	The count of marshalling operations for saving an etcd debug snapshot. The duration is measured in seconds.
etcd_debugging_snap_save_marshalling_duration_seconds_sum	The total time in seconds spent marshalling debugging snapshots for saving.
etcd_debugging_snap_save_total_duration_seconds_bucket	The total time it takes to save an etcd debug snapshot, in seconds, by bucket.
etcd_debugging_snap_save_total_duration_seconds_count	Total count of etcd debug snapshot save operations (duration in seconds)
etcd_debugging_snap_save_total_duration_seconds_sum	The total time, in seconds, spent saving etcd debug snapshots.
etcd_debugging_store_expires_total	Total number of etcd debugging store expirations.
etcd_debugging_store_reads_total	Total debug store reads in etcd.
etcd_debugging_store_watch_requests_total	The total number of watch requests for the etcd debug store.
etcd_debugging_store_watchers	Number of etcd debugging store watchers
etcd_debugging_store_writes_total	Total etcd debug store writes
etcd_disk_backend_commit_duration_seconds_bucket	etcd disk backend commit duration bucket (seconds)
etcd_disk_backend_commit_duration_seconds_count	The total number of etcd disk backend commits.
etcd_disk_backend_commit_duration_seconds_sum	Total duration of etcd disk backend commits, in seconds.
etcd_disk_backend_defrag_duration_seconds_bucket	Distribution of etcd disk WAL fsync duration
etcd_disk_backend_defrag_duration_seconds_count	Duration of etcd disk backend defragmentation (seconds)
etcd_disk_backend_defrag_duration_seconds_sum	The sum of etcd disk backend defragmentation durations, in seconds.
etcd_disk_backend_snapshot_duration_seconds_bucket	Distribution of etcd disk backend snapshot duration (seconds)
etcd_disk_backend_snapshot_duration_seconds_count	The total count of timed etcd disk backend snapshots.
etcd_disk_backend_snapshot_duration_seconds_sum	Total duration of etcd disk backend snapshots in seconds.
etcd_disk_defrag_inflight	etcd disk defragmentation in progress
etcd_disk_wal_fsync_duration_seconds_bucket	etcd disk WAL fsync duration seconds bucket
etcd_disk_wal_fsync_duration_seconds_count	The total number of etcd disk WAL fsync operations.
etcd_disk_wal_fsync_duration_seconds_sum	Sum of the etcd disk WAL fsync duration in seconds.
etcd_disk_wal_write_bytes_total	Total bytes written to the etcd disk WAL
etcd_grpc_proxy_cache_hits_total	Total number of etcd gRPC proxy cache hits
etcd_grpc_proxy_cache_keys_total	The total number of etcd gRPC proxy cache keys.
etcd_grpc_proxy_cache_misses_total	Total etcd gRPC proxy cache misses
etcd_grpc_proxy_events_coalescing_total	Total number of events merged by the etcd gRPC proxy
etcd_grpc_proxy_watchers_coalescing_total	Total number of coalesced watchers in the etcd gRPC proxy.
etcd_mvcc_db_open_read_transactions	The number of open read transactions in the etcd MVCC database.
etcd_mvcc_db_total_size_in_bytes	Total size of the etcd MVCC database (bytes)
etcd_mvcc_db_total_size_in_use_in_bytes	The total size in use of the etcd MVCC database, in bytes.
etcd_mvcc_delete_total	Total etcd MVCC deletes
etcd_mvcc_hash_duration_seconds_bucket	Bucket for etcd MVCC hash duration in seconds.
etcd_mvcc_hash_duration_seconds_count	Count of etcd MVCC hash durations (seconds)
etcd_mvcc_hash_duration_seconds_sum	Total etcd MVCC hash duration in seconds
etcd_mvcc_hash_rev_duration_seconds_bucket	etcd MVCC hash revision duration distribution (seconds)
etcd_mvcc_hash_rev_duration_seconds_count	The count of etcd MVCC hash revision durations in seconds.
etcd_mvcc_hash_rev_duration_seconds_sum	Sum of etcd MVCC hash revision duration, in seconds
etcd_mvcc_put_total	The total number of etcd MVCC Put operations
etcd_mvcc_range_total	Total number of etcd MVCC range queries
etcd_mvcc_txn_total	Total etcd multiversion concurrency control transactions
etcd_network_active_peers	Number of active etcd network peers
etcd_network_client_grpc_received_bytes_total	Total number of bytes received by the etcd network client over gRPC
etcd_network_client_grpc_sent_bytes_total	The total number of bytes sent by the etcd gRPC client.
etcd_network_disconnected_peers_total	Total number of disconnected peers in the etcd network
etcd_network_peer_received_bytes_total	Total bytes received by the etcd network peer
etcd_network_peer_received_failures_total	Total number of failed receives from etcd network peers
etcd_network_peer_round_trip_time_seconds_bucket	etcd network peer round-trip time distribution (seconds)
etcd_network_peer_round_trip_time_seconds_count	Count of round trip times in seconds for etcd network peers
etcd_network_peer_round_trip_time_seconds_sum	Total round trip time in seconds for etcd network peers
etcd_network_peer_sent_bytes_total	Total bytes sent to etcd peers
etcd_network_peer_sent_failures_total	Total etcd network peer send failures
etcd_network_server_stream_failures_total	Total number of etcd network server stream failures
etcd_network_snapshot_receive_inflights_total	The number of concurrent requests to receive etcd network snapshots.
etcd_network_snapshot_receive_success	The etcd network snapshot was accepted successfully.
etcd_network_snapshot_receive_total_duration_seconds_bucket	Distribution bucket for the total duration, in seconds, of accepting etcd network snapshots.
etcd_network_snapshot_receive_total_duration_seconds_count	The total count of etcd network snapshot receive operations.
etcd_network_snapshot_receive_total_duration_seconds_sum	Total time spent receiving etcd network snapshots, in seconds.
etcd_network_snapshot_send_inflights_total	The number of concurrent requests for sending etcd network snapshots.
etcd_network_snapshot_send_success	The etcd network snapshot was sent successfully.
etcd_network_snapshot_send_total_duration_seconds_bucket	Total duration distribution for sending etcd network snapshots (seconds)
etcd_network_snapshot_send_total_duration_seconds_count	Total number of etcd network snapshot send operations.
etcd_network_snapshot_send_total_duration_seconds_sum	Sum of the total duration for sending etcd network snapshots, in seconds.
etcd_server_apply_duration_seconds_bucket	etcd server apply duration distribution (seconds)
etcd_server_apply_duration_seconds_count	Count of apply operations for the etcd server
etcd_server_apply_duration_seconds_sum	The total time, in seconds, that the etcd server has spent applying requests.
etcd_server_client_requests_total	Total number of client requests to the etcd server
etcd_server_go_version	The Go version of the etcd server
etcd_server_has_leader	The etcd server has a leader.
etcd_server_health_failures	Number of etcd server health check failures
etcd_server_health_success	The etcd server health check is successful.
etcd_server_heartbeat_send_failures_total	Total number of failed heartbeat sends from the etcd server
etcd_server_id	etcd server ID
etcd_server_is_leader	Is the etcd server the leader
etcd_server_is_learner	Whether the etcd server is a Learner
etcd_server_leader_changes_seen_total	The total number of leader changes seen by the etcd server.
etcd_server_learner_promote_successes	The number of successful learner promotions in the etcd server.
etcd_server_proposals_applied_total	Total proposals applied on the etcd server
etcd_server_proposals_committed_total	Total number of proposals committed by the etcd server
etcd_server_proposals_failed_total	Total number of failed etcd server proposals
etcd_server_proposals_pending	Number of pending etcd server proposals
etcd_server_quota_backend_bytes	The backend storage quota for the etcd server in bytes.
etcd_server_read_indexes_failed_total	Total number of failed index reads on the etcd server.
etcd_server_slow_apply_total	Total slow applies on the etcd server
etcd_server_slow_read_indexes_total	The total number of slow read indexes for the etcd server.
etcd_server_snapshot_apply_in_progress_total	Total etcd server snapshot applications in progress
etcd_server_version	etcd server version
etcd_snap_db_fsync_duration_seconds_bucket	Distribution of fsync duration for the etcd snapshot database (seconds).
etcd_snap_db_fsync_duration_seconds_count	Total fsync count for the etcd snapshot database
etcd_snap_db_fsync_duration_seconds_sum	Total fsync duration for the etcd snapshot database, in seconds.
etcd_snap_db_save_total_duration_seconds_bucket	The bucket for the total duration, in seconds, to save the etcd snapshot database.
etcd_snap_db_save_total_duration_seconds_count	Total save duration for the ETCD snapshot database in seconds
etcd_snap_db_save_total_duration_seconds_sum	Total retention duration of the etcd snapshot database (seconds)
etcd_snap_fsync_duration_seconds_bucket	Etcd snapshot fsync duration distribution (seconds)
etcd_snap_fsync_duration_seconds_count	Etcd snapshot sync duration in seconds
etcd_snap_fsync_duration_seconds_sum	etcd snapshot fsync total duration (seconds)
grpc_server_handled_total	Total gRPC server requests processed
grpc_server_msg_received_total	Total messages received by the gRPC server
grpc_server_msg_sent_total	Total gRPC server messages sent
grpc_server_started_total	Total gRPC server startups
memory_utilization_byte	Memory utilization in bytes
os_fd_limit	Operating system file descriptor limit
os_fd_used	Operating system file descriptor count
up	Connectivity for metric collection

ACK Dedicated Scheduler (Job name: ack-scheduler)

Metric	Description
aggregator_discovery_aggregation_count_total	Total count of aggregator discovery aggregations.
aliyun_prometheus_agent_append_duration_seconds	Duration of append operations for the Alibaba Cloud Prometheus agent, in seconds.
aliyun_prometheus_agent_job_discovery_status	Discovery status of scrape jobs for the Alibaba Cloud Prometheus agent.
aliyun_prometheus_agent_scrape_custom_error	Number of custom scrape errors for the Alibaba Cloud Prometheus agent.
aliyun_prometheus_agent_scrapes_by_target_total	Total number of scrapes by target for the Alibaba Cloud Prometheus agent.
aliyun_prometheus_agent_target_info	Target information for the Alibaba Cloud Prometheus agent.
apiserver_audit_event_total	Total number of API server audit events.
apiserver_audit_requests_rejected_total	Total number of rejected API server audit requests.
apiserver_client_certificate_expiration_seconds_bucket	Distribution of remaining seconds until API server client certificate expiration.
apiserver_client_certificate_expiration_seconds_count	Count of remaining seconds until API server client certificate expiration.
apiserver_client_certificate_expiration_seconds_sum	Sum of remaining seconds until API server client certificate expiration.
apiserver_delegated_authn_request_duration_seconds_bucket	Distribution of API server delegated authentication request duration, in seconds.
apiserver_delegated_authn_request_duration_seconds_count	Count of API server delegated authentication request duration.
apiserver_delegated_authn_request_duration_seconds_sum	Sum of API server delegated authentication request duration.
apiserver_delegated_authn_request_total	Total number of API server delegated authentication requests.
apiserver_delegated_authz_request_duration_seconds_bucket	Distribution of API server delegated authorization request duration, in seconds.
apiserver_delegated_authz_request_duration_seconds_count	Count of API server delegated authorization request duration.
apiserver_delegated_authz_request_duration_seconds_sum	Sum of API server delegated authorization request duration, in seconds.
apiserver_delegated_authz_request_total	Total number of API server delegated authorization requests.
apiserver_encryption_config_controller_automatic_reload_failures_total	Total number of automatic reload failures for the API server encryption configuration controller.
apiserver_encryption_config_controller_automatic_reload_success_total	Total number of successful automatic reloads for the API server encryption configuration controller.
apiserver_envelope_encryption_dek_cache_fill_percent	Cache fill percentage for the API server envelope encryption Data Encryption Key (DEK).
apiserver_storage_data_key_generation_duration_seconds_bucket	Distribution of API server storage data key generation duration.
apiserver_storage_data_key_generation_duration_seconds_count	Count of API server storage data key generation duration.
apiserver_storage_data_key_generation_duration_seconds_sum	Sum of API server storage data key generation duration, in seconds.
apiserver_storage_data_key_generation_failures_total	Total number of API server storage data key generation failures.
apiserver_storage_envelope_transformation_cache_misses_total	Total number of cache misses for API server storage envelope transformation.
apiserver_webhooks_x509_insecure_sha1_total	Total count of insecure SHA1 in API server webhook X.509 certificates.
apiserver_webhooks_x509_missing_san_total	Total count of API server webhooks with missing Subject Alternative Name (SAN) in X.509 certificates.
authenticated_user_requests	Authenticated user requests.
authentication_attempts	Number of authentication attempts.
authentication_duration_seconds_bucket	Distribution of authentication duration.
authentication_duration_seconds_count	Count of authentication duration.
authentication_duration_seconds_sum	Sum of authentication duration, in seconds.
authentication_token_cache_active_fetch_count	Count of active fetches from the authentication token cache.
authentication_token_cache_fetch_total	Total number of fetches from the authentication token cache.
authentication_token_cache_request_duration_seconds_bucket	Distribution of authentication token cache request duration.
authentication_token_cache_request_duration_seconds_count	Count of authentication token cache request duration.
authentication_token_cache_request_duration_seconds_sum	Sum of authentication token cache request duration, in seconds.
authentication_token_cache_request_total	Total number of authentication token cache requests.
authorization_attempts_total	Total number of authorization attempts.
authorization_duration_seconds_bucket	Distribution of authorization duration, in seconds.
authorization_duration_seconds_count	Count of authorization duration.
authorization_duration_seconds_sum	Sum of authorization duration.
cardinality_enforcement_unexpected_categorizations_total	Total number of unexpected categorizations from cardinality enforcement.
kubernetes_build_info	Kubernetes build information.
kubernetes_feature_enabled	Enabled Kubernetes feature.
leader_election_master_status	Status of the leader election master.
registered_metric_total	Total number of registered metrics.
registered_metrics_total	Total number of registered metrics.
rest_client_exec_plugin_certificate_rotation_age_bucket	Buckets for the age of rotated certificates for the REST client exec plugin.
rest_client_exec_plugin_certificate_rotation_age_count	Count of the age of rotated certificates for the REST client exec plugin.
rest_client_exec_plugin_certificate_rotation_age_sum	Sum of the age of rotated certificates for the REST client exec plugin.
rest_client_rate_limiter_duration_seconds_bucket	Distribution of REST client rate limiter duration.
rest_client_rate_limiter_duration_seconds_count	Count of REST client rate limiter duration, in seconds.
rest_client_rate_limiter_duration_seconds_sum	Sum of REST client rate limiter duration, in seconds.
rest_client_request_duration_seconds_bucket	Buckets for REST client request duration, in seconds.
rest_client_request_duration_seconds_count	Count of REST client request duration.
rest_client_request_duration_seconds_sum	Sum of REST client request duration, in seconds.
rest_client_request_retries_total	Total number of REST client request retries.
rest_client_request_size_bytes_bucket	Distribution of REST client request size, in bytes.
rest_client_request_size_bytes_count	Count of REST client request size, in bytes.
rest_client_request_size_bytes_sum	Sum of REST client request size, in bytes.
rest_client_requests_total	Total number of REST client requests.
rest_client_response_size_bytes_bucket	Buckets for REST client response size, in bytes.
rest_client_response_size_bytes_count	Count of REST client response size, in bytes.
rest_client_response_size_bytes_sum	Sum of REST client response size, in bytes.
rest_client_transport_cache_entries	Number of REST client transport cache entries.
rest_client_transport_create_calls_total	Total number of REST client transport creation calls.
scheduler_binding_duration_seconds_bucket	Buckets for scheduler binding duration, in seconds.
scheduler_binding_duration_seconds_count	Count of binding duration.
scheduler_binding_duration_seconds_sum	Sum of scheduler binding duration, in seconds.
scheduler_e2e_scheduling_duration_seconds_bucket	Distribution of scheduler end-to-end scheduling duration.
scheduler_e2e_scheduling_duration_seconds_count	Count of scheduler end-to-end scheduling duration.
scheduler_e2e_scheduling_duration_seconds_sum	Sum of scheduler end-to-end scheduling duration, in seconds.
scheduler_framework_extension_point_duration_seconds_bucket	Distribution of scheduler framework extension point duration.
scheduler_framework_extension_point_duration_seconds_count	Count of scheduler framework extension point duration.
scheduler_framework_extension_point_duration_seconds_sum	Sum of scheduler framework extension point duration.
scheduler_goroutines	Number of scheduler goroutines.
scheduler_pending_pods	Number of pending pods in the scheduler.
scheduler_plugin_evaluation_total	Total number of scheduler plugin evaluations.
scheduler_plugin_execution_duration_seconds_bucket	Distribution of scheduler plugin execution duration, in seconds.
scheduler_plugin_execution_duration_seconds_count	Count of scheduler plugin execution duration.
scheduler_plugin_execution_duration_seconds_sum	Sum of scheduler plugin execution duration, in seconds.
scheduler_pod_preemption_victims_bucket	Buckets for the number of pod preemption victims in the scheduler.
scheduler_pod_preemption_victims_count	Count of pod preemption victims in the scheduler.
scheduler_pod_preemption_victims_sum	Sum of pod preemption victims in the scheduler.
scheduler_pod_scheduling_attempts_bucket	Buckets for the number of pod scheduling attempts in the scheduler.
scheduler_pod_scheduling_attempts_count	Count of pod scheduling attempts in the scheduler.
scheduler_pod_scheduling_attempts_sum	Sum of pod scheduling attempts in the scheduler.
scheduler_pod_scheduling_duration_seconds_bucket	Buckets for pod scheduling duration in the scheduler, in seconds.
scheduler_pod_scheduling_duration_seconds_count	Count of pod scheduling duration in the scheduler.
scheduler_pod_scheduling_duration_seconds_sum	Sum of pod scheduling duration in the scheduler, in seconds.
scheduler_pod_scheduling_sli_duration_seconds_bucket	Buckets for pod scheduling Service Level Indicator (SLI) duration.
scheduler_pod_scheduling_sli_duration_seconds_count	Count of pod scheduling Service Level Indicator (SLI) duration in the scheduler.
scheduler_pod_scheduling_sli_duration_seconds_sum	Sum of pod scheduling Service Level Indicator (SLI) duration.
scheduler_preemption_attempts_total	Total number of preemption attempts in the scheduler.
scheduler_preemption_victims_bucket	Buckets for the number of preemption victims in the scheduler.
scheduler_preemption_victims_count	Count of preemption victims in the scheduler.
scheduler_preemption_victims_sum	Total number of preemption victims in the scheduler.
scheduler_queue_incoming_pods_total	Total number of incoming pods in the scheduler queue.
scheduler_schedule_attempts_total	Total number of scheduling attempts in the scheduler.
scheduler_scheduler_cache_size	Size of the scheduler cache.
scheduler_scheduler_goroutines	Number of scheduler goroutines.
scheduler_scheduling_algorithm_duration_seconds_bucket	Distribution of scheduler scheduling algorithm duration, in seconds.
scheduler_scheduling_algorithm_duration_seconds_count	Count of scheduler scheduling algorithm duration, in seconds.
scheduler_scheduling_algorithm_duration_seconds_sum	Sum of scheduler scheduling algorithm duration, in seconds.
scheduler_scheduling_algorithm_predicate_evaluation_seconds_bucket	Buckets for scheduler scheduling algorithm predicate evaluation duration, in seconds.
scheduler_scheduling_algorithm_predicate_evaluation_seconds_count	Count of scheduling algorithm predicate evaluation duration, in seconds.
scheduler_scheduling_algorithm_predicate_evaluation_seconds_sum	Sum of scheduling algorithm predicate evaluation duration, in seconds.
scheduler_scheduling_algorithm_preemption_evaluation_seconds_bucket	Buckets for scheduling algorithm preemption evaluation duration, in seconds.
scheduler_scheduling_algorithm_preemption_evaluation_seconds_count	Count of scheduling algorithm preemption evaluation duration, in seconds.
scheduler_scheduling_algorithm_preemption_evaluation_seconds_sum	Sum of scheduling algorithm preemption evaluation duration, in seconds.
scheduler_scheduling_algorithm_priority_evaluation_seconds_bucket	Buckets for scheduler scheduling algorithm priority evaluation duration, in seconds.
scheduler_scheduling_algorithm_priority_evaluation_seconds_count	Count of scheduling algorithm priority evaluation duration, in seconds.
scheduler_scheduling_algorithm_priority_evaluation_seconds_sum	Sum of scheduling algorithm priority evaluation duration, in seconds.
scheduler_scheduling_attempt_duration_seconds_bucket	Distribution of scheduler scheduling attempt duration.
scheduler_scheduling_attempt_duration_seconds_count	Count of scheduler scheduling attempt duration.
scheduler_scheduling_attempt_duration_seconds_sum	Sum of scheduler scheduling attempt duration, in seconds.
scheduler_scheduling_duration_seconds	Scheduler scheduling duration, in seconds.
scheduler_scheduling_duration_seconds_count	Count of scheduling duration.
scheduler_scheduling_duration_seconds_sum	Sum of scheduling duration.
scheduler_total_preemption_attempts	Total number of preemption attempts by the scheduler.
scheduler_unschedulable_pods	Number of unschedulable pods in the scheduler.
scheduler_volume_scheduling_duration_seconds_bucket	Buckets for volume scheduling duration.
scheduler_volume_scheduling_duration_seconds_count	Count of scheduler volume scheduling duration, in seconds.
scheduler_volume_scheduling_duration_seconds_sum	Sum of scheduler volume scheduling duration, in seconds.
scheduler_volume_scheduling_stage_error_total	Total number of errors in the scheduler volume scheduling stage.
scrape_duration_seconds	Scrape duration, in seconds.
scrape_samples_post_metric_relabeling	Number of scraped samples after metric relabeling.
scrape_samples_scraped	Number of scraped samples.
scrape_series_added	Number of new series added from scrapes.
up	Connectivity for metric scraping.
workqueue_adds_total	Total number of additions to the work queue.
workqueue_depth	Depth of the work queue.
workqueue_longest_running_processor_seconds	Longest running processor time in the work queue, in seconds.
workqueue_queue_duration_seconds_bucket	Buckets for the duration items stay in the work queue, in seconds.
workqueue_queue_duration_seconds_count	Count of the duration items stay in the work queue, in seconds.
workqueue_queue_duration_seconds_sum	Sum of the duration items stay in the work queue, in seconds.
workqueue_retries_total	Total number of retries in the work queue.
workqueue_unfinished_work_seconds	Seconds of unfinished work in the work queue.
workqueue_work_duration_seconds_bucket	Distribution of work duration in the work queue.
workqueue_work_duration_seconds_count	Count of work duration in the work queue.
workqueue_work_duration_seconds_sum	Sum of work duration in the work queue, in seconds.

References

To view the metrics for ARMS Application Monitoring, see Application Monitoring metrics.
Configure deprecated metrics