All Products
Search
Document Center

Application Real-Time Monitoring Service:Basic metrics for container clusters

Last Updated:Oct 29, 2025

This topic describes the basic metrics for container clusters that are supported by Managed Service for Prometheus.

Important
  • Billing for Managed Service for Prometheus is based on the data write volume or the number of reported data points. Metrics are divided into two types:

    • Basic metrics: Managed Service for Prometheus provides free data reporting and writing for basic metrics collected from Alibaba Cloud container services, such as Container Service for Kubernetes (ACK), ACS, ASK, ACK One, and ACK Edge. This benefit does not apply to other types of container clusters.

    • Custom metrics: Any metric that is not a basic metric is a custom metric. Billing for custom metrics started on January 6, 2020.

  • Starting from 00:00:00 (UTC+8) on November 12, 2024, Managed Service for Prometheus will adjust the scope of basic metrics collected from Alibaba Cloud container service clusters. The adjusted metric scope is described below.

Note that the scope of basic metrics collected by default for container clusters is limited to the metrics described in this topic.

Container cluster metrics outside this scope are custom metrics and are subject to charges. For more information about billing, see Billing of Prometheus instances.

cAdvisor (Job name: _arms/kubelet/cadvisor)

Metric

Description

container_cpu_usage_seconds_total

Total container CPU usage time.

container_fs_usage_bytes

Container file system usage in bytes.

container_memory_cache

Container memory cache.

container_memory_usage_bytes

Container memory usage in bytes.

container_memory_working_set_bytes

Container memory working set in bytes.

container_network_receive_bytes_total

Total bytes received by the container network.

container_network_transmit_bytes_total

Total bytes transmitted by the container network.

container_scrape_error

Container metric scrape error.

DCGM_CUSTOM_CONTAINER_CP_ALLOCATED

The proportion of computing power allocated to a container on a GPU card relative to the total computing power of that GPU. The value ranges from 0 to 1. For exclusive GPUs or shared GPUs that only request GPU memory, this metric is 0, which indicates no limit on computing power. For example, if a GPU card has 100 units of computing power and 30 units are allocated to a container, the allocated computing power ratio for that container is 30/100 = 0.3.

DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED

The GPU memory allocated to the container.

DCGM_CUSTOM_DEV_FB_ALLOCATED

The proportion of allocated GPU memory to the total GPU memory. The value ranges from 0 to 1.

DCGM_CUSTOM_DEV_FB_TOTAL

The total GPU memory of the GPU card.

DCGM_CUSTOM_DEV_HEALTH

GPU health status.

DCGM_CUSTOM_PROCESS_DECODE_UTIL

The decoder utilization of the GPU thread.

DCGM_CUSTOM_PROCESS_ENCODE_UTIL

The encoder utilization of the GPU thread.

DCGM_CUSTOM_PROCESS_MEM_COPY_UTIL

The memory copy utilization of the GPU thread.

DCGM_CUSTOM_PROCESS_MEM_USED

The GPU memory currently used by the GPU thread.

DCGM_CUSTOM_PROCESS_SM_UTIL

The SM utilization of the GPU thread.

DCGM_CUSTOM_PROF_MEM_BANDWIDTH_USED

GPU memory bandwidth usage.

DCGM_CUSTOM_PROF_TENS_TFPS_USED

The usage of the GPU tensor core.

DCGM_FI_DEV_DEC_UTIL

Decoder utilization.

DCGM_FI_DEV_ENC_UTIL

Encoder utilization.

DCGM_FI_DEV_FB_FREE

The amount of available framebuffer memory.

DCGM_FI_DEV_FB_USED

The amount of used framebuffer memory. This value corresponds to the used value of Memory-Usage in the nvidia-smi command.

DCGM_FI_DEV_GPU_TEMP

GPU temperature.

DCGM_FI_DEV_GPU_UTIL

GPU utilization. This is the percentage of time one or more kernel functions are active on the GPU over a period, such as 1s or 1/6s, depending on the GPU product. This metric only shows that a GPU resource is in use by a kernel function, but does not show the specific usage.

DCGM_FI_DEV_MEM_CLOCK

Memory clock frequency.

DCGM_FI_DEV_MEM_COPY_UTIL

Memory bandwidth utilization. For example, for an NVIDIA V100 GPU, the maximum memory bandwidth is 900 GB/sec. If the current memory bandwidth is 450 GB/sec, the memory bandwidth utilization is 50%.

DCGM_FI_DEV_POWER_USAGE

Power usage.

DCGM_FI_DEV_SM_CLOCK

SM clock frequency.

DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION

The energy consumed since the driver was loaded.

DCGM_FI_DEV_XID_ERRORS

The last XID error number that occurred within a period of time.

DCGM_FI_PROF_DRAM_ACTIVE

Memory bandwidth utilization. The fraction of cycles where data is sent to or received from the device memory.

This value is an average over the time interval, not an instantaneous value.

A higher value indicates higher utilization of the device memory.

A value of 1 (100%) means that a DRAM instruction is executed in every cycle within the time interval. In practice, a peak of about 0.8 (80%) is the maximum achievable value.

A value of 0.2 (20%) means that 20% of the cycles are used to read from or write to the device memory within the time interval.

DCGM_FI_PROF_NVLINK_RX_BYTES

The data rate of data transmitted or received over NVLink, excluding protocol headers.

This value is an average over a time interval, not an instantaneous value.

The rate is averaged over the time interval. For example, if 1 GB of data is transmitted in 1 second, the rate is 1 GB/s, regardless of whether the data is transmitted at a constant rate or in bursts. The theoretical maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction.

DCGM_FI_PROF_NVLINK_TX_BYTES

Total bytes transmitted over NVLink (send direction).

DCGM_FI_PROF_PCIE_RX_BYTES

The data rate of data transmitted or received over the PCIe bus, including protocol headers and data payloads.

This value is an average over a time interval, not an instantaneous value.

The rate is averaged over the time interval. For example, if 1 GB of data is transmitted in 1 second, the rate is 1 GB/s, regardless of whether the data is transmitted at a constant rate or in bursts. The theoretical maximum PCIe Gen3 bandwidth is 985 MB/s per channel.

DCGM_FI_PROF_PCIE_TX_BYTES

The data rate of data transmitted or received over the PCIe bus, including protocol headers and data payloads.

This value is an average over a time interval, not an instantaneous value.

The rate is averaged over the time interval. For example, if 1 GB of data is transmitted in 1 second, the rate is 1 GB/s, regardless of whether the data is transmitted at a constant rate or in bursts. The theoretical maximum PCIe Gen3 bandwidth is 985 MB/s per channel.

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

The fraction of cycles where the Tensor (HMMA/IMMA) Pipe is active.

This value is an average over a time interval, not an instantaneous value.

A higher value indicates higher utilization of Tensor Cores.

A value of 1 (100%) means that a Tensor instruction is issued every other instruction cycle. One instruction is completed in two cycles.

A value of 0.2 (20%) could mean:

20% of the SMs' Tensor Cores are running at 100% utilization throughout the interval.

100% of the SMs' Tensor Cores are running at 20% utilization throughout the interval.

For 1/5 of the interval, 100% of the Tensor Cores on the SMs are running at 100% utilization.

Other combinations.

DCGM_FI_PROF_SM_ACTIVE

The percentage of time that at least one warp is active on a Streaming Multiprocessor (SM) within a time interval. This value is the average for all SMs and is not sensitive to the number of threads per block. A warp is active when it is scheduled and allocated resources. It can be in a computing or non-computing state, such as waiting for a memory request. A value less than 0.5 indicates inefficient GPU utilization, and a value greater than 0.8 is necessary. Assume a GPU has N SMs: If a kernel function runs on all SMs using N thread blocks throughout the interval, the value is 1 (100%). If a kernel function runs N/5 thread blocks within the interval, the value is 0.2. If a kernel function uses N thread blocks but runs for only 1/5 of the cycle time within the interval, the value is 0.2.

machine_cpu_cores

Number of machine CPU cores.

node_exporter_build_info

Node exporter build information.

nvidia_gpu_duty_cycle

NVIDIA GPU duty cycle percentage.

nvidia_gpu_memory_total_bytes

Total NVIDIA GPU memory in bytes.

nvidia_gpu_memory_used_bytes

Amount of used NVIDIA GPU memory.

nvidia_gpu_num_devices

Number of NVIDIA GPU devices.

nvidia_gpu_power_usage_milliwatts

NVIDIA GPU power consumption in milliwatts.

nvidia_gpu_temperature_celsius

NVIDIA GPU temperature in Celsius.

rdma_service_monitor_local_ack_timeout_err

Number of RDMA network timeout errors.

rdma_service_monitor_out_of_seq

Number of out-of-sequence RDMA network datagrams.

rdma_service_monitor_packet_seq_err

Number of out-of-sequence RDMA network packet sending errors.

rdma_service_monitor_rx_bytes

RDMA network receive throughput.

rdma_service_monitor_rx_packets

Number of received RDMA network packets.

rdma_service_monitor_tx_bytes

RDMA network send throughput.

rdma_service_monitor_tx_packets

Number of sent RDMA network packets.

up

Connectivity of metric scraping.

ACK ControlPlane APIServer (Includes ACK Pro control plane components such as APIServer, etcd, scheduler, KCM, and CCM. ACK Dedicated clusters include only APIServer) (Job name: apiserver)

Metric

Description

aggregator_discovery_aggregation_count_total

Total count of aggregations from aggregator discovery

aggregator_openapi_v2_regeneration_count

Aggregator OpenAPI V2 regeneration count

aggregator_openapi_v2_regeneration_duration

Aggregator OpenAPI V2 regeneration duration

aggregator_unavailable_apiservice

Unavailable aggregator APIService

aggregator_unavailable_apiservice_count

The number of unavailable APIServices in the aggregator.

aggregator_unavailable_apiservice_total

Total number of unavailable API services in the aggregator

aliyun_prometheus_agent_append_duration_seconds

Alibaba Cloud Prometheus Agent append duration (seconds)

aliyun_prometheus_agent_job_discovery_status

Alibaba Cloud Prometheus Agent job discovery status

aliyun_prometheus_agent_scrapes_by_target_total

Total scrapes by target for the Alibaba Cloud Prometheus Agent

aliyun_prometheus_agent_target_info

Alibaba Cloud Prometheus Agent target information

apiextensions_apiserver_validation_ratcheting_seconds_bucket

APIServer validation ratcheting seconds bucket

apiextensions_apiserver_validation_ratcheting_seconds_count

Count of APIServer validation ratcheting seconds

apiextensions_apiserver_validation_ratcheting_seconds_sum

Sum of APIServer validation increment in seconds

apiextensions_openapi_v2_regeneration_count

Apiextensions OpenAPI V2 regeneration count

apiextensions_openapi_v3_regeneration_count

Apiextensions OpenAPI V3 regeneration count

apiserver_accepted_listall_requests_total

The total number of listall requests accepted by the APIServer.

apiserver_admission_controller_admission_duration_seconds_bucket

The bucket for the APIServer admission controller admission duration, in seconds.

apiserver_admission_controller_admission_duration_seconds_count

The number of admission requests processed by the APIServer admission controller.

apiserver_admission_controller_admission_duration_seconds_sum

Total admission duration for the APIServer admission controller, in seconds

apiserver_admission_step_admission_duration_seconds_bucket

The histogram bucket for the duration of an APIServer admission step in seconds.

apiserver_admission_step_admission_duration_seconds_count

Count of API server admission step durations in seconds.

apiserver_admission_step_admission_duration_seconds_sum

Total duration of API server admission steps in seconds

apiserver_admission_step_admission_duration_seconds_summary

Summary of the APIServer admission step duration in seconds.

apiserver_admission_step_admission_duration_seconds_summary_count

Summary count of the admission duration of an APIServer admission step in seconds.

apiserver_admission_step_admission_duration_seconds_summary_sum

The sum of the summary of the API server admission step duration, in seconds.

apiserver_admission_webhook_admission_duration_seconds_bucket

APIServer admission webhook admission duration seconds bucket

apiserver_admission_webhook_admission_duration_seconds_count

The count of APIServer admission webhook durations in seconds.

apiserver_admission_webhook_admission_duration_seconds_sum

Sum of the admission duration of API server admission webhooks, in seconds.

apiserver_admission_webhook_fail_open_count

API server admission webhook fail open count

apiserver_admission_webhook_rejection_count

The number of rejections from the API server admission webhook.

apiserver_admission_webhook_request_total

Total number of API server admission webhook requests

apiserver_audit_error_total

Total number of API Server audit errors

apiserver_audit_event_total

Total APIServer audit events

apiserver_audit_level_total

Total number of API server audit events

apiserver_audit_requests_rejected_total

Total number of rejected APIServer audit requests.

apiserver_authorization_decisions_total

Total number of API server authorization decisions

apiserver_cache_list_fetched_objects_total

The total number of objects fetched from the APIServer cache list.

apiserver_cache_list_returned_objects_total

Total number of objects returned by the APIServer cache list

apiserver_cache_list_total

Total number of APIServer cache list operations

apiserver_cacher_received_events

Events received by the APIServer cache

apiserver_cacher_sended_events_latency_milliseconds_bucket

The distribution of latency in milliseconds for events sent by the APIServer cacher.

apiserver_cacher_sended_events_latency_milliseconds_count

The count of latency measurements in milliseconds for events sent by the APIServer cacher.

apiserver_cacher_sended_events_latency_milliseconds_sum

The total latency in milliseconds for events sent by the APIServer cacher.

apiserver_cacher_watcher_channel_length

APIServer cacher watcher channel length

apiserver_cel_compilation_duration_seconds_bucket

Distribution of APIServer CEL compilation durations in seconds

apiserver_cel_compilation_duration_seconds_count

Counter of API server CEL compilations

apiserver_cel_compilation_duration_seconds_sum

Total APIServer CEL compilation duration (seconds)

apiserver_cel_evaluation_duration_seconds_bucket

Distribution of APIServer CEL evaluation durations in seconds.

apiserver_cel_evaluation_duration_seconds_count

The number of API server CEL evaluations.

apiserver_cel_evaluation_duration_seconds_sum

Total duration of APIServer CEL evaluation in seconds

apiserver_client_certificate_expiration_seconds_bucket

Distribution of seconds remaining before the API server client certificate expires.

apiserver_client_certificate_expiration_seconds_count

The number of seconds before the API server client certificate expires.

apiserver_client_certificate_expiration_seconds_sum

The total number of seconds remaining before the APIServer client certificate expires.

apiserver_clusterip_repair_ip_errors_total

Total ClusterIP errors repaired by the API server

apiserver_clusterip_repair_reconcile_errors_total

The total number of reconciliation errors for ClusterIP repairs by the APIServer.

apiserver_conversion_webhook_duration_seconds_bucket

The distribution of API server conversion webhook durations in seconds.

apiserver_conversion_webhook_duration_seconds_count

The number of APIServer conversion webhook calls

apiserver_conversion_webhook_duration_seconds_sum

Total duration of API server conversion webhooks in seconds

apiserver_conversion_webhook_request_total

Total number of API server conversion webhook requests

apiserver_crd_conversion_webhook_duration_seconds_bucket

The distribution of API Server CRD conversion webhook durations in seconds.

apiserver_crd_conversion_webhook_duration_seconds_count

Count of calls to the APIServer CRD conversion webhook

apiserver_crd_conversion_webhook_duration_seconds_sum

Total duration of APIServer CRD conversion webhooks in seconds.

apiserver_crd_webhook_conversion_duration_seconds_bucket

Distribution of APIServer CRD webhook conversion duration in seconds.

apiserver_crd_webhook_conversion_duration_seconds_count

The total number of APIServer CRD webhook conversions.

apiserver_crd_webhook_conversion_duration_seconds_sum

Total duration of APIServer CRD webhook conversions in seconds.

apiserver_created_watchers

Number of watchers created by the API server

apiserver_current_inflight_requests

The number of requests the APIServer is currently processing.

apiserver_current_inqueue_requests

The current number of requests in the API server queue.

apiserver_dropped_requests_total

The total number of requests dropped by the APIServer.

apiserver_encryption_config_controller_automatic_reload_failures_total

Number of failed automatic reloads for the APIServer encryption configuration controller

apiserver_encryption_config_controller_automatic_reload_success_total

Number of successful automatic reloads for the APIServer encryption configuration controller

apiserver_envelope_encryption_dek_cache_fill_percent

APIServer envelope encryption DEK cache fill percentage

apiserver_error_watchers

Number of APIServer fault observers

apiserver_flowcontrol_current_executing_requests

Number of requests currently being executed by the APIServer throttle

apiserver_flowcontrol_current_executing_seats

Number of seats currently used by the APIServer throttle

apiserver_flowcontrol_current_inqueue_requests

Number of requests in the APIServer throttle queue

apiserver_flowcontrol_current_inqueue_seats

Number of seats in the APIServer throttle queue

apiserver_flowcontrol_current_limit_seats

Current seat limit for the API server throttle

apiserver_flowcontrol_current_r

Current R value of the APIServer throttle

apiserver_flowcontrol_demand_seats_average

Average value of requested seats for APIServer throttling

apiserver_flowcontrol_demand_seats_bucket

Seat distribution for throttled API server requests

apiserver_flowcontrol_demand_seats_count

APIServer throttle request seat count

apiserver_flowcontrol_demand_seats_high_watermark

APIServer throttling request seats high-water mark

apiserver_flowcontrol_demand_seats_smoothed

Smoothing value for APIServer throttle request seats

apiserver_flowcontrol_demand_seats_stdev

Standard deviation of request seats for APIServer throttling

apiserver_flowcontrol_demand_seats_sum

Total requested seats for APIServer throttling

apiserver_flowcontrol_dispatch_r

APIServer throttle scheduling R value

apiserver_flowcontrol_dispatched_requests_total

Total number of requests scheduled by APIServer throttling

apiserver_flowcontrol_latest_s

Recent S value limit for APIServer throttling

apiserver_flowcontrol_lower_limit_seats

Minimum seats for APIServer throttling

apiserver_flowcontrol_next_discounted_s_bounds

Next discounted S-value threshold for the APIServer throttle

apiserver_flowcontrol_next_s_bounds

Next S value threshold for APIServer throttling

apiserver_flowcontrol_nominal_limit_seats

Nominal seat limit for APIServer throttling

apiserver_flowcontrol_priority_level_request_count_samples_bucket

Sample distribution of APIServer requests by throttling priority level

apiserver_flowcontrol_priority_level_request_count_samples_count

Sample count of APIServer requests per throttling priority level

apiserver_flowcontrol_priority_level_request_count_samples_sum

Sum of sampled request counts for the APIServer throttling priority level

apiserver_flowcontrol_priority_level_request_count_watermarks_bucket

Distribution of request count watermarks across APIServer flow control priority levels

apiserver_flowcontrol_priority_level_request_count_watermarks_count

API server throttling priority level: request count watermark mark count

apiserver_flowcontrol_priority_level_request_count_watermarks_sum

Sum of request watermarks for APIServer throttling priority levels

apiserver_flowcontrol_priority_level_request_utilization_bucket

Distribution of APIServer request utilization by flow control priority level

apiserver_flowcontrol_priority_level_request_utilization_count

APIServer throttle priority level request utilization count

apiserver_flowcontrol_priority_level_request_utilization_sum

Total request utilization across APIServer throttling priority levels

apiserver_flowcontrol_priority_level_seat_count_samples_bucket

Sample distribution of seats across APIServer throttling priority levels

apiserver_flowcontrol_priority_level_seat_count_samples_count

APIServer throttling priority level seats sample count

apiserver_flowcontrol_priority_level_seat_count_samples_sum

Sum of seat count samples for the APIServer throttle priority level

apiserver_flowcontrol_priority_level_seat_count_watermarks_bucket

Distribution of seat watermarks for API server priority levels

apiserver_flowcontrol_priority_level_seat_count_watermarks_count

APIServer throttle priority level seats watermark mark count

apiserver_flowcontrol_priority_level_seat_count_watermarks_sum

Total seats at the watermark for the APIServer throttling priority level

apiserver_flowcontrol_priority_level_seat_utilization_bucket

API server: Seat utilization distribution by throttle priority level

apiserver_flowcontrol_priority_level_seat_utilization_count

APIServer flow control priority level seat utilization count

apiserver_flowcontrol_priority_level_seat_utilization_sum

Total seat utilization across API server throttling priority levels

apiserver_flowcontrol_read_vs_write_current_requests_bucket

Current request count in the APIServer read/write throttle bucket

apiserver_flowcontrol_read_vs_write_current_requests_count

Current read/write request count for APIServer throttling

apiserver_flowcontrol_read_vs_write_current_requests_sum

Sum of current read and write requests throttled by the APIServer

apiserver_flowcontrol_read_vs_write_request_count_samples_bucket

Sample bucket for the read/write request count of the APIServer throttle.

apiserver_flowcontrol_read_vs_write_request_count_samples_count

Number of samples for the APIServer throttled read/write request counter

apiserver_flowcontrol_read_vs_write_request_count_samples_sum

Total count of throttled APIServer read/write requests

apiserver_flowcontrol_read_vs_write_request_count_watermarks_bucket

APIServer throttling read/write request count watermark bucket

apiserver_flowcontrol_read_vs_write_request_count_watermarks_count

APIServer throttled read/write request count watermark

apiserver_flowcontrol_read_vs_write_request_count_watermarks_sum

Total count watermark for APIServer throttled read/write requests

apiserver_flowcontrol_rejected_requests_total

Total requests rejected by APIServer throttling

apiserver_flowcontrol_request_concurrency_in_use

APIServer throttled concurrent requests

apiserver_flowcontrol_request_concurrency_limit

Concurrency limit for APIServer request throttling

apiserver_flowcontrol_request_dispatch_no_accommodation_total

The API server request throttling scheduler cannot accommodate the total number of requests.

apiserver_flowcontrol_request_execution_seconds_bucket

APIServer throttled request execution time in seconds (buckets)

apiserver_flowcontrol_request_execution_seconds_count

Total execution time in seconds for throttled APIServer requests

apiserver_flowcontrol_request_execution_seconds_sum

Sum of execution seconds for throttled APIServer requests

apiserver_flowcontrol_request_queue_length_after_enqueue_bucket

Post-enqueue length buckets of the APIServer request throttling queue

apiserver_flowcontrol_request_queue_length_after_enqueue_count

Count of requests in the APIServer throttling queue

apiserver_flowcontrol_request_queue_length_after_enqueue_sum

Total enqueued requests in APIServer throttling queues

apiserver_flowcontrol_request_wait_duration_seconds_bucket

APIServer request throttling wait time bucket (seconds)

apiserver_flowcontrol_request_wait_duration_seconds_count

Total wait time in seconds for throttled APIServer requests

apiserver_flowcontrol_request_wait_duration_seconds_sum

Total wait time in seconds for throttled APIServer requests

apiserver_flowcontrol_seat_fair_frac

The APIServer contains the fair allocation ratio from the previous borrowing adjustment period.

apiserver_flowcontrol_target_seats

Target seat count for API server throttling

apiserver_flowcontrol_upper_limit_seats

Maximum number of seats for APIServer throttling

apiserver_flowcontrol_watch_count_samples_bucket

APIServer throttle observation count sample bucket

apiserver_flowcontrol_watch_count_samples_count

APIServer throttle observation sample count

apiserver_flowcontrol_watch_count_samples_sum

Sum of APIServer throttle observation counts

apiserver_flowcontrol_work_estimated_seats_bucket

APIServer flow control's bucket for estimated work seats

apiserver_flowcontrol_work_estimated_seats_count

APIServer flow control estimated seat count

apiserver_flowcontrol_work_estimated_seats_sum

Total estimated seats for APIServer throttling work

apiserver_init_events_total

Total APIServer initialization events

apiserver_kube_aggregator_x509_insecure_sha1_total

Number of requests using insecure SHA1 signatures

apiserver_kube_aggregator_x509_missing_san_total

APIServer kube-aggregator: Total missing x509 SANs

apiserver_longrunning_gauge

APIServer long-running gauge

apiserver_longrunning_requests

Long-running APIServer requests

apiserver_nodeport_repair_reconcile_errors_total

Total reconciliation faults for APIServer node port repairs

apiserver_realtime_watchers

Number of real-time APIServer observers

apiserver_registered_watchers

Number of registered observers in APIServer

apiserver_request_aborts_total

Total aborted APIServer requests

apiserver_request_body_size_bytes_bucket

APIServer request body size in bytes bucket

apiserver_request_body_size_bytes_count

APIServer request body size in bytes

apiserver_request_body_size_bytes_sum

Total APIServer request body size in bytes

apiserver_request_count

Number of API server requests

apiserver_request_duration_seconds_bucket

Buckets for APIServer request processing time (in seconds)

apiserver_request_duration_seconds_count

Count of APIServer request duration in seconds

apiserver_request_duration_seconds_sum

Total APIServer request duration in seconds

apiserver_request_filter_duration_seconds_bucket

APIServer request filter duration bucket (seconds)

apiserver_request_filter_duration_seconds_count

Count of APIServer request filter durations in seconds.

apiserver_request_filter_duration_seconds_sum

Total duration of APIServer request filters in seconds

apiserver_request_latencies_summary

APIServer request latency distribution summary

apiserver_request_no_resourceversion_list_total

Total LIST requests for versions without resources

apiserver_request_post_timeout_total

Total POST API Request Timeouts

apiserver_request_sli_duration_seconds_bucket

API request Service Level Indicator (SLI) duration seconds bucket

apiserver_request_sli_duration_seconds_count

Total API request SLI duration in seconds

apiserver_request_sli_duration_seconds_sum

Total API request SLI duration in seconds

apiserver_request_slo_duration_seconds_bucket

API request SLO duration bucket (seconds)

apiserver_request_slo_duration_seconds_count

API request SLO duration seconds count

apiserver_request_slo_duration_seconds_sum

Total API request SLO duration in seconds

apiserver_request_terminations_total

Total stopped API requests

apiserver_request_timestamp_comparison_time_bucket

Distribution buckets for API request timestamp differences

apiserver_request_timestamp_comparison_time_count

API request timestamp comparison sample count

apiserver_request_timestamp_comparison_time_sum

Total time for API request timestamp comparison

apiserver_request_total

Total API requests

apiserver_requested_deprecated_apis

Number of requests to the API server for deprecated APIs

apiserver_response_sizes_bucket

API response size distribution buckets

apiserver_response_sizes_count

API response size count

apiserver_response_sizes_sum

Total API response size

apiserver_selfrequest_total

Total API server self-requests

apiserver_storage_data_key_generation_duration_seconds_bucket

APIServer storage data key generation duration: seconds buckets

apiserver_storage_data_key_generation_duration_seconds_count

Count of data key generations by APIServer storage

apiserver_storage_data_key_generation_duration_seconds_sum

Total data key generation time for APIServer storage, in seconds

apiserver_storage_data_key_generation_failures_total

Total number of data key generation failures for the APIServer store

apiserver_storage_db_total_size_in_bytes

Total size of the APIServer database (bytes)

apiserver_storage_decode_errors_total

Total APIServer storage decoding errors

apiserver_storage_envelope_transformation_cache_misses_total

Total cache misses for the envelope transform in APIServer storage

apiserver_storage_events_received_total

Total number of events accepted and stored by the APIServer

apiserver_storage_list_evaluated_objects_total

Total objects evaluated from APIServer storage for list operations

apiserver_storage_list_fetched_objects_total

Total objects retrieved from the APIServer storage list

apiserver_storage_list_returned_objects_total

Total number of objects in a list response from the APIServer

apiserver_storage_list_total

Total APIServer storage list operations

apiserver_storage_objects

Number of APIServer objects

apiserver_storage_size_bytes

APIServer storage size (bytes)

apiserver_terminated_watchers_total

Total number of observers for APIServer stop

apiserver_tls_handshake_errors_total

Total failed TLS handshake requests for the API server

apiserver_too_large_resourceversion_errors

Number of error requests to APIServer due to oversized resource versions

apiserver_watch_cache_events_dispatched_total

Total number of events distributed by the APIServer observation cache

apiserver_watch_cache_events_received_total

Total events accepted by the APIServer observation cache

apiserver_watch_cache_initializations_total

Total APIServer watch cache initializations

apiserver_watch_cache_read_wait_seconds_bucket

APIServer watch cache read wait time bucket (seconds)

apiserver_watch_cache_read_wait_seconds_count

APIServer observation cache read wait seconds count

apiserver_watch_cache_read_wait_seconds_sum

Sum of wait time in seconds for APIServer observation cache reads

apiserver_watch_cache_watch_cache_initializations_total

Total APIServer observation cache initializations

apiserver_watch_events_sizes_bucket

API server observation event size distribution buckets

apiserver_watch_events_sizes_count

APIServer observation event size count

apiserver_watch_events_sizes_sum

Total size of APIServer observation events

apiserver_watch_events_total

Total APIServer observation events

apiserver_webhooks_x509_insecure_sha1_total

Number of requests that use insecure SHA1 signatures

apiserver_webhooks_x509_missing_san_total

Total missing SANs in APIServerWebhooks

authenticated_user_requests

Total number of authenticated user requests

authentication_attempts

Authentication attempts

authentication_duration_seconds_bucket

Authentication procedure duration buckets (seconds)

authentication_duration_seconds_count

Authentication procedure duration (seconds)

authentication_duration_seconds_sum

Total authentication duration in seconds

authentication_token_cache_active_fetch_count

Authentication token cache proactive fetch count

authentication_token_cache_fetch_total

Total authentication token cache retrievals

authentication_token_cache_request_duration_seconds_bucket

Authentication token cache request latency distribution buckets (seconds)

authentication_token_cache_request_duration_seconds_count

Authentication token cache request latency counter (seconds)

authentication_token_cache_request_duration_seconds_sum

Total duration of authentication token cache requests in seconds

authentication_token_cache_request_total

Total authentication token cache requests

authorization_attempts_total

Total authorization attempts

authorization_duration_seconds_bucket

Distribution buckets for authorization procedure duration (seconds)

authorization_duration_seconds_count

Authorization procedure duration in seconds

authorization_duration_seconds_sum

Total authorization procedure duration in seconds

cardinality_enforcement_unexpected_categorizations_total

Total by execution and exception category

count

Count

cpu_utilization_core

CPU utilization (core)

disabled_metric_total

Total disabled metrics

disabled_metrics_total

Total disabled metrics

etcd_bookmark_counts

Etcd bookmark count

etcd_db_total_size_in_bytes

Total etcd database size (bytes)

etcd_lease_object_counts_bucket

Histogram buckets for etcd lease object count

etcd_lease_object_counts_count

Total ETCD lease object count

etcd_lease_object_counts_sum

Total etcd lease object count

etcd_object_counts

ETCD object count

etcd_request_duration_seconds_bucket

Bucket counter for ETCD request processing time (in seconds)

etcd_request_duration_seconds_count

ETCD request duration count (seconds)

etcd_request_duration_seconds_sum

Sum of etcd request durations in seconds

etcd_request_errors_total

Total ETCD request faults

etcd_requests_total

Total etcd requests

etcd_watcher_channel_length

etcd observer channel length

etcd_watcher_received_events

Events received by the ETCD observer

etcd_watcher_sended_events_latency_milliseconds_bucket

Distribution bucket for etcd observer event send latency (ms)

etcd_watcher_sent_events_latency_milliseconds_count

ETCD observer event send latency in milliseconds

etcd_watcher_sent_events_latency_milliseconds_sum

Sum of etcd observer send event latency in milliseconds

field_validation_request_duration_seconds_bucket

Field validation request duration distribution bucket (seconds)

field_validation_request_duration_seconds_count

Field validation request duration count (seconds)

field_validation_request_duration_seconds_sum

Total field authentication request duration in seconds

get_token_count

Get token count

get_token_fail_count

Failed token acquisition count

grpc_client_handled_total

gRPC client: Total processed

grpc_client_msg_received_total

gRPC client: Total messages received

grpc_client_msg_sent_total

gRPC client: Total messages sent

grpc_client_started_total

gRPC Client: Total Starts

hidden_metric_total

Hidden metric: Total

hidden_metrics_total

Hidden metric: Total

http_request_duration_microseconds

HTTP request: Duration (microseconds)

http_request_size_bytes

HTTP request: size (bytes)

http_requests_total

HTTP requests: Total

http_response_size_bytes

HTTP response size (bytes)

Job

Job name

job_instance_mode

Job instance pattern

kube_apiserver_clusterip_allocator_allocated_ips

Kubernetes APIServer: number of IPs allocated by the ClusterIP allocator

kube_apiserver_clusterip_allocator_allocation_errors_total

Kubernetes API server: Total ClusterIP allocator allocation errors

kube_apiserver_clusterip_allocator_allocation_total

Kubernetes APIServer: Total allocations by the ClusterIP allocator

kube_apiserver_clusterip_allocator_available_ips

Kubernetes API server: Available IP address count for the ClusterIP allocator

kube_apiserver_nodeport_allocator_allocated_ports

Kubernetes APIServer: Number of ports allocated by the NodePort allocator

kube_apiserver_nodeport_allocator_allocation_errors_total

Kubernetes APIServer: Total NodePort allocator allocation faults

kube_apiserver_nodeport_allocator_allocation_total

Kubernetes APIServer: Total allocations by the NodePort allocator

kube_apiserver_nodeport_allocator_available_ports

Kubernetes APIServer: Number of available ports for the NodePort allocator

kube_apiserver_pod_logs_backend_tls_failure_total

Kubernetes APIServer: Total number of pods/logs requests due to TLS authentication failure

kube_apiserver_pod_logs_insecure_backend_total

Kubernetes APIServer: Total insecure pods/logs requests

kube_apiserver_pod_logs_pods_logs_backend_tls_failure_total

Kubernetes API server: Total pods/logs requests that failed TLS authentication

kube_apiserver_pod_logs_pods_logs_insecure_backend_total

Kubernetes API server: Number of insecure pods/logs requests

kubelet_container_log_filesystem_used_bytes

Kubelet: File system usage for container logs in bytes

kubelet_node_name

Kubelet: Node name

kubelet_pleg_relist_duration_seconds_bucket

Kubelet: PLEG relist duration buckets (seconds)

kubelet_pod_worker_duration_seconds_bucket

Kubelet: bucketing of pod worker duration in seconds

kubelet_volume_stats_available_bytes

Kubelet: Available bytes in volume stats

kubelet_volume_stats_capacity_bytes

Kubelet: Capacity in bytes from volume statistics

kubelet_volume_stats_inodes

Kubelet: Volume statistics for available inodes

kubelet_volume_stats_inodes_free

Kubelet: Free inode count on the volume

kubelet_volume_stats_inodes_used

Kubelet: Used inode count for the volume

kubelet_volume_stats_used_bytes

Kubelet: Volume used bytes

kubernetes_build_info

Kubernetes build information

kubernetes_feature_enabled

Kubernetes feature status: Enabled

last_list_all_response_size_in_bytes

Total size of the last list response (bytes)

memory_utilization_byte

Memory utilization: Bytes

node_authorizer_graph_actions_duration_seconds_bucket

Node authorizer: Graph operation duration bucketing in seconds

node_authorizer_graph_actions_duration_seconds_count

Node authorizer: Graph operation duration in seconds

node_authorizer_graph_actions_duration_seconds_sum

Node authorizer: Total duration of graph operations in seconds

pod_security_evaluations_total

Total pod security assessments

pod_security_exemptions_total

Total pod security exemptions

process_cpu_seconds_total

Total process CPU time in seconds

process_max_fds

Maximum number of file descriptors per process

process_open_fds

Number of open file descriptors for the process

process_resident_memory_bytes

Process resident memory in bytes

process_start_time_seconds

Process startup time (seconds)

process_virtual_memory_bytes

Process virtual memory in bytes

process_virtual_memory_max_bytes

Maximum virtual memory of a process in bytes

registered_metric_total

Registration metric: Total count

registered_metrics_total

Registration metrics: Total

rest_client_exec_plugin_certificate_rotation_age_bucket

REST client plugin: Certificate rotation age bucketing (seconds)

rest_client_exec_plugin_certificate_rotation_age_count

REST client plugin: Certificate rotation age in seconds

rest_client_exec_plugin_certificate_rotation_age_sum

REST client plugin: Sum of certificate rotation age in seconds

rest_client_exec_plugin_ttl_seconds

REST client plugin: Certificate TTL in seconds

rest_client_request_duration_seconds_bucket

REST client: Request duration bucketing in seconds

rest_client_request_duration_seconds_count

REST client: Request duration count in seconds

rest_client_request_duration_seconds_sum

REST client: Total request duration in seconds

rest_client_request_latency_seconds_bucket

REST client: Request latency bucketing in seconds

rest_client_request_size_bytes_bucket

REST client: Request size bucketing (bytes)

rest_client_request_size_bytes_count

REST client: Request byte count

rest_client_request_size_bytes_sum

REST client: Total request size (bytes)

rest_client_requests_total

REST client: Total requests

rest_client_response_size_bytes_bucket

REST client: Response size (bytes) bucketing

rest_client_response_size_bytes_count

REST client: Response byte count

rest_client_response_size_bytes_sum

REST client: Total response size (bytes)

rest_client_transport_cache_entries

REST client: number of transport cache entries

rest_client_transport_create_calls_total

REST client: Total transport creation calls

scheduler_pending_pods

Scheduler: Number of pending pods

scheduler_pod_scheduling_attempts_bucket

Scheduler: pod scheduling attempt count bucketing

scheduler_scheduler_cache_size

Scheduler: Scheduler cache size

scrape_duration_seconds

Scrape duration (seconds)

scrape_samples_post_metric_relabeling

Number of scraped samples (after metric relabeling)

scrape_samples_scraped

Number of scraped samples

scrape_series_added

Number of new series scraped

serviceaccount_invalid_legacy_auto_token_uses_total

Total uses of invalid legacy automated service account tokens

serviceaccount_legacy_auto_token_uses_total

Total usage count of legacy automated service account tokens

serviceaccount_legacy_manual_token_uses_total

Total uses of legacy manual service account tokens

serviceaccount_legacy_tokens_total

Total number of legacy service account tokens

serviceaccount_stale_tokens_total

Total number of legacy service account tokens

serviceaccount_valid_tokens_total

Total valid service account tokens

ssh_tunnel_open_count

Open SSH tunnel count

ssh_tunnel_open_fail_count

Number of failed SSH tunnel openings

up

Metric collection connectivity

watch_cache_capacity

Monitor cache capacity

watch_cache_capacity_decrease_total

Total reduction in cache capacity

watch_cache_capacity_increase_total

Total increase in monitoring cache capacity

workqueue_adds_total

Total additions to the work queue

workqueue_depth

Work queue depth

workqueue_longest_running_processor_seconds

Longest processor run time in the work queue (seconds)

workqueue_queue_duration_seconds_bucket

Work queue queuing duration (seconds) quantile bucket

workqueue_queue_duration_seconds_count

Total work queue wait time (seconds)

workqueue_queue_duration_seconds_sum

Sum of work queue wait time (seconds)

workqueue_retries_total

Total work queue retries

workqueue_unfinished_work_seconds

Duration of pending work in the work queue (seconds)

workqueue_work_duration_seconds_bucket

Work queue duration (seconds) quantile bucket

workqueue_work_duration_seconds_count

Work queue processing time (seconds)

workqueue_work_duration_seconds_sum

Total work queue duration (seconds)

Node Exporter (Job name: node-exporter)

Metric

Description

aliyun_prometheus_agent_append_duration_seconds

Duration of append operations for the Alibaba Cloud Prometheus agent in seconds.

aliyun_prometheus_agent_job_discovery_status

Discovery status of scrape jobs for the Alibaba Cloud Prometheus agent.

aliyun_prometheus_agent_scrapes_by_target_total

Total number of scrapes by target for the Alibaba Cloud Prometheus agent.

aliyun_prometheus_agent_target_info

Information about the targets of the Alibaba Cloud Prometheus agent.

job

The name of the job.

node_boot_time_seconds

Node boot time in seconds.

node_context_switches_total

Total number of context switches on the node.

node_cpu_seconds_total

Total CPU time spent by the node.

node_disk_io_now

Current disk I/O on the node.

node_disk_io_time_seconds_total

Total time spent on disk I/O on the node, in seconds.

node_disk_io_time_weighted_seconds_total

Total weighted time spent on disk I/O on the node, in seconds.

node_disk_read_bytes_total

Total bytes read from disk on the node.

node_disk_read_time_seconds_total

Total time spent reading from disk on the node, in seconds.

node_disk_reads_completed_total

Total number of completed disk reads on the node.

node_disk_reads_merged_total

Total number of merged disk reads on the node.

node_disk_write_time_seconds_total

Total time spent writing to disk on the node, in seconds.

node_disk_writes_completed_total

Total number of completed disk writes on the node.

node_disk_writes_merged_total

Total number of merged disk writes on the node.

node_disk_written_bytes_total

Total bytes written to disk on the node.

node_exporter_build_info

Build information for Node Exporter.

node_filefd_allocated

Number of allocated file descriptors on the node.

node_filefd_maximum

Maximum number of file descriptors on the node.

node_filesystem_avail_bytes

Number of available bytes in the file system on the node.

node_filesystem_free_bytes

Number of free bytes in the file system on the node.

node_filesystem_size_bytes

Total size of the file system on the node, in bytes.

node_intr_total

Total number of interrupts on the node.

node_load1

1-minute load average on the node.

node_load15

15-minute load average on the node.

node_load5

5-minute load average on the node.

node_memory_MemAvailable_bytes

Available memory on the node, in bytes.

node_memory_MemFree_bytes

Free memory on the node, in bytes.

node_memory_MemTotal_bytes

Total memory on the node, in bytes.

node_memory_Slab_bytes

Slab memory on the node, in bytes.

node_memory_SReclaimable_bytes

Reclaimable slab memory on the node, in bytes.

node_netstat_Tcp_InErrs

Number of TCP receive errors.

node_netstat_Tcp_InSegs

Number of received TCP segments.

node_netstat_Tcp_OutSegs

Number of sent TCP segments.

node_netstat_Tcp_PassiveOpens

Number of passive TCP connection openings.

node_netstat_Tcp_RetransSegs

Number of retransmitted TCP segments.

node_network_receive_bytes_total

Total number of bytes received over the network.

node_network_receive_drop_total

Total number of received packets dropped.

node_network_receive_errs_total

Total number of receive errors.

node_network_receive_packets_total

Total number of packets received.

node_network_transmit_bytes_total

Total number of bytes transmitted over the network.

node_network_transmit_drop_total

Total number of transmitted packets dropped.

node_network_transmit_errs_total

Total number of transmit errors.

node_network_transmit_packets_total

Total number of packets transmitted.

node_network_up

Indicates whether the network interface is enabled.

node_processes_max_processes

Maximum number of processes.

node_processes_max_threads

Maximum number of threads.

node_processes_pids

Number of process IDs.

node_processes_state

Distribution of process states.

node_processes_threads

Number of threads.

node_schedstat_running_seconds_total

Total seconds spent in the running state according to scheduling statistics.

node_sockstat_TCP_alloc

Number of allocated TCP sockets.

node_sockstat_TCP_inuse

Number of TCP sockets in use.

node_sockstat_TCP_mem

Memory usage of TCP sockets.

node_sockstat_TCP_mem_bytes

Memory usage of TCP sockets, in bytes.

node_sockstat_TCP_tw

Number of TCP sockets in the TIME_WAIT state.

node_time_zone_offset_seconds

Time zone offset in seconds.

node_timex_offset_seconds

Time offset in seconds.

node_timex_sync_status

Clock synchronization status.

node_uname_info

System information from uname.

node_vmstat_pgfault

Number of page faults from VM statistics.

node_vmstat_pgmajfault

Number of major page faults from VM statistics.

node_vmstat_pgpgin

Number of page-ins from VM statistics.

node_vmstat_pgpgout

Number of page-outs from VM statistics.

up

Connectivity for metric scraping.

kube-state-metrics (Job name: _kube-state-metrics)

Metric

Description

kube_configmap_info

Information about Kubernetes ConfigMaps

kube_cronjob_annotations

Kubernetes CronJob annotations

kube_cronjob_created

The creation time of the Kubernetes CronJob.

kube_cronjob_info

Kubernetes CronJob information

kube_cronjob_labels

Kubernetes CronJob labels

kube_cronjob_metadata_resource_version

Shows the resource version of the Kubernetes CronJob metadata.

kube_cronjob_next_schedule_time

The next scheduled time of a Kubernetes CronJob.

kube_cronjob_spec_failed_job_history_limit

Kubernetes CronJob failed job history limit

kube_cronjob_spec_starting_deadline_seconds

The starting deadline for the Kubernetes CronJob in seconds.

kube_cronjob_spec_successful_job_history_limit

The retention limit for the history of successful jobs in a Kubernetes CronJob.

kube_cronjob_spec_suspend

The suspend status of a Kubernetes CronJob.

kube_cronjob_status_active

Number of active Kubernetes CronJobs

kube_cronjob_status_last_schedule_time

The last schedule time of the Kubernetes CronJob

kube_cronjob_status_last_successful_time

The last successful running time of the Kubernetes CronJob

kube_daemonset_created

The creation time of the Kubernetes DaemonSet.

kube_daemonset_status_current_number_scheduled

The current number of nodes scheduled for the Kubernetes DaemonSet.

kube_daemonset_status_desired_number_scheduled

The desired number of scheduled nodes for a Kubernetes DaemonSet.

kube_daemonset_status_number_available

Number of available nodes in the Kubernetes DaemonSet

kube_daemonset_status_number_misscheduled

Number of nodes incorrectly running a Kubernetes DaemonSet pod

kube_daemonset_status_number_ready

The number of ready nodes in a Kubernetes DaemonSet.

kube_daemonset_status_number_unavailable

Number of unavailable nodes in the Kubernetes DaemonSet

kube_daemonset_status_updated_number_scheduled

The number of nodes scheduled with the updated Kubernetes DaemonSet.

kube_daemonset_updated_number_scheduled

Number of nodes scheduled with the updated Kubernetes DaemonSet.

kube_deployment_created

The creation time of the Kubernetes deployment.

kube_deployment_labels

Kubernetes deployment labels

kube_deployment_metadata_generation

The generation of the Kubernetes deployment metadata.

kube_deployment_spec_replicas

Number of replicas in the Kubernetes deployment specification

kube_deployment_spec_strategy_rollingupdate_max_unavailable

The maximum number of unavailable pods during a rolling update for a Kubernetes deployment

kube_deployment_status_observed_generation

The observed generation of the Kubernetes deployment.

kube_deployment_status_replicas

Total number of replicas in a Kubernetes deployment

kube_deployment_status_replicas_available

Number of available Kubernetes deployment replicas

kube_deployment_status_replicas_ready

Number of ready replicas in a Kubernetes deployment

kube_deployment_status_replicas_unavailable

Number of unavailable replicas in a Kubernetes deployment

kube_deployment_status_replicas_updated

The number of updated replicas in a Kubernetes deployment.

kube_horizontalpodautoscaler_info

Information about the Kubernetes HorizontalPodAutoscaler.

kube_horizontalpodautoscaler_labels

Kubernetes HorizontalPodAutoscaler labels

kube_horizontalpodautoscaler_metadata_generation

The metadata generation of the Kubernetes HorizontalPodAutoscaler.

kube_horizontalpodautoscaler_spec_max_replicas

The maximum number of replicas in the specification for a Kubernetes HorizontalPodAutoscaler.

kube_horizontalpodautoscaler_spec_min_replicas

The minimum number of replicas for a Kubernetes HorizontalPodAutoscaler.

kube_horizontalpodautoscaler_spec_target_metric

The target metric of a Kubernetes HorizontalPodAutoscaler.

kube_horizontalpodautoscaler_status_condition

The status condition of a Kubernetes HorizontalPodAutoscaler.

kube_horizontalpodautoscaler_status_current_replicas

The current number of replicas of the Kubernetes HorizontalPodAutoscaler.

kube_horizontalpodautoscaler_status_desired_replicas

Desired number of replicas for the Kubernetes HorizontalPodAutoscaler

kube_hpa_labels

kube_hpa labels

kube_hpa_metadata_generation

The metadata generation of the Kubernetes HorizontalPodAutoscaler.

kube_hpa_spec_max_replicas

The maximum number of replicas for a Kubernetes HorizontalPodAutoscaler.

kube_hpa_spec_min_replicas

The minimum number of replicas in the Kubernetes HorizontalPodAutoscaler specification.

kube_hpa_spec_target_metric

The target metric for a Kubernetes HorizontalPodAutoscaler.

kube_hpa_status_condition

Kubernetes HorizontalPodAutoscaler status condition

kube_hpa_status_current_replicas

The current number of replicas for the Kubernetes HorizontalPodAutoscaler.

kube_hpa_status_desired_replicas

The desired number of replicas for a Kubernetes HorizontalPodAutoscaler.

kube_ingress_info

Ingress information

kube_job_created

The time when the job was created.

kube_job_failed

Total number of failed jobs

kube_job_info

Job information

kube_job_spec_completions

The number of completions specified for the job

kube_job_status_active

Number of active jobs

kube_job_status_failed

The number of failed jobs.

kube_job_status_succeeded

The number of jobs that have succeeded.

kube_namespace_created

The creation time of the namespace.

kube_namespace_labels

Namespace labels

kube_namespace_status_phase

Namespace status phase

kube_node_info

Node information

kube_node_labels

Node labels

kube_node_spec_taint

Node taint configuration

kube_node_spec_unschedulable

Flag indicating whether the node can be scheduled.

kube_node_status_allocatable

The amount of allocatable resources on a node.

kube_node_status_allocatable_cpu_cores

Number of allocatable CPU cores on the node.

kube_node_status_allocatable_memory_bytes

Allocatable memory on the node in bytes

kube_node_status_allocatable_pods

Number of allocatable pods on the node

kube_node_status_capacity

Node capacity

kube_node_status_capacity_cpu_cores

The CPU capacity of a node in cores.

kube_node_status_capacity_memory_bytes

Node memory capacity in bytes

kube_node_status_capacity_pods

Node pod capacity

kube_node_status_condition

Node status condition

kube_persistentvolume_status_phase

The status phase of the persistent volume.

kube_persistentvolumeclaim_info

Persistent Volume Claim information

kube_persistentvolumeclaim_resource_requests_storage_bytes

The amount of storage requested by a persistent volume claim

kube_persistentvolumeclaim_status_phase

The status phase of the persistent volume claim.

kube_pod_completion_time

Pod completion time

kube_pod_container_info

Pod container information

kube_pod_container_resource_limits

Pod container resource limits

kube_pod_container_resource_limits_cpu_cores

Pod container CPU core limit

kube_pod_container_resource_limits_memory_bytes

Pod container memory limit in bytes

kube_pod_container_resource_requests

Pod container resource request

kube_pod_container_resource_requests_cpu_cores

Pod container CPU core request

kube_pod_container_resource_requests_memory_bytes

pod container memory resource request in bytes

kube_pod_container_status_last_terminated_reason

Last termination reason of the pod container

kube_pod_container_status_ready

Pod container readiness status

kube_pod_container_status_restarts_total

Pod container restart count

kube_pod_container_status_running

Pod container runtime status

kube_pod_container_status_terminated

Pod container termination status

kube_pod_container_status_terminated_reason

Pod container stop reason

kube_pod_container_status_waiting

Pod container waiting status

kube_pod_container_status_waiting_reason

Pod container wait reason

kube_pod_created

Pod creation time

kube_pod_deletion_timestamp

Pod deletion timestamp

kube_pod_info

Pod information

kube_pod_labels

Pod label

kube_pod_owner

Owner object

kube_pod_start_time

Pod start time

kube_pod_status_container_ready_time

Pod container readiness time

kube_pod_status_initialized_time

Pod status initialization completion time

kube_pod_status_phase

Pod phase

kube_pod_status_ready

Pod readiness status

kube_pod_status_ready_time

Pod readiness time

kube_pod_status_reason

Pod status reason

kube_pod_status_scheduled_time

Pod scheduling time

kube_pod_status_unschedulable

Unscheduled pod flag

kube_replicaset_owner

ReplicaSet owner object

kube_replicaset_status_ready_replicas

Number of ready replicas in the ReplicaSet

kube_resource_relationship

Resource relationships

kube_resourcequota

Resource quota

kube_resourcequota_created

Resource quota creation time

kube_secret_info

Secret information

kube_service_info

Service information

kube_service_spec_type

Service type specifications

kube_service_status_load_balancer_ingress

Service status and Server Load Balancer endpoint information

kube_statefulset_created

Stateful ReplicaSet creation time

kube_statefulset_metadata_generation

Stateful ReplicaSet metadata generation

kube_statefulset_replicas

Number of replicas for the stateful ReplicaSet

kube_statefulset_status_replicas

Number of replicas in the Stateful ReplicaSet status

kube_statefulset_status_replicas_available

Number of active replicas

kube_statefulset_status_replicas_ready

Stateful ReplicaSet ready replica count

kube_statefulset_status_replicas_updated

stateful ReplicaSet status: Updated number of replicas

rest_client_requests_total

Total REST client requests

up

Connectivity for metric collection

workqueue_adds_total

Total work queue additions

workqueue_depth

Work queue depth

workqueue_queue_duration_seconds_bucket

Work queue queuing duration distribution (seconds)

kube-events (Job name: _arms/kube-event)

Metric

Description

aliyun_prometheus_agent_append_duration_seconds

The duration of an append operation for the Alibaba Cloud Prometheus agent, in seconds.

aliyun_prometheus_agent_job_discovery_status

The discovery status of a scrape job for the Alibaba Cloud Prometheus agent.

aliyun_prometheus_agent_scrape_custom_error

The number of custom scrape errors for the Alibaba Cloud Prometheus agent.

aliyun_prometheus_agent_scrapes_by_target_total

The total number of scrapes by target for the Alibaba Cloud Prometheus agent.

aliyun_prometheus_agent_target_info

The target information for the Alibaba Cloud Prometheus agent.

eventer_events_error_total

The total number of event processing errors.

eventer_events_normal_total

The total number of normal events.

eventer_events_warning_total

The total number of event warnings.

eventer_exporter_duration_milliseconds_count

The number of samples for the event export duration, in milliseconds.

eventer_exporter_duration_milliseconds_sum

The total event export duration, in milliseconds.

eventer_manager_last_time_seconds

The last operation time of the event manager, in seconds.

eventer_scraper_duration_milliseconds_count

The count of the event scrape duration, in milliseconds.

eventer_scraper_duration_milliseconds_sum

The total event scrape duration, in milliseconds.

eventer_scraper_events_total_number

The total number of events scraped.

eventer_scraper_last_time_seconds

The last running time of the event scrape, in seconds.

up

The connectivity for metric collection.

CoreDNS (Job name: arms-ack-coredns)

Metric

Description

aliyun_prometheus_agent_append_duration_seconds

The duration of append operations for the Alibaba Cloud Prometheus agent, in seconds.

aliyun_prometheus_agent_job_discovery_status

The status of scrape job discovery for the Alibaba Cloud Prometheus agent.

aliyun_prometheus_agent_scrape_custom_error

Number of custom scrape errors from the Alibaba Cloud Prometheus agent

aliyun_prometheus_agent_scrapes_by_target_total

The total number of scrapes by the Alibaba Cloud Prometheus agent per target.

aliyun_prometheus_agent_target_info

Target information for the Alibaba Cloud Prometheus agent

coredns_autopath_success_count_total

Total success count for CoreDNS autopath.

coredns_autopath_success_total

Total number of successful CoreDNS autopaths.

coredns_build_info

CoreDNS build information

coredns_cache_drops_total

Total CoreDNS cache drop count

coredns_cache_entries

Number of CoreDNS cache entries

coredns_cache_evictions_total

Total number of CoreDNS cache evictions

coredns_cache_hits_total

Total CoreDNS cache hits

coredns_cache_misses_total

Total number of CoreDNS cache misses

coredns_cache_requests_total

Total CoreDNS cache requests

coredns_cache_size

The size of the CoreDNS cache.

coredns_dns_do_requests_total

Total CoreDNS DNS DO requests

coredns_dns_request_count_total

Total DNS request count for CoreDNS

coredns_dns_request_duration_seconds_bucket

CoreDNS DNS request duration quantile (seconds)

coredns_dns_request_duration_seconds_count

The count of CoreDNS DNS requests

coredns_dns_request_duration_seconds_sum

Total CoreDNS DNS request duration in seconds

coredns_dns_request_size_bytes_bucket

CoreDNS DNS request size quantile (bytes)

coredns_dns_request_size_bytes_count

CoreDNS DNS request size count (bytes)

coredns_dns_request_size_bytes_sum

Sum of CoreDNS DNS request size (bytes)

coredns_dns_request_type_count_total

The total number of DNS requests in CoreDNS, categorized by request type.

coredns_dns_requests_total

Total DNS requests handled by CoreDNS

coredns_dns_response_rcode_count_total

Total number of CoreDNS DNS responses by response code

coredns_dns_response_size_bytes_bucket

CoreDNS DNS response size quantile (bytes)

coredns_dns_response_size_bytes_count

CoreDNS DNS response size (bytes) count

coredns_dns_response_size_bytes_sum

The sum of CoreDNS DNS response sizes in bytes

coredns_dns_responses_total

Total number of CoreDNS DNS responses

coredns_forward_conn_cache_hits_total

Total CoreDNS forward connection cache hits.

coredns_forward_conn_cache_misses_total

Total misses in the CoreDNS forward connection cache.

coredns_forward_healthcheck_broken_total

Total number of failed CoreDNS forward health checks

coredns_forward_healthcheck_failure_count_total

Total count of CoreDNS forwarding health check failures

coredns_forward_healthcheck_failures_total

Total CoreDNS forward health check failures

coredns_forward_max_concurrent_rejects_total

Total number of rejections for CoreDNS forwarding due to maximum concurrency

coredns_forward_request_count_total

Total count of requests forwarded by CoreDNS

coredns_forward_request_duration_seconds_bucket

Quantiles for CoreDNS forwarded request duration in seconds.

coredns_forward_request_duration_seconds_count

Count of CoreDNS forward request duration (seconds)

coredns_forward_request_duration_seconds_sum

Total duration of CoreDNS forward requests in seconds.

coredns_forward_requests_total

Total number of requests forwarded by CoreDNS

coredns_forward_response_rcode_count_total

Total count of CoreDNS forwarded response codes

coredns_forward_responses_total

Total number of responses forwarded by CoreDNS

coredns_forward_sockets_open

Number of open sockets for CoreDNS forwarding

coredns_health_request_duration_seconds_bucket

Quantile of CoreDNS health check request duration in seconds

coredns_health_request_duration_seconds_count

Number of CoreDNS health check requests.

coredns_health_request_duration_seconds_sum

Total duration of CoreDNS health check requests in seconds.

coredns_health_request_failures_total

Total number of failed CoreDNS health check requests

coredns_hosts_entries

Number of CoreDNS host entries

coredns_hosts_reload_timestamp_seconds

CoreDNS host reload timestamp (seconds)

coredns_kubernetes_dns_programming_duration_seconds_bucket

CoreDNS Kubernetes DNS programming duration quantile (seconds)

coredns_kubernetes_dns_programming_duration_seconds_count

CoreDNS Kubernetes DNS request duration (seconds) count

coredns_kubernetes_dns_programming_duration_seconds_sum

CoreDNS: Sum of Kubernetes DNS programming time

coredns_local_localhost_requests_total

Total CoreDNS requests to localhost

coredns_panic_count_total

Total CoreDNS panics

coredns_panics_total

Total CoreDNS panic count

coredns_plugin_enabled

CoreDNS plugin status

coredns_reload_failed_total

Total CoreDNS reload failures

coredns_reload_version_info

CoreDNS reload version

coredns_template_matches_total

Total CoreDNS template matches

up

Metric collection connectivity

CSI (cluster dimension) (Job name: k8s-csi-cluster-pv)

Metric

Description

alibaba_cloud_storage_operator_build_info

The build information for Alibaba Cloud storage O&M.

aliyun_prometheus_agent_append_duration_seconds

The duration of the append operation for the Alibaba Cloud Prometheus agent, in seconds.

aliyun_prometheus_agent_job_discovery_status

The discovery status of the scrape job for the Alibaba Cloud Prometheus agent.

aliyun_prometheus_agent_scrape_custom_error

The number of custom scrape errors for the Alibaba Cloud Prometheus agent.

aliyun_prometheus_agent_scrapes_by_target_total

The total number of scrapes by target for the Alibaba Cloud Prometheus agent.

aliyun_prometheus_agent_target_info

The target information of the Alibaba Cloud Prometheus agent.

cluster_pv_detail_num_total

The total count of detailed information for cluster PVs.

cluster_pv_status_num_total

The total number of cluster PV statuses.

cluster_pvc_detail_num_total

The total count of detailed information for cluster PVCs.

cluster_pvc_status_num_total

The total number of cluster PVC statuses.

cluster_scrape_collector_duration_seconds

The duration of the cluster scrape collector, in seconds.

cluster_scrape_collector_success

The number of successful attempts by the cluster scrape collector.

up

The connectivity for metric scraping.

CSI (node dimension) (Job name: k8s-csi-node-pv)

Metric

Description

alibaba_cloud_csi_driver_build_info

Alibaba Cloud CSI driver build information

aliyun_prometheus_agent_append_duration_seconds

Alibaba Cloud Prometheus agent append operation duration in seconds

aliyun_prometheus_agent_job_discovery_status

Discovery status of scrape jobs for the Alibaba Cloud Prometheus agent

aliyun_prometheus_agent_scrape_custom_error

Number of custom scrape errors from the Alibaba Cloud Prometheus agent

aliyun_prometheus_agent_scrapes_by_target_total

Total number of scrapes by target from the Alibaba Cloud Prometheus agent

aliyun_prometheus_agent_target_info

Target information for the Alibaba Cloud Prometheus agent

cluster_scrape_collector_duration_seconds

Duration of the cluster scrape collector in seconds

cluster_scrape_collector_success

Number of successful cluster scrape collections

container_fs_available_bytes

Available bytes in the container file system

container_fs_inodes_free

Available inodes in the container file system

container_fs_inodes_total

Total inodes in the container file system

container_fs_inodes_used

Used inodes in the container file system

container_fs_limit_bytes

Byte limit for the container file system

container_fs_usage_bytes

Used bytes in the container file system

ephemeral_storage_pod_available_bytes

Available bytes for the ephemeral storage pod

ephemeral_storage_pod_inodes_free

Available inodes for the ephemeral storage pod

ephemeral_storage_pod_inodes_total

Total inodes for the ephemeral storage pod

ephemeral_storage_pod_inodes_used

Used inodes for the ephemeral storage pod

ephemeral_storage_pod_limit_bytes

Byte limit for the ephemeral storage pod

ephemeral_storage_pod_usage_bytes

Used bytes for the ephemeral storage pod

node_volume_backend_posix_access_total_counter

Total POSIX access operations on the node volume backend.

node_volume_backend_posix_getattr_total_counter

Total POSIX getattr calls on the node volume backend.

node_volume_backend_posix_getmode_total_counter

Total POSIX get mode operations on the node volume backend.

node_volume_backend_posix_link_total_counter

Total POSIX link operations on the node volume backend.

node_volume_backend_posix_lookup_total_counter

Total POSIX lookup operations on the node volume backend.

node_volume_backend_posix_mknod_total_counter

Total POSIX mknod operations on the node volume backend.

node_volume_backend_posix_readdir_total_counter

Total POSIX readdir operations on the node volume backend.

node_volume_backend_posix_readlink_total_counter

Total POSIX readlink operations on the node volume backend.

node_volume_backend_posix_remove_total_counter

Total POSIX remove operations on the node volume backend.

node_volume_backend_posix_rename_total_counter

Total POSIX rename operations on the node volume backend.

node_volume_backend_posix_setattr_total_counter

Total POSIX setattr operations on the node volume backend.

node_volume_backend_posix_statfs_total_counter

Total POSIX statfs operations on the node volume backend.

node_volume_backend_read_bytes_total_counter

Total bytes read from the node volume backend.

node_volume_backend_read_completed_total_counter

Total completed read requests on the node volume backend.

node_volume_backend_read_time_milliseconds_total_counter

Total read time in milliseconds on the node volume backend.

node_volume_backend_write_bytes_total_counter

Total bytes written to the node volume backend.

node_volume_backend_write_completed_total_counter

Total completed write requests on the node volume backend.

node_volume_backend_write_time_milliseconds_total_counter

Total write time in milliseconds on the node volume backend.

node_volume_capacity_bytes_available

Available capacity of the node volume in bytes.

node_volume_capacity_bytes_available_counter

Counter for the available capacity of the node volume in bytes.

node_volume_capacity_bytes_total

Total capacity of the node volume in bytes.

node_volume_capacity_bytes_total_counter

Counter for the total capacity of the node volume in bytes.

node_volume_capacity_bytes_used

Used capacity of the node volume in bytes.

node_volume_capacity_bytes_used_counter

Counter for the used capacity of the node volume in bytes.

node_volume_hot_spot_head_file_top

Ranking of hot spot head files on the node volume.

node_volume_hot_spot_read_file_top

Ranking of hot spot read files on the node volume.

node_volume_hot_spot_write_file_top

Ranking of hot spot write files on the node volume.

node_volume_inode_bytes_available_counter

Counter for available bytes for inodes on the node volume.

node_volume_inode_bytes_total_counter

Counter for total bytes for inodes on the node volume.

node_volume_inode_bytes_used_counter

Counter for used bytes for inodes on the node volume.

node_volume_inodes_available

Available inodes on the node volume.

node_volume_inodes_total

Total inodes on the node volume.

node_volume_inodes_used

Used inodes on the node volume.

node_volume_io_now

Current I/O operations on the node volume.

node_volume_io_time_seconds_total

Total I/O time on the node volume in seconds.

node_volume_oss_delete_object_total_counter

Total objects deleted from OSS for the node volume.

node_volume_oss_get_object_total_counter

Total objects retrieved from OSS for the node volume.

node_volume_oss_head_object_total_counter

Total head object operations on OSS for the node volume.

node_volume_oss_post_object_total_counter

Total objects posted to OSS for the node volume.

node_volume_oss_put_object_total_counter

Total objects put to OSS for the node volume.

node_volume_posix_access_total_counter

Total POSIX access operations on the node volume.

node_volume_posix_chmod_total_counter

Total POSIX chmod operations on the node volume.

node_volume_posix_chown_total_counter

Total POSIX chown operations on the node volume.

node_volume_posix_create_total_counter

Total POSIX create operations on the node volume.

node_volume_posix_flush_total_counter

Total POSIX flush operations on the node volume.

node_volume_posix_fsync_total_counter

Total POSIX fsync operations on the node volume.

node_volume_posix_mkdir_total_counter

Total POSIX mkdir operations on the node volume.

node_volume_posix_open_total_counter

Total POSIX open operations on the node volume.

node_volume_posix_opendir_total_counter

Total POSIX opendir operations on the node volume.

node_volume_posix_read_total_counter

Total POSIX read operations on the node volume.

node_volume_posix_readdir_total_counter

Total POSIX readdir operations on the node volume.

node_volume_posix_release_total_counter

Total POSIX release operations on the node volume.

node_volume_posix_rename_total_counter

Total POSIX rename operations on the node volume.

node_volume_posix_rmdir_total_counter

Total POSIX rmdir operations on the node volume.

node_volume_posix_truncate_total_counter

Total POSIX truncate operations on the node volume.

node_volume_posix_write_total_counter

Total POSIX write operations on the node volume.

node_volume_read_bytes_total

Total bytes read from the node volume.

node_volume_read_bytes_total_counter

Counter for the total bytes read from the node volume.

node_volume_read_completed_total

Total completed read operations on the node volume.

node_volume_read_completed_total_counter

Counter for total completed read operations on the node volume.

node_volume_read_merged_total

Total merged read operations on the node volume.

node_volume_read_queue_time_milliseconds_total

Total time spent in the read queue on the node volume, in milliseconds.

node_volume_read_rtt_time_milliseconds_total

Total round trip time for read operations on the node volume, in milliseconds.

node_volume_read_sent_bytes_total

Total bytes sent for read operations on the node volume.

node_volume_read_time_milliseconds_total

Total time for read operations on the node volume, in milliseconds.

node_volume_read_time_milliseconds_total_counter

Counter for the total time for read operations on the node volume, in milliseconds.

node_volume_read_timeouts_total

Total read timeouts on the node volume.

node_volume_read_transmissions_total

Total read transmissions on the node volume.

node_volume_vg_free_bytes

Free bytes in the node volume group (VG).

node_volume_vg_size_bytes

Total size of the node volume group (VG) in bytes.

node_volume_write_bytes_total

Total bytes written to the node volume.

node_volume_write_bytes_total_counter

Counter for the total bytes written to the node volume.

node_volume_write_completed_total

Total completed write operations on the node volume.

node_volume_write_completed_total_counter

Counter for total completed write operations on the node volume.

node_volume_write_merged_total

Total merged write operations on the node volume.

node_volume_write_queue_time_milliseconds_total

Total time spent in the write queue on the node volume, in milliseconds.

node_volume_write_recv_bytes_total

Total bytes received for write operations on the node volume.

node_volume_write_rtt_time_milliseconds_total

Total round trip time for write operations on the node volume, in milliseconds.

node_volume_write_time_milliseconds_total

Total time for write operations on the node volume, in milliseconds.

node_volume_write_time_milliseconds_total_counter

Counter for the total time for write operations on the node volume, in milliseconds.

node_volume_write_timeouts_total

Total write timeouts on the node volume.

node_volume_write_transmissions_total

Total write transmissions on the node volume.

up

Connectivity for metric scraping.

GPU-Exporter (job name: gpu-exporter)

Metric

Description

DCGM_CUSTOM_ALLOCATE_MODE

The operating pattern of the node. The possible values are: 0 (None) indicates that no GPU pods are running on the node. 1 (Exclusive) indicates that GPU pods on the node run in exclusive mode. 2 (Share) indicates that GPU pods on the node run in shared mode.

DCGM_CUSTOM_CONTAINER_CP_ALLOCATED

Indicates the ratio of the computing power allocated to a container to the total computing power of the GPU card. The value ranges from 0 to 1. The value is 0 if only GPU memory is requested for an exclusive or shared GPU. A value of 0 means computing power is not limited. For example, if a GPU card has 100 units of computing power and 30 units are allocated to a container, the allocated computing power ratio is 30/100 = 0.3.

DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED

The GPU memory allocated to the container.

DCGM_CUSTOM_DEV_FB_ALLOCATED

The percentage of total GPU memory that is allocated. The value ranges from 0 to 1.

DCGM_CUSTOM_DEV_FB_TOTAL

Indicates the total GPU memory of the GPU.

DCGM_CUSTOM_ILLEGAL_PROCESS_DECODE_UTIL

Illegal process decode utilization

DCGM_CUSTOM_ILLEGAL_PROCESS_ENCODE_UTIL

Illegal process encoding utilization

DCGM_CUSTOM_ILLEGAL_PROCESS_MEM_COPY_UTIL

Illegal process memory copy utilization

DCGM_CUSTOM_ILLEGAL_PROCESS_MEM_USED

Memory used by illegal process

DCGM_CUSTOM_ILLEGAL_PROCESS_SM_UTIL

Illegal process Streaming Multiprocessor (SM) utilization

DCGM_CUSTOM_PROCESS_DECODE_UTIL

Indicates the decoder utilization of the GPU thread.

DCGM_CUSTOM_PROCESS_ENCODE_UTIL

The encoder utilization of the GPU thread.

DCGM_CUSTOM_PROCESS_MEM_COPY_UTIL

Indicates the memory copy utilization of GPU threads.

DCGM_CUSTOM_PROCESS_MEM_USED

The GPU memory currently used by the GPU thread.

DCGM_CUSTOM_PROCESS_SM_UTIL

The SM utilization of GPU threads.

DCGM_FI_DEV_APP_MEM_CLOCK

The application memory clock speed.

DCGM_FI_DEV_APP_SM_CLOCK

The SM application clock frequency.

DCGM_FI_DEV_BAR1_FREE

Indicates the free BAR1 memory.

DCGM_FI_DEV_BAR1_TOTAL

Total size of Base Address Register 1 (BAR1), which maps GPU memory to the system address space.

DCGM_FI_DEV_BAR1_USED

The amount of used BAR1.

DCGM_FI_DEV_BOARD_LIMIT_VIOLATION

Indicates a violation due to the board limit. The value is the duration of the violation.

DCGM_FI_DEV_CLOCK_THROTTLE_REASONS

The reasons for clock throttling.

DCGM_FI_DEV_COUNT

Number of devices

DCGM_FI_DEV_DEC_UTIL

Indicates the decoder utilization.

DCGM_FI_DEV_ENC_UTIL

Indicates the encoder utilization.

DCGM_FI_DEV_FB_FREE

The amount of free framebuffer memory.

DCGM_FI_DEV_FB_USED

The amount of used framebuffer memory. This value corresponds to the used value for Memory-Usage from the nvidia-smi command.

DCGM_FI_DEV_GPU_TEMP

Indicates the GPU temperature.

DCGM_FI_DEV_GPU_UTIL

Indicates GPU utilization. This is the time that one or more kernel functions are active in a set period. The period is 1 s or 1/6 s. It depends on the GPU product. This metric shows that a kernel function is using the GPU. It does not show how the GPU is used.

DCGM_FI_DEV_LOW_UTIL_VIOLATION

A violation triggered by the low utilization limit. The value is the duration of the violation.

DCGM_FI_DEV_MEM_CLOCK

The memory clock frequency.

DCGM_FI_DEV_MEM_COPY_UTIL

Indicates the memory bandwidth utilization. For example, an NVIDIA V100 GPU has a maximum memory bandwidth of 900 GB/sec. If the current memory bandwidth is 450 GB/sec, the memory bandwidth utilization is 50%.

DCGM_FI_DEV_MEMORY_TEMP

Indicates the memory temperature.

DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL

Total NVLINK bandwidth

DCGM_FI_DEV_PCIE_REPLAY_COUNTER

PCIe replay counter (records the number of retries due to data transmission errors)

DCGM_FI_DEV_POWER_USAGE

Indicates power.

DCGM_FI_DEV_POWER_VIOLATION

Indicates a violation caused by the power limit. The value is the duration of the violation.

DCGM_FI_DEV_PSTATE

Device power state

DCGM_FI_DEV_RELIABILITY_VIOLATION

Indicates a violation caused by the board's reliability limit. The value is the duration of the violation.

DCGM_FI_DEV_RETIRED_DBE

Indicates pages retired due to a double-bit fault.

DCGM_FI_DEV_RETIRED_PENDING

Number of pages pending retirement (pages in GPU memory marked as unusable due to faults)

DCGM_FI_DEV_RETIRED_SBE

Indicates pages retired due to a single-bit error.

DCGM_FI_DEV_SM_CLOCK

Indicates the SM clock frequency.

DCGM_FI_DEV_SYNC_BOOST_VIOLATION

Indicates the duration of a violation caused by a sync boost limit.

DCGM_FI_DEV_THERMAL_VIOLATION

Indicates a thermal violation. The value is the duration of the violation.

DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION

The total energy consumed since the driver was loaded.

DCGM_FI_DEV_VIDEO_CLOCK

Video clock frequency

DCGM_FI_DEV_XID_ERRORS

The error number of the most recent XID error that occurred over a period of time.

DCGM_FI_PROF_DRAM_ACTIVE

The fraction of cycles that the device memory is active sending or receiving data. This metric measures Memory Bandwidth Utilization.

This value is an average over a time interval, not an instantaneous value.

A higher value indicates higher device memory utilization.

A value of 1 (100%) means that one DRAM instruction is executed in every cycle during the time interval. In practice, the maximum achievable peak value is approximately 0.8 (80%).

For example, a value of 0.2 (20%) means that the device memory is read from or written to during 20% of the cycles in the time interval.

DCGM_FI_PROF_GR_ENGINE_ACTIVE

Indicates the percentage of time that a graphics or compute engine is active over a time interval. This value is the average for all graphics and compute engines. An engine is considered active if a graphics or compute Context is attached to a thread and the Context is busy.

DCGM_FI_PROF_NVLINK_RX_BYTES

The rate of data received over NVLink, excluding protocol headers.

This value is an average over a time interval, not an instantaneous value.

The rate is averaged over the time interval. For example, if 1 GB of data is transferred in 1 second, the rate is 1 GB/s. This is true whether the data is transferred at a constant rate or in a burst. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction.

DCGM_FI_PROF_NVLINK_TX_BYTES

Total bytes sent over NVLink

DCGM_FI_PROF_PCIE_RX_BYTES

The rate of data received over the PCIe bus, including protocol headers and data payloads.

This value represents an average over a time interval, not an instantaneous value.

The rate is averaged over the time interval. For example, if 1 GB of data is transferred in 1 second, the rate is 1 GB/s, regardless of whether the transfer is constant or in a burst. The theoretical maximum bandwidth for PCIe Gen3 is 985 MB/s per channel.

DCGM_FI_PROF_PCIE_TX_BYTES

Indicates the rate of data sent or received over the PCIe bus. This includes protocol headers and data payloads.

This value is an average over a time interval, not an instantaneous value.

The rate is averaged over the time interval. For example, if 1 GB of data is sent in 1 second, the rate is 1 GB/s. This is true whether the data is sent at a constant rate or in a burst. The theoretical maximum bandwidth for PCIe Gen3 is 985 MB/s per channel.

DCGM_FI_PROF_PIPE_FP16_ACTIVE

The fraction of epochs that the FP16 (half-precision) pipeline is active.

This value is an average over a time interval, not an instantaneous value.

A higher value indicates higher utilization of the FP16 Cores.

A value of 1 (100%) means that an FP16 instruction is executed every two epochs for the entire time interval. For example, on a Volta-based GPU.

If the value is 0.2 (20%), the following scenarios are possible:

20% of the Streaming Multiprocessors (SMs) run their FP16 Cores at 100% utilization for the entire time interval.

All SMs run their FP16 Cores at 20% utilization for the entire time interval.

All SMs run their FP16 Cores at 100% utilization for one-fifth of the time interval.

Other combinations.

DCGM_FI_PROF_PIPE_FP32_ACTIVE

Indicates the fraction of cycles where the Fused Multiply-Add (FMA) pipeline is active. FMA operations include both single-precision (FP32) and integer types.

This value is an average over a time interval, not an instantaneous value.

A higher value indicates higher utilization of the FP32 Cores.

A value of 1 (100%) indicates that an FP32 instruction is executed every two cycles over the entire time interval, for example, on a Volta-architecture card.

For example, a value of 0.2 (20%) indicates one of the following scenarios:

20% of the FP32 Cores on the Streaming Multiprocessors (SMs) operate at 100% utilization throughout the interval.

All FP32 Cores on the SMs operate at 20% utilization throughout the interval.

All FP32 Cores on the SMs operate at 100% utilization for 20% of the interval.

Other combinations.

DCGM_FI_PROF_PIPE_FP64_ACTIVE

The fraction of cycles that the FP64 (double-precision) pipe is active.

This value is an average over a time interval, not an instantaneous value.

A higher value means higher utilization of the FP64 Cores.

A value of 1 (100%) means an FP64 instruction is executed every four cycles over the entire time interval. For example, on a Volta-based GPU.

A value of 0.2 (20%) could mean any of the following:

20% of the Streaming Multiprocessors (SMs) run their FP64 Cores at 100% utilization for the entire interval.

All SMs run their FP64 Cores at 20% utilization for the entire interval.

All SMs run their FP64 Cores at 100% utilization for one-fifth of the interval.

Other combinations.

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

The fraction of epochs that the Tensor (HMMA/IMMA) pipe is active.

This value is an average over a time interval and not an instantaneous value.

A higher value indicates higher Tensor Core utilization.

A value of 1 (100%) means a Tensor instruction is issued every other instruction cycle for the entire interval. This is because one instruction takes two cycles to complete.

For example, a value of 0.2 (20%) could mean:

The Tensor Cores on 20% of the Streaming Multiprocessors (SMs) run at 100% utilization for the entire interval.

The Tensor Cores on 100% of the SMs run at 20% utilization for the entire interval.

The Tensor Cores on 100% of the SMs run at 100% utilization for one-fifth of the interval.

Other combinations.

DCGM_FI_PROF_SM_ACTIVE

The percentage of time within an interval that at least one warp is active on a Streaming Multiprocessor (SM). This value is the average across all SMs and is not sensitive to the number of threads per block. A warp is active when it has been scheduled and allocated resources. An active warp can be in a computing or a non-computing state, such as waiting for a memory request. A value below 0.5 indicates that the GPU is underutilized, while a value above 0.8 is necessary for high efficiency. Assume a GPU has N SMs. If a kernel function uses N thread blocks and runs on all N SMs for the entire interval, the value is 1 (100%). If a kernel function runs with N/5 thread blocks during the interval, the value is 0.2. If a kernel function uses N thread blocks but runs for only 1/5 of the interval, the value is 0.2.

DCGM_FI_PROF_SM_OCCUPANCY

The ratio of active warps to the maximum number of resident warps on a Streaming Multiprocessor (SM). This value is the average across all SMs over a time interval. A higher occupancy does not necessarily mean higher GPU utilization. Higher occupancy indicates more effective GPU utilization only for workloads that are limited by GPU memory bandwidth (DCGM_FI_PROF_DRAM_ACTIVE).

nvidia_gpu_allocated_num_devices

The number of allocated GPU devices. Warning: This metric will be deprecated.

nvidia_gpu_memory_allocated_bytes

The allocated memory on the GPU device. Warning: This metric will be deprecated and replaced by DCGM_CUSTOM_DEV_FB_allocated.

nvidia_gpu_sharing_memory

The memory allocated for GPU sharing. Warning: This metric will be deprecated and replaced by DCGM_CUSTOM_DEV_FB_allocated.

Up

Connectivity for metric collection

Cost-Exporter (Job name: alibaba-cloud-cost-exporter)

Metric

Description

deducted_by_cash_coupons

The amount deducted by coupons from a bill for the current instance.

deducted_by_prepaid_card

The amount deducted by a prepaid card from a bill for the current instance.

invoice_discount

The discount amount for a bill of the current instance.

list_price

The unit price for a bill of the current instance.

node_current_price

The actual price of the current node.

node_payAsYouGo_price

The pay-as-you-go price of the current node.

node_payByPeriod_price

The subscription price of the current node.

node_spot_price

The price of the current node, based on the pricing of a Spot Instance with the same specifications.

outstanding_amount

The outstanding amount for a bill of the current instance.

payent_amount

The cash payment amount for a bill of the current instance.

pretax_amount

The amount payable for a bill of the current instance.

pretax_gross_amount

The original amount for a bill of the current instance.

usage

The resource usage for a bill of the current instance.

up

The connectivity for metric collection.

Ingress (Job name: arms-ack-ingress or ingress-ask-default)

Metric

Description

aliyun_prometheus_agent_append_duration_seconds

The duration of an append operation by the Alibaba Cloud Prometheus agent (in seconds).

aliyun_prometheus_agent_job_discovery_status

Status of scrape job discovery for the Alibaba Cloud Prometheus agent

aliyun_prometheus_agent_scrape_custom_error

The number of custom scrape errors for the Prometheus agent.

aliyun_prometheus_agent_scrapes_by_target_total

Total number of scrapes by the Alibaba Cloud Prometheus agent per Target

aliyun_prometheus_agent_target_info

Target information for the Alibaba Cloud Prometheus agent

nginx_ingress_controller_admission_config_size

Nginx Ingress controller - Admission configuration size

nginx_ingress_controller_admission_render_duration

Nginx Ingress controller - Rendering duration

nginx_ingress_controller_admission_render_ingresses

Nginx Ingress controller - Rendered Ingress count

nginx_ingress_controller_admission_roundtrip_duration

Nginx Ingress controller - Roundtrip processing duration

nginx_ingress_controller_admission_tested_duration

Nginx Ingress controller - Test duration

nginx_ingress_controller_admission_tested_ingresses

Nginx Ingress controller - Number of Ingresses tested

nginx_ingress_controller_build_info

Nginx Ingress controller - Build information

nginx_ingress_controller_bytes_sent_bucket

Nginx Ingress controller - Total bytes sent (bucket)

nginx_ingress_controller_bytes_sent_count

Nginx Ingress controller - Total bytes sent (count)

nginx_ingress_controller_bytes_sent_sum

Nginx Ingress controller - Sent bytes total (Sum)

nginx_ingress_controller_check_errors

Nginx Ingress controller - Check errors

nginx_ingress_controller_check_success

Nginx Ingress controller - Successful check count

nginx_ingress_controller_config_hash

Nginx Ingress controller - Configuration hash

nginx_ingress_controller_config_last_reload_successful

Nginx Ingress controller - Last configuration load successful

nginx_ingress_controller_config_last_reload_successful_timestamp_seconds

Nginx Ingress controller - Last successful configuration load time (seconds)

nginx_ingress_controller_connect_duration_seconds_bucket

Nginx Ingress controller - Connection duration (seconds) - Bucket

nginx_ingress_controller_connect_duration_seconds_count

Nginx Ingress controller - connection duration (seconds) - count

nginx_ingress_controller_connect_duration_seconds_sum

Nginx Ingress controller - Connection duration (seconds) - Sum

nginx_ingress_controller_errors

Nginx Ingress controller - Error count

nginx_ingress_controller_header_duration_seconds_bucket

Nginx Ingress controller - Header processing time (s) - Bucket

nginx_ingress_controller_header_duration_seconds_count

Nginx Ingress controller - Header processing time (seconds) - Count

nginx_ingress_controller_header_duration_seconds_sum

Total header processing time for the Nginx Ingress controller (seconds)

nginx_ingress_controller_ingress_upstream_latency_seconds

Nginx Ingress controller upstream latency (seconds)

nginx_ingress_controller_ingress_upstream_latency_seconds_count

Nginx Ingress controller upstream latency count

nginx_ingress_controller_ingress_upstream_latency_seconds_sum

Nginx Ingress controller upstream latency sum (seconds)

nginx_ingress_controller_leader_election_status

Nginx Ingress controller leader election status

nginx_ingress_controller_nginx_process_connections

Nginx Ingress controller nginx process connections

nginx_ingress_controller_nginx_process_connections_total

Total connections for the nginx process in the Nginx Ingress controller

nginx_ingress_controller_nginx_process_cpu_seconds_total

Total CPU seconds for the Nginx Ingress controller's nginx process

nginx_ingress_controller_nginx_process_num_procs

Number of Nginx processes for the Nginx Ingress controller

nginx_ingress_controller_nginx_process_oldest_start_time_seconds

Start time of the oldest nginx process in the Nginx Ingress controller (seconds)

nginx_ingress_controller_nginx_process_read_bytes_total

Total bytes read by the nginx process of the Nginx Ingress controller

nginx_ingress_controller_nginx_process_requests_total

Total requests for the Nginx Ingress controller's nginx process

nginx_ingress_controller_nginx_process_resident_memory_bytes

Resident memory size (bytes) of the nginx process for the Nginx Ingress controller

nginx_ingress_controller_nginx_process_virtual_memory_bytes

Virtual memory of the nginx process for the Nginx Ingress controller in bytes

nginx_ingress_controller_nginx_process_write_bytes_total

Total bytes written by the nginx process of the Nginx Ingress controller

nginx_ingress_controller_orphan_ingress

Number of isolated Ingresses for the Nginx Ingress controller

nginx_ingress_controller_request_duration_seconds_bucket

Nginx Ingress controller request latency distribution (seconds)

nginx_ingress_controller_request_duration_seconds_count

Nginx Ingress controller request duration (seconds)

nginx_ingress_controller_request_duration_seconds_sum

Sum of Nginx Ingress controller request time (seconds)

nginx_ingress_controller_request_size_bucket

Nginx Ingress controller request size distribution

nginx_ingress_controller_request_size_count

Nginx Ingress controller request size count

nginx_ingress_controller_request_size_sum

Nginx Ingress controller total request size

nginx_ingress_controller_requests

Total Nginx Ingress controller requests

nginx_ingress_controller_response_duration_seconds_bucket

Nginx Ingress controller response time distribution (seconds)

nginx_ingress_controller_response_duration_seconds_count

Nginx Ingress controller response time (seconds)

nginx_ingress_controller_response_duration_seconds_sum

Total Nginx Ingress controller response time (seconds)

nginx_ingress_controller_response_size_bucket

Nginx Ingress controller response size distribution

nginx_ingress_controller_response_size_count

Nginx Ingress controller response size count

nginx_ingress_controller_response_size_sum

Total Nginx Ingress controller response size

nginx_ingress_controller_ssl_certificate_info

Nginx Ingress controller SSL certificate information

nginx_ingress_controller_ssl_expire_time_seconds

Nginx Ingress controller SSL certificate expiration time (seconds)

nginx_ingress_controller_success

Nginx Ingress controller success count

Up

Metric collection connectivity

Koordinator (Job names: kube-system/koordlet-metrics-podmonitor, koord-manager-metrics-service)

Metric

Description

aliyun_prometheus_agent_append_duration_seconds

The duration of append operations for the Alibaba Cloud Prometheus agent, in seconds.

aliyun_prometheus_agent_scrapes_by_target_total

The total number of scrapes by the Alibaba Cloud Prometheus agent, per target.

aliyun_prometheus_agent_target_info

The target information for the Alibaba Cloud Prometheus agent.

koord_manager_recommender_recommendation_workload_target

The metric for recommended workload specifications from the resource profiling feature.

koordlet_container_resource_limits

The metric for container resource limits.

koordlet_container_resource_requests

The metric for container resource requests.

koordlet_node_priority_resource_reclaimable

The metric for node resource priority.

koordlet_node_resource_allocatable

The metric for allocatable resources on a node.

slo_manager_recommender_recommendation_workload_target

The metric for recommended workload specifications from the resource profiling feature. (Deprecated)

up

The connectivity for metric scraping.

ACK dedicated etcd component (Job name: etcd)

Metric

Description

aliyun_prometheus_agent_append_duration_seconds

Duration of the append operation for the Alibaba Cloud Prometheus agent (seconds)

aliyun_prometheus_agent_job_discovery_status

Status of scrape job discovery for the Alibaba Cloud Prometheus agent

aliyun_prometheus_agent_scrape_custom_error

The number of errors from custom scrapes by the Alibaba Cloud Prometheus agent.

aliyun_prometheus_agent_scrapes_by_target_total

The total number of scrapes by target for the Alibaba Cloud Prometheus agent.

aliyun_prometheus_agent_target_info

Target information for an Alibaba Cloud Prometheus agent

cpu_utilization_core

CPU core utilization

etcd_cluster_version

The version of the etcd cluster.

etcd_debugging_auth_revision

etcd debug authentication revision

etcd_debugging_disk_backend_commit_rebalance_duration_seconds_bucket

Etcd debugging disk backend commit rebalance duration distribution (seconds)

etcd_debugging_disk_backend_commit_rebalance_duration_seconds_count

The count of commit rebalance durations in seconds for the etcd Multi-Version Concurrency Control (MVCC) database, used for debugging.

etcd_debugging_disk_backend_commit_rebalance_duration_seconds_sum

Total commit rebalance duration for the etcd debug disk backend (seconds)

etcd_debugging_disk_backend_commit_spill_duration_seconds_bucket

The distribution of commit spill duration for the etcd debugging disk backend

etcd_debugging_disk_backend_commit_spill_duration_seconds_count

The total number of commit spills for the etcd debug disk backend.

etcd_debugging_disk_backend_commit_spill_duration_seconds_sum

Sum of the commit spill duration for the etcd debugging disk backend (seconds)

etcd_debugging_disk_backend_commit_write_duration_seconds_bucket

Etcd debug disk backend commit write duration distribution (seconds)

etcd_debugging_disk_backend_commit_write_duration_seconds_count

The total number of write commits to the etcd debug disk backend.

etcd_debugging_disk_backend_commit_write_duration_seconds_sum

The total duration of commit writes to the etcd debug disk backend, in seconds.

etcd_debugging_lease_granted_total

Total number of leases granted for etcd debugging

etcd_debugging_lease_renewed_total

The total number of etcd debugging lease renewals

etcd_debugging_lease_revoked_total

Total number of etcd debugging leases revoked.

etcd_debugging_lease_ttl_total_bucket

Etcd debug lease TTL total bucket

etcd_debugging_lease_ttl_total_count

Total count of etcd debug lease TTLs

etcd_debugging_lease_ttl_total_sum

etcd lease TTL sum (seconds)

etcd_debugging_mvcc_compact_revision

etcd MVCC compaction revision for debugging

etcd_debugging_mvcc_current_revision

Current MVCC revision for etcd debugging

etcd_debugging_mvcc_db_compaction_keys_total

Total keys compacted in the etcd MVCC database for debugging

etcd_debugging_mvcc_db_compaction_last

Last compaction time of the etcd MVCC database for debugging.

etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket

The bucket for the pause duration in milliseconds during etcd MVCC database compaction for debugging.

etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_count

The count of pause durations (in milliseconds) during MVCC database compaction for etcd debugging.

etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_sum

Sum of pause durations for etcd MVCC database compaction during debugging (milliseconds).

etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket

Distribution of the total duration of MVCC database compaction for etcd debugging (in milliseconds)

etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_count

The total count of etcd debug MVCC database compactions, measured in milliseconds.

etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_sum

Sum of the total duration of etcd MVCC database compaction for debugging (milliseconds)

etcd_debugging_mvcc_db_total_size_in_bytes

Total size of the etcd debug MVCC database in bytes

etcd_debugging_mvcc_delete_total

Total MVCC delete operations for etcd debugging

etcd_debugging_mvcc_events_total

Total number of etcd debug events

etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucket

The bucket for the etcd debugging MVCC index compaction pause duration in milliseconds.

etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_count

Count of etcd debug MVCC index compaction pauses.

etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_sum

The sum of pause durations in milliseconds for etcd MVCC index compaction during debugging.

etcd_debugging_mvcc_keys_total

The total number of MVCC keys for etcd debugging.

etcd_debugging_mvcc_pending_events_total

Total number of pending MVCC events for etcd debugging

etcd_debugging_mvcc_put_total

Total number of MVCC put operations for debugging etcd

etcd_debugging_mvcc_range_total

Total etcd MVCC range queries

etcd_debugging_mvcc_slow_watcher_total

Total number of slow watchers for etcd debugging

etcd_debugging_mvcc_total_put_size_in_bytes

Total MVCC put size for etcd debugging (bytes)

etcd_debugging_mvcc_txn_total

Total Multi-Version Concurrency Control (MVCC) transactions for etcd debugging

etcd_debugging_mvcc_watch_stream_total

Total etcd debug snapshot streams

etcd_debugging_mvcc_watcher_total

Total number of etcd debug watchers

etcd_debugging_server_lease_expired_total

Total expired leases for the etcd debugging server.

etcd_debugging_snap_save_marshalling_duration_seconds_bucket

Distribution of marshalling durations when saving etcd debug snapshots

etcd_debugging_snap_save_marshalling_duration_seconds_count

The count of marshalling operations for saving an etcd debug snapshot. The duration is measured in seconds.

etcd_debugging_snap_save_marshalling_duration_seconds_sum

The total time in seconds spent marshalling debugging snapshots for saving.

etcd_debugging_snap_save_total_duration_seconds_bucket

The total time it takes to save an etcd debug snapshot, in seconds, by bucket.

etcd_debugging_snap_save_total_duration_seconds_count

Total count of etcd debug snapshot save operations (duration in seconds)

etcd_debugging_snap_save_total_duration_seconds_sum

The total time, in seconds, spent saving etcd debug snapshots.

etcd_debugging_store_expires_total

Total number of etcd debugging store expirations.

etcd_debugging_store_reads_total

Total debug store reads in etcd.

etcd_debugging_store_watch_requests_total

The total number of watch requests for the etcd debug store.

etcd_debugging_store_watchers

Number of etcd debugging store watchers

etcd_debugging_store_writes_total

Total etcd debug store writes

etcd_disk_backend_commit_duration_seconds_bucket

etcd disk backend commit duration bucket (seconds)

etcd_disk_backend_commit_duration_seconds_count

The total number of etcd disk backend commits.

etcd_disk_backend_commit_duration_seconds_sum

Total duration of etcd disk backend commits, in seconds.

etcd_disk_backend_defrag_duration_seconds_bucket

Distribution of etcd disk WAL fsync duration

etcd_disk_backend_defrag_duration_seconds_count

Duration of etcd disk backend defragmentation (seconds)

etcd_disk_backend_defrag_duration_seconds_sum

The sum of etcd disk backend defragmentation durations, in seconds.

etcd_disk_backend_snapshot_duration_seconds_bucket

Distribution of etcd disk backend snapshot duration (seconds)

etcd_disk_backend_snapshot_duration_seconds_count

The total count of timed etcd disk backend snapshots.

etcd_disk_backend_snapshot_duration_seconds_sum

Total duration of etcd disk backend snapshots in seconds.

etcd_disk_defrag_inflight

etcd disk defragmentation in progress

etcd_disk_wal_fsync_duration_seconds_bucket

etcd disk WAL fsync duration seconds bucket

etcd_disk_wal_fsync_duration_seconds_count

The total number of etcd disk WAL fsync operations.

etcd_disk_wal_fsync_duration_seconds_sum

Sum of the etcd disk WAL fsync duration in seconds.

etcd_disk_wal_write_bytes_total

Total bytes written to the etcd disk WAL

etcd_grpc_proxy_cache_hits_total

Total number of etcd gRPC proxy cache hits

etcd_grpc_proxy_cache_keys_total

The total number of etcd gRPC proxy cache keys.

etcd_grpc_proxy_cache_misses_total

Total etcd gRPC proxy cache misses

etcd_grpc_proxy_events_coalescing_total

Total number of events merged by the etcd gRPC proxy

etcd_grpc_proxy_watchers_coalescing_total

Total number of coalesced watchers in the etcd gRPC proxy.

etcd_mvcc_db_open_read_transactions

The number of open read transactions in the etcd MVCC database.

etcd_mvcc_db_total_size_in_bytes

Total size of the etcd MVCC database (bytes)

etcd_mvcc_db_total_size_in_use_in_bytes

The total size in use of the etcd MVCC database, in bytes.

etcd_mvcc_delete_total

Total etcd MVCC deletes

etcd_mvcc_hash_duration_seconds_bucket

Bucket for etcd MVCC hash duration in seconds.

etcd_mvcc_hash_duration_seconds_count

Count of etcd MVCC hash durations (seconds)

etcd_mvcc_hash_duration_seconds_sum

Total etcd MVCC hash duration in seconds

etcd_mvcc_hash_rev_duration_seconds_bucket

etcd MVCC hash revision duration distribution (seconds)

etcd_mvcc_hash_rev_duration_seconds_count

The count of etcd MVCC hash revision durations in seconds.

etcd_mvcc_hash_rev_duration_seconds_sum

Sum of etcd MVCC hash revision duration, in seconds

etcd_mvcc_put_total

The total number of etcd MVCC Put operations

etcd_mvcc_range_total

Total number of etcd MVCC range queries

etcd_mvcc_txn_total

Total etcd multiversion concurrency control transactions

etcd_network_active_peers

Number of active etcd network peers

etcd_network_client_grpc_received_bytes_total

Total number of bytes received by the etcd network client over gRPC

etcd_network_client_grpc_sent_bytes_total

The total number of bytes sent by the etcd gRPC client.

etcd_network_disconnected_peers_total

Total number of disconnected peers in the etcd network

etcd_network_peer_received_bytes_total

Total bytes received by the etcd network peer

etcd_network_peer_received_failures_total

Total number of failed receives from etcd network peers

etcd_network_peer_round_trip_time_seconds_bucket

etcd network peer round-trip time distribution (seconds)

etcd_network_peer_round_trip_time_seconds_count

Count of round trip times in seconds for etcd network peers

etcd_network_peer_round_trip_time_seconds_sum

Total round trip time in seconds for etcd network peers

etcd_network_peer_sent_bytes_total

Total bytes sent to etcd peers

etcd_network_peer_sent_failures_total

Total etcd network peer send failures

etcd_network_server_stream_failures_total

Total number of etcd network server stream failures

etcd_network_snapshot_receive_inflights_total

The number of concurrent requests to receive etcd network snapshots.

etcd_network_snapshot_receive_success

The etcd network snapshot was accepted successfully.

etcd_network_snapshot_receive_total_duration_seconds_bucket

Distribution bucket for the total duration, in seconds, of accepting etcd network snapshots.

etcd_network_snapshot_receive_total_duration_seconds_count

The total count of etcd network snapshot receive operations.

etcd_network_snapshot_receive_total_duration_seconds_sum

Total time spent receiving etcd network snapshots, in seconds.

etcd_network_snapshot_send_inflights_total

The number of concurrent requests for sending etcd network snapshots.

etcd_network_snapshot_send_success

The etcd network snapshot was sent successfully.

etcd_network_snapshot_send_total_duration_seconds_bucket

Total duration distribution for sending etcd network snapshots (seconds)

etcd_network_snapshot_send_total_duration_seconds_count

Total number of etcd network snapshot send operations.

etcd_network_snapshot_send_total_duration_seconds_sum

Sum of the total duration for sending etcd network snapshots, in seconds.

etcd_server_apply_duration_seconds_bucket

etcd server apply duration distribution (seconds)

etcd_server_apply_duration_seconds_count

Count of apply operations for the etcd server

etcd_server_apply_duration_seconds_sum

The total time, in seconds, that the etcd server has spent applying requests.

etcd_server_client_requests_total

Total number of client requests to the etcd server

etcd_server_go_version

The Go version of the etcd server

etcd_server_has_leader

The etcd server has a leader.

etcd_server_health_failures

Number of etcd server health check failures

etcd_server_health_success

The etcd server health check is successful.

etcd_server_heartbeat_send_failures_total

Total number of failed heartbeat sends from the etcd server

etcd_server_id

etcd server ID

etcd_server_is_leader

Is the etcd server the leader

etcd_server_is_learner

Whether the etcd server is a Learner

etcd_server_leader_changes_seen_total

The total number of leader changes seen by the etcd server.

etcd_server_learner_promote_successes

The number of successful learner promotions in the etcd server.

etcd_server_proposals_applied_total

Total proposals applied on the etcd server

etcd_server_proposals_committed_total

Total number of proposals committed by the etcd server

etcd_server_proposals_failed_total

Total number of failed etcd server proposals

etcd_server_proposals_pending

Number of pending etcd server proposals

etcd_server_quota_backend_bytes

The backend storage quota for the etcd server in bytes.

etcd_server_read_indexes_failed_total

Total number of failed index reads on the etcd server.

etcd_server_slow_apply_total

Total slow applies on the etcd server

etcd_server_slow_read_indexes_total

The total number of slow read indexes for the etcd server.

etcd_server_snapshot_apply_in_progress_total

Total etcd server snapshot applications in progress

etcd_server_version

etcd server version

etcd_snap_db_fsync_duration_seconds_bucket

Distribution of fsync duration for the etcd snapshot database (seconds).

etcd_snap_db_fsync_duration_seconds_count

Total fsync count for the etcd snapshot database

etcd_snap_db_fsync_duration_seconds_sum

Total fsync duration for the etcd snapshot database, in seconds.

etcd_snap_db_save_total_duration_seconds_bucket

The bucket for the total duration, in seconds, to save the etcd snapshot database.

etcd_snap_db_save_total_duration_seconds_count

Total save duration for the ETCD snapshot database in seconds

etcd_snap_db_save_total_duration_seconds_sum

Total retention duration of the etcd snapshot database (seconds)

etcd_snap_fsync_duration_seconds_bucket

Etcd snapshot fsync duration distribution (seconds)

etcd_snap_fsync_duration_seconds_count

Etcd snapshot sync duration in seconds

etcd_snap_fsync_duration_seconds_sum

etcd snapshot fsync total duration (seconds)

grpc_server_handled_total

Total gRPC server requests processed

grpc_server_msg_received_total

Total messages received by the gRPC server

grpc_server_msg_sent_total

Total gRPC server messages sent

grpc_server_started_total

Total gRPC server startups

memory_utilization_byte

Memory utilization in bytes

os_fd_limit

Operating system file descriptor limit

os_fd_used

Operating system file descriptor count

up

Connectivity for metric collection

ACK Dedicated Scheduler (Job name: ack-scheduler)

Metric

Description

aggregator_discovery_aggregation_count_total

Total count of aggregator discovery aggregations.

aliyun_prometheus_agent_append_duration_seconds

Duration of append operations for the Alibaba Cloud Prometheus agent, in seconds.

aliyun_prometheus_agent_job_discovery_status

Discovery status of scrape jobs for the Alibaba Cloud Prometheus agent.

aliyun_prometheus_agent_scrape_custom_error

Number of custom scrape errors for the Alibaba Cloud Prometheus agent.

aliyun_prometheus_agent_scrapes_by_target_total

Total number of scrapes by target for the Alibaba Cloud Prometheus agent.

aliyun_prometheus_agent_target_info

Target information for the Alibaba Cloud Prometheus agent.

apiserver_audit_event_total

Total number of API server audit events.

apiserver_audit_requests_rejected_total

Total number of rejected API server audit requests.

apiserver_client_certificate_expiration_seconds_bucket

Distribution of remaining seconds until API server client certificate expiration.

apiserver_client_certificate_expiration_seconds_count

Count of remaining seconds until API server client certificate expiration.

apiserver_client_certificate_expiration_seconds_sum

Sum of remaining seconds until API server client certificate expiration.

apiserver_delegated_authn_request_duration_seconds_bucket

Distribution of API server delegated authentication request duration, in seconds.

apiserver_delegated_authn_request_duration_seconds_count

Count of API server delegated authentication request duration.

apiserver_delegated_authn_request_duration_seconds_sum

Sum of API server delegated authentication request duration.

apiserver_delegated_authn_request_total

Total number of API server delegated authentication requests.

apiserver_delegated_authz_request_duration_seconds_bucket

Distribution of API server delegated authorization request duration, in seconds.

apiserver_delegated_authz_request_duration_seconds_count

Count of API server delegated authorization request duration.

apiserver_delegated_authz_request_duration_seconds_sum

Sum of API server delegated authorization request duration, in seconds.

apiserver_delegated_authz_request_total

Total number of API server delegated authorization requests.

apiserver_encryption_config_controller_automatic_reload_failures_total

Total number of automatic reload failures for the API server encryption configuration controller.

apiserver_encryption_config_controller_automatic_reload_success_total

Total number of successful automatic reloads for the API server encryption configuration controller.

apiserver_envelope_encryption_dek_cache_fill_percent

Cache fill percentage for the API server envelope encryption Data Encryption Key (DEK).

apiserver_storage_data_key_generation_duration_seconds_bucket

Distribution of API server storage data key generation duration.

apiserver_storage_data_key_generation_duration_seconds_count

Count of API server storage data key generation duration.

apiserver_storage_data_key_generation_duration_seconds_sum

Sum of API server storage data key generation duration, in seconds.

apiserver_storage_data_key_generation_failures_total

Total number of API server storage data key generation failures.

apiserver_storage_envelope_transformation_cache_misses_total

Total number of cache misses for API server storage envelope transformation.

apiserver_webhooks_x509_insecure_sha1_total

Total count of insecure SHA1 in API server webhook X.509 certificates.

apiserver_webhooks_x509_missing_san_total

Total count of API server webhooks with missing Subject Alternative Name (SAN) in X.509 certificates.

authenticated_user_requests

Authenticated user requests.

authentication_attempts

Number of authentication attempts.

authentication_duration_seconds_bucket

Distribution of authentication duration.

authentication_duration_seconds_count

Count of authentication duration.

authentication_duration_seconds_sum

Sum of authentication duration, in seconds.

authentication_token_cache_active_fetch_count

Count of active fetches from the authentication token cache.

authentication_token_cache_fetch_total

Total number of fetches from the authentication token cache.

authentication_token_cache_request_duration_seconds_bucket

Distribution of authentication token cache request duration.

authentication_token_cache_request_duration_seconds_count

Count of authentication token cache request duration.

authentication_token_cache_request_duration_seconds_sum

Sum of authentication token cache request duration, in seconds.

authentication_token_cache_request_total

Total number of authentication token cache requests.

authorization_attempts_total

Total number of authorization attempts.

authorization_duration_seconds_bucket

Distribution of authorization duration, in seconds.

authorization_duration_seconds_count

Count of authorization duration.

authorization_duration_seconds_sum

Sum of authorization duration.

cardinality_enforcement_unexpected_categorizations_total

Total number of unexpected categorizations from cardinality enforcement.

kubernetes_build_info

Kubernetes build information.

kubernetes_feature_enabled

Enabled Kubernetes feature.

leader_election_master_status

Status of the leader election master.

registered_metric_total

Total number of registered metrics.

registered_metrics_total

Total number of registered metrics.

rest_client_exec_plugin_certificate_rotation_age_bucket

Buckets for the age of rotated certificates for the REST client exec plugin.

rest_client_exec_plugin_certificate_rotation_age_count

Count of the age of rotated certificates for the REST client exec plugin.

rest_client_exec_plugin_certificate_rotation_age_sum

Sum of the age of rotated certificates for the REST client exec plugin.

rest_client_rate_limiter_duration_seconds_bucket

Distribution of REST client rate limiter duration.

rest_client_rate_limiter_duration_seconds_count

Count of REST client rate limiter duration, in seconds.

rest_client_rate_limiter_duration_seconds_sum

Sum of REST client rate limiter duration, in seconds.

rest_client_request_duration_seconds_bucket

Buckets for REST client request duration, in seconds.

rest_client_request_duration_seconds_count

Count of REST client request duration.

rest_client_request_duration_seconds_sum

Sum of REST client request duration, in seconds.

rest_client_request_retries_total

Total number of REST client request retries.

rest_client_request_size_bytes_bucket

Distribution of REST client request size, in bytes.

rest_client_request_size_bytes_count

Count of REST client request size, in bytes.

rest_client_request_size_bytes_sum

Sum of REST client request size, in bytes.

rest_client_requests_total

Total number of REST client requests.

rest_client_response_size_bytes_bucket

Buckets for REST client response size, in bytes.

rest_client_response_size_bytes_count

Count of REST client response size, in bytes.

rest_client_response_size_bytes_sum

Sum of REST client response size, in bytes.

rest_client_transport_cache_entries

Number of REST client transport cache entries.

rest_client_transport_create_calls_total

Total number of REST client transport creation calls.

scheduler_binding_duration_seconds_bucket

Buckets for scheduler binding duration, in seconds.

scheduler_binding_duration_seconds_count

Count of binding duration.

scheduler_binding_duration_seconds_sum

Sum of scheduler binding duration, in seconds.

scheduler_e2e_scheduling_duration_seconds_bucket

Distribution of scheduler end-to-end scheduling duration.

scheduler_e2e_scheduling_duration_seconds_count

Count of scheduler end-to-end scheduling duration.

scheduler_e2e_scheduling_duration_seconds_sum

Sum of scheduler end-to-end scheduling duration, in seconds.

scheduler_framework_extension_point_duration_seconds_bucket

Distribution of scheduler framework extension point duration.

scheduler_framework_extension_point_duration_seconds_count

Count of scheduler framework extension point duration.

scheduler_framework_extension_point_duration_seconds_sum

Sum of scheduler framework extension point duration.

scheduler_goroutines

Number of scheduler goroutines.

scheduler_pending_pods

Number of pending pods in the scheduler.

scheduler_plugin_evaluation_total

Total number of scheduler plugin evaluations.

scheduler_plugin_execution_duration_seconds_bucket

Distribution of scheduler plugin execution duration, in seconds.

scheduler_plugin_execution_duration_seconds_count

Count of scheduler plugin execution duration.

scheduler_plugin_execution_duration_seconds_sum

Sum of scheduler plugin execution duration, in seconds.

scheduler_pod_preemption_victims_bucket

Buckets for the number of pod preemption victims in the scheduler.

scheduler_pod_preemption_victims_count

Count of pod preemption victims in the scheduler.

scheduler_pod_preemption_victims_sum

Sum of pod preemption victims in the scheduler.

scheduler_pod_scheduling_attempts_bucket

Buckets for the number of pod scheduling attempts in the scheduler.

scheduler_pod_scheduling_attempts_count

Count of pod scheduling attempts in the scheduler.

scheduler_pod_scheduling_attempts_sum

Sum of pod scheduling attempts in the scheduler.

scheduler_pod_scheduling_duration_seconds_bucket

Buckets for pod scheduling duration in the scheduler, in seconds.

scheduler_pod_scheduling_duration_seconds_count

Count of pod scheduling duration in the scheduler.

scheduler_pod_scheduling_duration_seconds_sum

Sum of pod scheduling duration in the scheduler, in seconds.

scheduler_pod_scheduling_sli_duration_seconds_bucket

Buckets for pod scheduling Service Level Indicator (SLI) duration.

scheduler_pod_scheduling_sli_duration_seconds_count

Count of pod scheduling Service Level Indicator (SLI) duration in the scheduler.

scheduler_pod_scheduling_sli_duration_seconds_sum

Sum of pod scheduling Service Level Indicator (SLI) duration.

scheduler_preemption_attempts_total

Total number of preemption attempts in the scheduler.

scheduler_preemption_victims_bucket

Buckets for the number of preemption victims in the scheduler.

scheduler_preemption_victims_count

Count of preemption victims in the scheduler.

scheduler_preemption_victims_sum

Total number of preemption victims in the scheduler.

scheduler_queue_incoming_pods_total

Total number of incoming pods in the scheduler queue.

scheduler_schedule_attempts_total

Total number of scheduling attempts in the scheduler.

scheduler_scheduler_cache_size

Size of the scheduler cache.

scheduler_scheduler_goroutines

Number of scheduler goroutines.

scheduler_scheduling_algorithm_duration_seconds_bucket

Distribution of scheduler scheduling algorithm duration, in seconds.

scheduler_scheduling_algorithm_duration_seconds_count

Count of scheduler scheduling algorithm duration, in seconds.

scheduler_scheduling_algorithm_duration_seconds_sum

Sum of scheduler scheduling algorithm duration, in seconds.

scheduler_scheduling_algorithm_predicate_evaluation_seconds_bucket

Buckets for scheduler scheduling algorithm predicate evaluation duration, in seconds.

scheduler_scheduling_algorithm_predicate_evaluation_seconds_count

Count of scheduling algorithm predicate evaluation duration, in seconds.

scheduler_scheduling_algorithm_predicate_evaluation_seconds_sum

Sum of scheduling algorithm predicate evaluation duration, in seconds.

scheduler_scheduling_algorithm_preemption_evaluation_seconds_bucket

Buckets for scheduling algorithm preemption evaluation duration, in seconds.

scheduler_scheduling_algorithm_preemption_evaluation_seconds_count

Count of scheduling algorithm preemption evaluation duration, in seconds.

scheduler_scheduling_algorithm_preemption_evaluation_seconds_sum

Sum of scheduling algorithm preemption evaluation duration, in seconds.

scheduler_scheduling_algorithm_priority_evaluation_seconds_bucket

Buckets for scheduler scheduling algorithm priority evaluation duration, in seconds.

scheduler_scheduling_algorithm_priority_evaluation_seconds_count

Count of scheduling algorithm priority evaluation duration, in seconds.

scheduler_scheduling_algorithm_priority_evaluation_seconds_sum

Sum of scheduling algorithm priority evaluation duration, in seconds.

scheduler_scheduling_attempt_duration_seconds_bucket

Distribution of scheduler scheduling attempt duration.

scheduler_scheduling_attempt_duration_seconds_count

Count of scheduler scheduling attempt duration.

scheduler_scheduling_attempt_duration_seconds_sum

Sum of scheduler scheduling attempt duration, in seconds.

scheduler_scheduling_duration_seconds

Scheduler scheduling duration, in seconds.

scheduler_scheduling_duration_seconds_count

Count of scheduling duration.

scheduler_scheduling_duration_seconds_sum

Sum of scheduling duration.

scheduler_total_preemption_attempts

Total number of preemption attempts by the scheduler.

scheduler_unschedulable_pods

Number of unschedulable pods in the scheduler.

scheduler_volume_scheduling_duration_seconds_bucket

Buckets for volume scheduling duration.

scheduler_volume_scheduling_duration_seconds_count

Count of scheduler volume scheduling duration, in seconds.

scheduler_volume_scheduling_duration_seconds_sum

Sum of scheduler volume scheduling duration, in seconds.

scheduler_volume_scheduling_stage_error_total

Total number of errors in the scheduler volume scheduling stage.

scrape_duration_seconds

Scrape duration, in seconds.

scrape_samples_post_metric_relabeling

Number of scraped samples after metric relabeling.

scrape_samples_scraped

Number of scraped samples.

scrape_series_added

Number of new series added from scrapes.

up

Connectivity for metric scraping.

workqueue_adds_total

Total number of additions to the work queue.

workqueue_depth

Depth of the work queue.

workqueue_longest_running_processor_seconds

Longest running processor time in the work queue, in seconds.

workqueue_queue_duration_seconds_bucket

Buckets for the duration items stay in the work queue, in seconds.

workqueue_queue_duration_seconds_count

Count of the duration items stay in the work queue, in seconds.

workqueue_queue_duration_seconds_sum

Sum of the duration items stay in the work queue, in seconds.

workqueue_retries_total

Total number of retries in the work queue.

workqueue_unfinished_work_seconds

Seconds of unfinished work in the work queue.

workqueue_work_duration_seconds_bucket

Distribution of work duration in the work queue.

workqueue_work_duration_seconds_count

Count of work duration in the work queue.

workqueue_work_duration_seconds_sum

Sum of work duration in the work queue, in seconds.

References