This topic describes the basic metrics for container clusters that are supported by Managed Service for Prometheus.
Billing for Managed Service for Prometheus is based on the data write volume or the number of reported data points. Metrics are divided into two types:
Basic metrics: Managed Service for Prometheus provides free data reporting and writing for basic metrics collected from Alibaba Cloud container services, such as Container Service for Kubernetes (ACK), ACS, ASK, ACK One, and ACK Edge. This benefit does not apply to other types of container clusters.
Custom metrics: Any metric that is not a basic metric is a custom metric. Billing for custom metrics started on January 6, 2020.
Starting from 00:00:00 (UTC+8) on November 12, 2024, Managed Service for Prometheus will adjust the scope of basic metrics collected from Alibaba Cloud container service clusters. The adjusted metric scope is described below.
Note that the scope of basic metrics collected by default for container clusters is limited to the metrics described in this topic.
Container cluster metrics outside this scope are custom metrics and are subject to charges. For more information about billing, see Billing of Prometheus instances.
cAdvisor (Job name: _arms/kubelet/cadvisor)
Metric | Description |
container_cpu_usage_seconds_total | Total container CPU usage time. |
container_fs_usage_bytes | Container file system usage in bytes. |
container_memory_cache | Container memory cache. |
container_memory_usage_bytes | Container memory usage in bytes. |
container_memory_working_set_bytes | Container memory working set in bytes. |
container_network_receive_bytes_total | Total bytes received by the container network. |
container_network_transmit_bytes_total | Total bytes transmitted by the container network. |
container_scrape_error | Container metric scrape error. |
DCGM_CUSTOM_CONTAINER_CP_ALLOCATED | The proportion of computing power allocated to a container on a GPU card relative to the total computing power of that GPU. The value ranges from 0 to 1. For exclusive GPUs or shared GPUs that only request GPU memory, this metric is 0, which indicates no limit on computing power. For example, if a GPU card has 100 units of computing power and 30 units are allocated to a container, the allocated computing power ratio for that container is 30/100 = 0.3. |
DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED | The GPU memory allocated to the container. |
DCGM_CUSTOM_DEV_FB_ALLOCATED | The proportion of allocated GPU memory to the total GPU memory. The value ranges from 0 to 1. |
DCGM_CUSTOM_DEV_FB_TOTAL | The total GPU memory of the GPU card. |
DCGM_CUSTOM_DEV_HEALTH | GPU health status. |
DCGM_CUSTOM_PROCESS_DECODE_UTIL | The decoder utilization of the GPU thread. |
DCGM_CUSTOM_PROCESS_ENCODE_UTIL | The encoder utilization of the GPU thread. |
DCGM_CUSTOM_PROCESS_MEM_COPY_UTIL | The memory copy utilization of the GPU thread. |
DCGM_CUSTOM_PROCESS_MEM_USED | The GPU memory currently used by the GPU thread. |
DCGM_CUSTOM_PROCESS_SM_UTIL | The SM utilization of the GPU thread. |
DCGM_CUSTOM_PROF_MEM_BANDWIDTH_USED | GPU memory bandwidth usage. |
DCGM_CUSTOM_PROF_TENS_TFPS_USED | The usage of the GPU tensor core. |
DCGM_FI_DEV_DEC_UTIL | Decoder utilization. |
DCGM_FI_DEV_ENC_UTIL | Encoder utilization. |
DCGM_FI_DEV_FB_FREE | The amount of available framebuffer memory. |
DCGM_FI_DEV_FB_USED | The amount of used framebuffer memory. This value corresponds to the used value of Memory-Usage in the nvidia-smi command. |
DCGM_FI_DEV_GPU_TEMP | GPU temperature. |
DCGM_FI_DEV_GPU_UTIL | GPU utilization. This is the percentage of time one or more kernel functions are active on the GPU over a period, such as 1s or 1/6s, depending on the GPU product. This metric only shows that a GPU resource is in use by a kernel function, but does not show the specific usage. |
DCGM_FI_DEV_MEM_CLOCK | Memory clock frequency. |
DCGM_FI_DEV_MEM_COPY_UTIL | Memory bandwidth utilization. For example, for an NVIDIA V100 GPU, the maximum memory bandwidth is 900 GB/sec. If the current memory bandwidth is 450 GB/sec, the memory bandwidth utilization is 50%. |
DCGM_FI_DEV_POWER_USAGE | Power usage. |
DCGM_FI_DEV_SM_CLOCK | SM clock frequency. |
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION | The energy consumed since the driver was loaded. |
DCGM_FI_DEV_XID_ERRORS | The last XID error number that occurred within a period of time. |
DCGM_FI_PROF_DRAM_ACTIVE | Memory bandwidth utilization. The fraction of cycles where data is sent to or received from the device memory. This value is an average over the time interval, not an instantaneous value. A higher value indicates higher utilization of the device memory. A value of 1 (100%) means that a DRAM instruction is executed in every cycle within the time interval. In practice, a peak of about 0.8 (80%) is the maximum achievable value. A value of 0.2 (20%) means that 20% of the cycles are used to read from or write to the device memory within the time interval. |
DCGM_FI_PROF_NVLINK_RX_BYTES | The data rate of data transmitted or received over NVLink, excluding protocol headers. This value is an average over a time interval, not an instantaneous value. The rate is averaged over the time interval. For example, if 1 GB of data is transmitted in 1 second, the rate is 1 GB/s, regardless of whether the data is transmitted at a constant rate or in bursts. The theoretical maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction. |
DCGM_FI_PROF_NVLINK_TX_BYTES | Total bytes transmitted over NVLink (send direction). |
DCGM_FI_PROF_PCIE_RX_BYTES | The data rate of data transmitted or received over the PCIe bus, including protocol headers and data payloads. This value is an average over a time interval, not an instantaneous value. The rate is averaged over the time interval. For example, if 1 GB of data is transmitted in 1 second, the rate is 1 GB/s, regardless of whether the data is transmitted at a constant rate or in bursts. The theoretical maximum PCIe Gen3 bandwidth is 985 MB/s per channel. |
DCGM_FI_PROF_PCIE_TX_BYTES | The data rate of data transmitted or received over the PCIe bus, including protocol headers and data payloads. This value is an average over a time interval, not an instantaneous value. The rate is averaged over the time interval. For example, if 1 GB of data is transmitted in 1 second, the rate is 1 GB/s, regardless of whether the data is transmitted at a constant rate or in bursts. The theoretical maximum PCIe Gen3 bandwidth is 985 MB/s per channel. |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | The fraction of cycles where the Tensor (HMMA/IMMA) Pipe is active. This value is an average over a time interval, not an instantaneous value. A higher value indicates higher utilization of Tensor Cores. A value of 1 (100%) means that a Tensor instruction is issued every other instruction cycle. One instruction is completed in two cycles. A value of 0.2 (20%) could mean: 20% of the SMs' Tensor Cores are running at 100% utilization throughout the interval. 100% of the SMs' Tensor Cores are running at 20% utilization throughout the interval. For 1/5 of the interval, 100% of the Tensor Cores on the SMs are running at 100% utilization. Other combinations. |
DCGM_FI_PROF_SM_ACTIVE | The percentage of time that at least one warp is active on a Streaming Multiprocessor (SM) within a time interval. This value is the average for all SMs and is not sensitive to the number of threads per block. A warp is active when it is scheduled and allocated resources. It can be in a computing or non-computing state, such as waiting for a memory request. A value less than 0.5 indicates inefficient GPU utilization, and a value greater than 0.8 is necessary. Assume a GPU has N SMs: If a kernel function runs on all SMs using N thread blocks throughout the interval, the value is 1 (100%). If a kernel function runs N/5 thread blocks within the interval, the value is 0.2. If a kernel function uses N thread blocks but runs for only 1/5 of the cycle time within the interval, the value is 0.2. |
machine_cpu_cores | Number of machine CPU cores. |
node_exporter_build_info | Node exporter build information. |
nvidia_gpu_duty_cycle | NVIDIA GPU duty cycle percentage. |
nvidia_gpu_memory_total_bytes | Total NVIDIA GPU memory in bytes. |
nvidia_gpu_memory_used_bytes | Amount of used NVIDIA GPU memory. |
nvidia_gpu_num_devices | Number of NVIDIA GPU devices. |
nvidia_gpu_power_usage_milliwatts | NVIDIA GPU power consumption in milliwatts. |
nvidia_gpu_temperature_celsius | NVIDIA GPU temperature in Celsius. |
rdma_service_monitor_local_ack_timeout_err | Number of RDMA network timeout errors. |
rdma_service_monitor_out_of_seq | Number of out-of-sequence RDMA network datagrams. |
rdma_service_monitor_packet_seq_err | Number of out-of-sequence RDMA network packet sending errors. |
rdma_service_monitor_rx_bytes | RDMA network receive throughput. |
rdma_service_monitor_rx_packets | Number of received RDMA network packets. |
rdma_service_monitor_tx_bytes | RDMA network send throughput. |
rdma_service_monitor_tx_packets | Number of sent RDMA network packets. |
up | Connectivity of metric scraping. |
ACK ControlPlane APIServer (Includes ACK Pro control plane components such as APIServer, etcd, scheduler, KCM, and CCM. ACK Dedicated clusters include only APIServer) (Job name: apiserver)
Metric | Description |
aggregator_discovery_aggregation_count_total | Total count of aggregations from aggregator discovery |
aggregator_openapi_v2_regeneration_count | Aggregator OpenAPI V2 regeneration count |
aggregator_openapi_v2_regeneration_duration | Aggregator OpenAPI V2 regeneration duration |
aggregator_unavailable_apiservice | Unavailable aggregator APIService |
aggregator_unavailable_apiservice_count | The number of unavailable APIServices in the aggregator. |
aggregator_unavailable_apiservice_total | Total number of unavailable API services in the aggregator |
aliyun_prometheus_agent_append_duration_seconds | Alibaba Cloud Prometheus Agent append duration (seconds) |
aliyun_prometheus_agent_job_discovery_status | Alibaba Cloud Prometheus Agent job discovery status |
aliyun_prometheus_agent_scrapes_by_target_total | Total scrapes by target for the Alibaba Cloud Prometheus Agent |
aliyun_prometheus_agent_target_info | Alibaba Cloud Prometheus Agent target information |
apiextensions_apiserver_validation_ratcheting_seconds_bucket | APIServer validation ratcheting seconds bucket |
apiextensions_apiserver_validation_ratcheting_seconds_count | Count of APIServer validation ratcheting seconds |
apiextensions_apiserver_validation_ratcheting_seconds_sum | Sum of APIServer validation increment in seconds |
apiextensions_openapi_v2_regeneration_count | Apiextensions OpenAPI V2 regeneration count |
apiextensions_openapi_v3_regeneration_count | Apiextensions OpenAPI V3 regeneration count |
apiserver_accepted_listall_requests_total | The total number of listall requests accepted by the APIServer. |
apiserver_admission_controller_admission_duration_seconds_bucket | The bucket for the APIServer admission controller admission duration, in seconds. |
apiserver_admission_controller_admission_duration_seconds_count | The number of admission requests processed by the APIServer admission controller. |
apiserver_admission_controller_admission_duration_seconds_sum | Total admission duration for the APIServer admission controller, in seconds |
apiserver_admission_step_admission_duration_seconds_bucket | The histogram bucket for the duration of an APIServer admission step in seconds. |
apiserver_admission_step_admission_duration_seconds_count | Count of API server admission step durations in seconds. |
apiserver_admission_step_admission_duration_seconds_sum | Total duration of API server admission steps in seconds |
apiserver_admission_step_admission_duration_seconds_summary | Summary of the APIServer admission step duration in seconds. |
apiserver_admission_step_admission_duration_seconds_summary_count | Summary count of the admission duration of an APIServer admission step in seconds. |
apiserver_admission_step_admission_duration_seconds_summary_sum | The sum of the summary of the API server admission step duration, in seconds. |
apiserver_admission_webhook_admission_duration_seconds_bucket | APIServer admission webhook admission duration seconds bucket |
apiserver_admission_webhook_admission_duration_seconds_count | The count of APIServer admission webhook durations in seconds. |
apiserver_admission_webhook_admission_duration_seconds_sum | Sum of the admission duration of API server admission webhooks, in seconds. |
apiserver_admission_webhook_fail_open_count | API server admission webhook fail open count |
apiserver_admission_webhook_rejection_count | The number of rejections from the API server admission webhook. |
apiserver_admission_webhook_request_total | Total number of API server admission webhook requests |
apiserver_audit_error_total | Total number of API Server audit errors |
apiserver_audit_event_total | Total APIServer audit events |
apiserver_audit_level_total | Total number of API server audit events |
apiserver_audit_requests_rejected_total | Total number of rejected APIServer audit requests. |
apiserver_authorization_decisions_total | Total number of API server authorization decisions |
apiserver_cache_list_fetched_objects_total | The total number of objects fetched from the APIServer cache list. |
apiserver_cache_list_returned_objects_total | Total number of objects returned by the APIServer cache list |
apiserver_cache_list_total | Total number of APIServer cache list operations |
apiserver_cacher_received_events | Events received by the APIServer cache |
apiserver_cacher_sended_events_latency_milliseconds_bucket | The distribution of latency in milliseconds for events sent by the APIServer cacher. |
apiserver_cacher_sended_events_latency_milliseconds_count | The count of latency measurements in milliseconds for events sent by the APIServer cacher. |
apiserver_cacher_sended_events_latency_milliseconds_sum | The total latency in milliseconds for events sent by the APIServer cacher. |
apiserver_cacher_watcher_channel_length | APIServer cacher watcher channel length |
apiserver_cel_compilation_duration_seconds_bucket | Distribution of APIServer CEL compilation durations in seconds |
apiserver_cel_compilation_duration_seconds_count | Counter of API server CEL compilations |
apiserver_cel_compilation_duration_seconds_sum | Total APIServer CEL compilation duration (seconds) |
apiserver_cel_evaluation_duration_seconds_bucket | Distribution of APIServer CEL evaluation durations in seconds. |
apiserver_cel_evaluation_duration_seconds_count | The number of API server CEL evaluations. |
apiserver_cel_evaluation_duration_seconds_sum | Total duration of APIServer CEL evaluation in seconds |
apiserver_client_certificate_expiration_seconds_bucket | Distribution of seconds remaining before the API server client certificate expires. |
apiserver_client_certificate_expiration_seconds_count | The number of seconds before the API server client certificate expires. |
apiserver_client_certificate_expiration_seconds_sum | The total number of seconds remaining before the APIServer client certificate expires. |
apiserver_clusterip_repair_ip_errors_total | Total ClusterIP errors repaired by the API server |
apiserver_clusterip_repair_reconcile_errors_total | The total number of reconciliation errors for ClusterIP repairs by the APIServer. |
apiserver_conversion_webhook_duration_seconds_bucket | The distribution of API server conversion webhook durations in seconds. |
apiserver_conversion_webhook_duration_seconds_count | The number of APIServer conversion webhook calls |
apiserver_conversion_webhook_duration_seconds_sum | Total duration of API server conversion webhooks in seconds |
apiserver_conversion_webhook_request_total | Total number of API server conversion webhook requests |
apiserver_crd_conversion_webhook_duration_seconds_bucket | The distribution of API Server CRD conversion webhook durations in seconds. |
apiserver_crd_conversion_webhook_duration_seconds_count | Count of calls to the APIServer CRD conversion webhook |
apiserver_crd_conversion_webhook_duration_seconds_sum | Total duration of APIServer CRD conversion webhooks in seconds. |
apiserver_crd_webhook_conversion_duration_seconds_bucket | Distribution of APIServer CRD webhook conversion duration in seconds. |
apiserver_crd_webhook_conversion_duration_seconds_count | The total number of APIServer CRD webhook conversions. |
apiserver_crd_webhook_conversion_duration_seconds_sum | Total duration of APIServer CRD webhook conversions in seconds. |
apiserver_created_watchers | Number of watchers created by the API server |
apiserver_current_inflight_requests | The number of requests the APIServer is currently processing. |
apiserver_current_inqueue_requests | The current number of requests in the API server queue. |
apiserver_dropped_requests_total | The total number of requests dropped by the APIServer. |
apiserver_encryption_config_controller_automatic_reload_failures_total | Number of failed automatic reloads for the APIServer encryption configuration controller |
apiserver_encryption_config_controller_automatic_reload_success_total | Number of successful automatic reloads for the APIServer encryption configuration controller |
apiserver_envelope_encryption_dek_cache_fill_percent | APIServer envelope encryption DEK cache fill percentage |
apiserver_error_watchers | Number of APIServer fault observers |
apiserver_flowcontrol_current_executing_requests | Number of requests currently being executed by the APIServer throttle |
apiserver_flowcontrol_current_executing_seats | Number of seats currently used by the APIServer throttle |
apiserver_flowcontrol_current_inqueue_requests | Number of requests in the APIServer throttle queue |
apiserver_flowcontrol_current_inqueue_seats | Number of seats in the APIServer throttle queue |
apiserver_flowcontrol_current_limit_seats | Current seat limit for the API server throttle |
apiserver_flowcontrol_current_r | Current R value of the APIServer throttle |
apiserver_flowcontrol_demand_seats_average | Average value of requested seats for APIServer throttling |
apiserver_flowcontrol_demand_seats_bucket | Seat distribution for throttled API server requests |
apiserver_flowcontrol_demand_seats_count | APIServer throttle request seat count |
apiserver_flowcontrol_demand_seats_high_watermark | APIServer throttling request seats high-water mark |
apiserver_flowcontrol_demand_seats_smoothed | Smoothing value for APIServer throttle request seats |
apiserver_flowcontrol_demand_seats_stdev | Standard deviation of request seats for APIServer throttling |
apiserver_flowcontrol_demand_seats_sum | Total requested seats for APIServer throttling |
apiserver_flowcontrol_dispatch_r | APIServer throttle scheduling R value |
apiserver_flowcontrol_dispatched_requests_total | Total number of requests scheduled by APIServer throttling |
apiserver_flowcontrol_latest_s | Recent S value limit for APIServer throttling |
apiserver_flowcontrol_lower_limit_seats | Minimum seats for APIServer throttling |
apiserver_flowcontrol_next_discounted_s_bounds | Next discounted S-value threshold for the APIServer throttle |
apiserver_flowcontrol_next_s_bounds | Next S value threshold for APIServer throttling |
apiserver_flowcontrol_nominal_limit_seats | Nominal seat limit for APIServer throttling |
apiserver_flowcontrol_priority_level_request_count_samples_bucket | Sample distribution of APIServer requests by throttling priority level |
apiserver_flowcontrol_priority_level_request_count_samples_count | Sample count of APIServer requests per throttling priority level |
apiserver_flowcontrol_priority_level_request_count_samples_sum | Sum of sampled request counts for the APIServer throttling priority level |
apiserver_flowcontrol_priority_level_request_count_watermarks_bucket | Distribution of request count watermarks across APIServer flow control priority levels |
apiserver_flowcontrol_priority_level_request_count_watermarks_count | API server throttling priority level: request count watermark mark count |
apiserver_flowcontrol_priority_level_request_count_watermarks_sum | Sum of request watermarks for APIServer throttling priority levels |
apiserver_flowcontrol_priority_level_request_utilization_bucket | Distribution of APIServer request utilization by flow control priority level |
apiserver_flowcontrol_priority_level_request_utilization_count | APIServer throttle priority level request utilization count |
apiserver_flowcontrol_priority_level_request_utilization_sum | Total request utilization across APIServer throttling priority levels |
apiserver_flowcontrol_priority_level_seat_count_samples_bucket | Sample distribution of seats across APIServer throttling priority levels |
apiserver_flowcontrol_priority_level_seat_count_samples_count | APIServer throttling priority level seats sample count |
apiserver_flowcontrol_priority_level_seat_count_samples_sum | Sum of seat count samples for the APIServer throttle priority level |
apiserver_flowcontrol_priority_level_seat_count_watermarks_bucket | Distribution of seat watermarks for API server priority levels |
apiserver_flowcontrol_priority_level_seat_count_watermarks_count | APIServer throttle priority level seats watermark mark count |
apiserver_flowcontrol_priority_level_seat_count_watermarks_sum | Total seats at the watermark for the APIServer throttling priority level |
apiserver_flowcontrol_priority_level_seat_utilization_bucket | API server: Seat utilization distribution by throttle priority level |
apiserver_flowcontrol_priority_level_seat_utilization_count | APIServer flow control priority level seat utilization count |
apiserver_flowcontrol_priority_level_seat_utilization_sum | Total seat utilization across API server throttling priority levels |
apiserver_flowcontrol_read_vs_write_current_requests_bucket | Current request count in the APIServer read/write throttle bucket |
apiserver_flowcontrol_read_vs_write_current_requests_count | Current read/write request count for APIServer throttling |
apiserver_flowcontrol_read_vs_write_current_requests_sum | Sum of current read and write requests throttled by the APIServer |
apiserver_flowcontrol_read_vs_write_request_count_samples_bucket | Sample bucket for the read/write request count of the APIServer throttle. |
apiserver_flowcontrol_read_vs_write_request_count_samples_count | Number of samples for the APIServer throttled read/write request counter |
apiserver_flowcontrol_read_vs_write_request_count_samples_sum | Total count of throttled APIServer read/write requests |
apiserver_flowcontrol_read_vs_write_request_count_watermarks_bucket | APIServer throttling read/write request count watermark bucket |
apiserver_flowcontrol_read_vs_write_request_count_watermarks_count | APIServer throttled read/write request count watermark |
apiserver_flowcontrol_read_vs_write_request_count_watermarks_sum | Total count watermark for APIServer throttled read/write requests |
apiserver_flowcontrol_rejected_requests_total | Total requests rejected by APIServer throttling |
apiserver_flowcontrol_request_concurrency_in_use | APIServer throttled concurrent requests |
apiserver_flowcontrol_request_concurrency_limit | Concurrency limit for APIServer request throttling |
apiserver_flowcontrol_request_dispatch_no_accommodation_total | The API server request throttling scheduler cannot accommodate the total number of requests. |
apiserver_flowcontrol_request_execution_seconds_bucket | APIServer throttled request execution time in seconds (buckets) |
apiserver_flowcontrol_request_execution_seconds_count | Total execution time in seconds for throttled APIServer requests |
apiserver_flowcontrol_request_execution_seconds_sum | Sum of execution seconds for throttled APIServer requests |
apiserver_flowcontrol_request_queue_length_after_enqueue_bucket | Post-enqueue length buckets of the APIServer request throttling queue |
apiserver_flowcontrol_request_queue_length_after_enqueue_count | Count of requests in the APIServer throttling queue |
apiserver_flowcontrol_request_queue_length_after_enqueue_sum | Total enqueued requests in APIServer throttling queues |
apiserver_flowcontrol_request_wait_duration_seconds_bucket | APIServer request throttling wait time bucket (seconds) |
apiserver_flowcontrol_request_wait_duration_seconds_count | Total wait time in seconds for throttled APIServer requests |
apiserver_flowcontrol_request_wait_duration_seconds_sum | Total wait time in seconds for throttled APIServer requests |
apiserver_flowcontrol_seat_fair_frac | The APIServer contains the fair allocation ratio from the previous borrowing adjustment period. |
apiserver_flowcontrol_target_seats | Target seat count for API server throttling |
apiserver_flowcontrol_upper_limit_seats | Maximum number of seats for APIServer throttling |
apiserver_flowcontrol_watch_count_samples_bucket | APIServer throttle observation count sample bucket |
apiserver_flowcontrol_watch_count_samples_count | APIServer throttle observation sample count |
apiserver_flowcontrol_watch_count_samples_sum | Sum of APIServer throttle observation counts |
apiserver_flowcontrol_work_estimated_seats_bucket | APIServer flow control's bucket for estimated work seats |
apiserver_flowcontrol_work_estimated_seats_count | APIServer flow control estimated seat count |
apiserver_flowcontrol_work_estimated_seats_sum | Total estimated seats for APIServer throttling work |
apiserver_init_events_total | Total APIServer initialization events |
apiserver_kube_aggregator_x509_insecure_sha1_total | Number of requests using insecure SHA1 signatures |
apiserver_kube_aggregator_x509_missing_san_total | APIServer kube-aggregator: Total missing x509 SANs |
apiserver_longrunning_gauge | APIServer long-running gauge |
apiserver_longrunning_requests | Long-running APIServer requests |
apiserver_nodeport_repair_reconcile_errors_total | Total reconciliation faults for APIServer node port repairs |
apiserver_realtime_watchers | Number of real-time APIServer observers |
apiserver_registered_watchers | Number of registered observers in APIServer |
apiserver_request_aborts_total | Total aborted APIServer requests |
apiserver_request_body_size_bytes_bucket | APIServer request body size in bytes bucket |
apiserver_request_body_size_bytes_count | APIServer request body size in bytes |
apiserver_request_body_size_bytes_sum | Total APIServer request body size in bytes |
apiserver_request_count | Number of API server requests |
apiserver_request_duration_seconds_bucket | Buckets for APIServer request processing time (in seconds) |
apiserver_request_duration_seconds_count | Count of APIServer request duration in seconds |
apiserver_request_duration_seconds_sum | Total APIServer request duration in seconds |
apiserver_request_filter_duration_seconds_bucket | APIServer request filter duration bucket (seconds) |
apiserver_request_filter_duration_seconds_count | Count of APIServer request filter durations in seconds. |
apiserver_request_filter_duration_seconds_sum | Total duration of APIServer request filters in seconds |
apiserver_request_latencies_summary | APIServer request latency distribution summary |
apiserver_request_no_resourceversion_list_total | Total LIST requests for versions without resources |
apiserver_request_post_timeout_total | Total POST API Request Timeouts |
apiserver_request_sli_duration_seconds_bucket | API request Service Level Indicator (SLI) duration seconds bucket |
apiserver_request_sli_duration_seconds_count | Total API request SLI duration in seconds |
apiserver_request_sli_duration_seconds_sum | Total API request SLI duration in seconds |
apiserver_request_slo_duration_seconds_bucket | API request SLO duration bucket (seconds) |
apiserver_request_slo_duration_seconds_count | API request SLO duration seconds count |
apiserver_request_slo_duration_seconds_sum | Total API request SLO duration in seconds |
apiserver_request_terminations_total | Total stopped API requests |
apiserver_request_timestamp_comparison_time_bucket | Distribution buckets for API request timestamp differences |
apiserver_request_timestamp_comparison_time_count | API request timestamp comparison sample count |
apiserver_request_timestamp_comparison_time_sum | Total time for API request timestamp comparison |
apiserver_request_total | Total API requests |
apiserver_requested_deprecated_apis | Number of requests to the API server for deprecated APIs |
apiserver_response_sizes_bucket | API response size distribution buckets |
apiserver_response_sizes_count | API response size count |
apiserver_response_sizes_sum | Total API response size |
apiserver_selfrequest_total | Total API server self-requests |
apiserver_storage_data_key_generation_duration_seconds_bucket | APIServer storage data key generation duration: seconds buckets |
apiserver_storage_data_key_generation_duration_seconds_count | Count of data key generations by APIServer storage |
apiserver_storage_data_key_generation_duration_seconds_sum | Total data key generation time for APIServer storage, in seconds |
apiserver_storage_data_key_generation_failures_total | Total number of data key generation failures for the APIServer store |
apiserver_storage_db_total_size_in_bytes | Total size of the APIServer database (bytes) |
apiserver_storage_decode_errors_total | Total APIServer storage decoding errors |
apiserver_storage_envelope_transformation_cache_misses_total | Total cache misses for the envelope transform in APIServer storage |
apiserver_storage_events_received_total | Total number of events accepted and stored by the APIServer |
apiserver_storage_list_evaluated_objects_total | Total objects evaluated from APIServer storage for list operations |
apiserver_storage_list_fetched_objects_total | Total objects retrieved from the APIServer storage list |
apiserver_storage_list_returned_objects_total | Total number of objects in a list response from the APIServer |
apiserver_storage_list_total | Total APIServer storage list operations |
apiserver_storage_objects | Number of APIServer objects |
apiserver_storage_size_bytes | APIServer storage size (bytes) |
apiserver_terminated_watchers_total | Total number of observers for APIServer stop |
apiserver_tls_handshake_errors_total | Total failed TLS handshake requests for the API server |
apiserver_too_large_resourceversion_errors | Number of error requests to APIServer due to oversized resource versions |
apiserver_watch_cache_events_dispatched_total | Total number of events distributed by the APIServer observation cache |
apiserver_watch_cache_events_received_total | Total events accepted by the APIServer observation cache |
apiserver_watch_cache_initializations_total | Total APIServer watch cache initializations |
apiserver_watch_cache_read_wait_seconds_bucket | APIServer watch cache read wait time bucket (seconds) |
apiserver_watch_cache_read_wait_seconds_count | APIServer observation cache read wait seconds count |
apiserver_watch_cache_read_wait_seconds_sum | Sum of wait time in seconds for APIServer observation cache reads |
apiserver_watch_cache_watch_cache_initializations_total | Total APIServer observation cache initializations |
apiserver_watch_events_sizes_bucket | API server observation event size distribution buckets |
apiserver_watch_events_sizes_count | APIServer observation event size count |
apiserver_watch_events_sizes_sum | Total size of APIServer observation events |
apiserver_watch_events_total | Total APIServer observation events |
apiserver_webhooks_x509_insecure_sha1_total | Number of requests that use insecure SHA1 signatures |
apiserver_webhooks_x509_missing_san_total | Total missing SANs in APIServerWebhooks |
authenticated_user_requests | Total number of authenticated user requests |
authentication_attempts | Authentication attempts |
authentication_duration_seconds_bucket | Authentication procedure duration buckets (seconds) |
authentication_duration_seconds_count | Authentication procedure duration (seconds) |
authentication_duration_seconds_sum | Total authentication duration in seconds |
authentication_token_cache_active_fetch_count | Authentication token cache proactive fetch count |
authentication_token_cache_fetch_total | Total authentication token cache retrievals |
authentication_token_cache_request_duration_seconds_bucket | Authentication token cache request latency distribution buckets (seconds) |
authentication_token_cache_request_duration_seconds_count | Authentication token cache request latency counter (seconds) |
authentication_token_cache_request_duration_seconds_sum | Total duration of authentication token cache requests in seconds |
authentication_token_cache_request_total | Total authentication token cache requests |
authorization_attempts_total | Total authorization attempts |
authorization_duration_seconds_bucket | Distribution buckets for authorization procedure duration (seconds) |
authorization_duration_seconds_count | Authorization procedure duration in seconds |
authorization_duration_seconds_sum | Total authorization procedure duration in seconds |
cardinality_enforcement_unexpected_categorizations_total | Total by execution and exception category |
count | Count |
cpu_utilization_core | CPU utilization (core) |
disabled_metric_total | Total disabled metrics |
disabled_metrics_total | Total disabled metrics |
etcd_bookmark_counts | Etcd bookmark count |
etcd_db_total_size_in_bytes | Total etcd database size (bytes) |
etcd_lease_object_counts_bucket | Histogram buckets for etcd lease object count |
etcd_lease_object_counts_count | Total ETCD lease object count |
etcd_lease_object_counts_sum | Total etcd lease object count |
etcd_object_counts | ETCD object count |
etcd_request_duration_seconds_bucket | Bucket counter for ETCD request processing time (in seconds) |
etcd_request_duration_seconds_count | ETCD request duration count (seconds) |
etcd_request_duration_seconds_sum | Sum of etcd request durations in seconds |
etcd_request_errors_total | Total ETCD request faults |
etcd_requests_total | Total etcd requests |
etcd_watcher_channel_length | etcd observer channel length |
etcd_watcher_received_events | Events received by the ETCD observer |
etcd_watcher_sended_events_latency_milliseconds_bucket | Distribution bucket for etcd observer event send latency (ms) |
etcd_watcher_sent_events_latency_milliseconds_count | ETCD observer event send latency in milliseconds |
etcd_watcher_sent_events_latency_milliseconds_sum | Sum of etcd observer send event latency in milliseconds |
field_validation_request_duration_seconds_bucket | Field validation request duration distribution bucket (seconds) |
field_validation_request_duration_seconds_count | Field validation request duration count (seconds) |
field_validation_request_duration_seconds_sum | Total field authentication request duration in seconds |
get_token_count | Get token count |
get_token_fail_count | Failed token acquisition count |
grpc_client_handled_total | gRPC client: Total processed |
grpc_client_msg_received_total | gRPC client: Total messages received |
grpc_client_msg_sent_total | gRPC client: Total messages sent |
grpc_client_started_total | gRPC Client: Total Starts |
hidden_metric_total | Hidden metric: Total |
hidden_metrics_total | Hidden metric: Total |
http_request_duration_microseconds | HTTP request: Duration (microseconds) |
http_request_size_bytes | HTTP request: size (bytes) |
http_requests_total | HTTP requests: Total |
http_response_size_bytes | HTTP response size (bytes) |
Job | Job name |
job_instance_mode | Job instance pattern |
kube_apiserver_clusterip_allocator_allocated_ips | Kubernetes APIServer: number of IPs allocated by the ClusterIP allocator |
kube_apiserver_clusterip_allocator_allocation_errors_total | Kubernetes API server: Total ClusterIP allocator allocation errors |
kube_apiserver_clusterip_allocator_allocation_total | Kubernetes APIServer: Total allocations by the ClusterIP allocator |
kube_apiserver_clusterip_allocator_available_ips | Kubernetes API server: Available IP address count for the ClusterIP allocator |
kube_apiserver_nodeport_allocator_allocated_ports | Kubernetes APIServer: Number of ports allocated by the NodePort allocator |
kube_apiserver_nodeport_allocator_allocation_errors_total | Kubernetes APIServer: Total NodePort allocator allocation faults |
kube_apiserver_nodeport_allocator_allocation_total | Kubernetes APIServer: Total allocations by the NodePort allocator |
kube_apiserver_nodeport_allocator_available_ports | Kubernetes APIServer: Number of available ports for the NodePort allocator |
kube_apiserver_pod_logs_backend_tls_failure_total | Kubernetes APIServer: Total number of pods/logs requests due to TLS authentication failure |
kube_apiserver_pod_logs_insecure_backend_total | Kubernetes APIServer: Total insecure pods/logs requests |
kube_apiserver_pod_logs_pods_logs_backend_tls_failure_total | Kubernetes API server: Total pods/logs requests that failed TLS authentication |
kube_apiserver_pod_logs_pods_logs_insecure_backend_total | Kubernetes API server: Number of insecure pods/logs requests |
kubelet_container_log_filesystem_used_bytes | Kubelet: File system usage for container logs in bytes |
kubelet_node_name | Kubelet: Node name |
kubelet_pleg_relist_duration_seconds_bucket | Kubelet: PLEG relist duration buckets (seconds) |
kubelet_pod_worker_duration_seconds_bucket | Kubelet: bucketing of pod worker duration in seconds |
kubelet_volume_stats_available_bytes | Kubelet: Available bytes in volume stats |
kubelet_volume_stats_capacity_bytes | Kubelet: Capacity in bytes from volume statistics |
kubelet_volume_stats_inodes | Kubelet: Volume statistics for available inodes |
kubelet_volume_stats_inodes_free | Kubelet: Free inode count on the volume |
kubelet_volume_stats_inodes_used | Kubelet: Used inode count for the volume |
kubelet_volume_stats_used_bytes | Kubelet: Volume used bytes |
kubernetes_build_info | Kubernetes build information |
kubernetes_feature_enabled | Kubernetes feature status: Enabled |
last_list_all_response_size_in_bytes | Total size of the last list response (bytes) |
memory_utilization_byte | Memory utilization: Bytes |
node_authorizer_graph_actions_duration_seconds_bucket | Node authorizer: Graph operation duration bucketing in seconds |
node_authorizer_graph_actions_duration_seconds_count | Node authorizer: Graph operation duration in seconds |
node_authorizer_graph_actions_duration_seconds_sum | Node authorizer: Total duration of graph operations in seconds |
pod_security_evaluations_total | Total pod security assessments |
pod_security_exemptions_total | Total pod security exemptions |
process_cpu_seconds_total | Total process CPU time in seconds |
process_max_fds | Maximum number of file descriptors per process |
process_open_fds | Number of open file descriptors for the process |
process_resident_memory_bytes | Process resident memory in bytes |
process_start_time_seconds | Process startup time (seconds) |
process_virtual_memory_bytes | Process virtual memory in bytes |
process_virtual_memory_max_bytes | Maximum virtual memory of a process in bytes |
registered_metric_total | Registration metric: Total count |
registered_metrics_total | Registration metrics: Total |
rest_client_exec_plugin_certificate_rotation_age_bucket | REST client plugin: Certificate rotation age bucketing (seconds) |
rest_client_exec_plugin_certificate_rotation_age_count | REST client plugin: Certificate rotation age in seconds |
rest_client_exec_plugin_certificate_rotation_age_sum | REST client plugin: Sum of certificate rotation age in seconds |
rest_client_exec_plugin_ttl_seconds | REST client plugin: Certificate TTL in seconds |
rest_client_request_duration_seconds_bucket | REST client: Request duration bucketing in seconds |
rest_client_request_duration_seconds_count | REST client: Request duration count in seconds |
rest_client_request_duration_seconds_sum | REST client: Total request duration in seconds |
rest_client_request_latency_seconds_bucket | REST client: Request latency bucketing in seconds |
rest_client_request_size_bytes_bucket | REST client: Request size bucketing (bytes) |
rest_client_request_size_bytes_count | REST client: Request byte count |
rest_client_request_size_bytes_sum | REST client: Total request size (bytes) |
rest_client_requests_total | REST client: Total requests |
rest_client_response_size_bytes_bucket | REST client: Response size (bytes) bucketing |
rest_client_response_size_bytes_count | REST client: Response byte count |
rest_client_response_size_bytes_sum | REST client: Total response size (bytes) |
rest_client_transport_cache_entries | REST client: number of transport cache entries |
rest_client_transport_create_calls_total | REST client: Total transport creation calls |
scheduler_pending_pods | Scheduler: Number of pending pods |
scheduler_pod_scheduling_attempts_bucket | Scheduler: pod scheduling attempt count bucketing |
scheduler_scheduler_cache_size | Scheduler: Scheduler cache size |
scrape_duration_seconds | Scrape duration (seconds) |
scrape_samples_post_metric_relabeling | Number of scraped samples (after metric relabeling) |
scrape_samples_scraped | Number of scraped samples |
scrape_series_added | Number of new series scraped |
serviceaccount_invalid_legacy_auto_token_uses_total | Total uses of invalid legacy automated service account tokens |
serviceaccount_legacy_auto_token_uses_total | Total usage count of legacy automated service account tokens |
serviceaccount_legacy_manual_token_uses_total | Total uses of legacy manual service account tokens |
serviceaccount_legacy_tokens_total | Total number of legacy service account tokens |
serviceaccount_stale_tokens_total | Total number of legacy service account tokens |
serviceaccount_valid_tokens_total | Total valid service account tokens |
ssh_tunnel_open_count | Open SSH tunnel count |
ssh_tunnel_open_fail_count | Number of failed SSH tunnel openings |
up | Metric collection connectivity |
watch_cache_capacity | Monitor cache capacity |
watch_cache_capacity_decrease_total | Total reduction in cache capacity |
watch_cache_capacity_increase_total | Total increase in monitoring cache capacity |
workqueue_adds_total | Total additions to the work queue |
workqueue_depth | Work queue depth |
workqueue_longest_running_processor_seconds | Longest processor run time in the work queue (seconds) |
workqueue_queue_duration_seconds_bucket | Work queue queuing duration (seconds) quantile bucket |
workqueue_queue_duration_seconds_count | Total work queue wait time (seconds) |
workqueue_queue_duration_seconds_sum | Sum of work queue wait time (seconds) |
workqueue_retries_total | Total work queue retries |
workqueue_unfinished_work_seconds | Duration of pending work in the work queue (seconds) |
workqueue_work_duration_seconds_bucket | Work queue duration (seconds) quantile bucket |
workqueue_work_duration_seconds_count | Work queue processing time (seconds) |
workqueue_work_duration_seconds_sum | Total work queue duration (seconds) |
Node Exporter (Job name: node-exporter)
Metric | Description |
aliyun_prometheus_agent_append_duration_seconds | Duration of append operations for the Alibaba Cloud Prometheus agent in seconds. |
aliyun_prometheus_agent_job_discovery_status | Discovery status of scrape jobs for the Alibaba Cloud Prometheus agent. |
aliyun_prometheus_agent_scrapes_by_target_total | Total number of scrapes by target for the Alibaba Cloud Prometheus agent. |
aliyun_prometheus_agent_target_info | Information about the targets of the Alibaba Cloud Prometheus agent. |
job | The name of the job. |
node_boot_time_seconds | Node boot time in seconds. |
node_context_switches_total | Total number of context switches on the node. |
node_cpu_seconds_total | Total CPU time spent by the node. |
node_disk_io_now | Current disk I/O on the node. |
node_disk_io_time_seconds_total | Total time spent on disk I/O on the node, in seconds. |
node_disk_io_time_weighted_seconds_total | Total weighted time spent on disk I/O on the node, in seconds. |
node_disk_read_bytes_total | Total bytes read from disk on the node. |
node_disk_read_time_seconds_total | Total time spent reading from disk on the node, in seconds. |
node_disk_reads_completed_total | Total number of completed disk reads on the node. |
node_disk_reads_merged_total | Total number of merged disk reads on the node. |
node_disk_write_time_seconds_total | Total time spent writing to disk on the node, in seconds. |
node_disk_writes_completed_total | Total number of completed disk writes on the node. |
node_disk_writes_merged_total | Total number of merged disk writes on the node. |
node_disk_written_bytes_total | Total bytes written to disk on the node. |
node_exporter_build_info | Build information for Node Exporter. |
node_filefd_allocated | Number of allocated file descriptors on the node. |
node_filefd_maximum | Maximum number of file descriptors on the node. |
node_filesystem_avail_bytes | Number of available bytes in the file system on the node. |
node_filesystem_free_bytes | Number of free bytes in the file system on the node. |
node_filesystem_size_bytes | Total size of the file system on the node, in bytes. |
node_intr_total | Total number of interrupts on the node. |
node_load1 | 1-minute load average on the node. |
node_load15 | 15-minute load average on the node. |
node_load5 | 5-minute load average on the node. |
node_memory_MemAvailable_bytes | Available memory on the node, in bytes. |
node_memory_MemFree_bytes | Free memory on the node, in bytes. |
node_memory_MemTotal_bytes | Total memory on the node, in bytes. |
node_memory_Slab_bytes | Slab memory on the node, in bytes. |
node_memory_SReclaimable_bytes | Reclaimable slab memory on the node, in bytes. |
node_netstat_Tcp_InErrs | Number of TCP receive errors. |
node_netstat_Tcp_InSegs | Number of received TCP segments. |
node_netstat_Tcp_OutSegs | Number of sent TCP segments. |
node_netstat_Tcp_PassiveOpens | Number of passive TCP connection openings. |
node_netstat_Tcp_RetransSegs | Number of retransmitted TCP segments. |
node_network_receive_bytes_total | Total number of bytes received over the network. |
node_network_receive_drop_total | Total number of received packets dropped. |
node_network_receive_errs_total | Total number of receive errors. |
node_network_receive_packets_total | Total number of packets received. |
node_network_transmit_bytes_total | Total number of bytes transmitted over the network. |
node_network_transmit_drop_total | Total number of transmitted packets dropped. |
node_network_transmit_errs_total | Total number of transmit errors. |
node_network_transmit_packets_total | Total number of packets transmitted. |
node_network_up | Indicates whether the network interface is enabled. |
node_processes_max_processes | Maximum number of processes. |
node_processes_max_threads | Maximum number of threads. |
node_processes_pids | Number of process IDs. |
node_processes_state | Distribution of process states. |
node_processes_threads | Number of threads. |
node_schedstat_running_seconds_total | Total seconds spent in the running state according to scheduling statistics. |
node_sockstat_TCP_alloc | Number of allocated TCP sockets. |
node_sockstat_TCP_inuse | Number of TCP sockets in use. |
node_sockstat_TCP_mem | Memory usage of TCP sockets. |
node_sockstat_TCP_mem_bytes | Memory usage of TCP sockets, in bytes. |
node_sockstat_TCP_tw | Number of TCP sockets in the TIME_WAIT state. |
node_time_zone_offset_seconds | Time zone offset in seconds. |
node_timex_offset_seconds | Time offset in seconds. |
node_timex_sync_status | Clock synchronization status. |
node_uname_info | System information from uname. |
node_vmstat_pgfault | Number of page faults from VM statistics. |
node_vmstat_pgmajfault | Number of major page faults from VM statistics. |
node_vmstat_pgpgin | Number of page-ins from VM statistics. |
node_vmstat_pgpgout | Number of page-outs from VM statistics. |
up | Connectivity for metric scraping. |
kube-state-metrics (Job name: _kube-state-metrics)
Metric | Description |
kube_configmap_info | Information about Kubernetes ConfigMaps |
kube_cronjob_annotations | Kubernetes CronJob annotations |
kube_cronjob_created | The creation time of the Kubernetes CronJob. |
kube_cronjob_info | Kubernetes CronJob information |
kube_cronjob_labels | Kubernetes CronJob labels |
kube_cronjob_metadata_resource_version | Shows the resource version of the Kubernetes CronJob metadata. |
kube_cronjob_next_schedule_time | The next scheduled time of a Kubernetes CronJob. |
kube_cronjob_spec_failed_job_history_limit | Kubernetes CronJob failed job history limit |
kube_cronjob_spec_starting_deadline_seconds | The starting deadline for the Kubernetes CronJob in seconds. |
kube_cronjob_spec_successful_job_history_limit | The retention limit for the history of successful jobs in a Kubernetes CronJob. |
kube_cronjob_spec_suspend | The suspend status of a Kubernetes CronJob. |
kube_cronjob_status_active | Number of active Kubernetes CronJobs |
kube_cronjob_status_last_schedule_time | The last schedule time of the Kubernetes CronJob |
kube_cronjob_status_last_successful_time | The last successful running time of the Kubernetes CronJob |
kube_daemonset_created | The creation time of the Kubernetes DaemonSet. |
kube_daemonset_status_current_number_scheduled | The current number of nodes scheduled for the Kubernetes DaemonSet. |
kube_daemonset_status_desired_number_scheduled | The desired number of scheduled nodes for a Kubernetes DaemonSet. |
kube_daemonset_status_number_available | Number of available nodes in the Kubernetes DaemonSet |
kube_daemonset_status_number_misscheduled | Number of nodes incorrectly running a Kubernetes DaemonSet pod |
kube_daemonset_status_number_ready | The number of ready nodes in a Kubernetes DaemonSet. |
kube_daemonset_status_number_unavailable | Number of unavailable nodes in the Kubernetes DaemonSet |
kube_daemonset_status_updated_number_scheduled | The number of nodes scheduled with the updated Kubernetes DaemonSet. |
kube_daemonset_updated_number_scheduled | Number of nodes scheduled with the updated Kubernetes DaemonSet. |
kube_deployment_created | The creation time of the Kubernetes deployment. |
kube_deployment_labels | Kubernetes deployment labels |
kube_deployment_metadata_generation | The generation of the Kubernetes deployment metadata. |
kube_deployment_spec_replicas | Number of replicas in the Kubernetes deployment specification |
kube_deployment_spec_strategy_rollingupdate_max_unavailable | The maximum number of unavailable pods during a rolling update for a Kubernetes deployment |
kube_deployment_status_observed_generation | The observed generation of the Kubernetes deployment. |
kube_deployment_status_replicas | Total number of replicas in a Kubernetes deployment |
kube_deployment_status_replicas_available | Number of available Kubernetes deployment replicas |
kube_deployment_status_replicas_ready | Number of ready replicas in a Kubernetes deployment |
kube_deployment_status_replicas_unavailable | Number of unavailable replicas in a Kubernetes deployment |
kube_deployment_status_replicas_updated | The number of updated replicas in a Kubernetes deployment. |
kube_horizontalpodautoscaler_info | Information about the Kubernetes HorizontalPodAutoscaler. |
kube_horizontalpodautoscaler_labels | Kubernetes HorizontalPodAutoscaler labels |
kube_horizontalpodautoscaler_metadata_generation | The metadata generation of the Kubernetes HorizontalPodAutoscaler. |
kube_horizontalpodautoscaler_spec_max_replicas | The maximum number of replicas in the specification for a Kubernetes HorizontalPodAutoscaler. |
kube_horizontalpodautoscaler_spec_min_replicas | The minimum number of replicas for a Kubernetes HorizontalPodAutoscaler. |
kube_horizontalpodautoscaler_spec_target_metric | The target metric of a Kubernetes HorizontalPodAutoscaler. |
kube_horizontalpodautoscaler_status_condition | The status condition of a Kubernetes HorizontalPodAutoscaler. |
kube_horizontalpodautoscaler_status_current_replicas | The current number of replicas of the Kubernetes HorizontalPodAutoscaler. |
kube_horizontalpodautoscaler_status_desired_replicas | Desired number of replicas for the Kubernetes HorizontalPodAutoscaler |
kube_hpa_labels | kube_hpa labels |
kube_hpa_metadata_generation | The metadata generation of the Kubernetes HorizontalPodAutoscaler. |
kube_hpa_spec_max_replicas | The maximum number of replicas for a Kubernetes HorizontalPodAutoscaler. |
kube_hpa_spec_min_replicas | The minimum number of replicas in the Kubernetes HorizontalPodAutoscaler specification. |
kube_hpa_spec_target_metric | The target metric for a Kubernetes HorizontalPodAutoscaler. |
kube_hpa_status_condition | Kubernetes HorizontalPodAutoscaler status condition |
kube_hpa_status_current_replicas | The current number of replicas for the Kubernetes HorizontalPodAutoscaler. |
kube_hpa_status_desired_replicas | The desired number of replicas for a Kubernetes HorizontalPodAutoscaler. |
kube_ingress_info | Ingress information |
kube_job_created | The time when the job was created. |
kube_job_failed | Total number of failed jobs |
kube_job_info | Job information |
kube_job_spec_completions | The number of completions specified for the job |
kube_job_status_active | Number of active jobs |
kube_job_status_failed | The number of failed jobs. |
kube_job_status_succeeded | The number of jobs that have succeeded. |
kube_namespace_created | The creation time of the namespace. |
kube_namespace_labels | Namespace labels |
kube_namespace_status_phase | Namespace status phase |
kube_node_info | Node information |
kube_node_labels | Node labels |
kube_node_spec_taint | Node taint configuration |
kube_node_spec_unschedulable | Flag indicating whether the node can be scheduled. |
kube_node_status_allocatable | The amount of allocatable resources on a node. |
kube_node_status_allocatable_cpu_cores | Number of allocatable CPU cores on the node. |
kube_node_status_allocatable_memory_bytes | Allocatable memory on the node in bytes |
kube_node_status_allocatable_pods | Number of allocatable pods on the node |
kube_node_status_capacity | Node capacity |
kube_node_status_capacity_cpu_cores | The CPU capacity of a node in cores. |
kube_node_status_capacity_memory_bytes | Node memory capacity in bytes |
kube_node_status_capacity_pods | Node pod capacity |
kube_node_status_condition | Node status condition |
kube_persistentvolume_status_phase | The status phase of the persistent volume. |
kube_persistentvolumeclaim_info | Persistent Volume Claim information |
kube_persistentvolumeclaim_resource_requests_storage_bytes | The amount of storage requested by a persistent volume claim |
kube_persistentvolumeclaim_status_phase | The status phase of the persistent volume claim. |
kube_pod_completion_time | Pod completion time |
kube_pod_container_info | Pod container information |
kube_pod_container_resource_limits | Pod container resource limits |
kube_pod_container_resource_limits_cpu_cores | Pod container CPU core limit |
kube_pod_container_resource_limits_memory_bytes | Pod container memory limit in bytes |
kube_pod_container_resource_requests | Pod container resource request |
kube_pod_container_resource_requests_cpu_cores | Pod container CPU core request |
kube_pod_container_resource_requests_memory_bytes | pod container memory resource request in bytes |
kube_pod_container_status_last_terminated_reason | Last termination reason of the pod container |
kube_pod_container_status_ready | Pod container readiness status |
kube_pod_container_status_restarts_total | Pod container restart count |
kube_pod_container_status_running | Pod container runtime status |
kube_pod_container_status_terminated | Pod container termination status |
kube_pod_container_status_terminated_reason | Pod container stop reason |
kube_pod_container_status_waiting | Pod container waiting status |
kube_pod_container_status_waiting_reason | Pod container wait reason |
kube_pod_created | Pod creation time |
kube_pod_deletion_timestamp | Pod deletion timestamp |
kube_pod_info | Pod information |
kube_pod_labels | Pod label |
kube_pod_owner | Owner object |
kube_pod_start_time | Pod start time |
kube_pod_status_container_ready_time | Pod container readiness time |
kube_pod_status_initialized_time | Pod status initialization completion time |
kube_pod_status_phase | Pod phase |
kube_pod_status_ready | Pod readiness status |
kube_pod_status_ready_time | Pod readiness time |
kube_pod_status_reason | Pod status reason |
kube_pod_status_scheduled_time | Pod scheduling time |
kube_pod_status_unschedulable | Unscheduled pod flag |
kube_replicaset_owner | ReplicaSet owner object |
kube_replicaset_status_ready_replicas | Number of ready replicas in the ReplicaSet |
kube_resource_relationship | Resource relationships |
kube_resourcequota | Resource quota |
kube_resourcequota_created | Resource quota creation time |
kube_secret_info | Secret information |
kube_service_info | Service information |
kube_service_spec_type | Service type specifications |
kube_service_status_load_balancer_ingress | Service status and Server Load Balancer endpoint information |
kube_statefulset_created | Stateful ReplicaSet creation time |
kube_statefulset_metadata_generation | Stateful ReplicaSet metadata generation |
kube_statefulset_replicas | Number of replicas for the stateful ReplicaSet |
kube_statefulset_status_replicas | Number of replicas in the Stateful ReplicaSet status |
kube_statefulset_status_replicas_available | Number of active replicas |
kube_statefulset_status_replicas_ready | Stateful ReplicaSet ready replica count |
kube_statefulset_status_replicas_updated | stateful ReplicaSet status: Updated number of replicas |
rest_client_requests_total | Total REST client requests |
up | Connectivity for metric collection |
workqueue_adds_total | Total work queue additions |
workqueue_depth | Work queue depth |
workqueue_queue_duration_seconds_bucket | Work queue queuing duration distribution (seconds) |
kube-events (Job name: _arms/kube-event)
Metric | Description |
aliyun_prometheus_agent_append_duration_seconds | The duration of an append operation for the Alibaba Cloud Prometheus agent, in seconds. |
aliyun_prometheus_agent_job_discovery_status | The discovery status of a scrape job for the Alibaba Cloud Prometheus agent. |
aliyun_prometheus_agent_scrape_custom_error | The number of custom scrape errors for the Alibaba Cloud Prometheus agent. |
aliyun_prometheus_agent_scrapes_by_target_total | The total number of scrapes by target for the Alibaba Cloud Prometheus agent. |
aliyun_prometheus_agent_target_info | The target information for the Alibaba Cloud Prometheus agent. |
eventer_events_error_total | The total number of event processing errors. |
eventer_events_normal_total | The total number of normal events. |
eventer_events_warning_total | The total number of event warnings. |
eventer_exporter_duration_milliseconds_count | The number of samples for the event export duration, in milliseconds. |
eventer_exporter_duration_milliseconds_sum | The total event export duration, in milliseconds. |
eventer_manager_last_time_seconds | The last operation time of the event manager, in seconds. |
eventer_scraper_duration_milliseconds_count | The count of the event scrape duration, in milliseconds. |
eventer_scraper_duration_milliseconds_sum | The total event scrape duration, in milliseconds. |
eventer_scraper_events_total_number | The total number of events scraped. |
eventer_scraper_last_time_seconds | The last running time of the event scrape, in seconds. |
up | The connectivity for metric collection. |
CoreDNS (Job name: arms-ack-coredns)
Metric | Description |
aliyun_prometheus_agent_append_duration_seconds | The duration of append operations for the Alibaba Cloud Prometheus agent, in seconds. |
aliyun_prometheus_agent_job_discovery_status | The status of scrape job discovery for the Alibaba Cloud Prometheus agent. |
aliyun_prometheus_agent_scrape_custom_error | Number of custom scrape errors from the Alibaba Cloud Prometheus agent |
aliyun_prometheus_agent_scrapes_by_target_total | The total number of scrapes by the Alibaba Cloud Prometheus agent per target. |
aliyun_prometheus_agent_target_info | Target information for the Alibaba Cloud Prometheus agent |
coredns_autopath_success_count_total | Total success count for CoreDNS autopath. |
coredns_autopath_success_total | Total number of successful CoreDNS autopaths. |
coredns_build_info | CoreDNS build information |
coredns_cache_drops_total | Total CoreDNS cache drop count |
coredns_cache_entries | Number of CoreDNS cache entries |
coredns_cache_evictions_total | Total number of CoreDNS cache evictions |
coredns_cache_hits_total | Total CoreDNS cache hits |
coredns_cache_misses_total | Total number of CoreDNS cache misses |
coredns_cache_requests_total | Total CoreDNS cache requests |
coredns_cache_size | The size of the CoreDNS cache. |
coredns_dns_do_requests_total | Total CoreDNS DNS DO requests |
coredns_dns_request_count_total | Total DNS request count for CoreDNS |
coredns_dns_request_duration_seconds_bucket | CoreDNS DNS request duration quantile (seconds) |
coredns_dns_request_duration_seconds_count | The count of CoreDNS DNS requests |
coredns_dns_request_duration_seconds_sum | Total CoreDNS DNS request duration in seconds |
coredns_dns_request_size_bytes_bucket | CoreDNS DNS request size quantile (bytes) |
coredns_dns_request_size_bytes_count | CoreDNS DNS request size count (bytes) |
coredns_dns_request_size_bytes_sum | Sum of CoreDNS DNS request size (bytes) |
coredns_dns_request_type_count_total | The total number of DNS requests in CoreDNS, categorized by request type. |
coredns_dns_requests_total | Total DNS requests handled by CoreDNS |
coredns_dns_response_rcode_count_total | Total number of CoreDNS DNS responses by response code |
coredns_dns_response_size_bytes_bucket | CoreDNS DNS response size quantile (bytes) |
coredns_dns_response_size_bytes_count | CoreDNS DNS response size (bytes) count |
coredns_dns_response_size_bytes_sum | The sum of CoreDNS DNS response sizes in bytes |
coredns_dns_responses_total | Total number of CoreDNS DNS responses |
coredns_forward_conn_cache_hits_total | Total CoreDNS forward connection cache hits. |
coredns_forward_conn_cache_misses_total | Total misses in the CoreDNS forward connection cache. |
coredns_forward_healthcheck_broken_total | Total number of failed CoreDNS forward health checks |
coredns_forward_healthcheck_failure_count_total | Total count of CoreDNS forwarding health check failures |
coredns_forward_healthcheck_failures_total | Total CoreDNS forward health check failures |
coredns_forward_max_concurrent_rejects_total | Total number of rejections for CoreDNS forwarding due to maximum concurrency |
coredns_forward_request_count_total | Total count of requests forwarded by CoreDNS |
coredns_forward_request_duration_seconds_bucket | Quantiles for CoreDNS forwarded request duration in seconds. |
coredns_forward_request_duration_seconds_count | Count of CoreDNS forward request duration (seconds) |
coredns_forward_request_duration_seconds_sum | Total duration of CoreDNS forward requests in seconds. |
coredns_forward_requests_total | Total number of requests forwarded by CoreDNS |
coredns_forward_response_rcode_count_total | Total count of CoreDNS forwarded response codes |
coredns_forward_responses_total | Total number of responses forwarded by CoreDNS |
coredns_forward_sockets_open | Number of open sockets for CoreDNS forwarding |
coredns_health_request_duration_seconds_bucket | Quantile of CoreDNS health check request duration in seconds |
coredns_health_request_duration_seconds_count | Number of CoreDNS health check requests. |
coredns_health_request_duration_seconds_sum | Total duration of CoreDNS health check requests in seconds. |
coredns_health_request_failures_total | Total number of failed CoreDNS health check requests |
coredns_hosts_entries | Number of CoreDNS host entries |
coredns_hosts_reload_timestamp_seconds | CoreDNS host reload timestamp (seconds) |
coredns_kubernetes_dns_programming_duration_seconds_bucket | CoreDNS Kubernetes DNS programming duration quantile (seconds) |
coredns_kubernetes_dns_programming_duration_seconds_count | CoreDNS Kubernetes DNS request duration (seconds) count |
coredns_kubernetes_dns_programming_duration_seconds_sum | CoreDNS: Sum of Kubernetes DNS programming time |
coredns_local_localhost_requests_total | Total CoreDNS requests to localhost |
coredns_panic_count_total | Total CoreDNS panics |
coredns_panics_total | Total CoreDNS panic count |
coredns_plugin_enabled | CoreDNS plugin status |
coredns_reload_failed_total | Total CoreDNS reload failures |
coredns_reload_version_info | CoreDNS reload version |
coredns_template_matches_total | Total CoreDNS template matches |
up | Metric collection connectivity |
CSI (cluster dimension) (Job name: k8s-csi-cluster-pv)
Metric | Description |
alibaba_cloud_storage_operator_build_info | The build information for Alibaba Cloud storage O&M. |
aliyun_prometheus_agent_append_duration_seconds | The duration of the append operation for the Alibaba Cloud Prometheus agent, in seconds. |
aliyun_prometheus_agent_job_discovery_status | The discovery status of the scrape job for the Alibaba Cloud Prometheus agent. |
aliyun_prometheus_agent_scrape_custom_error | The number of custom scrape errors for the Alibaba Cloud Prometheus agent. |
aliyun_prometheus_agent_scrapes_by_target_total | The total number of scrapes by target for the Alibaba Cloud Prometheus agent. |
aliyun_prometheus_agent_target_info | The target information of the Alibaba Cloud Prometheus agent. |
cluster_pv_detail_num_total | The total count of detailed information for cluster PVs. |
cluster_pv_status_num_total | The total number of cluster PV statuses. |
cluster_pvc_detail_num_total | The total count of detailed information for cluster PVCs. |
cluster_pvc_status_num_total | The total number of cluster PVC statuses. |
cluster_scrape_collector_duration_seconds | The duration of the cluster scrape collector, in seconds. |
cluster_scrape_collector_success | The number of successful attempts by the cluster scrape collector. |
up | The connectivity for metric scraping. |
CSI (node dimension) (Job name: k8s-csi-node-pv)
Metric | Description |
alibaba_cloud_csi_driver_build_info | Alibaba Cloud CSI driver build information |
aliyun_prometheus_agent_append_duration_seconds | Alibaba Cloud Prometheus agent append operation duration in seconds |
aliyun_prometheus_agent_job_discovery_status | Discovery status of scrape jobs for the Alibaba Cloud Prometheus agent |
aliyun_prometheus_agent_scrape_custom_error | Number of custom scrape errors from the Alibaba Cloud Prometheus agent |
aliyun_prometheus_agent_scrapes_by_target_total | Total number of scrapes by target from the Alibaba Cloud Prometheus agent |
aliyun_prometheus_agent_target_info | Target information for the Alibaba Cloud Prometheus agent |
cluster_scrape_collector_duration_seconds | Duration of the cluster scrape collector in seconds |
cluster_scrape_collector_success | Number of successful cluster scrape collections |
container_fs_available_bytes | Available bytes in the container file system |
container_fs_inodes_free | Available inodes in the container file system |
container_fs_inodes_total | Total inodes in the container file system |
container_fs_inodes_used | Used inodes in the container file system |
container_fs_limit_bytes | Byte limit for the container file system |
container_fs_usage_bytes | Used bytes in the container file system |
ephemeral_storage_pod_available_bytes | Available bytes for the ephemeral storage pod |
ephemeral_storage_pod_inodes_free | Available inodes for the ephemeral storage pod |
ephemeral_storage_pod_inodes_total | Total inodes for the ephemeral storage pod |
ephemeral_storage_pod_inodes_used | Used inodes for the ephemeral storage pod |
ephemeral_storage_pod_limit_bytes | Byte limit for the ephemeral storage pod |
ephemeral_storage_pod_usage_bytes | Used bytes for the ephemeral storage pod |
node_volume_backend_posix_access_total_counter | Total POSIX access operations on the node volume backend. |
node_volume_backend_posix_getattr_total_counter | Total POSIX getattr calls on the node volume backend. |
node_volume_backend_posix_getmode_total_counter | Total POSIX get mode operations on the node volume backend. |
node_volume_backend_posix_link_total_counter | Total POSIX link operations on the node volume backend. |
node_volume_backend_posix_lookup_total_counter | Total POSIX lookup operations on the node volume backend. |
node_volume_backend_posix_mknod_total_counter | Total POSIX mknod operations on the node volume backend. |
node_volume_backend_posix_readdir_total_counter | Total POSIX readdir operations on the node volume backend. |
node_volume_backend_posix_readlink_total_counter | Total POSIX readlink operations on the node volume backend. |
node_volume_backend_posix_remove_total_counter | Total POSIX remove operations on the node volume backend. |
node_volume_backend_posix_rename_total_counter | Total POSIX rename operations on the node volume backend. |
node_volume_backend_posix_setattr_total_counter | Total POSIX setattr operations on the node volume backend. |
node_volume_backend_posix_statfs_total_counter | Total POSIX statfs operations on the node volume backend. |
node_volume_backend_read_bytes_total_counter | Total bytes read from the node volume backend. |
node_volume_backend_read_completed_total_counter | Total completed read requests on the node volume backend. |
node_volume_backend_read_time_milliseconds_total_counter | Total read time in milliseconds on the node volume backend. |
node_volume_backend_write_bytes_total_counter | Total bytes written to the node volume backend. |
node_volume_backend_write_completed_total_counter | Total completed write requests on the node volume backend. |
node_volume_backend_write_time_milliseconds_total_counter | Total write time in milliseconds on the node volume backend. |
node_volume_capacity_bytes_available | Available capacity of the node volume in bytes. |
node_volume_capacity_bytes_available_counter | Counter for the available capacity of the node volume in bytes. |
node_volume_capacity_bytes_total | Total capacity of the node volume in bytes. |
node_volume_capacity_bytes_total_counter | Counter for the total capacity of the node volume in bytes. |
node_volume_capacity_bytes_used | Used capacity of the node volume in bytes. |
node_volume_capacity_bytes_used_counter | Counter for the used capacity of the node volume in bytes. |
node_volume_hot_spot_head_file_top | Ranking of hot spot head files on the node volume. |
node_volume_hot_spot_read_file_top | Ranking of hot spot read files on the node volume. |
node_volume_hot_spot_write_file_top | Ranking of hot spot write files on the node volume. |
node_volume_inode_bytes_available_counter | Counter for available bytes for inodes on the node volume. |
node_volume_inode_bytes_total_counter | Counter for total bytes for inodes on the node volume. |
node_volume_inode_bytes_used_counter | Counter for used bytes for inodes on the node volume. |
node_volume_inodes_available | Available inodes on the node volume. |
node_volume_inodes_total | Total inodes on the node volume. |
node_volume_inodes_used | Used inodes on the node volume. |
node_volume_io_now | Current I/O operations on the node volume. |
node_volume_io_time_seconds_total | Total I/O time on the node volume in seconds. |
node_volume_oss_delete_object_total_counter | Total objects deleted from OSS for the node volume. |
node_volume_oss_get_object_total_counter | Total objects retrieved from OSS for the node volume. |
node_volume_oss_head_object_total_counter | Total head object operations on OSS for the node volume. |
node_volume_oss_post_object_total_counter | Total objects posted to OSS for the node volume. |
node_volume_oss_put_object_total_counter | Total objects put to OSS for the node volume. |
node_volume_posix_access_total_counter | Total POSIX access operations on the node volume. |
node_volume_posix_chmod_total_counter | Total POSIX chmod operations on the node volume. |
node_volume_posix_chown_total_counter | Total POSIX chown operations on the node volume. |
node_volume_posix_create_total_counter | Total POSIX create operations on the node volume. |
node_volume_posix_flush_total_counter | Total POSIX flush operations on the node volume. |
node_volume_posix_fsync_total_counter | Total POSIX fsync operations on the node volume. |
node_volume_posix_mkdir_total_counter | Total POSIX mkdir operations on the node volume. |
node_volume_posix_open_total_counter | Total POSIX open operations on the node volume. |
node_volume_posix_opendir_total_counter | Total POSIX opendir operations on the node volume. |
node_volume_posix_read_total_counter | Total POSIX read operations on the node volume. |
node_volume_posix_readdir_total_counter | Total POSIX readdir operations on the node volume. |
node_volume_posix_release_total_counter | Total POSIX release operations on the node volume. |
node_volume_posix_rename_total_counter | Total POSIX rename operations on the node volume. |
node_volume_posix_rmdir_total_counter | Total POSIX rmdir operations on the node volume. |
node_volume_posix_truncate_total_counter | Total POSIX truncate operations on the node volume. |
node_volume_posix_write_total_counter | Total POSIX write operations on the node volume. |
node_volume_read_bytes_total | Total bytes read from the node volume. |
node_volume_read_bytes_total_counter | Counter for the total bytes read from the node volume. |
node_volume_read_completed_total | Total completed read operations on the node volume. |
node_volume_read_completed_total_counter | Counter for total completed read operations on the node volume. |
node_volume_read_merged_total | Total merged read operations on the node volume. |
node_volume_read_queue_time_milliseconds_total | Total time spent in the read queue on the node volume, in milliseconds. |
node_volume_read_rtt_time_milliseconds_total | Total round trip time for read operations on the node volume, in milliseconds. |
node_volume_read_sent_bytes_total | Total bytes sent for read operations on the node volume. |
node_volume_read_time_milliseconds_total | Total time for read operations on the node volume, in milliseconds. |
node_volume_read_time_milliseconds_total_counter | Counter for the total time for read operations on the node volume, in milliseconds. |
node_volume_read_timeouts_total | Total read timeouts on the node volume. |
node_volume_read_transmissions_total | Total read transmissions on the node volume. |
node_volume_vg_free_bytes | Free bytes in the node volume group (VG). |
node_volume_vg_size_bytes | Total size of the node volume group (VG) in bytes. |
node_volume_write_bytes_total | Total bytes written to the node volume. |
node_volume_write_bytes_total_counter | Counter for the total bytes written to the node volume. |
node_volume_write_completed_total | Total completed write operations on the node volume. |
node_volume_write_completed_total_counter | Counter for total completed write operations on the node volume. |
node_volume_write_merged_total | Total merged write operations on the node volume. |
node_volume_write_queue_time_milliseconds_total | Total time spent in the write queue on the node volume, in milliseconds. |
node_volume_write_recv_bytes_total | Total bytes received for write operations on the node volume. |
node_volume_write_rtt_time_milliseconds_total | Total round trip time for write operations on the node volume, in milliseconds. |
node_volume_write_time_milliseconds_total | Total time for write operations on the node volume, in milliseconds. |
node_volume_write_time_milliseconds_total_counter | Counter for the total time for write operations on the node volume, in milliseconds. |
node_volume_write_timeouts_total | Total write timeouts on the node volume. |
node_volume_write_transmissions_total | Total write transmissions on the node volume. |
up | Connectivity for metric scraping. |
GPU-Exporter (job name: gpu-exporter)
Metric | Description |
DCGM_CUSTOM_ALLOCATE_MODE | The operating pattern of the node. The possible values are: 0 (None) indicates that no GPU pods are running on the node. 1 (Exclusive) indicates that GPU pods on the node run in exclusive mode. 2 (Share) indicates that GPU pods on the node run in shared mode. |
DCGM_CUSTOM_CONTAINER_CP_ALLOCATED | Indicates the ratio of the computing power allocated to a container to the total computing power of the GPU card. The value ranges from 0 to 1. The value is 0 if only GPU memory is requested for an exclusive or shared GPU. A value of 0 means computing power is not limited. For example, if a GPU card has 100 units of computing power and 30 units are allocated to a container, the allocated computing power ratio is 30/100 = 0.3. |
DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED | The GPU memory allocated to the container. |
DCGM_CUSTOM_DEV_FB_ALLOCATED | The percentage of total GPU memory that is allocated. The value ranges from 0 to 1. |
DCGM_CUSTOM_DEV_FB_TOTAL | Indicates the total GPU memory of the GPU. |
DCGM_CUSTOM_ILLEGAL_PROCESS_DECODE_UTIL | Illegal process decode utilization |
DCGM_CUSTOM_ILLEGAL_PROCESS_ENCODE_UTIL | Illegal process encoding utilization |
DCGM_CUSTOM_ILLEGAL_PROCESS_MEM_COPY_UTIL | Illegal process memory copy utilization |
DCGM_CUSTOM_ILLEGAL_PROCESS_MEM_USED | Memory used by illegal process |
DCGM_CUSTOM_ILLEGAL_PROCESS_SM_UTIL | Illegal process Streaming Multiprocessor (SM) utilization |
DCGM_CUSTOM_PROCESS_DECODE_UTIL | Indicates the decoder utilization of the GPU thread. |
DCGM_CUSTOM_PROCESS_ENCODE_UTIL | The encoder utilization of the GPU thread. |
DCGM_CUSTOM_PROCESS_MEM_COPY_UTIL | Indicates the memory copy utilization of GPU threads. |
DCGM_CUSTOM_PROCESS_MEM_USED | The GPU memory currently used by the GPU thread. |
DCGM_CUSTOM_PROCESS_SM_UTIL | The SM utilization of GPU threads. |
DCGM_FI_DEV_APP_MEM_CLOCK | The application memory clock speed. |
DCGM_FI_DEV_APP_SM_CLOCK | The SM application clock frequency. |
DCGM_FI_DEV_BAR1_FREE | Indicates the free BAR1 memory. |
DCGM_FI_DEV_BAR1_TOTAL | Total size of Base Address Register 1 (BAR1), which maps GPU memory to the system address space. |
DCGM_FI_DEV_BAR1_USED | The amount of used BAR1. |
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION | Indicates a violation due to the board limit. The value is the duration of the violation. |
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS | The reasons for clock throttling. |
DCGM_FI_DEV_COUNT | Number of devices |
DCGM_FI_DEV_DEC_UTIL | Indicates the decoder utilization. |
DCGM_FI_DEV_ENC_UTIL | Indicates the encoder utilization. |
DCGM_FI_DEV_FB_FREE | The amount of free framebuffer memory. |
DCGM_FI_DEV_FB_USED | The amount of used framebuffer memory. This value corresponds to the used value for Memory-Usage from the nvidia-smi command. |
DCGM_FI_DEV_GPU_TEMP | Indicates the GPU temperature. |
DCGM_FI_DEV_GPU_UTIL | Indicates GPU utilization. This is the time that one or more kernel functions are active in a set period. The period is 1 s or 1/6 s. It depends on the GPU product. This metric shows that a kernel function is using the GPU. It does not show how the GPU is used. |
DCGM_FI_DEV_LOW_UTIL_VIOLATION | A violation triggered by the low utilization limit. The value is the duration of the violation. |
DCGM_FI_DEV_MEM_CLOCK | The memory clock frequency. |
DCGM_FI_DEV_MEM_COPY_UTIL | Indicates the memory bandwidth utilization. For example, an NVIDIA V100 GPU has a maximum memory bandwidth of 900 GB/sec. If the current memory bandwidth is 450 GB/sec, the memory bandwidth utilization is 50%. |
DCGM_FI_DEV_MEMORY_TEMP | Indicates the memory temperature. |
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL | Total NVLINK bandwidth |
DCGM_FI_DEV_PCIE_REPLAY_COUNTER | PCIe replay counter (records the number of retries due to data transmission errors) |
DCGM_FI_DEV_POWER_USAGE | Indicates power. |
DCGM_FI_DEV_POWER_VIOLATION | Indicates a violation caused by the power limit. The value is the duration of the violation. |
DCGM_FI_DEV_PSTATE | Device power state |
DCGM_FI_DEV_RELIABILITY_VIOLATION | Indicates a violation caused by the board's reliability limit. The value is the duration of the violation. |
DCGM_FI_DEV_RETIRED_DBE | Indicates pages retired due to a double-bit fault. |
DCGM_FI_DEV_RETIRED_PENDING | Number of pages pending retirement (pages in GPU memory marked as unusable due to faults) |
DCGM_FI_DEV_RETIRED_SBE | Indicates pages retired due to a single-bit error. |
DCGM_FI_DEV_SM_CLOCK | Indicates the SM clock frequency. |
DCGM_FI_DEV_SYNC_BOOST_VIOLATION | Indicates the duration of a violation caused by a sync boost limit. |
DCGM_FI_DEV_THERMAL_VIOLATION | Indicates a thermal violation. The value is the duration of the violation. |
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION | The total energy consumed since the driver was loaded. |
DCGM_FI_DEV_VIDEO_CLOCK | Video clock frequency |
DCGM_FI_DEV_XID_ERRORS | The error number of the most recent XID error that occurred over a period of time. |
DCGM_FI_PROF_DRAM_ACTIVE | The fraction of cycles that the device memory is active sending or receiving data. This metric measures Memory Bandwidth Utilization. This value is an average over a time interval, not an instantaneous value. A higher value indicates higher device memory utilization. A value of 1 (100%) means that one DRAM instruction is executed in every cycle during the time interval. In practice, the maximum achievable peak value is approximately 0.8 (80%). For example, a value of 0.2 (20%) means that the device memory is read from or written to during 20% of the cycles in the time interval. |
DCGM_FI_PROF_GR_ENGINE_ACTIVE | Indicates the percentage of time that a graphics or compute engine is active over a time interval. This value is the average for all graphics and compute engines. An engine is considered active if a graphics or compute Context is attached to a thread and the Context is busy. |
DCGM_FI_PROF_NVLINK_RX_BYTES | The rate of data received over NVLink, excluding protocol headers. This value is an average over a time interval, not an instantaneous value. The rate is averaged over the time interval. For example, if 1 GB of data is transferred in 1 second, the rate is 1 GB/s. This is true whether the data is transferred at a constant rate or in a burst. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction. |
DCGM_FI_PROF_NVLINK_TX_BYTES | Total bytes sent over NVLink |
DCGM_FI_PROF_PCIE_RX_BYTES | The rate of data received over the PCIe bus, including protocol headers and data payloads. This value represents an average over a time interval, not an instantaneous value. The rate is averaged over the time interval. For example, if 1 GB of data is transferred in 1 second, the rate is 1 GB/s, regardless of whether the transfer is constant or in a burst. The theoretical maximum bandwidth for PCIe Gen3 is 985 MB/s per channel. |
DCGM_FI_PROF_PCIE_TX_BYTES | Indicates the rate of data sent or received over the PCIe bus. This includes protocol headers and data payloads. This value is an average over a time interval, not an instantaneous value. The rate is averaged over the time interval. For example, if 1 GB of data is sent in 1 second, the rate is 1 GB/s. This is true whether the data is sent at a constant rate or in a burst. The theoretical maximum bandwidth for PCIe Gen3 is 985 MB/s per channel. |
DCGM_FI_PROF_PIPE_FP16_ACTIVE | The fraction of epochs that the FP16 (half-precision) pipeline is active. This value is an average over a time interval, not an instantaneous value. A higher value indicates higher utilization of the FP16 Cores. A value of 1 (100%) means that an FP16 instruction is executed every two epochs for the entire time interval. For example, on a Volta-based GPU. If the value is 0.2 (20%), the following scenarios are possible: 20% of the Streaming Multiprocessors (SMs) run their FP16 Cores at 100% utilization for the entire time interval. All SMs run their FP16 Cores at 20% utilization for the entire time interval. All SMs run their FP16 Cores at 100% utilization for one-fifth of the time interval. Other combinations. |
DCGM_FI_PROF_PIPE_FP32_ACTIVE | Indicates the fraction of cycles where the Fused Multiply-Add (FMA) pipeline is active. FMA operations include both single-precision (FP32) and integer types. This value is an average over a time interval, not an instantaneous value. A higher value indicates higher utilization of the FP32 Cores. A value of 1 (100%) indicates that an FP32 instruction is executed every two cycles over the entire time interval, for example, on a Volta-architecture card. For example, a value of 0.2 (20%) indicates one of the following scenarios: 20% of the FP32 Cores on the Streaming Multiprocessors (SMs) operate at 100% utilization throughout the interval. All FP32 Cores on the SMs operate at 20% utilization throughout the interval. All FP32 Cores on the SMs operate at 100% utilization for 20% of the interval. Other combinations. |
DCGM_FI_PROF_PIPE_FP64_ACTIVE | The fraction of cycles that the FP64 (double-precision) pipe is active. This value is an average over a time interval, not an instantaneous value. A higher value means higher utilization of the FP64 Cores. A value of 1 (100%) means an FP64 instruction is executed every four cycles over the entire time interval. For example, on a Volta-based GPU. A value of 0.2 (20%) could mean any of the following: 20% of the Streaming Multiprocessors (SMs) run their FP64 Cores at 100% utilization for the entire interval. All SMs run their FP64 Cores at 20% utilization for the entire interval. All SMs run their FP64 Cores at 100% utilization for one-fifth of the interval. Other combinations. |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | The fraction of epochs that the Tensor (HMMA/IMMA) pipe is active. This value is an average over a time interval and not an instantaneous value. A higher value indicates higher Tensor Core utilization. A value of 1 (100%) means a Tensor instruction is issued every other instruction cycle for the entire interval. This is because one instruction takes two cycles to complete. For example, a value of 0.2 (20%) could mean: The Tensor Cores on 20% of the Streaming Multiprocessors (SMs) run at 100% utilization for the entire interval. The Tensor Cores on 100% of the SMs run at 20% utilization for the entire interval. The Tensor Cores on 100% of the SMs run at 100% utilization for one-fifth of the interval. Other combinations. |
DCGM_FI_PROF_SM_ACTIVE | The percentage of time within an interval that at least one warp is active on a Streaming Multiprocessor (SM). This value is the average across all SMs and is not sensitive to the number of threads per block. A warp is active when it has been scheduled and allocated resources. An active warp can be in a computing or a non-computing state, such as waiting for a memory request. A value below 0.5 indicates that the GPU is underutilized, while a value above 0.8 is necessary for high efficiency. Assume a GPU has N SMs. If a kernel function uses N thread blocks and runs on all N SMs for the entire interval, the value is 1 (100%). If a kernel function runs with N/5 thread blocks during the interval, the value is 0.2. If a kernel function uses N thread blocks but runs for only 1/5 of the interval, the value is 0.2. |
DCGM_FI_PROF_SM_OCCUPANCY | The ratio of active warps to the maximum number of resident warps on a Streaming Multiprocessor (SM). This value is the average across all SMs over a time interval. A higher occupancy does not necessarily mean higher GPU utilization. Higher occupancy indicates more effective GPU utilization only for workloads that are limited by GPU memory bandwidth (DCGM_FI_PROF_DRAM_ACTIVE). |
nvidia_gpu_allocated_num_devices | The number of allocated GPU devices. Warning: This metric will be deprecated. |
nvidia_gpu_memory_allocated_bytes | The allocated memory on the GPU device. Warning: This metric will be deprecated and replaced by DCGM_CUSTOM_DEV_FB_allocated. |
nvidia_gpu_sharing_memory | The memory allocated for GPU sharing. Warning: This metric will be deprecated and replaced by DCGM_CUSTOM_DEV_FB_allocated. |
Up | Connectivity for metric collection |
Cost-Exporter (Job name: alibaba-cloud-cost-exporter)
Metric | Description |
deducted_by_cash_coupons | The amount deducted by coupons from a bill for the current instance. |
deducted_by_prepaid_card | The amount deducted by a prepaid card from a bill for the current instance. |
invoice_discount | The discount amount for a bill of the current instance. |
list_price | The unit price for a bill of the current instance. |
node_current_price | The actual price of the current node. |
node_payAsYouGo_price | The pay-as-you-go price of the current node. |
node_payByPeriod_price | The subscription price of the current node. |
node_spot_price | The price of the current node, based on the pricing of a Spot Instance with the same specifications. |
outstanding_amount | The outstanding amount for a bill of the current instance. |
payent_amount | The cash payment amount for a bill of the current instance. |
pretax_amount | The amount payable for a bill of the current instance. |
pretax_gross_amount | The original amount for a bill of the current instance. |
usage | The resource usage for a bill of the current instance. |
up | The connectivity for metric collection. |
Ingress (Job name: arms-ack-ingress or ingress-ask-default)
Metric | Description |
aliyun_prometheus_agent_append_duration_seconds | The duration of an append operation by the Alibaba Cloud Prometheus agent (in seconds). |
aliyun_prometheus_agent_job_discovery_status | Status of scrape job discovery for the Alibaba Cloud Prometheus agent |
aliyun_prometheus_agent_scrape_custom_error | The number of custom scrape errors for the Prometheus agent. |
aliyun_prometheus_agent_scrapes_by_target_total | Total number of scrapes by the Alibaba Cloud Prometheus agent per Target |
aliyun_prometheus_agent_target_info | Target information for the Alibaba Cloud Prometheus agent |
nginx_ingress_controller_admission_config_size | Nginx Ingress controller - Admission configuration size |
nginx_ingress_controller_admission_render_duration | Nginx Ingress controller - Rendering duration |
nginx_ingress_controller_admission_render_ingresses | Nginx Ingress controller - Rendered Ingress count |
nginx_ingress_controller_admission_roundtrip_duration | Nginx Ingress controller - Roundtrip processing duration |
nginx_ingress_controller_admission_tested_duration | Nginx Ingress controller - Test duration |
nginx_ingress_controller_admission_tested_ingresses | Nginx Ingress controller - Number of Ingresses tested |
nginx_ingress_controller_build_info | Nginx Ingress controller - Build information |
nginx_ingress_controller_bytes_sent_bucket | Nginx Ingress controller - Total bytes sent (bucket) |
nginx_ingress_controller_bytes_sent_count | Nginx Ingress controller - Total bytes sent (count) |
nginx_ingress_controller_bytes_sent_sum | Nginx Ingress controller - Sent bytes total (Sum) |
nginx_ingress_controller_check_errors | Nginx Ingress controller - Check errors |
nginx_ingress_controller_check_success | Nginx Ingress controller - Successful check count |
nginx_ingress_controller_config_hash | Nginx Ingress controller - Configuration hash |
nginx_ingress_controller_config_last_reload_successful | Nginx Ingress controller - Last configuration load successful |
nginx_ingress_controller_config_last_reload_successful_timestamp_seconds | Nginx Ingress controller - Last successful configuration load time (seconds) |
nginx_ingress_controller_connect_duration_seconds_bucket | Nginx Ingress controller - Connection duration (seconds) - Bucket |
nginx_ingress_controller_connect_duration_seconds_count | Nginx Ingress controller - connection duration (seconds) - count |
nginx_ingress_controller_connect_duration_seconds_sum | Nginx Ingress controller - Connection duration (seconds) - Sum |
nginx_ingress_controller_errors | Nginx Ingress controller - Error count |
nginx_ingress_controller_header_duration_seconds_bucket | Nginx Ingress controller - Header processing time (s) - Bucket |
nginx_ingress_controller_header_duration_seconds_count | Nginx Ingress controller - Header processing time (seconds) - Count |
nginx_ingress_controller_header_duration_seconds_sum | Total header processing time for the Nginx Ingress controller (seconds) |
nginx_ingress_controller_ingress_upstream_latency_seconds | Nginx Ingress controller upstream latency (seconds) |
nginx_ingress_controller_ingress_upstream_latency_seconds_count | Nginx Ingress controller upstream latency count |
nginx_ingress_controller_ingress_upstream_latency_seconds_sum | Nginx Ingress controller upstream latency sum (seconds) |
nginx_ingress_controller_leader_election_status | Nginx Ingress controller leader election status |
nginx_ingress_controller_nginx_process_connections | Nginx Ingress controller nginx process connections |
nginx_ingress_controller_nginx_process_connections_total | Total connections for the nginx process in the Nginx Ingress controller |
nginx_ingress_controller_nginx_process_cpu_seconds_total | Total CPU seconds for the Nginx Ingress controller's nginx process |
nginx_ingress_controller_nginx_process_num_procs | Number of Nginx processes for the Nginx Ingress controller |
nginx_ingress_controller_nginx_process_oldest_start_time_seconds | Start time of the oldest nginx process in the Nginx Ingress controller (seconds) |
nginx_ingress_controller_nginx_process_read_bytes_total | Total bytes read by the nginx process of the Nginx Ingress controller |
nginx_ingress_controller_nginx_process_requests_total | Total requests for the Nginx Ingress controller's nginx process |
nginx_ingress_controller_nginx_process_resident_memory_bytes | Resident memory size (bytes) of the nginx process for the Nginx Ingress controller |
nginx_ingress_controller_nginx_process_virtual_memory_bytes | Virtual memory of the nginx process for the Nginx Ingress controller in bytes |
nginx_ingress_controller_nginx_process_write_bytes_total | Total bytes written by the nginx process of the Nginx Ingress controller |
nginx_ingress_controller_orphan_ingress | Number of isolated Ingresses for the Nginx Ingress controller |
nginx_ingress_controller_request_duration_seconds_bucket | Nginx Ingress controller request latency distribution (seconds) |
nginx_ingress_controller_request_duration_seconds_count | Nginx Ingress controller request duration (seconds) |
nginx_ingress_controller_request_duration_seconds_sum | Sum of Nginx Ingress controller request time (seconds) |
nginx_ingress_controller_request_size_bucket | Nginx Ingress controller request size distribution |
nginx_ingress_controller_request_size_count | Nginx Ingress controller request size count |
nginx_ingress_controller_request_size_sum | Nginx Ingress controller total request size |
nginx_ingress_controller_requests | Total Nginx Ingress controller requests |
nginx_ingress_controller_response_duration_seconds_bucket | Nginx Ingress controller response time distribution (seconds) |
nginx_ingress_controller_response_duration_seconds_count | Nginx Ingress controller response time (seconds) |
nginx_ingress_controller_response_duration_seconds_sum | Total Nginx Ingress controller response time (seconds) |
nginx_ingress_controller_response_size_bucket | Nginx Ingress controller response size distribution |
nginx_ingress_controller_response_size_count | Nginx Ingress controller response size count |
nginx_ingress_controller_response_size_sum | Total Nginx Ingress controller response size |
nginx_ingress_controller_ssl_certificate_info | Nginx Ingress controller SSL certificate information |
nginx_ingress_controller_ssl_expire_time_seconds | Nginx Ingress controller SSL certificate expiration time (seconds) |
nginx_ingress_controller_success | Nginx Ingress controller success count |
Up | Metric collection connectivity |
Koordinator (Job names: kube-system/koordlet-metrics-podmonitor, koord-manager-metrics-service)
Metric | Description |
aliyun_prometheus_agent_append_duration_seconds | The duration of append operations for the Alibaba Cloud Prometheus agent, in seconds. |
aliyun_prometheus_agent_scrapes_by_target_total | The total number of scrapes by the Alibaba Cloud Prometheus agent, per target. |
aliyun_prometheus_agent_target_info | The target information for the Alibaba Cloud Prometheus agent. |
koord_manager_recommender_recommendation_workload_target | The metric for recommended workload specifications from the resource profiling feature. |
koordlet_container_resource_limits | The metric for container resource limits. |
koordlet_container_resource_requests | The metric for container resource requests. |
koordlet_node_priority_resource_reclaimable | The metric for node resource priority. |
koordlet_node_resource_allocatable | The metric for allocatable resources on a node. |
slo_manager_recommender_recommendation_workload_target | The metric for recommended workload specifications from the resource profiling feature. (Deprecated) |
up | The connectivity for metric scraping. |
ACK dedicated etcd component (Job name: etcd)
Metric | Description |
aliyun_prometheus_agent_append_duration_seconds | Duration of the append operation for the Alibaba Cloud Prometheus agent (seconds) |
aliyun_prometheus_agent_job_discovery_status | Status of scrape job discovery for the Alibaba Cloud Prometheus agent |
aliyun_prometheus_agent_scrape_custom_error | The number of errors from custom scrapes by the Alibaba Cloud Prometheus agent. |
aliyun_prometheus_agent_scrapes_by_target_total | The total number of scrapes by target for the Alibaba Cloud Prometheus agent. |
aliyun_prometheus_agent_target_info | Target information for an Alibaba Cloud Prometheus agent |
cpu_utilization_core | CPU core utilization |
etcd_cluster_version | The version of the etcd cluster. |
etcd_debugging_auth_revision | etcd debug authentication revision |
etcd_debugging_disk_backend_commit_rebalance_duration_seconds_bucket | Etcd debugging disk backend commit rebalance duration distribution (seconds) |
etcd_debugging_disk_backend_commit_rebalance_duration_seconds_count | The count of commit rebalance durations in seconds for the etcd Multi-Version Concurrency Control (MVCC) database, used for debugging. |
etcd_debugging_disk_backend_commit_rebalance_duration_seconds_sum | Total commit rebalance duration for the etcd debug disk backend (seconds) |
etcd_debugging_disk_backend_commit_spill_duration_seconds_bucket | The distribution of commit spill duration for the etcd debugging disk backend |
etcd_debugging_disk_backend_commit_spill_duration_seconds_count | The total number of commit spills for the etcd debug disk backend. |
etcd_debugging_disk_backend_commit_spill_duration_seconds_sum | Sum of the commit spill duration for the etcd debugging disk backend (seconds) |
etcd_debugging_disk_backend_commit_write_duration_seconds_bucket | Etcd debug disk backend commit write duration distribution (seconds) |
etcd_debugging_disk_backend_commit_write_duration_seconds_count | The total number of write commits to the etcd debug disk backend. |
etcd_debugging_disk_backend_commit_write_duration_seconds_sum | The total duration of commit writes to the etcd debug disk backend, in seconds. |
etcd_debugging_lease_granted_total | Total number of leases granted for etcd debugging |
etcd_debugging_lease_renewed_total | The total number of etcd debugging lease renewals |
etcd_debugging_lease_revoked_total | Total number of etcd debugging leases revoked. |
etcd_debugging_lease_ttl_total_bucket | Etcd debug lease TTL total bucket |
etcd_debugging_lease_ttl_total_count | Total count of etcd debug lease TTLs |
etcd_debugging_lease_ttl_total_sum | etcd lease TTL sum (seconds) |
etcd_debugging_mvcc_compact_revision | etcd MVCC compaction revision for debugging |
etcd_debugging_mvcc_current_revision | Current MVCC revision for etcd debugging |
etcd_debugging_mvcc_db_compaction_keys_total | Total keys compacted in the etcd MVCC database for debugging |
etcd_debugging_mvcc_db_compaction_last | Last compaction time of the etcd MVCC database for debugging. |
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket | The bucket for the pause duration in milliseconds during etcd MVCC database compaction for debugging. |
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_count | The count of pause durations (in milliseconds) during MVCC database compaction for etcd debugging. |
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_sum | Sum of pause durations for etcd MVCC database compaction during debugging (milliseconds). |
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket | Distribution of the total duration of MVCC database compaction for etcd debugging (in milliseconds) |
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_count | The total count of etcd debug MVCC database compactions, measured in milliseconds. |
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_sum | Sum of the total duration of etcd MVCC database compaction for debugging (milliseconds) |
etcd_debugging_mvcc_db_total_size_in_bytes | Total size of the etcd debug MVCC database in bytes |
etcd_debugging_mvcc_delete_total | Total MVCC delete operations for etcd debugging |
etcd_debugging_mvcc_events_total | Total number of etcd debug events |
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucket | The bucket for the etcd debugging MVCC index compaction pause duration in milliseconds. |
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_count | Count of etcd debug MVCC index compaction pauses. |
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_sum | The sum of pause durations in milliseconds for etcd MVCC index compaction during debugging. |
etcd_debugging_mvcc_keys_total | The total number of MVCC keys for etcd debugging. |
etcd_debugging_mvcc_pending_events_total | Total number of pending MVCC events for etcd debugging |
etcd_debugging_mvcc_put_total | Total number of MVCC put operations for debugging etcd |
etcd_debugging_mvcc_range_total | Total etcd MVCC range queries |
etcd_debugging_mvcc_slow_watcher_total | Total number of slow watchers for etcd debugging |
etcd_debugging_mvcc_total_put_size_in_bytes | Total MVCC put size for etcd debugging (bytes) |
etcd_debugging_mvcc_txn_total | Total Multi-Version Concurrency Control (MVCC) transactions for etcd debugging |
etcd_debugging_mvcc_watch_stream_total | Total etcd debug snapshot streams |
etcd_debugging_mvcc_watcher_total | Total number of etcd debug watchers |
etcd_debugging_server_lease_expired_total | Total expired leases for the etcd debugging server. |
etcd_debugging_snap_save_marshalling_duration_seconds_bucket | Distribution of marshalling durations when saving etcd debug snapshots |
etcd_debugging_snap_save_marshalling_duration_seconds_count | The count of marshalling operations for saving an etcd debug snapshot. The duration is measured in seconds. |
etcd_debugging_snap_save_marshalling_duration_seconds_sum | The total time in seconds spent marshalling debugging snapshots for saving. |
etcd_debugging_snap_save_total_duration_seconds_bucket | The total time it takes to save an etcd debug snapshot, in seconds, by bucket. |
etcd_debugging_snap_save_total_duration_seconds_count | Total count of etcd debug snapshot save operations (duration in seconds) |
etcd_debugging_snap_save_total_duration_seconds_sum | The total time, in seconds, spent saving etcd debug snapshots. |
etcd_debugging_store_expires_total | Total number of etcd debugging store expirations. |
etcd_debugging_store_reads_total | Total debug store reads in etcd. |
etcd_debugging_store_watch_requests_total | The total number of watch requests for the etcd debug store. |
etcd_debugging_store_watchers | Number of etcd debugging store watchers |
etcd_debugging_store_writes_total | Total etcd debug store writes |
etcd_disk_backend_commit_duration_seconds_bucket | etcd disk backend commit duration bucket (seconds) |
etcd_disk_backend_commit_duration_seconds_count | The total number of etcd disk backend commits. |
etcd_disk_backend_commit_duration_seconds_sum | Total duration of etcd disk backend commits, in seconds. |
etcd_disk_backend_defrag_duration_seconds_bucket | Distribution of etcd disk WAL fsync duration |
etcd_disk_backend_defrag_duration_seconds_count | Duration of etcd disk backend defragmentation (seconds) |
etcd_disk_backend_defrag_duration_seconds_sum | The sum of etcd disk backend defragmentation durations, in seconds. |
etcd_disk_backend_snapshot_duration_seconds_bucket | Distribution of etcd disk backend snapshot duration (seconds) |
etcd_disk_backend_snapshot_duration_seconds_count | The total count of timed etcd disk backend snapshots. |
etcd_disk_backend_snapshot_duration_seconds_sum | Total duration of etcd disk backend snapshots in seconds. |
etcd_disk_defrag_inflight | etcd disk defragmentation in progress |
etcd_disk_wal_fsync_duration_seconds_bucket | etcd disk WAL fsync duration seconds bucket |
etcd_disk_wal_fsync_duration_seconds_count | The total number of etcd disk WAL fsync operations. |
etcd_disk_wal_fsync_duration_seconds_sum | Sum of the etcd disk WAL fsync duration in seconds. |
etcd_disk_wal_write_bytes_total | Total bytes written to the etcd disk WAL |
etcd_grpc_proxy_cache_hits_total | Total number of etcd gRPC proxy cache hits |
etcd_grpc_proxy_cache_keys_total | The total number of etcd gRPC proxy cache keys. |
etcd_grpc_proxy_cache_misses_total | Total etcd gRPC proxy cache misses |
etcd_grpc_proxy_events_coalescing_total | Total number of events merged by the etcd gRPC proxy |
etcd_grpc_proxy_watchers_coalescing_total | Total number of coalesced watchers in the etcd gRPC proxy. |
etcd_mvcc_db_open_read_transactions | The number of open read transactions in the etcd MVCC database. |
etcd_mvcc_db_total_size_in_bytes | Total size of the etcd MVCC database (bytes) |
etcd_mvcc_db_total_size_in_use_in_bytes | The total size in use of the etcd MVCC database, in bytes. |
etcd_mvcc_delete_total | Total etcd MVCC deletes |
etcd_mvcc_hash_duration_seconds_bucket | Bucket for etcd MVCC hash duration in seconds. |
etcd_mvcc_hash_duration_seconds_count | Count of etcd MVCC hash durations (seconds) |
etcd_mvcc_hash_duration_seconds_sum | Total etcd MVCC hash duration in seconds |
etcd_mvcc_hash_rev_duration_seconds_bucket | etcd MVCC hash revision duration distribution (seconds) |
etcd_mvcc_hash_rev_duration_seconds_count | The count of etcd MVCC hash revision durations in seconds. |
etcd_mvcc_hash_rev_duration_seconds_sum | Sum of etcd MVCC hash revision duration, in seconds |
etcd_mvcc_put_total | The total number of etcd MVCC Put operations |
etcd_mvcc_range_total | Total number of etcd MVCC range queries |
etcd_mvcc_txn_total | Total etcd multiversion concurrency control transactions |
etcd_network_active_peers | Number of active etcd network peers |
etcd_network_client_grpc_received_bytes_total | Total number of bytes received by the etcd network client over gRPC |
etcd_network_client_grpc_sent_bytes_total | The total number of bytes sent by the etcd gRPC client. |
etcd_network_disconnected_peers_total | Total number of disconnected peers in the etcd network |
etcd_network_peer_received_bytes_total | Total bytes received by the etcd network peer |
etcd_network_peer_received_failures_total | Total number of failed receives from etcd network peers |
etcd_network_peer_round_trip_time_seconds_bucket | etcd network peer round-trip time distribution (seconds) |
etcd_network_peer_round_trip_time_seconds_count | Count of round trip times in seconds for etcd network peers |
etcd_network_peer_round_trip_time_seconds_sum | Total round trip time in seconds for etcd network peers |
etcd_network_peer_sent_bytes_total | Total bytes sent to etcd peers |
etcd_network_peer_sent_failures_total | Total etcd network peer send failures |
etcd_network_server_stream_failures_total | Total number of etcd network server stream failures |
etcd_network_snapshot_receive_inflights_total | The number of concurrent requests to receive etcd network snapshots. |
etcd_network_snapshot_receive_success | The etcd network snapshot was accepted successfully. |
etcd_network_snapshot_receive_total_duration_seconds_bucket | Distribution bucket for the total duration, in seconds, of accepting etcd network snapshots. |
etcd_network_snapshot_receive_total_duration_seconds_count | The total count of etcd network snapshot receive operations. |
etcd_network_snapshot_receive_total_duration_seconds_sum | Total time spent receiving etcd network snapshots, in seconds. |
etcd_network_snapshot_send_inflights_total | The number of concurrent requests for sending etcd network snapshots. |
etcd_network_snapshot_send_success | The etcd network snapshot was sent successfully. |
etcd_network_snapshot_send_total_duration_seconds_bucket | Total duration distribution for sending etcd network snapshots (seconds) |
etcd_network_snapshot_send_total_duration_seconds_count | Total number of etcd network snapshot send operations. |
etcd_network_snapshot_send_total_duration_seconds_sum | Sum of the total duration for sending etcd network snapshots, in seconds. |
etcd_server_apply_duration_seconds_bucket | etcd server apply duration distribution (seconds) |
etcd_server_apply_duration_seconds_count | Count of apply operations for the etcd server |
etcd_server_apply_duration_seconds_sum | The total time, in seconds, that the etcd server has spent applying requests. |
etcd_server_client_requests_total | Total number of client requests to the etcd server |
etcd_server_go_version | The Go version of the etcd server |
etcd_server_has_leader | The etcd server has a leader. |
etcd_server_health_failures | Number of etcd server health check failures |
etcd_server_health_success | The etcd server health check is successful. |
etcd_server_heartbeat_send_failures_total | Total number of failed heartbeat sends from the etcd server |
etcd_server_id | etcd server ID |
etcd_server_is_leader | Is the etcd server the leader |
etcd_server_is_learner | Whether the etcd server is a Learner |
etcd_server_leader_changes_seen_total | The total number of leader changes seen by the etcd server. |
etcd_server_learner_promote_successes | The number of successful learner promotions in the etcd server. |
etcd_server_proposals_applied_total | Total proposals applied on the etcd server |
etcd_server_proposals_committed_total | Total number of proposals committed by the etcd server |
etcd_server_proposals_failed_total | Total number of failed etcd server proposals |
etcd_server_proposals_pending | Number of pending etcd server proposals |
etcd_server_quota_backend_bytes | The backend storage quota for the etcd server in bytes. |
etcd_server_read_indexes_failed_total | Total number of failed index reads on the etcd server. |
etcd_server_slow_apply_total | Total slow applies on the etcd server |
etcd_server_slow_read_indexes_total | The total number of slow read indexes for the etcd server. |
etcd_server_snapshot_apply_in_progress_total | Total etcd server snapshot applications in progress |
etcd_server_version | etcd server version |
etcd_snap_db_fsync_duration_seconds_bucket | Distribution of fsync duration for the etcd snapshot database (seconds). |
etcd_snap_db_fsync_duration_seconds_count | Total fsync count for the etcd snapshot database |
etcd_snap_db_fsync_duration_seconds_sum | Total fsync duration for the etcd snapshot database, in seconds. |
etcd_snap_db_save_total_duration_seconds_bucket | The bucket for the total duration, in seconds, to save the etcd snapshot database. |
etcd_snap_db_save_total_duration_seconds_count | Total save duration for the ETCD snapshot database in seconds |
etcd_snap_db_save_total_duration_seconds_sum | Total retention duration of the etcd snapshot database (seconds) |
etcd_snap_fsync_duration_seconds_bucket | Etcd snapshot fsync duration distribution (seconds) |
etcd_snap_fsync_duration_seconds_count | Etcd snapshot sync duration in seconds |
etcd_snap_fsync_duration_seconds_sum | etcd snapshot fsync total duration (seconds) |
grpc_server_handled_total | Total gRPC server requests processed |
grpc_server_msg_received_total | Total messages received by the gRPC server |
grpc_server_msg_sent_total | Total gRPC server messages sent |
grpc_server_started_total | Total gRPC server startups |
memory_utilization_byte | Memory utilization in bytes |
os_fd_limit | Operating system file descriptor limit |
os_fd_used | Operating system file descriptor count |
up | Connectivity for metric collection |
ACK Dedicated Scheduler (Job name: ack-scheduler)
Metric | Description |
aggregator_discovery_aggregation_count_total | Total count of aggregator discovery aggregations. |
aliyun_prometheus_agent_append_duration_seconds | Duration of append operations for the Alibaba Cloud Prometheus agent, in seconds. |
aliyun_prometheus_agent_job_discovery_status | Discovery status of scrape jobs for the Alibaba Cloud Prometheus agent. |
aliyun_prometheus_agent_scrape_custom_error | Number of custom scrape errors for the Alibaba Cloud Prometheus agent. |
aliyun_prometheus_agent_scrapes_by_target_total | Total number of scrapes by target for the Alibaba Cloud Prometheus agent. |
aliyun_prometheus_agent_target_info | Target information for the Alibaba Cloud Prometheus agent. |
apiserver_audit_event_total | Total number of API server audit events. |
apiserver_audit_requests_rejected_total | Total number of rejected API server audit requests. |
apiserver_client_certificate_expiration_seconds_bucket | Distribution of remaining seconds until API server client certificate expiration. |
apiserver_client_certificate_expiration_seconds_count | Count of remaining seconds until API server client certificate expiration. |
apiserver_client_certificate_expiration_seconds_sum | Sum of remaining seconds until API server client certificate expiration. |
apiserver_delegated_authn_request_duration_seconds_bucket | Distribution of API server delegated authentication request duration, in seconds. |
apiserver_delegated_authn_request_duration_seconds_count | Count of API server delegated authentication request duration. |
apiserver_delegated_authn_request_duration_seconds_sum | Sum of API server delegated authentication request duration. |
apiserver_delegated_authn_request_total | Total number of API server delegated authentication requests. |
apiserver_delegated_authz_request_duration_seconds_bucket | Distribution of API server delegated authorization request duration, in seconds. |
apiserver_delegated_authz_request_duration_seconds_count | Count of API server delegated authorization request duration. |
apiserver_delegated_authz_request_duration_seconds_sum | Sum of API server delegated authorization request duration, in seconds. |
apiserver_delegated_authz_request_total | Total number of API server delegated authorization requests. |
apiserver_encryption_config_controller_automatic_reload_failures_total | Total number of automatic reload failures for the API server encryption configuration controller. |
apiserver_encryption_config_controller_automatic_reload_success_total | Total number of successful automatic reloads for the API server encryption configuration controller. |
apiserver_envelope_encryption_dek_cache_fill_percent | Cache fill percentage for the API server envelope encryption Data Encryption Key (DEK). |
apiserver_storage_data_key_generation_duration_seconds_bucket | Distribution of API server storage data key generation duration. |
apiserver_storage_data_key_generation_duration_seconds_count | Count of API server storage data key generation duration. |
apiserver_storage_data_key_generation_duration_seconds_sum | Sum of API server storage data key generation duration, in seconds. |
apiserver_storage_data_key_generation_failures_total | Total number of API server storage data key generation failures. |
apiserver_storage_envelope_transformation_cache_misses_total | Total number of cache misses for API server storage envelope transformation. |
apiserver_webhooks_x509_insecure_sha1_total | Total count of insecure SHA1 in API server webhook X.509 certificates. |
apiserver_webhooks_x509_missing_san_total | Total count of API server webhooks with missing Subject Alternative Name (SAN) in X.509 certificates. |
authenticated_user_requests | Authenticated user requests. |
authentication_attempts | Number of authentication attempts. |
authentication_duration_seconds_bucket | Distribution of authentication duration. |
authentication_duration_seconds_count | Count of authentication duration. |
authentication_duration_seconds_sum | Sum of authentication duration, in seconds. |
authentication_token_cache_active_fetch_count | Count of active fetches from the authentication token cache. |
authentication_token_cache_fetch_total | Total number of fetches from the authentication token cache. |
authentication_token_cache_request_duration_seconds_bucket | Distribution of authentication token cache request duration. |
authentication_token_cache_request_duration_seconds_count | Count of authentication token cache request duration. |
authentication_token_cache_request_duration_seconds_sum | Sum of authentication token cache request duration, in seconds. |
authentication_token_cache_request_total | Total number of authentication token cache requests. |
authorization_attempts_total | Total number of authorization attempts. |
authorization_duration_seconds_bucket | Distribution of authorization duration, in seconds. |
authorization_duration_seconds_count | Count of authorization duration. |
authorization_duration_seconds_sum | Sum of authorization duration. |
cardinality_enforcement_unexpected_categorizations_total | Total number of unexpected categorizations from cardinality enforcement. |
kubernetes_build_info | Kubernetes build information. |
kubernetes_feature_enabled | Enabled Kubernetes feature. |
leader_election_master_status | Status of the leader election master. |
registered_metric_total | Total number of registered metrics. |
registered_metrics_total | Total number of registered metrics. |
rest_client_exec_plugin_certificate_rotation_age_bucket | Buckets for the age of rotated certificates for the REST client exec plugin. |
rest_client_exec_plugin_certificate_rotation_age_count | Count of the age of rotated certificates for the REST client exec plugin. |
rest_client_exec_plugin_certificate_rotation_age_sum | Sum of the age of rotated certificates for the REST client exec plugin. |
rest_client_rate_limiter_duration_seconds_bucket | Distribution of REST client rate limiter duration. |
rest_client_rate_limiter_duration_seconds_count | Count of REST client rate limiter duration, in seconds. |
rest_client_rate_limiter_duration_seconds_sum | Sum of REST client rate limiter duration, in seconds. |
rest_client_request_duration_seconds_bucket | Buckets for REST client request duration, in seconds. |
rest_client_request_duration_seconds_count | Count of REST client request duration. |
rest_client_request_duration_seconds_sum | Sum of REST client request duration, in seconds. |
rest_client_request_retries_total | Total number of REST client request retries. |
rest_client_request_size_bytes_bucket | Distribution of REST client request size, in bytes. |
rest_client_request_size_bytes_count | Count of REST client request size, in bytes. |
rest_client_request_size_bytes_sum | Sum of REST client request size, in bytes. |
rest_client_requests_total | Total number of REST client requests. |
rest_client_response_size_bytes_bucket | Buckets for REST client response size, in bytes. |
rest_client_response_size_bytes_count | Count of REST client response size, in bytes. |
rest_client_response_size_bytes_sum | Sum of REST client response size, in bytes. |
rest_client_transport_cache_entries | Number of REST client transport cache entries. |
rest_client_transport_create_calls_total | Total number of REST client transport creation calls. |
scheduler_binding_duration_seconds_bucket | Buckets for scheduler binding duration, in seconds. |
scheduler_binding_duration_seconds_count | Count of binding duration. |
scheduler_binding_duration_seconds_sum | Sum of scheduler binding duration, in seconds. |
scheduler_e2e_scheduling_duration_seconds_bucket | Distribution of scheduler end-to-end scheduling duration. |
scheduler_e2e_scheduling_duration_seconds_count | Count of scheduler end-to-end scheduling duration. |
scheduler_e2e_scheduling_duration_seconds_sum | Sum of scheduler end-to-end scheduling duration, in seconds. |
scheduler_framework_extension_point_duration_seconds_bucket | Distribution of scheduler framework extension point duration. |
scheduler_framework_extension_point_duration_seconds_count | Count of scheduler framework extension point duration. |
scheduler_framework_extension_point_duration_seconds_sum | Sum of scheduler framework extension point duration. |
scheduler_goroutines | Number of scheduler goroutines. |
scheduler_pending_pods | Number of pending pods in the scheduler. |
scheduler_plugin_evaluation_total | Total number of scheduler plugin evaluations. |
scheduler_plugin_execution_duration_seconds_bucket | Distribution of scheduler plugin execution duration, in seconds. |
scheduler_plugin_execution_duration_seconds_count | Count of scheduler plugin execution duration. |
scheduler_plugin_execution_duration_seconds_sum | Sum of scheduler plugin execution duration, in seconds. |
scheduler_pod_preemption_victims_bucket | Buckets for the number of pod preemption victims in the scheduler. |
scheduler_pod_preemption_victims_count | Count of pod preemption victims in the scheduler. |
scheduler_pod_preemption_victims_sum | Sum of pod preemption victims in the scheduler. |
scheduler_pod_scheduling_attempts_bucket | Buckets for the number of pod scheduling attempts in the scheduler. |
scheduler_pod_scheduling_attempts_count | Count of pod scheduling attempts in the scheduler. |
scheduler_pod_scheduling_attempts_sum | Sum of pod scheduling attempts in the scheduler. |
scheduler_pod_scheduling_duration_seconds_bucket | Buckets for pod scheduling duration in the scheduler, in seconds. |
scheduler_pod_scheduling_duration_seconds_count | Count of pod scheduling duration in the scheduler. |
scheduler_pod_scheduling_duration_seconds_sum | Sum of pod scheduling duration in the scheduler, in seconds. |
scheduler_pod_scheduling_sli_duration_seconds_bucket | Buckets for pod scheduling Service Level Indicator (SLI) duration. |
scheduler_pod_scheduling_sli_duration_seconds_count | Count of pod scheduling Service Level Indicator (SLI) duration in the scheduler. |
scheduler_pod_scheduling_sli_duration_seconds_sum | Sum of pod scheduling Service Level Indicator (SLI) duration. |
scheduler_preemption_attempts_total | Total number of preemption attempts in the scheduler. |
scheduler_preemption_victims_bucket | Buckets for the number of preemption victims in the scheduler. |
scheduler_preemption_victims_count | Count of preemption victims in the scheduler. |
scheduler_preemption_victims_sum | Total number of preemption victims in the scheduler. |
scheduler_queue_incoming_pods_total | Total number of incoming pods in the scheduler queue. |
scheduler_schedule_attempts_total | Total number of scheduling attempts in the scheduler. |
scheduler_scheduler_cache_size | Size of the scheduler cache. |
scheduler_scheduler_goroutines | Number of scheduler goroutines. |
scheduler_scheduling_algorithm_duration_seconds_bucket | Distribution of scheduler scheduling algorithm duration, in seconds. |
scheduler_scheduling_algorithm_duration_seconds_count | Count of scheduler scheduling algorithm duration, in seconds. |
scheduler_scheduling_algorithm_duration_seconds_sum | Sum of scheduler scheduling algorithm duration, in seconds. |
scheduler_scheduling_algorithm_predicate_evaluation_seconds_bucket | Buckets for scheduler scheduling algorithm predicate evaluation duration, in seconds. |
scheduler_scheduling_algorithm_predicate_evaluation_seconds_count | Count of scheduling algorithm predicate evaluation duration, in seconds. |
scheduler_scheduling_algorithm_predicate_evaluation_seconds_sum | Sum of scheduling algorithm predicate evaluation duration, in seconds. |
scheduler_scheduling_algorithm_preemption_evaluation_seconds_bucket | Buckets for scheduling algorithm preemption evaluation duration, in seconds. |
scheduler_scheduling_algorithm_preemption_evaluation_seconds_count | Count of scheduling algorithm preemption evaluation duration, in seconds. |
scheduler_scheduling_algorithm_preemption_evaluation_seconds_sum | Sum of scheduling algorithm preemption evaluation duration, in seconds. |
scheduler_scheduling_algorithm_priority_evaluation_seconds_bucket | Buckets for scheduler scheduling algorithm priority evaluation duration, in seconds. |
scheduler_scheduling_algorithm_priority_evaluation_seconds_count | Count of scheduling algorithm priority evaluation duration, in seconds. |
scheduler_scheduling_algorithm_priority_evaluation_seconds_sum | Sum of scheduling algorithm priority evaluation duration, in seconds. |
scheduler_scheduling_attempt_duration_seconds_bucket | Distribution of scheduler scheduling attempt duration. |
scheduler_scheduling_attempt_duration_seconds_count | Count of scheduler scheduling attempt duration. |
scheduler_scheduling_attempt_duration_seconds_sum | Sum of scheduler scheduling attempt duration, in seconds. |
scheduler_scheduling_duration_seconds | Scheduler scheduling duration, in seconds. |
scheduler_scheduling_duration_seconds_count | Count of scheduling duration. |
scheduler_scheduling_duration_seconds_sum | Sum of scheduling duration. |
scheduler_total_preemption_attempts | Total number of preemption attempts by the scheduler. |
scheduler_unschedulable_pods | Number of unschedulable pods in the scheduler. |
scheduler_volume_scheduling_duration_seconds_bucket | Buckets for volume scheduling duration. |
scheduler_volume_scheduling_duration_seconds_count | Count of scheduler volume scheduling duration, in seconds. |
scheduler_volume_scheduling_duration_seconds_sum | Sum of scheduler volume scheduling duration, in seconds. |
scheduler_volume_scheduling_stage_error_total | Total number of errors in the scheduler volume scheduling stage. |
scrape_duration_seconds | Scrape duration, in seconds. |
scrape_samples_post_metric_relabeling | Number of scraped samples after metric relabeling. |
scrape_samples_scraped | Number of scraped samples. |
scrape_series_added | Number of new series added from scrapes. |
up | Connectivity for metric scraping. |
workqueue_adds_total | Total number of additions to the work queue. |
workqueue_depth | Depth of the work queue. |
workqueue_longest_running_processor_seconds | Longest running processor time in the work queue, in seconds. |
workqueue_queue_duration_seconds_bucket | Buckets for the duration items stay in the work queue, in seconds. |
workqueue_queue_duration_seconds_count | Count of the duration items stay in the work queue, in seconds. |
workqueue_queue_duration_seconds_sum | Sum of the duration items stay in the work queue, in seconds. |
workqueue_retries_total | Total number of retries in the work queue. |
workqueue_unfinished_work_seconds | Seconds of unfinished work in the work queue. |
workqueue_work_duration_seconds_bucket | Distribution of work duration in the work queue. |
workqueue_work_duration_seconds_count | Count of work duration in the work queue. |
workqueue_work_duration_seconds_sum | Sum of work duration in the work queue, in seconds. |
References
-
To view the metrics for ARMS Application Monitoring, see Application Monitoring metrics.