All Products
Search
Document Center

Managed Service for Prometheus:Metrics

Last Updated:Sep 06, 2024

When you use Alibaba Cloud Managed Service for Prometheus, you are charged based on the number of reported data samples on billable metrics. The metrics are classified into basic metrics and custom metrics. Basic metrics are free of charge. You are charged for custom metrics starting from January 6, 2020.

Kubernetes clusters

The following tables describe the basic metrics of Kubernetes clusters that are supported by Managed Service for Prometheus.

Jobs names and basic metrics related to Prometheus instance status

Job name

Metric type

Metric name

Description

_arms-prom/kubelet/1

Basic metric

promhttp_metric_handler_requests_in_flight

-

go_memstats_mallocs_total

A counter value that shows the number of allocated heap objects. You can call the rate() function to calculate the allocation rate of heap objects.

go_memstats_lookups_total

A counter value that shows the number of dereferenced pointers. You can call the rate() function to calculate the dereferencing rate of pointers.

go_memstats_last_gc_time_seconds

The timestamp when the last garbage collection (GC) was complete.

go_memstats_heap_sys_bytes

The number of memory bytes allocated for the heap from the operating system, including the virtual address space that is reserved but not used.

go_memstats_heap_released_bytes

The number of free spans that have been returned to the operating system.

go_memstats_heap_objects

The number of objects allocated on the heap. The number varies based on GC and the allocation of new objects.

go_memstats_heap_inuse_bytes

The number of bytes occupied by the spans in use.

go_memstats_heap_idle_bytes

The number of memory bytes occupied by free spans.

go_memstats_heap_alloc_bytes

The number of memory bytes allocated for heap objects. The heap objects include all reachable heap objects and the unreachable objects that are not removed during GC.

go_memstats_gc_sys_bytes

The amount of memory occupied by GC metadata.

go_memstats_gc_cpu_fraction

The percentage of CPU time consumed by GC since the program was started.

go_memstats_frees_total

A counter value that shows the number of removed heap objects. You can call the rate() function to calculate the removal rate of heap objects. You can use the go_memstats_mallocs_total - go_memstats_frees_total formula to calculate the number of surviving heap objects.

go_memstats_buck_hash_sys_bytes

The amount of memory occupied by the hash tables used for profiling.

go_memstats_alloc_bytes_total

The value of the metric increases as objects are allocated in the heap, but does not decrease when objects are removed. Similar to Prometheus counters, the rate() function can be called to query the memory consumption rate.

go_memstats_alloc_bytes

The number of memory bytes allocated for heap objects. The heap objects include all reachable heap objects and the unreachable objects that are not removed during GC.

scrape_duration_seconds

-

go_info

The information about the Go version. The value is obtained by calling the runtime.Version() function.

go_goroutines

The value is obtained by calling the runtime.NumGoroutine() function based on the sched scheduler structure and the global allglen variable. All fields in the sched structure may concurrently change. Therefore, the system checks whether the value is less than 1. If the value is less than 1, 1 is returned.

scrape_samples_post_metric_relabeling

-

go_gc_duration_seconds_sum

-

go_gc_duration_seconds_count

-

blackbox_exporter_config_last_reload_successful

-

blackbox_exporter_config_last_reload_success_timestamp_seconds

-

scrape_samples_scraped

-

blackbox_exporter_build_info

-

arms_prometheus_target_scrapes_sample_out_of_order_total

-

arms_prometheus_target_scrapes_sample_out_of_bounds_total

-

arms_prometheus_target_scrapes_sample_duplicate_timestamp_total

-

scrape_series_added

-

arms_prometheus_target_scrapes_exceeded_sample_limit_total

-

arms_prometheus_target_scrapes_cache_flush_forced_total_arms-prom/kubelet/1

-

arms_prometheus_target_scrape_pools_total

-

statsd_metric_mapper_cache_gets_total

-

statsd_metric_mapper_cache_hits_total

-

statsd_metric_mapper_cache_length

-

arms_prometheus_target_scrape_pools_failed_total

-

up

-

arms_prometheus_target_scrape_pool_reloads_total

-

arms_prometheus_target_scrape_pool_reloads_failed_total

-

Job names and basic metrics related to API server data collection

Job name

Metric type

Metric name

apiserver

Basic metric

apiserver_request_duration_seconds_bucket (deprecated by default)

apiserver_admission_controller_admission_duration_seconds_bucket

apiserver_request_total

rest_client_requests_total

apiserver_admission_webhook_admission_duration_seconds_bucket

apiserver_current_inflight_requests

up

apiserver_admission_webhook_admission_duration_seconds_count

scrape_samples_post_metric_relabeling

scrape_samples_scraped

scrape_series_added

scrape_duration_seconds

Job names and basic metrics related to Ingress data collection

Job name

Metric type

Metric name

Description

arms-ack-ingress

Basic metric

nginx_ingress_controller_request_duration_seconds_bucket

-

nginx_ingress_controller_response_duration_seconds_bucket (deprecated by default)

-

nginx_ingress_controller_response_size_bucket (deprecated by default)

-

nginx_ingress_controller_request_size_bucket

-

nginx_ingress_controller_bytes_sent_bucket

-

go_gc_duration_seconds

The value is obtained by calling the debug.ReadGCStats() function. When the function is called, the PauseQuantile field of the GCStats structure is set to 5. The function will return the minimum percentile, 25%, 50%, 75%, and the maximum percentile of the GC pause time. Then, the Prometheus Go client creates a summary metric based on the returned percentile of the GC pause time, NumGC, and PauseTotal variables.

nginx_ingress_controller_nginx_process_connections

-

nginx_ingress_controller_request_duration_seconds_sum

-

nginx_ingress_controller_request_duration_seconds_count (deprecated by default)

-

nginx_ingress_controller_bytes_sent_sum

-

nginx_ingress_controller_request_size_sum

-

nginx_ingress_controller_response_duration_seconds_count

-

nginx_ingress_controller_response_duration_seconds_sum (deprecated by default)

-

nginx_ingress_controller_response_size_count (deprecated by default)

-

nginx_ingress_controller_bytes_sent_count

-

nginx_ingress_controller_response_size_sum

-

nginx_ingress_controller_request_size_count

-

promhttp_metric_handler_requests_total

-

nginx_ingress_controller_nginx_process_connections_total

-

go_memstats_mcache_sys_bytes

The amount of memory allocated from the operating system for the mcache structure.

go_memstats_lookups_total

A counter value that shows the number of dereferenced pointers. You can call the rate() function to calculate the dereferencing rate of pointers.

go_threads

The value is obtained by calling the runtime.CreateThreadProfile() function based on the global allm variable.

go_memstats_sys_bytes

The number of memory bytes that Go has obtained from the system.

go_memstats_last_gc_time_seconds

The timestamp when the last GC was complete.

go_memstats_heap_sys_bytes

The number of memory bytes allocated for the heap from the operating system, including the virtual address space that is reserved but not used.

go_memstats_heap_objects

The number of objects allocated on the heap. The number varies based on GC and the allocation of new objects.

go_memstats_heap_inuse_bytes

The number of bytes occupied by the spans in use.

go_memstats_heap_idle_bytes

The number of memory bytes occupied by free spans.

go_memstats_heap_alloc_bytes

The number of memory bytes allocated for heap objects. The heap objects include all reachable heap objects and the unreachable objects that are not removed during GC.

go_memstats_gc_sys_bytes

The amount of memory occupied by GC metadata.

promhttp_metric_handler_requests_in_flight

-

go_memstats_stack_sys_bytes

The number of stack memory bytes obtained from the operating system. The value is obtained based on the value of the go_memstats_stack_inuse_bytes metric plus the size of the operating system thread stack.

go_memstats_stack_inuse_bytes

The amount of used memory on a stack memory span on which at least one stack object is allocated.

go_memstats_gc_cpu_fraction

The percentage of CPU time consumed by GC since the program was started.

go_memstats_frees_total

A counter value that shows the number of removed heap objects. You can call the rate() function to calculate the removal rate of heap objects. You can use the go_memstats_mallocs_total - go_memstats_frees_total formula to calculate the number of surviving heap objects.

go_memstats_buck_hash_sys_bytes

The amount of memory occupied by the hash tables used for profiling.

go_memstats_alloc_bytes_total

The value of the metric increases as objects are allocated in the heap, but does not decrease when objects are removed. Similar to Prometheus counters, the rate() function can be called to query the memory consumption rate.

go_memstats_alloc_bytes

The number of memory bytes allocated for heap objects. The heap objects include all reachable heap objects and the unreachable objects that are not removed during GC.

nginx_ingress_controller_nginx_process_num_procs

-

go_info

The information about the Go version. The value is obtained by calling the runtime.Version() function.

go_memstats_mallocs_total

A counter value that shows the number of allocated heap objects. You can call the rate() function to calculate the allocation rate of heap objects.

go_memstats_other_sys_bytes

The amount of memory used for other runtime allocations.

go_goroutines

The value is obtained by calling the runtime.NumGoroutine() function based on the sched scheduler structure and the global allglen variable. All fields in the sched structure may concurrently change. Therefore, the system checks whether the value is less than 1. If the value is less than 1, 1 is returned.

scrape_samples_post_metric_relabeling

-

scrape_samples_scraped

-

process_virtual_memory_max_bytes

-

process_virtual_memory_bytes

The virtual set size (VSS). The value indicates all allocated memory, including the memory that is allocated but not used, and the memory that is shared and swapped out.

scrape_duration_seconds

-

go_memstats_heap_released_bytes

The number of free spans that have been returned to the operating system.

go_gc_duration_seconds_sum

-

go_memstats_next_gc_bytes

The amount of heap memory during the next GC cycle. GC is used to ensure that the value is greater than the value of the go_memstats_heap_alloc_bytes metric.

go_gc_duration_seconds_count

-

nginx_ingress_controller_config_hash

-

nginx_ingress_controller_config_last_reload_successful

-

nginx_ingress_controller_config_last_reload_successful_timestamp_seconds

-

nginx_ingress_controller_ingress_upstream_latency_seconds_count

-

nginx_ingress_controller_ingress_upstream_latency_seconds_sum

-

process_start_time_seconds

The value is obtained based on the start_time parameter. The start_time parameter specifies the time when a process starts. Unit: jiffy. The data comes from the /proc/stat directory. You can divide the value of the start_time parameter by USER_HZ to calculate the value, which is measured in seconds.

nginx_ingress_controller_nginx_process_cpu_seconds_total

-

scrape_series_added

-

nginx_ingress_controller_nginx_process_oldest_start_time_seconds

-

nginx_ingress_controller_nginx_process_read_bytes_total

-

nginx_ingress_controller_nginx_process_requests_total

-

nginx_ingress_controller_nginx_process_resident_memory_bytes

-

nginx_ingress_controller_nginx_process_virtual_memory_bytes

-

nginx_ingress_controller_nginx_process_write_bytes_total

-

nginx_ingress_controller_requests

-

go_memstats_mcache_inuse_bytes

The amount of memory used by the mcache structure.

nginx_ingress_controller_success

-

process_resident_memory_bytes

The resident set size (RSS). The value indicates the actual memory used by processes, including the shared memory. The memory that is allocated but not used, or the memory that is swapped out is not included.

process_open_fds

The value is obtained by calculating the total number of files in the /proc/PID/fd directory. It shows the total number of regular files, sockets, and pseudo-terminals opened by Go processes.

process_max_fds

The value is obtained by reading the value of the Max Open Files row in the /proc/{PID}/limits file. The value is a soft limit. The soft limit is the value that the kernel uses to limit the resources. The hard limit is the maximum value of the soft limit.

process_cpu_seconds_total

The value is obtained based on the utime parameter (the number of ticks executed by the Go process in user mode) and the stime parameter (the number of ticks executed by the Go process in kernel mode or when the system is called). Unit of the parameters: jiffy, which measures the tick time between two system timer interruptions. The value of the process_cpu_seconds_total metric is the sum of utime and stime divided by USER_HZ. The total number of program ticks divided by the tick rate (ticks per second. Unit: Hz) is the total time (unit: seconds) that the operating system has been running the process.

go_memstats_mspan_sys_bytes

The amount of memory allocated from the operating system for the mspan structure.

up

-

go_memstats_mspan_inuse_bytes

The amount of memory used by the mspan structure.

nginx_ingress_controller_ssl_expire_time_seconds

-

nginx_ingress_controller_leader_election_status

-

Job names and basic metrics related to CoreDNS data collection

Job name

Metric type

Metric name

Description

arms-ack-coredns

Basic metric

coredns_forward_request_duration_seconds_bucket

-

coredns_dns_request_size_bytes_bucket

-

coredns_dns_response_size_bytes_bucket

-

coredns_kubernetes_dns_programming_duration_seconds_bucket

-

coredns_dns_request_duration_seconds_bucket

-

coredns_plugin_enabled

-

coredns_health_request_duration_seconds_bucket

-

go_gc_duration_seconds

The value is obtained by calling the debug.ReadGCStats() function. When the function is called, the PauseQuantile field of the GCStats structure is set to 5. The function will return the minimum percentile, 25%, 50%, 75%, and the maximum percentile of the GC pause time. Then, the Prometheus Go client creates a summary metric based on the returned percentile of the GC pause time, NumGC, and PauseTotal variables.

coredns_forward_responses_total

-

coredns_forward_request_duration_seconds_sum

-

coredns_forward_request_duration_seconds_count

-

coredns_dns_requests_total

-

coredns_forward_conn_cache_misses_total

-

coredns_dns_responses_total

-

coredns_cache_entries

-

coredns_cache_hits_total

-

coredns_forward_conn_cache_hits_total

-

coredns_forward_requests_total

-

coredns_dns_request_size_bytes_sum

-

coredns_dns_response_size_bytes_count

-

coredns_dns_response_size_bytes_sum

-

coredns_dns_request_size_bytes_count

-

scrape_duration_seconds

-

scrape_samples_scraped

-

scrape_series_added

-

up

-

scrape_samples_post_metric_relabeling

-

go_memstats_lookups_total

A counter value that shows the number of dereferenced pointers. You can call the rate() function to calculate the dereferencing rate of pointers.

go_memstats_last_gc_time_seconds

The timestamp when the last GC was complete.

go_memstats_heap_sys_bytes

The number of memory bytes allocated for the heap from the operating system, including the virtual address space that is reserved but not used.

coredns_build_info

-

go_memstats_heap_released_bytes

The number of free spans that have been returned to the operating system.

go_memstats_heap_objects

The number of objects allocated on the heap. The number varies based on GC and the allocation of new objects.

go_memstats_heap_inuse_bytes

The number of bytes occupied by the spans in use.

go_memstats_heap_idle_bytes

The number of memory bytes occupied by free spans.

go_memstats_heap_alloc_bytes

The number of memory bytes allocated for heap objects. The heap objects include all reachable heap objects and the unreachable objects that are not removed during GC.

go_memstats_gc_sys_bytes

The amount of memory occupied by GC metadata.

go_memstats_sys_bytes

The number of memory bytes that Go has obtained from the system.

go_memstats_stack_sys_bytes

The number of stack memory bytes obtained from the operating system. The value is obtained based on the value of the go_memstats_stack_inuse_bytes metric plus the size of the operating system thread stack.

go_memstats_mallocs_total

A counter value that shows the number of allocated heap objects. You can call the rate() function to calculate the allocation rate of heap objects.

go_memstats_gc_cpu_fraction

The percentage of CPU time consumed by GC since the program was started.

go_memstats_stack_inuse_bytes

The amount of used memory on a stack memory span on which at least one stack object is allocated.

go_memstats_frees_total

A counter value that shows the number of removed heap objects. You can call the rate() function to calculate the removal rate of heap objects. You can use the go_memstats_mallocs_total - go_memstats_frees_total formula to calculate the number of surviving heap objects.

go_memstats_buck_hash_sys_bytes

The amount of memory occupied by the hash tables used for profiling.

go_memstats_alloc_bytes_total

The value of the metric increases as objects are allocated in the heap, but does not decrease when objects are removed. Similar to Prometheus counters, the rate() function can be called to query the memory consumption rate.

go_memstats_alloc_bytes

The number of memory bytes allocated for heap objects. The value is the same as the value of the go_memstats_heap_alloc_bytes metric. The heap objects include all reachable heap objects and the unreachable objects that are not removed during GC.

coredns_cache_misses_total

-

go_memstats_other_sys_bytes

The amount of memory used for other runtime allocations.

go_memstats_mcache_inuse_bytes

The amount of memory used by the mcache structure.

go_goroutines

The value is obtained by calling the runtime.NumGoroutine() function based on the sched scheduler structure and the global allglen variable. All fields in the sched structure may concurrently change. Therefore, the system checks whether the value is less than 1. If the value is less than 1, 1 is returned.

process_virtual_memory_max_bytes

-

process_virtual_memory_bytes

The VSS. The value indicates all allocated memory, including the memory that is allocated but not used, and the memory that is shared and swapped out.

go_gc_duration_seconds_sum

-

go_gc_duration_seconds_countarms-ack-coredns

-

go_memstats_next_gc_bytes

The amount of heap memory during the next GC cycle. GC is used to ensure that the value is greater than the value of the go_memstats_heap_alloc_bytes metric.

coredns_dns_request_duration_seconds_count

-

coredns_reload_failed_total

-

coredns_panics_total

-

coredns_local_localhost_requests_total

-

coredns_kubernetes_dns_programming_duration_seconds_sum

-

coredns_kubernetes_dns_programming_duration_seconds_count

-

coredns_dns_request_duration_seconds_sum

-

coredns_hosts_reload_timestamp_seconds

-

oredns_health_request_failures_total

-

process_start_time_seconds

The value is obtained based on the start_time parameter. The start_time parameter specifies the time when a process starts. Unit: jiffy. The data comes from the /proc/stat directory. You can divide the value of the start_time parameter by USER_HZ to calculate the value, which is measured in seconds.

process_resident_memory_bytes

The RSS. The value indicates the actual memory used by processes, including the shared memory. The memory that is allocated but not used, or the memory that is swapped out is not included.

process_open_fds

The value is obtained by calculating the total number of files in the /proc/PID/fd directory. It shows the total number of regular files, sockets, and pseudo-terminals opened by Go processes.

process_max_fds

The value is obtained by reading the value of the Max Open Files row in the /proc/{PID}/limits file. The value is a soft limit. The soft limit is the value that the kernel uses to limit the resources. The hard limit is the maximum value of the soft limit.

process_cpu_seconds_total

The value is obtained based on the utime parameter (the number of ticks executed by the Go process in user mode) and the stime parameter (the number of ticks executed by the Go process in kernel mode or when the system is called). Unit of the parameters: jiffy, which measures the tick time between two system timer interruptions. The value of the process_cpu_seconds_total metric is the sum of utime and stime divided by USER_HZ. The total number of program ticks divided by the tick rate (ticks per second. Unit: Hz) is the total time (unit: seconds) that the operating system has been running the process.

coredns_health_request_duration_seconds_sum

-

coredns_health_request_duration_seconds_count

-

go_memstats_mspan_sys_bytes

The amount of memory allocated from the operating system for the mspan structure.

coredns_forward_max_concurrent_rejects_total

-

coredns_forward_healthcheck_broken_total

-

go_memstats_mcache_sys_bytes

The amount of memory allocated from the operating system for the mcache structure.

go_memstats_mspan_inuse_bytes

The amount of memory used by the mspan structure.

go_threads

The value is obtained by calling the runtime.CreateThreadProfile() function based on the global allm variable.

go_info

The information about the Go version. The value is obtained by calling the runtime.Version() function.

Job names and basic metrics related to Kube-State-Metrics data collection

Job name

Metric type

Metric name

_kube-state-metrics

Basic metric

kube_pod_container_status_waiting_reason

kube_pod_status_phase

kube_pod_container_status_last_terminated_reason

kube_pod_container_status_terminated_reason

kube_pod_status_ready

kube_node_status_condition

kube_pod_container_status_running

kube_pod_container_status_restarts_total

kube_pod_container_info

kube_pod_container_status_waiting

kube_pod_container_status_terminated

kube_pod_labels

kube_pod_owner

kube_pod_info

kube_pod_container_resource_limits

kube_persistentvolume_status_phase

kube_pod_container_resource_requests_memory_bytes

kube_pod_container_resource_requests_cpu_cores

kube_pod_container_resource_limits_memory_bytes

kube_node_status_capacity

kube_service_info

kube_pod_container_resource_limits_cpu_cores

kube_deployment_status_replicas_updated

kube_deployment_status_replicas_unavailable

kube_deployment_spec_replicas

kube_deployment_created

kube_deployment_metadata_generation

kube_deployment_status_replicas

kube_deployment_labels

kube_deployment_status_observed_generation

kube_deployment_status_replicas_available

kube_deployment_spec_strategy_rollingupdate_max_unavailable

kube_daemonset_status_desired_number_scheduled

kube_daemonset_updated_number_scheduled

kube_daemonset_status_number_ready

kube_daemonset_status_number_misscheduled

kube_daemonset_status_number_available

kube_daemonset_status_current_number_scheduled

kube_daemonset_created

kube_node_status_allocatable_cpu_cores

kube_node_status_capacity_memory_bytes

kube_node_spec_unschedulable

kube_node_status_allocatable_memory_bytes

kube_node_labels

kube_node_info

kube_namespace_labels

kube_node_status_capacity_cpu_cores

kube_node_status_capacity_pods

kube_node_status_allocatable_pods

kube_node_spec_taint

kube_statefulset_status_replicas

kube_statefulset_replicas

kube_statefulset_created

up

scrape_samples_scraped

scrape_duration_seconds

scrape_samples_post_metric_relabeling

scrape_series_added

Job names and basic metrics related to Kubelet data collection

Job name

Metric type

Metric name

Description

_arms/kubelet/metric

Basic metric

rest_client_request_duration_seconds_bucket

-

apiserver_client_certificate_expiration_seconds_bucket

-

kubelet_pod_worker_duration_seconds_bucket

-

kubelet_pleg_relist_duration_seconds_bucket

-

workqueue_queue_duration_seconds_bucket

-

rest_client_requests_total

-

go_gc_duration_seconds

The value is obtained by calling the debug.ReadGCStats() function. When the function is called, the PauseQuantile field of the GCStats structure is set to 5. The function will return the minimum percentile, 25%, 50%, 75%, and the maximum percentile of the GC pause time. Then, the Prometheus Go client creates a summary metric based on the returned percentile of the GC pause time, NumGC, and PauseTotal variables.

process_cpu_seconds_total

The value is obtained based on the utime parameter (the number of ticks executed by the Go process in user mode) and the stime parameter (the number of ticks executed by the Go process in kernel mode or when the system is called). Unit of the parameters: jiffy, which measures the tick time between two system timer interruptions. The value of the process_cpu_seconds_total metric is the sum of utime and stime divided by USER_HZ. The total number of program ticks divided by the tick rate (ticks per second. Unit: Hz) is the total time (unit: seconds) that the operating system has been running the process.

process_resident_memory_bytes

The RSS. The value indicates the actual memory used by processes, including the shared memory. The memory that is allocated but not used, or the memory that is swapped out is not included.

kubernetes_build_info

-

kubelet_node_name

-

kubelet_certificate_manager_client_ttl_seconds

-

kubelet_certificate_manager_client_expiration_renew_errors

-

scrape_duration_seconds

-

go_goroutines

The value is obtained by calling the runtime.NumGoroutine() function based on the sched scheduler structure and the global allglen variable. All fields in the sched structure may concurrently change. Therefore, the system checks whether the value is less than 1. If the value is less than 1, 1 is returned.

crape_samples_post_metric_relabeling

-

scrape_samples_scraped

-

scrape_series_added

-

up

-

apiserver_client_certificate_expiration_seconds_count

-

workqueue_adds_total

-

workqueue_depth

-

Job names and basic metrics related to cAdvisor data collection

Job name

Metric type

Metric name

_arms/kubelet/cadvisor

Basic metric

container_memory_failures_total (deprecated by default)

container_memory_rss

container_spec_memory_limit_bytes

container_memory_failcnt

container_memory_cache

container_memory_swap

container_memory_usage_bytes

container_memory_max_usage_bytes

container_cpu_load_average_10s

container_fs_reads_total (deprecated by default)

container_fs_writes_total (deprecated by default)

container_network_transmit_errors_total

container_network_receive_bytes_total

container_network_transmit_packets_total

container_network_receive_errors_total

container_network_receive_bytes_total

container_network_receive_errors_total

container_network_transmit_errors_total

container_memory_working_set_bytes

container_cpu_usage_seconds_total

container_fs_reads_bytes_total

container_fs_writes_bytes_total

container_spec_cpu_quota

container_cpu_cfs_periods_total

container_cpu_cfs_throttled_periods_total

container_cpu_cfs_throttled_seconds_total

container_fs_inodes_free

container_fs_io_time_seconds_total

container_fs_io_time_weighted_seconds_total

container_fs_limit_bytes

container_tasks_state (deprecated by default)

container_fs_read_seconds_total (deprecated by default)

container_fs_write_seconds_total (deprecated by default)

container_fs_usage_bytes

container_fs_inodes_total

container_fs_io_current

scrape_duration_seconds

scrape_samples_scraped

machine_cpu_cores

machine_memory_bytes

scrape_samples_post_metric_relabeling

scrape_series_added

up

_arms-prom/kube-apiserver/cadvisor

Basic metric

scrape_duration_seconds

up

scrape_samples_scraped

scrape_samples_post_metric_relabeling

scrape_series_added

Job names and basic metrics related to ACK Scheduler data collection

Job name

Metric type

Metric name

ack-scheduler

Basic metric

rest_client_request_duration_seconds_bucket

scheduler_pod_scheduling_attempts_bucket

rest_client_requests_total

scheduler_pending_pods

scheduler_scheduler_cache_size

up

Job names and basic metrics related to etcd data collection

Job name

Metric type

Metric name

etcd

Basic metric

etcd_disk_backend_commit_duration_seconds_bucket

up

etcd_server_has_leader

etcd_debugging_mvcc_keys_total

etcd_debugging_mvcc_db_total_size_in_bytes

etcd_server_leader_changes_seen_total

Job names and basic metrics related to node data collection

Job name

Metric type

Metric name

Description

node-exporter

Basic metric

node_filesystem_size_bytes

-

node_filesystem_readonly

-

node_filesystem_free_bytes

-

node_filesystem_avail_bytes

-

node_cpu_seconds_total

-

node_network_receive_bytes_total

-

node_network_receive_errs_total

-

node_network_transmit_bytes_total

-

node_network_receive_packets_total

-

node_network_transmit_drop_total

-

node_network_transmit_errs_total

-

node_network_up

-

node_network_transmit_packets_total

-

node_network_receive_drop_total

-

go_gc_duration_seconds

The value is obtained by calling the debug.ReadGCStats() function. When the function is called, the PauseQuantile field of the GCStats structure is set to 5. The function will return the minimum percentile, 25%, 50%, 75%, and the maximum percentile of the GC pause time. Then, the Prometheus Go client creates a summary metric based on the returned percentile of the GC pause time, NumGC, and PauseTotal variables.

node_load5

-

node_filefd_allocated

-

node_exporter_build_info

-

node_disk_written_bytes_total

-

node_disk_writes_completed_total

-

node_disk_write_time_seconds_total

-

node_nf_conntrack_entries

-

node_nf_conntrack_entries_limit

-

node_processes_max_processes

-

node_processes_pids

-

node_sockstat_TCP_alloc

-

node_sockstat_TCP_inuse

-

node_sockstat_TCP_tw

-

node_timex_offset_seconds

-

node_timex_sync_status

-

node_uname_info

-

node_vmstat_pgfault

-

node_vmstat_pgmajfault

-

node_vmstat_pgpgin

-

node_vmstat_pgpgout

-

node_disk_reads_completed_total

-

node_disk_read_time_seconds_total

-

process_cpu_seconds_total

The value is obtained based on the utime parameter (the number of ticks executed by the Go process in user mode) and the stime parameter (the number of ticks executed by the Go process in kernel mode or when the system is called). Unit of the parameters: jiffy, which measures the tick time between two system timer interruptions. The value of the process_cpu_seconds_total metric is the sum of utime and stime divided by USER_HZ. The total number of program ticks divided by the tick rate (ticks per second. Unit: Hz) is the total time (unit: seconds) that the operating system has been running the process.

node_disk_read_bytes_total

-

node_disk_io_time_weighted_seconds_total

-

node_disk_io_time_seconds_total

-

node_disk_io_now

-

node_context_switches_total

-

node_boot_time_seconds

-

process_resident_memory_bytes

The RSS. The value indicates the actual memory used by processes, including the shared memory. The memory that is allocated but not used, or the memory that is swapped out is not included.

node_intr_total

-

node_load1

-

go_goroutines

The value is obtained by calling the runtime.NumGoroutine() function based on the sched scheduler structure and the global allglen variable. All fields in the sched structure may concurrently change. Therefore, the system checks whether the value is less than 1. If the value is less than 1, 1 is returned.

scrape_duration_seconds

-

node_load15

-

scrape_samples_post_metric_relabeling

-

node_netstat_Tcp_PassiveOpens

-

scrape_samples_scraped

-

node_netstat_Tcp_CurrEstab

-

scrape_series_added

-

node_netstat_Tcp_ActiveOpens

-

node_memory_MemTotal_bytes

-

node_memory_MemFree_bytes

-

node_memory_MemAvailable_bytes

-

node_memory_Cached_bytes

-

up

-

node_memory_Buffers_bytes

-

Job names and basic metrics related to GPU data collection

Job name

Metric type

Metric name

Description

gpu-exporter

Basic metric

go_gc_duration_seconds

The value is obtained by calling the debug.ReadGCStats() function. When the function is called, the PauseQuantile field of the GCStats structure is set to 5. The function will return the minimum percentile, 25%, 50%, 75%, and the maximum percentile of the GC pause time. Then, the Prometheus Go client creates a summary metric based on the returned percentile of the GC pause time, NumGC, and PauseTotal variables.

promhttp_metric_handler_requests_total

-

scrape_series_added

-

up

-

scrape_duration_seconds

-

scrape_samples_scraped

-

scrape_samples_post_metric_relabeling

-

go_memstats_mcache_inuse_bytes

The amount of memory used by the mcache structure.

process_virtual_memory_max_bytes

-

process_virtual_memory_bytes

The VSS. The value indicates all allocated memory, including the memory that is allocated but not used, and the memory that is shared and swapped out.

process_start_time_seconds

The value is obtained based on the start_time parameter. The start_time parameter specifies the time when a process starts. Unit: jiffy. The data comes from the /proc/stat directory. You can divide the value of the start_time parameter by USER_HZ to calculate the value, which is measured in seconds.

go_memstats_next_gc_bytes

The amount of heap memory during the next GC cycle. GC is used to ensure that the value is greater than the value of the go_memstats_heap_alloc_bytes metric.

go_memstats_heap_objects

The number of objects allocated on the heap. The number varies based on GC and the allocation of new objects.

process_resident_memory_bytes

The RSS. The value indicates the actual memory used by processes, including the shared memory. The memory that is allocated but not used, or the memory that is swapped out is not included.

process_open_fds

The value is obtained by calculating the total number of files in the /proc/PID/fd directory. It shows the total number of regular files, sockets, and pseudo-terminals opened by Go processes.

process_max_fds

The value is obtained by reading the value of the Max Open Files row in the /proc/{PID}/limits file. The value is a soft limit. The soft limit is the value that the kernel uses to limit the resources. The hard limit is the maximum value of the soft limit.

go_memstats_other_sys_bytes

The amount of memory used for other runtime allocations.

go_gc_duration_seconds_count

-

go_memstats_heap_alloc_bytes

The number of memory bytes allocated for heap objects. The heap objects include all reachable heap objects and the unreachable objects that are not removed during GC.

process_cpu_seconds_total

The value is obtained based on the utime parameter (the number of ticks executed by the Go process in user mode) and the stime parameter (the number of ticks executed by the Go process in kernel mode or when the system is called). Unit of the parameters: jiffy, which measures the tick time between two system timer interruptions. The value of the process_cpu_seconds_total metric is the sum of utime and stime divided by USER_HZ. The total number of program ticks divided by the tick rate (ticks per second. Unit: Hz) is the total time (unit: seconds) that the operating system has been running the process.

nvidia_gpu_temperature_celsius (deprecated by default)

-

go_memstats_stack_inuse_bytes

The amount of used memory on a stack memory span on which at least one stack object is allocated.

nvidia_gpu_power_usage_milliwatts (deprecated by default)

-

nvidia_gpu_num_devices (deprecated by default)

-

nvidia_gpu_memory_used_bytes (deprecated by default)

-

nvidia_gpu_memory_total_bytes (deprecated by default)

-

go_memstats_stack_sys_bytes

The number of stack memory bytes obtained from the operating system. The value is obtained based on the value of the go_memstats_stack_inuse_bytes metric plus the size of the operating system thread stack.

nvidia_gpu_memory_allocated_bytes (deprecated by default)

-

nvidia_gpu_duty_cycle (deprecated by default)

-

nvidia_gpu_allocated_num_devices (deprecated by default)

-

promhttp_metric_handler_requests_in_flight

-

go_memstats_sys_bytes

The number of memory bytes that Go has obtained from the system.

go_memstats_gc_sys_bytes

The amount of memory occupied by GC metadata.

go_memstats_gc_cpu_fraction

The percentage of CPU time consumed by GC since the program was started.

go_memstats_heap_released_bytes

The number of free spans that have been returned to the operating system.

go_memstats_frees_total

A counter value that shows the number of removed heap objects. You can call the rate() function to calculate the removal rate of heap objects. You can use the go_memstats_mallocs_total - go_memstats_frees_total formula to calculate the number of surviving heap objects.

go_threads

The value is obtained by calling the runtime.CreateThreadProfile() function based on the global allm variable.

go_memstats_mspan_sys_bytes

The amount of memory allocated from the operating system for the mspan structure.

go_memstats_buck_hash_sys_bytes

The amount of memory occupied by the hash tables used for profiling.

go_memstats_alloc_bytes_total

The value of the metric increases as objects are allocated in the heap, but does not decrease when objects are removed. Similar to Prometheus counters, the rate() function can be called to query the memory consumption rate.

go_memstats_heap_sys_bytes

The number of memory bytes allocated for the heap from the operating system, including the virtual address space that is reserved but not used.

go_memstats_mspan_inuse_bytes

The amount of memory used by the mspan structure.

go_memstats_alloc_bytes

The number of memory bytes allocated for heap objects. The value is the same as the value of the go_memstats_heap_alloc_bytes metric. The heap objects include all reachable heap objects and the unreachable objects that are not removed during GC.

go_info

The information about the Go version. The value is obtained by calling the runtime.Version() function.

go_memstats_last_gc_time_seconds

The timestamp when the last GC was complete.

go_memstats_heap_inuse_bytes

The number of bytes occupied by the spans in use.

go_memstats_mcache_sys_bytes

The amount of memory allocated from the operating system for the mcache structure.

go_memstats_lookups_total

A counter value that shows the number of dereferenced pointers. You can call the rate() function to calculate the dereferencing rate of pointers.

go_memstats_mallocs_total

A counter value that shows the number of allocated heap objects. You can call the rate() function to calculate the allocation rate of heap objects.

go_gc_duration_seconds_sum

-

go_goroutines

The value is obtained by calling the runtime.NumGoroutine() function based on the sched scheduler structure and the global allglen variable. All fields in the sched structure may concurrently change. Therefore, the system checks whether the value is less than 1. If the value is less than 1, 1 is returned.

go_memstats_heap_idle_bytes

The number of memory bytes occupied by free spans.

Job names and basic metrics related to PV data collection

Job name

Metric type

Metric name

k8s-csi-cluster-pv

Basic metric

cluster_pvc_detail_num_total

cluster_pv_detail_num_total

cluster_pv_status_num_total

cluster_scrape_collector_success

cluster_scrape_collector_duration_seconds

alibaba_cloud_storage_operator_build_info

cluster_pvc_status_num_total

scrape_duration_seconds

scrape_samples_post_metric_relabeling

scrape_samples_scraped

scrape_series_added

up

k8s-csi-node-pv

Basic metric

cluster_scrape_collector_duration_seconds

cluster_scrape_collector_success

alibaba_cloud_csi_driver_build_info

up

scrape_series_added

scrape_samples_post_metric_relabeling

scrape_samples_scraped

scrape_duration_seconds

Hybrid Cloud Monitoring

The following table describes the metrics of Hybrid Cloud Monitoring that are supported by Managed Service for Prometheus.

Category

Metric type

Metric name

Description

ECS

Custom metric

cpu_util_lization

The CPU utilization of an Elastic Compute Service (ECS) instance.

internet_in_rate

The average rate of inbound traffic from the Internet to an ECS instance.

internet_out_rate

The average rate of outbound traffic from an ECS instance to the Internet.

disk_read_bps

The bit rate of reads to all disks of an ECS instance.

disk_write_bps

The number of reads per second to all disks of an ECS instance.

vpc_public_ip_internet_in_Rate

The average rate of inbound traffic from the Internet to the IP address of an ECS instance.

vpc_public_ip_internet_out_Rate

The average rate of outbound traffic from the IP address of an ECS instance to the Internet.

cpu_total

(Agent) cpu.total

memory_totalspace

(Agent) memory.total.space

memory_usedutilization

(Agent) memory.used.utilization

diskusage_utilization

(Agent) disk.usage.utilization_device

RDS

Custom metric

cpu_usage_average

The CPU utilization.

disk_usage

The disk usage.

iops_usage

The IOPS usage.

connection_usage

The connection usage.

data_delay

The latency of read-only instances.

memory_usage

The memory usage.

mysql_network_in_new

The inbound bandwidth of an ApsaraDB RDS for MySQL instance.

mysql_network_out_new

The outbound bandwidth of an ApsaraDB RDS for MySQL instance.

mysql_active_sessions

MySQL_ActiveSessions

sqlserver_network_in_new

The inbound bandwidth of an ApsaraDB RDS for SQL Server instance.

sqlserver_network_out_new

The outbound bandwidth of an ApsaraDB RDS for SQL Server instance.

NAT

Custom metric

snat_connection

The number of SNAT connections.

snat_connection_drop_limit

The cumulative number of SNAT connections dropped due to the limit on the number of concurrent connections.

snat_connection_drop_rate_limit

The cumulative number of SNAT connections dropped due to the limit on the number of new connections.

net_rx_rate

The inbound bandwidth.

net_tx_rate

The outbound bandwidth.

net_rx_pkgs

The rate of inbound packets.

net_tx_pkgs

The rate of outbound packets.

RocketMQ

Custom metric

consumer_lag_gid

The number of accumulated messages.

receive_message_count_gid

The number of messages received per minute by a consumer group.

send_message_count_gid

The number of messages sent per minute by a producer group.

consumer_lag_topic

The number of accumulated messages of a topic or group.

receive_message_count_topic

The number of messages of a topic received per minute by a consumer group.

send_message_count_topic

The number of messages of a topic sent per minute by a producer group.

receive_message_count

The number of messages received per minute.

send_message_count

The number of messages sent per minute.

SLB

Custom metric

healthy_server_count

The number of healthy backend ECS instances.

unhealthy_server_count

The number of unhealthy backend ECS instances.

packet_tx

The number of inbound packets per second.

packet_rx

The number of outbound packets per second.

traffic_rx_new

The inbound bandwidth.

traffic_tx_new

The outbound bandwidth.

active_connection

The number of active connections over TCP.

inactive_connection

The number of inactive connections on a port.

new_connection

The number of new connections over TCP.

max_connection

The number of concurrent connections on a port.

instance_active_connection

The number of active connections established to an instance.

instance_new_connection

The number of new connections established to an instance per second.

instance_max_connection

The maximum number of concurrent connections established to an instance per second.

instance_drop_connection

The number of connections that are dropped per second on an instance.

instance_traffic_rx

The inbound traffic per second of an instance. Unit: bit.

instance_traffic_tx

The outbound traffic per second of an instance. Unit: bit.

E-MapReduce (EMR)

Custom metric

active_applications

The number of active jobs.

active_users

The number of active users.

aggregate_containers_allocated

The total number of allocated containers.

aggregate_containers_released

The total number of released containers.

allocated_containers

The number of allocated containers.

apps_completed

The number of completed jobs.

apps_failed

The number of failed jobs.

apps_killed

The number of terminated jobs.

apps_pending

The number of pending jobs.

apps_running

The number of jobs that are running.

apps_submitted

The number of submitted jobs.

available_mb

The amount of memory available to the current queue.

available_vcores

The number of vCores available to the current queue.

pending_containers

The number of pending containers.

reserved_containers

The number of reserved containers.

EIP

Custom metric

net_rx_rate

The inbound bandwidth.

net_tx_rate

The outbound bandwidth.

net_rx_pkgs_rate

The rate of inbound packets.

net_tx_pkgs_rate

The rate of outbound packets.

out_ratelimit_drop_speed

The rate at which packets are dropped due to throttling.

OSS

Custom metric

availability

The availability.

request_valid_rate

The ratio of valid requests.

success_rate

The ratio of successful requests.

network_error_rate

The ratio of failed requests due to network issues.

total_request_count

The total number of requests.

valid_count

The number of valid requests.

internet_send

The outbound traffic over the Internet.

internet_recv

The inbound traffic over the Internet.

intranet_send

The outbound traffic over the internal network.

intranet_recv

The inbound traffic over the internal network.

success_count

The total number of successful requests.

network_error_count

The total number of failed requests due to network issues.

client_timeout_count

The total number of failed requests due to client timeouts.

Elasticsearch

Custom metric

node_cpu_utilization

The CPU utilization of a node.

node_heap_memory_utilization

The heap memory usage of a node.

node_stats_exception_log_count

The number of exceptions.

node_stats_full_gc_collection_count

The number of full heap garbage collections (full GCs).

node_disk_utilization

The disk usage of a node.

node_load_1m

The average load of a node over the last 1 minute.

cluster_query_qps

The queries per second (QPS) of a cluster.

cluster_index_qps

ClusterIndexQPS

Logstash

Custom metric

cpu_percent

The CPU utilization of a node.

node_heap_memory

The memory usage of a node.

node_disk_usage

The disk usage of a node.

DRDS

Custom metric

cpu_utilization

The CPU utilization.

connection_count

The number of connections.

logic_qps

The logical QPS.

logic_rt

The logical response time (RT).

memory_utilization

The memory usage.

network_input_traffic

The inbound bandwidth.

network_output_traffic

The outbound bandwidth.

physics_qps

The physical QPS.

physics_rt

The physical RT.

thread_count

The number of active threads.

com_insert_select

The number of INSERT and SELECT statements that are executed per second on a private ApsaraDB RDS for MySQL instance.

com_replace

The number of REPLACE statements that are executed per second on a private ApsaraDB RDS for MySQL instance.

com_replace_select

The number of REPLACE and SELECT statements that are executed per second on a private ApsaraDB RDS for MySQL instance.

com_select

The number of SELECT statements that are executed per second on a private ApsaraDB RDS for MySQL instance.

com_update

The number of UPDATE statements that are executed per second on a private ApsaraDB RDS for MySQL instance.

conn_usage

The connection usage of a private ApsaraDB RDS for MySQL instance.

cpu_usage

The CPU utilization of a private ApsaraDB RDS for MySQL instance.

disk_usage

The disk usage of a private ApsaraDB RDS for MySQL instance.

ibuf_dirty_ratio

The dirty page ratio of the buffer pool of a private ApsaraDB RDS for MySQL instance.

ibuf_pool_reads

The number of physical reads per second on a private ApsaraDB RDS for MySQL instance.

ibuf_read_hit

The read hit ratio of the buffer pool of a private ApsaraDB RDS for MySQL instance.

ibuf_request_r

The number of logical reads per second on a private ApsaraDB RDS for MySQL instance.

ibuf_request_w

The number of logical writes per second on a private ApsaraDB RDS for MySQL instance.

ibuf_use_ratio

The utilization of the buffer pool of a private ApsaraDB RDS for MySQL instance.

inno_data_read

The amount of data read per second on a private ApsaraDB RDS for MySQL instance that uses InnoDB.

inno_data_written

The amount of data written per second to a private ApsaraDB RDS for MySQL instance that uses InnoDB.

inno_row_delete

The number of rows deleted per second from a private ApsaraDB RDS for MySQL instance that uses InnoDB.

inno_row_insert

The number of rows inserted per second to a private ApsaraDB RDS for MySQL instance that uses InnoDB.

inno_row_readed

The number of rows read per second on a private ApsaraDB RDS for MySQL instance that uses InnoDB.

inno_row_update

The number of rows updated per second on a private ApsaraDB RDS for MySQL instance that uses InnoDB.

innodb_log_write_requests

The number of write requests per second to the logs of a private ApsaraDB RDS for MySQL instance that uses InnoDB.

innodb_log_writes

The number of logical writes per second to the logs of a private ApsaraDB RDS for MySQL instance that uses InnoDB.

innodb_os_log_fsyncs

The number of times fsync is called per second to write data to the logs of a private ApsaraDB RDS for MySQL instance that uses InnoDB.

input_traffic_ps

The inbound bandwidth of a private ApsaraDB RDS for MySQL instance.

iops_usage

The IOPS usage of a private ApsaraDB RDS for MySQL instance.

mem_usage

The memory usage of a private ApsaraDB RDS for MySQL instance.

output_traffic_ps

The outbound bandwidth of a private ApsaraDB RDS for MySQL instance.

qps

The QPS of a private ApsaraDB RDS for MySQL instance.

slave_lag

The latency of a private read-only ApsaraDB RDS for MySQL instance.

slow_queries

The slow queries per second of a private ApsaraDB RDS for MySQL instance.

tb_tmp_disk

The number of temporary tables created per second on a private ApsaraDB RDS for MySQL instance.

Kafka

Custom metric

instance_disk_capacity

The disk usage of an instance.

instance_message_input

The number of messages produced on an instance.

instance_message_output

The number of messages consumed on an instance.

topic_message_input

The number of messages produced in a topic.

topic_message_output

The number of messages consumed in a topic.

MongoDB

Custom metric

cpu_utilization

The CPU utilization.

memory_utilization

The memory usage.

disk_utilization

The disk usage.

iops_utilization

The IOPS usage.

qps

The QPS.

connect_amount

The number of used connections.

instance_disk_amount

The disk space occupied by an instance.

data_disk_amount

The disk space occupied by data.

log_disk_amount

The disk space occupied by logs.

intranet_in

The inbound traffic over the internal network.

intranet_out

The outbound traffic over the internal network.

number_requests

The number of requests.

op_insert

The number of insert operations.

op_query

The number of query operations.

op_update

The number of update operations.

op_delete

The number of delete operations.

op_getmore

The number of getMore operations.

op_command

The number of operations performed by running commands.

PolarDB

Custom metric

active_connections

The number of active connections.

blks_read_delta

The number of reads to a data block.

cluster_active_sessions

The number of active connections.

cluster_connection_utilization

The connection usage.

cluster_cpu_utilization

The CPU utilization.

cluster_data_io

The I/O throughput per second of a storage engine.

cluster_data_iops

The IOPS of a storage engine.

cluster_mem_hit_ratio

The cache hit ratio.

cluster_memory_utilization

The memory usage.

cluster_qps

The QPS.

cluster_slow_queries_ps

The number of slow queries per second.

cluster_tps

The number of transactions per second.

conn_usage

The connection usage.

cpu_total

The CPU utilization.

db_age

The maximum database age.

instance_connection_utilization

The connection usage of an instance.

instance_cpu_utilization

The CPU utilization of an instance.

instance_input_bandwidth

The inbound bandwidth of an instance.

instance_memory_utilization

The memory usage of an instance.

instance_output_bandwidth

The outbound bandwidth of an instance.

mem_usage

The memory usage.

pls_data_size

The disk data size of a PolarDB for PostgreSQL cluster.

pls_iops

pg IOPS

pls_iops_read

The read IOPS of a PolarDB for PostgreSQL cluster.

pls_iops_write

The write IOPS of a PolarDB for PostgreSQL cluster.

pls_pg_wal_dir_size

The size of write-ahead logging (WAL) logs of a PolarDB for PostgreSQL cluster.

pls_throughput

The I/O throughput of a PolarDB for PostgreSQL cluster.

pls_throughput_read

The read I/O throughput of a PolarDB for PostgreSQL cluster.

pls_throughput_write

The write I/O throughput of a PolarDB for PostgreSQL cluster.

swell_time

The point in time at which data bloat occurs in a PolarDB for PostgreSQL cluster.

tps

pg TPS

cluster_iops

The IOPS.

Redis

Custom metric

intranet_in_ratio

The bandwidth utilization of writes.

intranet_out_ratio

The bandwidth utilization of reads.

failed_count

The number of failed operations.

cpu_usage

The CPU utilization.

used_memory

The memory usage.

used_connection

The number of used connections.

used_qps

The number of used QPS.

Cloud service monitoring

The following table describes the metrics of cloud service monitoring that are supported by Managed Service for Prometheus.

ApsaraMQ for RocketMQ

Category

Metric type

Metric name

Description

Producer

Custom metric

rocketmq_producer_requests

The number of API calls that are made to send messages.

rocketmq_producer_messages

The number of sent messages.

rocketmq_producer_message_size_bytes

The total size of sent messages.

rocketmq_producer_send_success_rate

The success rate of message sending.

rocketmq_producer_failure_api_calls

The number of failed API calls that are made to send messages.

rocketmq_producer_send_rt_milliseconds_avg

The average time required to send messages.

rocketmq_producer_send_rt_milliseconds_min

The minimum time required to send messages.

rocketmq_producer_send_rt_milliseconds_max

The maximum time required to send messages.

rocketmq_producer_send_rt_milliseconds_p95

The 95th percentile of the time required to send messages.

rocketmq_producer_send_rt_milliseconds_p99

The 99th percentile of the time required to send messages.

Consumer

Custom metric

rocketmq_consumer_requests

The number of API calls that are made to consume messages.

rocketmq_consumer_send_back_requests

The number of API calls that are made to return messages after consumers fail to consume messages.

rocketmq_consumer_send_back_messages

The messages returned from consumers after consumers fail to consume messages.

rocketmq_consumer_messages

The number of consumed messages.

rocketmq_consumer_message_size_bytes

The total size of messages consumed within 1 minute.

rocketmq_consumer_ready_and_inflight_messages

The number of lagging messages, including ready messages and inflight messages.

rocketmq_consumer_ready_messages

The number of ready messages.

rocketmq_consumer_inflight_messages

The number of inflight messages.

rocketmq_consumer_queue_time_milliseconds

The queuing duration of messages.

rocketmq_consumer_message_await_time_milliseconds_avg

The average time required for consumer clients to allocate resources to process messages.

rocketmq_consumer_message_await_time_milliseconds_min

The minimum time required for consumer clients to allocate resources to process messages.

rocketmq_consumer_message_await_time_milliseconds_max

The maximum time required for consumer clients to allocate resources to process messages.

rocketmq_consumer_message_await_time_milliseconds_p95

The 95th percentile of the time required for consumer clients to allocate resources to process messages.

rocketmq_consumer_message_await_time_milliseconds_p99

The 99th percentile of the time required for consumer clients to allocate resources to process messages.

rocketmq_consumer_message_process_time_milliseconds_avg

The average time required for consumers to process messages.

rocketmq_consumer_message_process_time_milliseconds_min

The minimum time required for consumers to process messages.

rocketmq_consumer_message_process_time_milliseconds_max

The maximum time required for consumers to process messages.

rocketmq_consumer_message_process_time_milliseconds_p95

The 95th percentile of the time required for consumers to process messages.

rocketmq_consumer_message_process_time_milliseconds_p99

The 99th percentile of the time required for consumers to process messages.

rocketmq_consumer_consume_success_rate

The success rate of message consumption.

rocketmq_consumer_failure_api_calls

The number of failed API calls that are made to consume messages.

rocketmq_consumer_to_dlq_messages

The number of dead-letter messages.

Overview

Custom metric

rabbitmq_instance_api_total

The number of instance-level API calls that are initiated within seconds.

rabbitmq_connections_opened_total

The total number of opened connections.

rabbitmq_connections_closed_total

The total number of closed connections.

rabbitmq_channels_opened_total

The total number of opened channels.

rabbitmq_channels_closed_total

The total number of closed channels.

rabbitmq_queues_declared_total

The total number of declared queues.

rabbitmq_queues_deleted_total

The total number of deleted queues.

rabbitmq_exchange_declared_total

-

rabbitmq_exchange_deleted_total

-

rabbitmq_exchange_bind_total

-

rabbitmq_exchange_unbind_total

-

rabbitmq_queue_bind_total

-

rabbitmq_queue_unbind_total

-

rabbitmq_connections

The number of connections that are being opened.

rabbitmq_channels

The number of channels that are being opened.

Connections

Custom metric

rabbitmq_connection_channels

The number of channels on connections.

Exchange

Custom metric

rabbitmq_exchange_messages_published_in_total

The number of inbound messages.

rabbitmq_exchange_messages_published_out_total

The number of outbound messages.

Queues

Custom metric

rabbitmq_queue_messages_published_total

The total number of messages published to queues.

rabbitmq_queue_messages_ready

The number of messages that are ready to be delivered to consumers.

rabbitmq_queue_messages_unacked

The number of messages that are being scheduled.

rabbitmq_queue_deliver_total

The total number of messages that have been delivered to consumers but not yet consumed.

rabbitmq_queue_get_total

-

rabbitmq_queue_ack_total

-

rabbitmq_queue_uack_total

-

rabbitmq_queue_recover_total

-

rabbitmq_queue_reject_total

-

rabbitmq_queue_consumers

The number of consumers in queues.

MongoDB

Metric type

Metric name

Description

Custom metric

avg_rt

The average response time of an instance.

bytes_in

The inbound traffic of an instance.

bytes_out

The outbound traffic of an instance.

bytes_read_into_cache

The amount of data read from the WiredTiger cache.

bytes_written_from_cache

The amount of data written into the WiredTiger cache.

command

The QPS of protocol command operations.

conn_usage

The connection usage of an instance. The value is generated by dividing the number of current connections by the maximum number of connections.

connections_active

The number of active connections of an instance.

cpu_usage

The CPU utilization of an instance.

current_conn

The total number of current connections of an instance.

data_iops

The IOPS usage of the data disk.

data_size

The used data disk space of an instance.

delete

The QPS of delete operations.

disk_usage

The disk usage of an instance. The value is generated by dividing the used space by the maximum space.

document_deleted_ps

The number of documents deleted from an instance.

document_inserted_ps

The number of documents inserted into an instance.

document_returned_ps

The number of documents returned by an instance.

document_updated_ps

The number of documents updated by an instance.

getmore

The QPS of read operations.

gl_ac_readers

The number of global read locks currently used by an instance.

gl_ac_writers

The number of global write locks currently used by an instance.

gl_cq_readers

The length of the queue waiting for the global read locks.

gl_cq_total

The length of the queue waiting for the global locks.

gl_cq_writers

The length of the queue waiting for global write locks.

ins_size

The used disk space of an instance.

insert

The QPS of insert operations.

iocheck_cost

The I/O latency. The value indicates the I/O performance.

iops_usage

The IOPS usage.

job_cursors_closed

The number of cursors that are closed with closed sessions.

log_iops

The IOPS usage of the log disk.

log_size

The used log disk space of an instance.

maximum_bytes_configured

The maximum size of the WiredTiger disk.

mem_usage

The memory usage.

moveChunk_donor_started_ps

The number of times that the current node is used as the moveChunk source shard.

moveChunk_recip_stared_ps

The number of times that the current node is used as the moveChunk destination shard.

noTimeout_open

The number of opened cursors without a timeout period.

operation_exactIDCount_ps

The number of requests that need to be broadcasted to obtain information about the matched IDs.

operation_scanAndOrder_ps

The number of requests for which indexes cannot be used for sorting.

operation_writeConflicts_ps

The number of write conflicts.

pinned_open

The number of opened cursors with a timeout period.

query

The QPS of query operations.

queryExecutor_scannedObject_ps

The number of queried documents.

queryExecutor_scanned_ps

The number of queried indexes.

read_concurrent_trans_available

The number of concurrent read requests available in a WiredTiger request queue.

read_concurrent_trans_out

The number of concurrent read requests sent from a WiredTiger request queue.

repl_lag

The data synchronization latency of the primary and secondary nodes of an instance.

timed_out

The number of cursors that are closed due to timeout.

total_open

The number of cursors that are being opened.

ttl_deletedDocuments_ps

The number of documents that are deleted due to time-to-live (TTL) indexes.

ttl_passes_ps

The number of delete operations that the background TTL threads perform.

update

The QPS of update operations.

write_concurrent_trans_available

The number of concurrent write requests available in a WiredTiger request queue.

write_concurrent_trans_out

The number of concurrent write requests sent from a WiredTiger request queue.

wt_cache_dirty_usage

The dirty cache usage of the WiredTiger storage engine of an instance.

wt_cache_usage

The dirty cache usage of the WiredTiger storage engine of an instance.

Flink

Flink metrics

Metric name

Definition

Description

Unit

Metric type

flink_jobmanager_job_numRestarts

The number of times that a job is restarted when a job failover occurs.

This metric indicates the number of times that a job is restarted when a job failover occurs. The number of times that the job is restarted when a JobManager failover occurs is not included.

Count

Custom metric

flink_taskmanager_job_task_operator_currentEmitEventTimeLag

The processing latency.

If the value of this metric is large, data latency may occur in the job when the system pulls or processes data.

Milliseconds

Custom metric

flink_taskmanager_job_task_operator_currentFetchEventTimeLag

The transmission latency.

If the value of this metric is large, data latency may occur in the job when the system pulls data.

Milliseconds

Custom metric

flink_taskmanager_job_task_numRecordsIn

The total number of input data records of all operators.

If the value of this metric does not increase for an extended period of time for an operator, data may be missing from the source. Therefore, data fails to be transmitted. In this case, you must check the data of the source.

Count

Custom metric

flink_taskmanager_job_task_numRecordsOut

The total number of output data records.

If the value of this metric does not increase for an extended period of time for an operator, an error may occur in the code logic of the job and data is missing. Therefore, data fails to be transmitted. In this case, you must check the code logic of the job.

Count

Custom metric

flink_taskmanager_job_task_operator_numBytesIn

The total number of input bytes.

This metric measures the size of the input data records of the source. This helps observe the job throughput.

Byte

Custom metric

flink_taskmanager_job_task_operator_numBytesOut

The total number of output bytes.

This metric measures the size of the output data records of the source. This helps observe the job throughput.

Byte

Custom metric

flink_taskmanager_job_task_operator_numRecordsIn

The total number of input data records of all operators.

If the value of this metric does not increase for an extended period of time for an operator, data may be missing from the source. Therefore, data fails to be transmitted. In this case, you must check the data of the source.

Count

Custom metric

flink_taskmanager_job_task_operator_numRecordsInPerSecond

The number of input data records per second for all data streams.

This metric measures the overall processing speed of data streams.

For example, the value of this metric helps determine whether the overall processing speed of data streams meets the expected requirements and how the job performance changes under different input data loads.

Count/s

Custom metric

flink_taskmanager_job_task_operator_numRecordsOut

The total number of output data records.

If the value of this metric does not increase for an extended period of time for an operator, an error may occur in the code logic of the job and data is missing. Therefore, data fails to be transmitted. In this case, you must check the code logic of the job.

Count

Custom metric

flink_taskmanager_job_task_operator_numRecordsOutPerSecond

The number of output data records per second for all data streams.

This metric measures the overall output speed of data streams. The speed indicates the number of output data records per second for all data streams.

For example, the value of this metric helps determine whether the overall output speed of data streams meets the expected requirements and how the job performance changes under different output data loads.

Count/s

Custom metric

flink_taskmanager_job_task_operator_source_numRecordsIn

The total number of data records that flow into the source operator.

This metric measures the number of data records that flow into the source.

Count

Custom metric

flink_taskmanager_job_task_operator_sink_numRecordsOut

The total number of output data records in a sink.

This metric measures the number of data records that are exported by the source.

Count

Custom metric

flink_taskmanager_job_task_numRecordsInPerSecond

The number of input data records per second for all data streams.

This metric measures the overall processing speed of data streams.

For example, the value of this metric helps determine whether the overall processing speed of data streams meets the expected requirements and how the job performance changes under different input data loads.

Count/s

Custom metric

flink_taskmanager_job_task_numRecordsOutPerSecond

The number of output data records per second for all data streams.

This metric measures the overall output speed of data streams. The speed indicates the number of output data records per second for all data streams.

For example, the value of this metric helps determine whether the overall output speed of data streams meets the expected requirements and how the job performance changes under different output data loads.

Count/s

Custom metric

flink_taskmanager_job_task_operator_source_numRecordsInPerSecond

The number of input data records per second in a source.

This metric measures the speed at which data records are generated in a source. The speed indicates the number of input data records per second in the source.

For example, the number of data records that can be generated varies based on the type of each source in a data stream. The value of this metric helps determine the speed at which data records are generated in a source and adjust data streams to improve performance.

This metric is also used for monitoring and alerting. If the value of this metric is 0, data may be missing from the source. In this case, you must check whether data output is blocked because the data of the source is not consumed.

Count/s

Custom metric

flink_taskmanager_job_task_operator_sink_numRecordsOutPerSecond

The number of output data records per second in a sink.

This metric measures the speed at which data records are exported from a sink. The speed indicates the number of output data records per second in the sink.

For example, the number of data records that can be exported varies based on the type of each sink in a data stream. The value of the numRecordsOutOfSinkPerSecond metric helps determine the speed at which data records are exported from a sink and adjust data streams to improve performance.

This metric is also used for monitoring and alerting. If the value of this metric is 0, the code logic of the job may be invalid and all data is filtered out. In this case, you must check the code logic of the job.

Count/s

Custom metric

flink_taskmanager_job_task_numBuffersInLocalPerSecond

The number of locally consumed data buffers per second.

If the value of this metric is large, inter-task communication is frequent on the local node .

Count/s

Custom metric

flink_taskmanager_job_task_numBuffersInRemotePerSecond

The number of buffers received from the remote TaskManager per second.

This metric indicates the frequency of inter-TaskManager communication.

Count/s

Custom metric

flink_taskmanager_job_task_numBuffersOutPerSecond

The number of buffers sent to other tasks per second.

This metric helps understand the output pressure of tasks and the usage of network bandwidth.

Count/s

Custom metric

flink_taskmanager_job_task_numBytesInLocalPerSecond

The total number of input bytes per second.

This metric measures the rate at which data flows into the source. This helps observe the job throughput.

Byte/s

Custom metric

flink_taskmanager_job_task_operator_numBytesOutPerSecond

The total number of output bytes per second.

This metric measures the rate at which data is exported by the source. This helps observe the job throughput.

Byte/s

Custom metric

flink_taskmanager_job_task_operator_pendingRecords

The number of data records that are not read by the source.

This metric measures the number of data records that are not pulled by the source from the external system.

Count

Custom metric

flink_taskmanager_job_task_operator_sourceIdleTime

The duration for which data is not processed in the source.

This metric specifies whether the source is idle. If the value of this metric is large, your data is generated at a low speed in the external system.

Milliseconds

Custom metric

flink_taskmanager_job_task_operator_numBytesInPerSecond

The total number of input bytes per second.

None.

Byte/s

Custom metric

flink_taskmanager_job_task_numBytesOutPerSecond

The total number of output bytes per second.

None.

Byte/s

Custom metric

flink_taskmanager_job_task_operator_currentSendTime

The time consumed to send the latest record.

None.

Milliseconds

Custom metric

flink_jobmanager_job_totalNumberOfCheckpoints

The total number of checkpoints.

None.

Count

Custom metric

flink_jobmanager_job_numberOfFailedCheckpoints

The number of failed checkpoints.

None.

Count

Custom metric

flink_jobmanager_job_numberOfCompletedCheckpoints

The number of completed checkpoints.

None.

Count

Custom metric

flink_jobmanager_job_numberOfInProgressCheckpoints

The number of checkpoints that are in progress.

None.

Count

Custom metric

flink_jobmanager_job_lastCheckpointDuration

The duration for which the last checkpoint is used.

If the checkpoint takes an extended period of time or times out, the possible cause is that the storage space occupied by state data is excessively large, a temporary network error occurs, barriers are not aligned, or data backpressure exists.

Milliseconds

Custom metric

flink_jobmanager_job_lastCheckpointSize

The size of the last checkpoint.

This metric measures the size of the last checkpoint that is uploaded. This metric helps analyze the checkpoint performance when a bottleneck occurs.

Byte

Custom metric

flink_taskmanager_job_task_operator_state_name_stateClearLatency

The maximum latency of a Clear operation on state data.

This metric measures the performance of a Clear operation on state data.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_valueStateGetLatency

The maximum latency of a Get operation on ValueState data.

This metric measures the performance of accessing ValueState data by an operator.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_valueStateUpdateLatency

The maximum latency of an Update operation on ValueState data.

This metric measures the performance of an Update operation on ValueState data.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_aggregatingStateGetLatency

The maximum latency of a Get operation on AggregatingState data.

This metric measures the performance of accessing AggregatingState data by an operator.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_aggregatingStateAddLatency

The maximum latency of an Add operation on AggregatingState data.

This metric measures the performance of an Add operation on AggregatingState data.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_aggregatingStateMergeNamespacesLatency

The maximum latency of a Merge Namespace operation on AggregatingState data.

This metric measures the performance of a Merge Namespace operation on AggregatingState data.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_reducingStateGetLatency

The maximum latency of a Get operation on ReducingState data.

This metric measures the performance of accessing ReducingState data by an operator.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_reducingStateAddLatency

The maximum latency of an Add operation on ReducingState data.

This metric measures the performance of an Add operation on ReducingState data.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_reducingStateMergeNamespacesLatency

The maximum latency of a Merge Namespace operation on ReducingState data.

This metric measures the performance of a Merge Namespace operation on ReducingState data.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_mapStateGetLatency

The maximum latency of a Get operation on MapState data.

This metric measures the performance of accessing MapState data by an operator.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_mapStatePutLatency

The maximum latency of a Put operation on MapState data.

This metric measures the performance of a Put operation on MapState data.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_mapStatePutAllLatency

The maximum latency of a PutAll operation on MapState data.

This metric measures the performance of a PutAll operation on MapState data.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_mapStateRemoveLatency

The maximum latency of a Remove operation on MapState data.

This metric measures the performance of a Remove operation on MapState data.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_mapStateContainsLatency

The maximum latency of a Contains operation on MapState data.

This metric measures the performance of a Contains operation on MapState data.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_mapStateEntriesInitLatency

The maximum latency of an Init operation on MapState entries.

This metric measures the performance of an Init operation on MapState entries.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_mapStateKeysInitLatency

The maximum latency of an Init operation on MapState keys.

This metric measures the performance of an Init operation on MapState keys.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_mapStateValuesInitLatency

The maximum latency of an Init operation on MapState values.

This metric measures the performance of an Init operation on MapState values.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_mapStateIteratorInitLatency

The maximum latency of an Init operation on MapState Iterator.

This metric measures the performance of an Init operation on MapState Iterator.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_mapStateIsEmptyLatency

The maximum latency of an Empty operation on MapState data.

This metric measures the performance of an Empty operation on MapState data.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_mapStateIteratorHasNextLatency

The maximum latency of a HasNext operation on MapState Iterator.

This metric measures the performance of a HasNext operation on MapState Iterator.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_mapStateIteratorNextLatency

The maximum latency of a Next operation on MapState Iterator.

This metric measures the performance of a Next operation on MapState Iterator.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_mapStateIteratorRemoveLatency

The maximum latency of a Remove operation on MapState Iterator.

This metric measures the performance of a Remove operation on MapState Iterator.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_listStateGetLatency

The maximum latency of a Get operation on ListState data.

This metric measures the performance of accessing ListState data by an operator.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_listStateAddLatency

The maximum latency of an Add operation on ListState data.

This metric measures the performance of an Add operation on ListState data.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_listStateAddAllLatency

The maximum latency of an AddAll operation on ListState data.

This metric measures the performance of an AddAll operation on ListState data.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_listStateUpdateLatency

The maximum latency of an Update operation on ListState data.

This metric measures the performance of an Update operation on ListState data.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_listStateMergeNamespacesLatency

The maximum latency of a Merge Namespace operation on ListState data.

This metric measures the performance of a Merge Namespace operation on ListState data.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_sortedMapStateFirstEntryLatency

The maximum latency of accessing the first entry of SortedMapState data.

This metric measures the performance of accessing SortedMapState data by an operator.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_state_name_sortedMapStateLastEntryLatency

The maximum latency of accessing the last entry of SortedMapState data.

This metric measures the performance of accessing SortedMapState data by an operator.

Nanoseconds

Custom metric

flink_taskmanager_job_task_operator_geminiDB_total_size

The size of the state data.

This metric helps you perform the following operations:

  • Directly identify nodes or identify nodes in advance in which state data bottlenecks may occur.

  • Check whether the TTL of state data takes effect.

Byte

Custom metric

flink_taskmanager_job_task_operator_geminiDB_total_filesize

The size of the state data file.

This metric helps you perform the following operations:

  • Check the size of the state data file in the local disk. You can take actions in advance if the size is large.

  • Determine whether the state data is excessively large if the local disk space is insufficient.

Byte

Custom metric

flink_taskmanager_job_task_currentInputWatermark

The time when each task receives the latest watermark.

This metric measures the latency of data receiving by the TaskManager.

N/A

Custom metric

flink_taskmanager_job_task_operator_watermarkLag

The latency of watermarks.

This metric measures the latency of subtasks.

Milliseconds

Custom metric

flink_jobmanager_Status_JVM_CPU_Load

The CPU load of the JobManager.

If the value of this metric is greater than 100% for an extended period of time, the CPU is busy and the CPU load is high. This may affect the system performance. As a result, issues such as system stuttering and slow response occur.

N/A

Basic metric

flink_jobmanager_Status_JVM_Memory_Heap_Used

The amount of heap memory of the JobManager.

None.

Byte

Basic metric

flink_jobmanager_Status_JVM_Memory_Heap_Committed

The amount of heap memory committed by the JobManager.

None.

Byte

Basic metric

flink_jobmanager_Status_JVM_Memory_Heap_Max

The maximum amount of heap memory of the JobManager.

None.

Byte

Basic metric

flink_jobmanager_Status_JVM_Memory_NonHeap_Used

The amount of non-heap memory of the JobManager.

None.

Byte

Basic metric

flink_jobmanager_Status_JVM_Memory_NonHeap_Committed

The amount of non-heap memory committed by the JobManager.

None.

Byte

Basic metric

flink_jobmanager_Status_JVM_Memory_NonHeap_Max

The maximum amount of non-heap memory of the JobManager.

None.

Byte

Basic metric

flink_jobmanager_Status_JVM_Threads_Count

The number of threads of the JobManager.

A large number of threads of the JobManager occupies excessive memory space. This reduces the job stability.

Count

Basic metric

flink_jobmanager_Status_JVM_GarbageCollector_ParNew_Count

The number of GCs performed within the JobManager.

Frequent GCs can lead to excessive memory consumption and negatively affect job performance. This metric helps diagnose job issues and identify the causes of job failures.

Count

Basic metric

flink_jobmanager_Status_JVM_GarbageCollector_G1_Young_Generation_Count

The number of young-generation GCs performed by the G1 garbage collector of the JobManager.

None.

Count

Custom metric

flink_jobmanager_Status_JVM_GarbageCollector_G1_Old_Generation_Count

The number of old-generation GCs performed by the G1 garbage collector of the JobManager.

None.

Count

Custom metric

flink_jobmanager_Status_JVM_GarbageCollector_G1_Young_Generation_Time

The time consumed by the G1 garbage collector of the JobManager to perform a young-generation GC.

None.

Milliseconds

Custom metric

flink_jobmanager_Status_JVM_GarbageCollector_G1_Old_Generation_Time

The time consumed by the G1 garbage collector of the JobManager to perform a old-generation GC.

None.

Milliseconds

Custom metric

flink_jobmanager_Status_JVM_GarbageCollector_ConcurrentMarkSweep_Count

The number of GCs performed by the Concurrent Mark Sweep (CMS) garbage collector of the JobManager.

None.

Count

Basic metric

flink_jobmanager_Status_JVM_GarbageCollector_ParNew_Time

The duration for which each GC of the JobManager lasts.

If GC of the JobManager lasts for an extended period of time, excessive memory space is occupied. This affects the job performance. This metric helps diagnose job issues and identify the causes of job failures.

Milliseconds

Basic metric

flink_jobmanager_Status_JVM_GarbageCollector_ConcurrentMarkSweep_Time

The time consumed by the CMS garbage collector of the JobManager to perform a GC.

None.

Milliseconds

Basic metric

flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded

The total number of classes that are loaded after the Java Virtual Machine (JVM) in which the JobManager resides is created.

If the total number of classes that are loaded is excessively large after the JVM in which the JobManager resides is created, excessive memory space is occupied. This affects the job performance.

N/A

Basic metric

flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded

The total number of classes that are unloaded after the JVM in which the JobManager resides is created.

If the total number of classes that are unloaded is excessively large after the JVM in which the JobManager resides is created, excessive memory space is occupied. This affects the job performance.

N/A

Basic metric

flink_taskmanager_Status_JVM_CPU_Load

The CPU load of the TaskManager.

This metric indicates the total number of processes in which the CPU is calculating data and processes in which data waits to be calculated by the CPU. In most cases, this metric indicates how busy the CPU is.

The value of this metric is related to the number of CPU cores that are used. The CPU load in Flink is calculated by using the following formula: CPU load = CPU utilization/Number of CPU cores. If the value of the flink_taskmanager_Status_JVM_CPU_Load metric is greater than the CPU load in Flink, CPU processing may be blocked.

N/A

Basic metric

flink_jobmanager_Status_ProcessTree_CPU_Usage

The CPU utilization of the JobManager.

This metric indicates the utilization of CPU time slices that are occupied by Flink.

  • If the value of this metric is 100%, one CPU core is used.

  • If the value of this metric is 400%, four CPU cores are used.

If the value of this metric is greater than 100% for an extended period of time, the CPU is busy.

If the CPU load is high but the CPU utilization is low, a large number of processes that are in the uninterruptible sleep state may be running due to frequent read and write operations.

N/A

Basic metric

flink_taskmanager_Status_ProcessTree_CPU_Usage

The CPU utilization of the TaskManager.

This metric indicates the utilization of CPU time slices that are occupied by Flink.

  • If the value of this metric is 100%, one CPU core is used.

  • If the value of this metric is 400%, four CPU cores are used.

If the value of this metric is greater than 100% for an extended period of time, the CPU is busy.

If the CPU load is high but the CPU utilization is low, a large number of processes that are in the uninterruptible sleep state may be running due to frequent read and write operations.

N/A

Basic metric

flink_taskmanager_Status_JVM_Memory_Heap_Used

The amount of heap memory of the TaskManager.

None.

Byte

Basic metric

flink_taskmanager_Status_JVM_Memory_Heap_Committed

The amount of heap memory committed by the TaskManager.

None.

Byte

Basic metric

flink_taskmanager_Status_JVM_Memory_Heap_Max

The maximum amount of heap memory of the TaskManager.

None.

Byte

Basic metric

flink_taskmanager_Status_JVM_Memory_NonHeap_Used

The amount of non-heap memory of the TaskManager.

None.

Byte

Basic metric

flink_taskmanager_Status_JVM_Memory_NonHeap_Committed

The amount of non-heap memory committed by the TaskManager.

None.

Byte

Basic metric

flink_taskmanager_Status_JVM_Memory_NonHeap_Max

The maximum amount of non-heap memory of the TaskManager.

None.

Byte

Basic metric

flink_taskmanager_Status_ProcessTree_Memory_RSS

The amount of memory consumed by the entire process on Linux.

This metric tracks changes in memory consumption of the process.

Byte

Basic metric

flink_taskmanager_Status_JVM_Threads_Count

The number of threads of the TaskManager.

A large number of threads of the TaskManager occupies excessive memory space. This reduces the job stability.

Count

Basic metric

flink_taskmanager_Status_JVM_GarbageCollector_ParNew_Count

The number of GCs performed within the TaskManager.

Frequent GCs can lead to excessive memory consumption and negatively affect job performance. This metric helps diagnose job issues and identify the causes of job failures.

Count

Basic metric

flink_taskmanager_Status_JVM_GarbageCollector_G1_Young_Generation_Count

The number of young-generation GCs performed by the G1 garbage collector of the TaskManager.

None.

Count

Custom metric

flink_taskmanager_Status_JVM_GarbageCollector_G1_Old_Generation_Count

The number of old-generation GCs performed by the G1 garbage collector of the TaskManager.

None.

Count

Custom metric

flink_taskmanager_Status_JVM_GarbageCollector_G1_Young_Generation_Time

The time consumed by the G1 garbage collector of the TaskManager to perform a young-generation GC.

None.

Milliseconds

Custom metric

flink_taskmanager_Status_JVM_GarbageCollector_G1_Old_Generation_Time

The time consumed by the G1 garbage collector of the TaskManager to perform a old-generation GC.

None.

Milliseconds

Custom metric

flink_taskmanager_Status_JVM_GarbageCollector_ConcurrentMarkSweep_Count

The number of GCs performed by the CMS garbage collector of the TaskManager.

None.

Count

Basic metric

flink_taskmanager_Status_JVM_GarbageCollector_ParNew_Time

The duration for which each GC of the TaskManager lasts.

If GC of the TaskManager lasts for an extended period of time, excessive memory space is occupied. This affects the job performance. This metric helps diagnose job issues and identify the causes of job failures.

Milliseconds

Basic metric

flink_taskmanager_Status_JVM_GarbageCollector_ConcurrentMarkSweep_Time

The time consumed by the CMS garbage collector of the TaskManager to perform a GC.

None.

Milliseconds

Basic metric

flink_taskmanager_Status_JVM_ClassLoader_ClassesLoaded

The total number of classes that are loaded after the JVM in which the TaskManager resides is created.

If the total number of classes that are loaded is excessively large after the JVM in which the TaskManager resides is created, excessive memory space is occupied. This affects the job performance.

None.

Basic metric

flink_taskmanager_Status_JVM_ClassLoader_ClassesUnloaded

The total number of classes that are unloaded after the JVM in which the TaskManager resides is created.

If the total number of classes that are unloaded is excessively large after the JVM in which the TaskManager resides is created, excessive memory space is occupied. This affects the job performance.

None.

Basic metric

flink_jobmanager_job_uptime

The period during which the job runs.

None.

Milliseconds

Custom metric

flink_jobmanager_numRunningJobs

The number of jobs that are running.

None.

None.

Custom metric

flink_jobmanager_taskSlotsAvailable

The number of available task slots.

None.

None.

Custom metric

flink_jobmanager_taskSlotsTotal

The total number of task slots.

None.

None.

Custom metric

flink_jobmanager_numRegisteredTaskManagers

The number of registered TaskManagers.

None.

None.

Custom metric

flink_taskmanager_job_task_numBytesInRemotePerSecond

The number of bytes read from the remote source per second.

None.

Byte/s

Custom metric

flink_taskmanager_job_task_operator_numLateRecordsDropped

The number of packets dropped due to window latency.

None.

Count

Custom metric

flink_taskmanager_job_task_operator_lateRecordsDroppedRate

The window latency rate.

None.

None.

Custom metric

flink_jobmanager_job_operator_coordinator_enumerator_isSnapshotting

Specifies whether the job is in the snapshot phase.

This metric indicates the job processing phase.

None.

Custom metric

flink_jobmanager_job_operator_coordinator_enumerator_isBinlogReading

Specifies whether the job is in the incremental phase.

This metric indicates the job processing phase.

None.

Custom metric

flink_jobmanager_job_operator_coordinator_enumerator_numTablesRemaining

Specifies whether the job is in the incremental phase.

This metric measures the number of unprocessed tables.

Count

Custom metric

flink_jobmanager_job_operator_coordinator_enumerator_numTablesSnapshotted

The number of tables that are waiting to be processed in the snapshot phase.

This metric measures the number of unprocessed tables.

Count

Custom metric

flink_jobmanager_job_operator_coordinator_enumerator_numSnapshotSplitsProcessed

The number of processed tables in the snapshot phase.

This metric measures the number of processed tables.

Count

Custom metric

flink_jobmanager_job_operator_coordinator_enumerator_namespace_schema_table_numSnapshotSplitsProcessed

The number of processed shards in the snapshot phase.

This metric measures the number of processed shards.

Count

Custom metric

flink_jobmanager_job_operator_coordinator_enumerator_numSnapshotSplitsRemaining

The number of shards that are waiting to be processed in the snapshot phase.

This metric measures the number of unprocessed shards.

Count

Custom metric

flink_jobmanager_job_operator_coordinator_enumerator_namespace_schema_table_numSnapshotSplitsRemaining

The number of shards that are waiting to be processed in the snapshot phase.

This metric measures the number of unprocessed shards.

Count

Custom metric

flink_taskmanager_job_task_operator_currentReadTimestampMs

The timestamp of the latest data record that is read.

This metric measures the time of the latest binary log data.

Milliseconds

Custom metric

flink_taskmanager_job_task_operator_numSnapshotRecords

The number of processed data records in the snapshot phase.

This metric measures the number of processed data records in the snapshot phase.

Count

Custom metric

flink_taskmanager_job_task_operator_namespace_schema_table_numRecordsIn

The number of data records that are read from each table.

This metric measures the total number of processed data records in each table.

Count

Custom metric

flink_taskmanager_job_task_operator_namespace_schema_table_numSnapshotRecords

The number of processed data records in each table in the snapshot phase.

This metric measures the number of processed data records in each table in the snapshot phase.

Count

Custom metric

flink_taskmanager_job_task_operator_namespace_schema_table_numInsertDMLRecords

The number of executed INSERT DML statements for each table in the incremental phase.

This metric measures the number of executed INSERT statements for each table.

Count

Custom metric

flink_taskmanager_job_task_operator_namespace_schema_table_numUpdateDMLRecords

The number of executed UPDATE DML statements for each table in the incremental phase.

This metric measures the number of executed UPDATE statements for each table.

Count

Custom metric

flink_taskmanager_job_task_operator_namespace_schema_table_numDeleteDMLRecords

The number of executed DELETE DML statements for each table in the incremental phase.

This metric measures the number of executed DELETE statements for each table.

Count

Custom metric

flink_taskmanager_job_task_operator_namespace_schema_table_numDDLRecords

The number of executed DDL statements for each table in the incremental phase.

This metric measures the number of executed DDL statements for each table.

Count

Custom metric

flink_taskmanager_job_task_operator_numInsertDMLRecords

The number of executed INSERT DML statements in the incremental phase.

This metric measures the number of executed INSERT statements.

Count

Custom metric

flink_taskmanager_job_task_operator_numUpdateDMLRecords

The number of executed UPDATE DML statements in the incremental phase.

This metric measures the number of executed UPDATE statements.

Count

Custom metric

flink_taskmanager_job_task_operator_numDeleteDMLRecords

The number of executed DELETE DML statements in the incremental phase.

This metric measures the number of executed DELETE statements.

Count

Custom metric

flink_taskmanager_job_task_operator_numDDLRecords

The number of executed DDL statements in the incremental phase.

This metric measures the number of executed DDL statements.

Count

Custom metric

Common metric labels

Label

Description

vvpNamespace

The name of the namespace.

deploymentName

The name of the deployment.

deploymentId

The deployment ID.

jobId

The job ID.

Others

For more information about the metrics of Application Real-Time Monitoring Service (ARMS) Application Monitoring, see Application Monitoring metrics.