This topic describes how to use Managed Service for Prometheus to monitor Alibaba Cloud E-MapReduce (EMR).

Prerequisites

An EMR cluster is created. For more information, see Create a cluster.

Limit

You can install the component only for Prometheus for ECS instances.

Step 1: Enable the Prometheus port for taihao-exporter

After you create an EMR cluster, the system automatically installs taihao-exporter in the corresponding Elastic Compute Service (ECS) instance. You must manually enable the Prometheus port.

  1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS. On the EMR on ECS page, obtain the ID of the EMR cluster, and then click the name of the cluster.
  2. Click the Nodes tab. Find the master node and core node, and click Details. In the Basic Information section of the Instance Details tab, click Connect to remotely log on to the ECS instance.
  3. Run the following command to query the exporter process.
    ce
    ps -ef | grep taihao_exporter
  4. Run the following command to add prom_sink_enable=true to the taihao_exporter.yaml file and restart the service:
    sed -i 's/prom_sink_enable:\s*false/prom_sink_enable: true/g' /usr/local/taihao_exporter/taihao_exporter.yaml
    service taihao_exporter restart
    Note You need to modify the configurations of all nodes.

Step 2: Integrate EMR into Managed Service for Prometheus

Entry points

Entry point 1

  1. Log on to the ARMS console.
  2. In the left-side navigation pane, choose Prometheus Service > Prometheus Instances.
  3. Click the name of the Prometheus instance instance that you want to manage to go to the Integration Center page.

Entry point 2

  1. Log on to the Application Real-time Monitoring Service (ARMS) console.
  2. In the left-side navigation pane, click Integration Center. In the Application Components section, find the E-MapReduce component and click Add. In the panel that appears, follow the instructions to add the component.

Configure the EMR component

This section describes how to configure the EMR component in the integration center of the Prometheus instance. Perform the following steps:

  1. Add the EMR component.
    • If you install the EMR component for the first time, perform the following operation:

      In the uninstalled section of the Integration Center page, find the E-MapReduce component and click Install.

      Note You can click the component to view the common EMR metrics and dashboard thumbnails in the panel that appears. The metrics listed are for reference only. After you install the EMR component, you can view the actual metrics. For more information, see E-MapReduce monitoring metrics.
    • If you have installed the EMR component, you must add the component again.

      In the Installed section of the Integration Center page, find the E-MapReduce component and click Add.

  2. On the Configurations tab in the STEP2 section, configure the parameters and click OK. The following table describes the parameters.
    ParameterDescription
    EMR Cluster IDThe EMR cluster ID obtained in Step 1: Enable the Prometheus port for taihao-exporter.
    EMR Cluster NameThe name of the EMR cluster.
    Exporter NameThe name of the current exporter.
    • The name can contain only lowercase letters, digits, and hyphens (-), and cannot start or end with a hyphen (-).
    • The name must be unique.
    Exporter Port NumberThe listening port of the metrics. Managed Service for Prometheus accesses the port to obtain metric data. Default value: 9712.
    Metrics PathThe HTTP path used by Managed Service for Prometheus to collect metric data from the exporter. Default value: /metrics_preget.
    Metrics Collection Interval (Seconds)The interval at which EMR metrics are collected. Default value: 30.
    ECS Tag (Service Discovery)The ECS tag and tag value that are used to deploy the exporter. Managed Service for Prometheus uses this tag for service discovery. Valid values: acs:emr:nodeGroupType or acs:emr:hostGroupType.
    ECS Tag ValueThe ECS tag values. Default values: CORE,MASTER. Separate multiple values with commas (,).
    Note You can view the monitoring metrics on the Metrics tab in the STEP2 section.

    The installed components are displayed in the Installed section of the Integration Center page. Click the component. In the panel that appears, you can view information such as targets, metrics, dashboard, alerts, service discovery configurations, and exporters. For more information, see Integration center.

Step 3: View monitoring data

Managed Service for Prometheus provides more than 20 Grafana dashboards for E-MapReduce, such as HOST, HDFS, Hive, YARN, Impala, ZooKeeper, Spark, Flink, and ClickHouse.

On the Integration Center page, click the E-MapReduce component in the Installed section. In the panel that appears, click the Dashboards tab to view the thumbnails and hyperlinks of EMR dashboards. Click a hyperlink to go to the Grafana page and view the dashboard. This section describes the monitoring metrics of common dashboards.
  • HOST dashboard: displays the CPU utilization, memory usage, disk space, load, network, and socket of the ECS instance. rt
  • HDFS dashboard
    • HDFS-HOMErt
    • HDFS-NameNodes
    • HDFS-DataNodeser
    • HDFS-JournanlNodeswq
  • Hive dashboard
    • HiveServer2: the HiveQL query server that receives SQL requests from JDBC clients. qwi
    • HiveMetaStore: the metadata management module that is used to store metadata such as database and table data. qwl
  • YARN dashboard
    • HOME: displays the cluster status, memory, tasks, nodes, and containers. we
    • NodeManager: manages and monitors node resources and executes jobs on nodes. qa
    • ResourceManager: manages and schedules cluster resources and allocates resources for various types of jobs that are running on YARN. er
    • TimeLineServer: collects the metrics of a job and displays the job execution status.wd
    • JobHistorysx
  • Kafka dashboard
    • KAFKA-HOMEqsd
    • KAFKA-Brokerqwn
    • KAFKA-Topicwt
  • ClickHouse dashboardwq
  • Flink dashboard
  • Impala dashboard1er
  • ZooKeeper dashboardsh
  • Spark dashboardwqk

E-MapReduce monitoring metrics

Metrics

Managed Service for Prometheus provides various Grafana dashboards for E-MapReduce, such as HOST, HDFS, Hive, YARN, Kafka, ZooKeeper, Flink, and ClickHouse.

HOST metrics

HOST metrics include the CPU utilization, memory usage, disk space, load, network, and sockets of the ECS instance.

HDFS metrics

Hadoop Distributed File System (HDFS) is suitable for distributed reading and writing of large-scale data, especially in scenarios with more reads and less writes. HDFS metrics include HOME, NameNodes, DataNodes, and JournnanlNodes.

YARN metrics

YARN is the core component of the Hadoop system. YARN manages resources in Hadoop clusters, and schedules and monitors jobs in the clusters. YARN metrics include HOME, Queue, ResourceManager, NodeManager, TImeLineServer, and JobHistory.

Hive metrics

Hive is a Hadoop-based data warehouse framework. It is used to extract, transform, and load data and manage metadata in big data scenarios. Hive consists of HiveServer2, Hive MetaStore, and Hive Client. HiveServer2 is the HiveQL query server and HiveMetaStore is the metadata management module. Hive metrics include HiveMetaStore and HiveServer2.
  • HiveMetaStore
    MetricDescription
    hive_memory_heap_maxThe maximum available heap memory of the JVM. Unit: bytes.
    hive_memory_heap_usedThe heap memory used by the JVM. Unit: bytes.
    hive_memory_non_heap_usedThe amount of off-heap memory used by the JVM. Unit: bytes.
    hive_active_calls_api_alter_tableThe number of active alter table requests.
    hive_active_calls_api_create_tableThe number of active create table requests.
    hive_active_calls_api_drop_tableThe number of active drop table requests.
    hive_api_alter_tableThe average duration of alter table requests. Unit: milliseconds.
    hive_api_alter_table_with_environment_contextThe average duration of alter table with env context requests. Unit: milliseconds.
    hive_api_create_tableThe average duration of create table requests. Unit: milliseconds.
    hive_api_create_table_with_environment_contextThe average duration of create table with env context requests. Unit: milliseconds.
    api_drop_tableThe average duration of drop table requests. Unit: milliseconds.
    hive_api_drop_table_with_environment_contextThe average duration of drop table with env context requests. Unit: milliseconds.
    hive_api_get_all_databasesThe average duration of get all databases requests. Unit: milliseconds.
    hive_api_get_all_functionsThe average duration of get all functions requests. Unit: milliseconds.
    hive_api_get_databaseThe average duration of get database requests. Unit: milliseconds.
    hive_api_get_multi_tableThe average duration of get multi table requests. Unit: milliseconds.
    hive_api_get_tables_by_typeThe average duration of get table requests. Unit: milliseconds.
    hive_api_get_table_objects_by_name_reqThe average duration of get table objects by name requests. Unit: milliseconds.
    hive_api_get_table_reqThe average duration of get table req requests. Unit: milliseconds.
    hive_api_get_table_statistics_reqThe average duration of get table statistics requests. Unit: milliseconds.
    hive_api_get_tablesThe average duration of get tables requests. Unit: milliseconds.
    hive_api_get_tables_by_typeThe average duration of get tables by type requests. Unit: milliseconds.
  • HiveServer2
    MetricDescription
    hive_metrics_hs2_active_sessionsThe number of active sessions.
    hive_metrics_memory_total_initThe total initialized memory of the JVM. Unit: bytes.
    hive_metrics_memory_total_committedThe total memory reserved by the JVM. Unit: bytes.
    hive_metrics_memory_total_maxThe maximum available memory of the JVM. Unit: bytes.
    hive_metrics_memory_heap_committedThe heap memory reserved by the JVM. Unit: bytes.
    hive_metrics_memory_heap_inithive_metrics_memory_heap_committedThe heap memory initialized by the JVM. Unit: bytes.
    hive_metrics_memory_non_heap_committedThe off-heap memory reserved by the JVM. Unit: bytes.
    hive_metrics_memory_non_heap_initThe off-heap memory initialized by the JVM. Unit: bytes.
    hive_metrics_memory_non_heap_maxThe maximum available off-heap memory of the JVM. Unit: bytes.
    hive_metrics_gc_PS_MarkSweep_countThe number of PS MarkSweep GCs in the JVM.
    hive_metrics_gc_PS_MarkSweep_timeThe time consumed for the PS MarkSweep GCs in the JVM. Unit: milliseconds.
    hive_metrics_gc_PS_Scavenge_timeThe time consumed for the PS Scavenge GCs in the JVM. Unit: milliseconds.
    hive_metrics_threads_daemon_countThe number of JVM daemon threads.
    hive_metrics_threads_countThe total number of JVM threads.
    hive_metrics_threads_blocked_countThe number of blocked JVM threads.
    hive_metrics_threads_deadlock_countThe number of deadlocked JVM threads.
    hive_metrics_threads_new_countThe number of new JVM threads.
    hive_metrics_threads_runnable_countThe number of runnable JVM threads.
    hive_metrics_threads_terminated_countThe number of terminated JVM threads.
    hive_metrics_threads_waiting_countThe number of waiting JVM threads.
    hive_metrics_threads_timed_waiting_countThe number of timed_waiting JVM threads.
    hive_metrics_memory_heap_maxThe maximum available heap memory of the JVM. Unit: bytes.
    hive_metrics_memory_heap_usedThe heap memory used by the JVM. Unit: bytes.
    hive_metrics_memory_non_heap_usedThe amount of off-heap memory used by the JVM. Unit: bytes.
    hive_metrics_hs2_open_sessionsThe number of opened sessions.
    hive_metrics_hive_mapred_tasksThe total number of submitted Hive on MR jobs.
    hive_metrics_hive_tez_tasksThe total number of submitted Hive on Tez jobs.
    hive_metrics_cumulative_connection_countThe cumulative number of connections.
    hive_metrics_active_calls_api_runTasksThe number of current runtask requests.
    hive_metrics_hs2_completed_sql_operation_FINISHEDThe total number of completed SQL statements.
    hive_metrics_hs2_sql_operation_active_userThe number of active users.
    hive_metrics_open_connectionsThe number of opened connections.
    hive_metrics_api_PostHook_com_aliyun_emr_meta_hive_hook_LineageLoggerHookThe average time to execute the LineageLoggerHook. Unit: milliseconds.
    hive_metrics_api_hs2_sql_operation_PENDINGThe average time that the SQL tasks are in the PENDING state. Unit: milliseconds.
    hive_metrics_api_hs2_sql_operation_RUNNINGThe average time that the SQL tasks are in the RUNNING state. Unit: milliseconds.
    hive_metrics_hs2_submitted_queriesThe average time taken to submit a query. Unit: milliseconds.
    hive_metrics_hs2_executing_queriesThe average time taken to execute a query. Unit: milliseconds.
    hive_metrics_hs2_succeeded_queriesThe number of successful queries after the service is started.
    hive_metrics_hs2_failed_queriesThe number of queries that fail after the service is started.

ZooKeeper metrics

ZooKeeper is a distributed and highly available coordination service. ZooKeeper provides features such as distributed configuration, synchronization, naming, and registration.
MetricDescription
zk_packets_receivedThe number of packets received by ZooKeeper.
zk_packets_sentThe number of packets sent by ZooKeeper.
zk_avg_latencyThe average latency of ZooKeeper requests. Unit: milliseconds.
zk_min_latencyThe minimum latency of ZooKeeper requests. Unit: milliseconds.
zk_max_latencyThe maximum latency of ZooKeeper requests. Unit: milliseconds.
zk_watch_countThe number of ZooKeeper watches.
zk_znode_countThe number of ZooKeeper znodes.
zk_num_alive_connectionsThe number of ZooKeeper connections alive.
zk_outstanding_requestsThe number of queuing ZooKeeper requests. The greater the value, the more difficultly ZooKeeper processes requests.
zk_approximate_data_sizeThe approximate size of the ZooKeeper data. Unit: bytes.
zk_open_file_descriptor_countThe number of files opened by ZooKeeper.
zk_max_file_descriptor_countThe maximum number of files that can be opened in ZooKeeper.
zk_node_statusThe status of a ZooKeeper node. Valid values:
  • -1: The node is unavailable.
  • 0: The node serves as a follower node.
  • 1: The node serves as a leader node.
zk_synced_followersThe number of synchronized ZooKeeper services.

Kafka metrics

ApsaraMQ for Kafka is a distributed, high-throughput, and scalable message queue service provided by Alibaba Cloud. ApsaraMQ for Kafka is used in big data scenarios such as log collection, monitoring data aggregation, streaming data processing, and online and offline analysis. It is important for the big data ecosystem.

Impala metrics

Impala provides high-performance and low-latency SQL queries for data stored in Apache Hadoop.
MetricDescription
impala_impala_server_resultset_cache_total_bytesThe size of the result set cache. Unit: bytes.
impala_num_executing_queriesThe number of queries that are being executed.
impala_num_waiting_queriesThe number of waiting queries.
impala_impala_server_query_durations_ms_95thThe query duration at the 95th percentile. Unit: milliseconds.
impala_num_in_flight_queriesThe number of queries in the in fight state in the cluster.
impala_impala_server_query_durations_ms_75thThe query duration at the 75th percentile. Unit: milliseconds.
impala_impala_thrift_server_CatalogService_svc_thread_wait_time_99_9thThe amount of time that the Catalog Service client waits for the service thread. Unit: milliseconds.
impala_impala_thrift_server_CatalogService_connection_setup_time_99_9thThe amount of time that the Catalog Service client spends waiting to establish a connection at the 99th percentile. Unit: milliseconds.
impala_impala_server_query_durations_ms_99_9thThe query duration at the 99th percentile. Unit: milliseconds.
impala_impala_server_ddl_durations_ms_99_9thThe duration of the Data Definition Language (DDL) operation at the 99th percentile. Unit: milliseconds.
impala_impala_server_query_durations_ms_90thThe query duration at the 90th percentile. Unit: milliseconds.
impala_impala_server_ddl_durations_ms_90thThe duration of the DDL operation at the 90th percentile. Unit: milliseconds.
impala_impala_server_query_durations_ms_50thThe query duration at the 50th percentile. Unit: milliseconds.
impala_impala_server_ddl_durations_ms_50thThe duration of the DDL operation at the 50th percentile. Unit: milliseconds.
impala_impala_server_ddl_durations_ms_95thThe duration of the DDL operation at the 95th percentile. Unit: milliseconds.
impala_impala_server_scan_ranges_num_missing_volume_idThe total number of scan ranges with missing volume IDs during the process lifecycle.
impala_impala_server_ddl_durations_ms_75thThe duration of the DDL operation at the 75th percentile. Unit: milliseconds.
impala_impala_server_num_queries_spilledThe number of queries with overloading operators.
impala_impala_server_scan_ranges_totalThe total number of scan ranges that are read during the process lifecycle.
impala_impala_server_num_queries_expiredThe number of queries that expire due to inactivity.
impala_impala_server_resultset_cache_total_num_rowsThe number of cached records in the result set.
impala_impala_server_num_open_hiveserver2_sessionsThe number of opened HiveServer2 sessions.
impala_impala_server_num_sessions_expiredThe number of sessions that expire due to inactivity.
impala_impala_server_num_fragments_in_flightThe number of segment instances that are being queried.
impala_impala_server_num_queries_registeredThe total number of queries registered on the Impala server instance, including queries that are in progress and waiting to be closed.
impala_impala_server_num_files_open_for_insertThe number of HDFS files opened for writing.
impala_impala_server_num_queriesThe total number of queries processed during the process lifecycle.
impala_impala_server_hedged_read_opsThe total number of hedged reads attempted during the process lifecycle.
impala_impala_server_num_open_beeswax_sessionsThe number of open Beeswax sessions.
impala_impala_server_backend_num_queries_executedThe total number of queries executed on the backend server during the process lifecycle.
impala_impala_server_num_fragmentsThe total number of segment queried during the process lifecycle.
impala_rpc_impala_ControlService_rpcs_queue_overflowThe total number of incoming RPCs rejected by the ControlService due to service queue overflow.
impala_impala_server_hedged_read_ops_winThe total number of times that a hedged read is faster than a regular read operation.
impala_mem_tracker_DataStreamService_current_usage_bytesThe number of bytes used by the Memtracker DataStreamService.
impala_impala_server_backend_num_queries_executingThe number of queries that are being queried on the backend server.
impala_cluster_membership_executor_groups_total_healthyThe total number of healthy executor groups.
impala_rpc_impala_DataStreamService_rpcs_queue_overflowThe total number of incoming remote procedure calls (RPCs) of the DataStreamService rejected due to service queue overflow.
impala_cluster_membership_backends_totalThe total number of backends registered with the statestore.
impala_mem_tracker_DataStreamService_peak_usage_bytesThe peak number of bytes used by the Memtracker DataStreamService.
impala_total_senders_blocked_on_recvr_creationThe total number of senders that have been forbidden to wait for the initialization of received segments.
impala_mem_tracker_ControlService_peak_usage_bytesThe peak number of bytes used by the Memtracker ControlService.
impala_simple_scheduler_local_assignments_totalThe number of local jobs.
impala_mem_tracker_ControlService_current_usage_bytesThe number of bytes used by the Memtracker ControlService.
impala_memory_total_usedThe memory used. Unit: bytes.
impala_cluster_membership_executor_groups_totalThe total number of executor groups with at least one executor.
impala_memory_rssThe Resident Set Size (RSS), including TCMalloc, buffer pool, and JVM. Unit: bytes.
impala_total_senders_timedout_waiting_for_recvr_creationThe total number of senders that time out waiting for the initialization of received segments.
impala_senders_blocked_on_recvr_creationThe number of senders waiting for the initialization of received segments.
impala_simple_scheduler_assignments_totalThe number of jobs.
impala_memory_mapped_bytesThe virtual memory size of the process. Unit: bytes.

HUE metrics

MetricDescription
hue_requests_response_time_avgThe average response duration of requests.
hue_requests_response_time_95_percentileThe response duration of requests at the 95th percentile.
hue_requests_response_time_std_devThe standard deviation of the request response duration.
hue_requests_response_time_medianThe response duration of requests at the 50th percentile.
hue_requests_response_time_75_percentileThe response duration of requests at the 75th percentile.
hue_requests_response_time_countThe number of request response durations.
hue_requests_response_time_5m_rateThe request response rate in the last 5 minutes.
hue_requests_response_time_minThe minimum request response duration.
hue_requests_response_time_sumThe total response durations of the requests.
hue_requests_response_time_maxThe maximum request response duration.
hue_requests_response_time_mean_rateThe average request response rate.
hue_requests_response_time_99_percentileThe request response duration in the last hour at the 99th percentile.
hue_requests_response_time_15m_rateThe request response rate in the last 15 minutes.
hue_requests_response_time_999_percentileThe response duration of requests at the 99.9th percentile.
hue_requests_response_time_1m_rateThe request response rate in the last 1 minute.
hue_users_active_totalThe total number of active users.
hue_users_activeThe number of active users in the last hour.
hue_usersThe total number of users.
hue_threads_totalThe total number of threads.
hue_threads_daemonThe number of daemon threads.
hue_queries_numberThe sum of the queried quantities.
hue_requests_exceptionsThe number of abnormal requests.
hue_requests_activeThe number of active requests.

Kudu metrics

ParameterMetricDescription
op_apply_queue_length (99)kudu_op_apply_queue_length_percentile_99The length of the operation queue at the 99th percentile.
op_apply_queue_length (75)kudu_op_apply_queue_length_percentile_75The length of the operation queue at the 75th percentile.
op_apply_queue_length (mean)kudu_op_apply_queue_length_meanThe average length of the operation queue.
rpc_incoming_queue_time (99)kudu_rpc_incoming_queue_time_percentile_99The waiting duration of the RPC queue at the 99th percentile. Unit: μs.
rpc_incoming_queue_time (75)kudu_rpc_incoming_queue_time_percentile_75The waiting duration of the RPC queue at the 75th percentile. Unit: μs.
rpc_incoming_queue_time (mean)kudu_rpc_incoming_queue_time_meanThe average waiting duration of the RPC queue. Unit: μs.
reactor_load_percent (99)kudu_reactor_load_percent_percentile_99The load of the Reactor thread at the 99th percentile.
reactor_load_percent (75)kudu_reactor_load_percent_percentile_75The load of the Reactor thread at the 75th percentile.
reactor_load_percent (mean)kudu_reactor_load_percent_meanThe average load of the Reactor thread.
op_apply_run_time (99)kudu_op_apply_run_time_percentile_99The execution duration at the 99th percentile. Unit: μs.
op_apply_run_time (75)kudu_op_apply_run_time_percentile_75The execution duration at the 75th percentile. Unit: μs.
op_apply_run_time (mean)kudu_op_apply_run_time_meanThe average execution duration. Unit: μs.
op_prepare_run_time (99)kudu_op_prepare_run_time_percentile_99The preparation duration at the 99th percentile. Unit: μs.
op_prepare_run_time (75)kudu_op_prepare_run_time_percentile_75The preparation duration at the 75th percentile. Unit: μs.
op_prepare_run_time (mean)kudu_op_prepare_run_time_meanThe average preparation duration. Unit: μs.
flush_mrs_duration (99)kudu_flush_mrs_duration_percentile_99The MemRowSet flush time at the 99th percentile. Unit: milliseconds.
flush_mrs_duration (75)kudu_flush_mrs_duration_percentile_75The MemRowSet flush time at the 75th percentile. Unit: milliseconds.
flush_mrs_duration (mean)kudu_flush_mrs_duration_meanThe average MemRowSet flush time. Unit: milliseconds.
log_append_latency (99)kudu_log_append_latency_percentile_99The append time of the logs at the 99th percentile. Unit: μs.
log_append_latency (75)kudu_log_append_latency_percentile_75The append time of the logs at the 75th percentile. Unit: μs.
log_append_latency (mean)kudu_log_append_latency_meanThe average append time of the logs. Unit: μs.
flush_dms_duration (99)kudu_flush_dms_duration_percentile_99The DeltaMemStore flush time at the 99th percentile. Unit: milliseconds.
flush_dms_duration (75)kudu_flush_dms_duration_percentile_75The DeltaMemStore flush time at the 75th percentile. Unit: milliseconds.
flush_dms_duration (mean)kudu_flush_dms_duration_meanThe average DeltaMemStore flush time. Unit: milliseconds.
op_prepare_queue_length (99)kudu_op_prepare_queue_length_percentile_99The length of the preparation queue at the 99th percentile.
op_prepare_queue_length (75)kudu_op_prepare_queue_length_percentile_75The length of the preparation queue at the 75th percentile.
op_prepare_queue_length (mean)kudu_op_prepare_queue_length_meanThe average length of the preparation queue.
log_gc_duration (99)kudu_log_gc_duration_percentile_99The GC duration of the logs at the 99th percentile. Unit: milliseconds.
log_gc_duration (75)kudu_log_gc_duration_percentile_75The GC duration of the logs at the 75th percentile. Unit: milliseconds.
log_gc_duration (mean)kudu_log_gc_duration_meanThe average GC duration of the logs. Unit: milliseconds.
log_sync_latency (99)kudu_log_sync_latency_percentile_99The Sync duration of the logs at the 99th percentile. Unit: μs.
log_sync_latency(75)kudu_log_sync_latency_percentile_75The Sync duration of the logs at the 75th percentile. Unit: μs.
log_sync_latency (mean)kudu_log_sync_latency_meanThe average Sync duration of the logs. Unit: μs.
prepare_queue_time (99)kudu_op_prepare_queue_time_percentile_99The waiting duration of the preparation queue at the 99th percentile. Unit: μs.
prepare_queue_time (75)kudu_op_prepare_queue_time_percentile_75The waiting duration of the preparation queue at the 75th percentile. Unit: μs.
prepare_queue_time (mean)kudu_op_prepare_queue_time_meanThe average waiting duration of the preparation queue. Unit: μs.
rpc_connections_acceptedkudu_rpc_connections_acceptedThe number of received RPC requests.
block_cache_usagekudu_block_cache_usageThe cache usage of the TServer Blocks. Unit: bytes.
active_scannerskudu_active_scannersThe number of active Scanners.
data_dirs_fullkudu_data_dirs_fullThe number of data directories in the Full state.
rpcs_queue_overflowkudu_rpcs_queue_overflowThe number of times that the RPC queue overflows.
cluster_replica_skewkudu_cluster_replica_skewThe difference between the maximum number of tablets and the minimum number of tablets hosted on the server.
log_gc_runningkudu_log_gc_runningThe number of logs during GCs.
data_dirs_failedkudu_data_dirs_failedThe number of invalid data directories.
leader_memory_pressure_rejectionskudu_leader_memory_pressure_rejectionsThe number of requests rejected due to memory pressure.
transaction_memory_pressure_rejectionskudu_transaction_memory_pressure_rejectionsThe number of transactions rejected due to memory pressure.

ClickHouse metrics

Compatible with the features of open source ClickHouse, EMR ClickHouse optimizes the read and write performance and improves the ability to quickly integrate ClickHouse with other EMR components.
MetricDescription
clickhouse_server_events_ReplicatedPartFailedFetchesThe number of times that data cannot be obtained from replicas in the Replicated*MergeTree table.
clickhouse_server_events_ReplicatedPartChecksFailedThe number of times that data in the Replicated*MergeTree table fails to be checked.
clickhouse_server_events_ReplicatedDataLossThe number of times that data in the Replicated*MergeTree table is not in a replica.
clickhouse_server_events_ReplicatedMetaDataChecksFailedThe number of times that the metadata of the Replicated*MergeTree table fails to be checked.
clickhouse_server_events_ReplicatedMetaDataLossThe number of times that metadata is lost in the Replicated*MergeTree table.
clickhouse_server_events_DuplicatedInsertedBlocksThe number of duplicate blocks written to the Replicated*MergeTree table.
clickhouse_server_events_ZooKeeperUserExceptionsThe number of times that errors related to the ClickHouse status occur in ZooKeeper.
clickhouse_server_events_ZooKeeperHardwareExceptionsThe number of ZooKeeper network errors and other errors.
clickhouse_server_events_ZooKeeperOtherExceptionsThe number of ZooKeeper non-hardware or status errors.
clickhouse_server_events_DistributedConnectionFailTryThe number of retry errors of the distributed connection.
clickhouse_server_events_DistributedConnectionMissingTableThe number of times that the distributed connection fails to find the table.
clickhouse_server_events_DistributedConnectionStaleReplicaThe number of times that the replicas obtained by the distributed connection are not fresh.
clickhouse_server_events_DistributedConnectionFailAtAllThe number of times that the distributed connection fails after all retries.
clickhouse_server_events_SlowReadThe number of slow reads.
clickhouse_server_events_ReadBackoffThe number of threads reduced due to slow reads.
clickhouse_server_metrics_BackgroundPoolTaskThe number of tasks in the background_pool.
clickhouse_server_metrics_BackgroundMovePoolTaskThe number of tasks in the background_move_pool.
clickhouse_server_metrics_BackgroundSchedulePoolTaskThe number of tasks in the schedule_pool.
clickhouse_server_metrics_BackgroundBufferFlushSchedulePoolTaskThe number of tasks in the buffer_flush_schedule_pool.
clickhouse_server_metrics_BackgroundDistributedSchedulePoolTaskThe number of tasks in the distributed_schedule_pool.
clickhouse_server_metrics_BackgroundTrivialSchedulePoolTaskThe number of tasks in the trivial_schedule_pool.
clickhouse_server_metrics_TCPConnectionThe number of TCP connections.
clickhouse_server_metrics_HTTPConnectionThe number of HTTP connections.
clickhouse_server_metrics_InterserverConnectionThe number of connections used to obtain data from other replicas.
clickhouse_server_metrics_MemoryTrackingThe total memory used by the server. Unit: bytes.
clickhouse_server_metrics_MemoryTrackingInBackgroundProcessingPoolThe memory used for task execution in the background_pool. Unit: bytes.
clickhouse_server_metrics_MemoryTrackingInBackgroundMoveProcessingPoolThe memory used for task execution in the background_move_pool. Unit: bytes.
clickhouse_server_metrics_MemoryTrackingInBackgroundBufferFlushSchedulePoolThe memory used for task execution in the buffer_flush_schedule_pool. Unit: bytes.
clickhouse_server_metrics_MemoryTrackingInBackgroundSchedulePoolThe memory used for task execution in the schedule_pool. Unit: bytes.
clickhouse_server_metrics_MemoryTrackingInBackgroundDistributedSchedulePoolThe memory used for task execution in the distributed_schedule_pool. Unit: bytes.
clickhouse_server_metrics_MemoryTrackingInBackgroundTrivialSchedulePoolThe memory used for task execution in the trivial_schedule_pool. Unit: bytes.
clickhouse_server_metrics_MemoryTrackingForMergesThe memory used by the background merge operation. Unit: bytes.

Flink metrics

Flink is a streaming data stream execution engine that provides data distribution, data communication, and fault tolerance mechanisms for distributed computing of data streams.
  • Overview
    ParameterMetricDescription
    Num Of RunningJobsnumRunningJobsThe number of jobs running in the JM.
    Job Uptimejob_uptimeThe uptime of the job. Unit: milliseconds. Only single series or tables can be returned.
    TaskSlots AvailabletaskSlotsAvailableThe number of available TaskSlots.
    TaskSlots TotaltaskSlotsTotalThe total number of TaskSlots.
    Num of TMnumRegisteredTaskManagersThe number of registered TMs.
    sourceIdleTimesourceIdleTimeThe time duration during which the source does not process records. Unit: milliseconds.
    currentFetchEventTimeLagcurrentFetchEventTimeLagThe different between the time that the data starts to be generated and the time that the Flink Source fetches the data.
    currentEmitEventTimeLagcurrentEmitEventTimeLagThe different between the time that the data starts to be generated and the time that the Flink Source emits the data.
  • Checkpoint
    ParameterMetricDescription
    Num of CheckpointstotalNumberOfCheckpointsThe total number of checkpoints.
    numberOfFailedCheckpointsThe number of checkpoints that fail.
    numberOfCompletedCheckpointsThe number of completed checkpoints.
    numberOfInProgressCheckpointsThe number of checkpoints in progress.
    lastCheckpointDurationlastCheckpointDurationThe time when the last checkpoint was completed. Unit: milliseconds.
    lastCheckpointSizelastCheckpointSizeThe size of the last checkpoint. Unit: bytes.
    lastCheckpointRestoreTimestamplastCheckpointRestoreTimestampThe recovery time of the last checkpoint on the coordinator. Unit: milliseconds.
  • Network
    ParameterMetricDescription
    InPool UsageinPoolUsageThe size of the used input buffer.
    OutPool UsageoutPoolUsageThe size of the used output buffer.
    OutputQueue LengthoutputQueueLengthThe number of output queues.
    InputQueue LengthinputQueueLengthThe number of input queues.
  • IO
    ParameterMetricDescription
    numBytesIn PerSecondnumBytesInLocalPerSecondThe number of bytes read from the local server per second.
    numBytesInRemotePerSecondThe number of bytes read from the remote server per second.
    numBuffersInLocalPerSecondThe number of buffers read from the local server per second.
    numBuffersInRemotePerSecondThe number of buffers read from the remote server per second.
    numBytesOut PerSecondnumBytesOutPerSecondThe number of bytes sent per second.
    numBuffersOutPerSecondThe number of outgoing buffers per second.
    Task numRecords I/O PerSecondnumRecordsInPerSecondThe number of records received per second.
    numRecordsOutPerSecondThe number of records sent per second.
    Task numRecords I/OnumRecordsInThe number of records received.
    numRecordsOutThe number of records sent.
    Operator CurrentSendTimecurrentSendTimeThe time consumed to send the last record. Unit: milliseconds.
  • Watermark
    ParameterMetricDescription
    Task InputWatermarkcurrentInputWatermarkThe time when the task receives the last watermark. Unit: milliseconds.
    Operator In/Out WatermarkcurrentInputWatermarkThe time when the operator receives the last watermark. Unit: milliseconds.
    currentOutputWatermarkThe time when the operator sends the last watermark. Unit: milliseconds.
    watermarkLagwatermarkLagThe latency of the watermark. Unit: milliseconds.
  • CPU
    ParameterMetricDescription
    JM CPU LoadCPU_LoadThe JM CPU utilization.
    TM CPU LoadCPU_LoadThe TM CPU utilization.
    CPU UsageCPU_UsageThe TM CPU utilization that is calculated based on the ProcessTree.
  • Memory
    ParameterMetricDescription
    JM Heap MemoryMemory_Heap_UsedThe used JM Heap Memory. Unit: bytes.
    Memory_Heap_CommittedThe requested JM Heap Memory. Unit: bytes.
    Memory_Heap_MaxThe maximum JM Heap Memory that can be used. Unit: bytes.
    JM NonHeap MemoryMemory_NonHeap_UsedThe used JM NonHeap Memory. Unit: bytes.
    Memory_NonHeap_CommittedThe requested JM NonHeap Memory. Unit: bytes.
    Memory_NonHeap_MaxThe maximum JM NonHeap Memory that can be used. Unit: bytes.
    TM Heap MemoryMemory_Heap_UsedThe used TM Heap Memory. Unit: bytes.
    Memory_Heap_CommittedThe requested TM Heap Memory. Unit: bytes.
    Memory_Heap_MaxThe maximum TM Heap Memory that can be used. Unit: bytes.
    TM NonHeap MemoryMemory_NonHeap_UsedThe used TM NonHeap Memory. Unit: bytes.
    Memory_NonHeap_CommittedThe requested TM NonHeap Memory. Unit: bytes.
    Memory_NonHeap_MaxThe maximum TM NonHeap Memory that can be used. Unit: bytes.
    Memory RSSMemory_RSSThe heap memory used by the TM. Unit: bytes.
  • JVM
    ParameterMetricDescription
    JM ThreadsThreads_CountThe total number of active JM threads.
    TM ThreadsThreads_CountThe total number of active TM threads.
    JM GC TimeGarbageCollector_PS_Scavenge_TimeThe GC duration of the JM young generation.
    GarbageCollector_PS_MarkSweep_TimeThe mark-and-sweep GC duration of the JM old generation.
    JM GC CountGarbageCollector_PS_Scavenge_CountThe GC quantity of the JM young generation.
    GarbageCollector_PS_MarkSweep_CountThe mark-and-sweep GC quantity of the JM old generation.
    TM GC CountGarbageCollector_PS_Scavenge_CountThe GC quantity of the TM young generation.
    GarbageCollector_PS_MarkSweep_CountThe mark-and-sweep GC quantity of the TM old generation.
    TM GC TimeGarbageCollector_PS_Scavenge_TimeThe GC duration of the TM young generation.
    GarbageCollector_PS_MarkSweep_TimeThe mark-and-sweep GC duration of the TM old generation.
    TM ClassLoaderClassLoader_ClassesLoadedThe total number of classes that the TM has loaded since the JVM was started.
    ClassLoader_ClassesUnloadedThe total number of classes that the TM has unloaded since the JVM was started.
    JM ClassLoaderClassLoader_ClassesLoadedThe total number of classes that the JM has loaded since the JVM was started.
    ClassLoader_ClassesUnloadedThe total number of classes that the JM has unloaded since the JVM was started.