Observability | Key Metrics to Focus On When Using Prometheus to Monitor E-MapReduce

By Hong Wen

E-MapReduce (EMR) is a cloud-native open-source platform that integrates various big data computing and storage engines like Hadoop, Hive, Spark, Flink, Presto, ClickHouse, StarRocks, Delta, and Hudi. This article explains how to monitor big data in EMR using Prometheus Service.

Introduction to EMR

EMR is increasingly adopted by enterprises as a big data processing solution. Built on Alibaba Cloud's ECS, EMR leverages open-source Apache Hadoop and Apache Spark ecosystems to easily analyze and process data. It can also integrate with cloud data storage systems and databases like Alibaba Cloud OSS and RDS, enabling quick setup of open-source big data services such as Hadoop, Spark, Flink, Kafka, and HBase.

The core of EMR is the cluster, which can be a Hadoop, Flink, Druid, or ZooKeeper cluster comprising one or more ECS instances. For instance, a Hadoop cluster consists of daemon processes like NameNode, DataNode, ResourceManager, and NodeManager running on ECS instances. Many big data components have numerous metrics that need to be monitored, posing challenges for O&M and SRE engineers. Hence, it is important to understand which metrics to focus on for different EMR components.

Interpretation of EMR Observation Metrics

Metric Collection

EMR metrics include HOST, HDFS, YARN, Hive, Kafka, Zookeeper, ClickHouse, and Flink. Let's introduce them one by one.

HOST Metrics [1]

Provides monitoring metrics for ECS nodes, such as CPU, memory, disk, load, network, and socket.

HDFS Metrics [2]

Hadoop Distributed File System (HDFS) is suitable for distributed reading and writing of large-scale data, especially in scenarios with more reads and fewer writes. HDFS metrics include HOME, NameNodes, DataNodes, and JournnanlNodes.

HDFS-HOME
HDFS-NameNodes
HDFS-DataNodes
HDFS-JournanlNodes

YARN Metrics [3]

YARN is the core component of the Hadoop system. YARN manages resources in Hadoop clusters, and schedules and monitors jobs in the clusters. YARN metrics include HOME, Queue, ResourceManager, NodeManager, TimeLineServer, and JobHistory.

YARN-HOME
YARN-Queues
YARN-ResourceManager
YARN-NodeManagers
YARN-TimeLineServer
YARN-JobHistory

Hive Metrics [4]

Hive is a Hadoop-based data warehouse framework. It is used to extract, transform, and load data and manage metadata in big data scenarios. Hive consists of HiveServer2 (HiveQL query server), Hive MetaStore (metadata management module), and Hive Client. Its metrics include HiveMetaStore and HiveServer2.

HiveMetaStore

table1

HiveServer2

table2

ZooKeeper Metrics [5]

ZooKeeper is a distributed and highly available coordination service. ZooKeeper provides features such as distributed configuration, synchronization, naming, and registration.

table3

Kafka Metrics [6]

ApsaraMQ for Kafka is a distributed, high-throughput, and scalable message queue service provided by Alibaba Cloud. Message Queue for Apache Kafka is used in big data scenarios such as log collection, monitoring data aggregation, streaming data processing, and online and offline analysis. It is important for the big data ecosystem.

Kafka-HOME
Kafka-Broker
- Status
- Throughput
- Performance
- Storage
- Request Rate
- Request Time
- MessageConversion
- ZK session
- JVM
Kafka-Topic
- Status
- Throughput
- Request Rate
- MessageConversion
- Storage

Impala Metrics [7]

Impala provides high-performance and low-latency SQL queries for data stored in Apache Hadoop.

table4

HUE Metrics [8]

table5

Kudu Metrics [9]

table6

ClickHouse Metrics [10]

Compatible with the features of open source ClickHouse, EMR ClickHouse optimizes the read and write performance and improves the ability to quickly integrate ClickHouse with other EMR components.

table7

Flink Metrics [11]

Flink is a streaming data stream execution engine that provides data distribution, data communication, and fault tolerance mechanisms for distributed computing of data streams.

Overview

table8

Checkpoint

table9

Network

table10

table11

Watermark

table12

table13

Memory

table14

table15

Use Prometheus Service to Monitor EMR

The following section describes how to use Alibaba Cloud Prometheus Service to monitor EMR. The following three aspects are included: integrate EMR configurations, view monitoring dashboards, and' configure alert rules.

Integrate EMR Configurations

Enable the Prometheus Port for taihao-exporter

After you create an EMR cluster, the system automatically installs taihao-exporter in the corresponding Elastic Compute Service (ECS) instance. You must manually enable the Prometheus port.

1. Log on to the EMR console [12] and find the ID and name of the cluster.

2. Click the Nodes tab. Find the master node and core node, and click Details. In the Basic Information section of the Instance Details tab, click Connect to remotely log on to the ECS instance.

3. Run the following command, ps -ef | grep taihao_exporter, to query the exporter process, and run the following command to add prom_sink_enable=true to the taihao_exporter.yaml file and restart the service: (You need to modify the configurations of all nodes.)

sed -i 's/prom_sink_enable:\s*false/prom_sink_enable: true/g' /usr/local/taihao_exporter/taihao_exporter.yamlservice taihao_exporter restart

Integrate EMR into Managed Service for Prometheus

Log on to the Alibaba Cloud Prometheus [13] console. Click Integration Center. In the Application Components section, find the E-MapReduce component and click Add.

Select an ECS environment and a Prometheus instance, and configure the following configurations:

EMR cluster ID: Go to the EMR console to find it.
EMR cluster name: The name of the EMR cluster.
Exporter name: Job name (default value + cluster name are recommended)
Exporter port number: Default value is 9712
Collection path: The HTTP path used by Managed Service for Prometheus to collect metric data from the exporter. Default value: /metrics_preget
Collection interval (seconds): metrics collection interval
ECS tag key: The ECS tag and tag value that are used to deploy the exporter. Managed Service for Prometheus uses this tag for service discovery. The specific configuration is set according to the ECS tag in the above picture. Valid values: acs:emr:nodeGroupType or acs:emr:hostGroupType.
ECS tag value: Default values: CORE,MASTER. Separate multiple values with commas (,).

FAQ

The error, context deadline exceeded, is reported. When this problem occurs, add the ECS of the EMR instance to the vpc security group. There is a security group prompt when the ECS is added.

View Monitoring Dashboard

Alibaba Cloud Prometheus Service provides 24 dashboards, including HOST, HDFS, Hive, YARN, Impala, ZooKeeper, Spark, Flink, and ClickHouse.

1. HOST dashboard: displays the CPU utilization, memory usage, disk space, load, network, and socket of the ECS instance.

2. HDFS dashboard: HDFS-HOME, HDFS-NameNodes, HDFS-DataNodes, and HDFS-JournanlNodes

3. Hive dashboard:

HiveServer2: the HiveQL query server that receives SQL requests from JDBC clients.
HiveMetaStore: the metadata management module that is used to store metadata such as database and table data.

4. YARN dashboard:

HOME: displays the cluster status, memory, tasks, nodes, and containers.
NodeManager: manages and monitors node resources and executes jobs on nodes.
ResourceManager: manages and schedules cluster resources and allocates resources for various types of jobs that are running on YARN.
TimeLineServer: collects the metrics of a job and displays the job execution status.
JobHistory:

5. ClickHouse dashboard

6. Flink dashboard

7. Impala dashboard

8. ZooKeeper dashboard

9. Go to the Spark dashboard page of the prometheus instance that is integrated with EMR. Click the E-MapReduce tab. On the page that appears, click the Dashboards tab and click the thumbnail of the dashboard to view the Grafana dashboard.