E-MapReduce (EMR) lets you create alert rules to monitor service resource metrics in your clusters. When a metric breaches a configured threshold, CloudMonitor sends an alert notification so you can identify and respond to cluster issues promptly.
Background information: The alerting feature is provided by CloudMonitor. You can manage alert rules or use more monitoring and alerting features in the CloudMonitor console. For more information, see What is CloudMonitor?
Prerequisites
Before you begin, ensure that you have:
An EMR cluster. See Create a cluster
(RAM users only) The following CloudMonitor permissions granted to your RAM user. See Grant permissions to a RAM user
{
"Version": "1",
"Statement": [
{
"Action": [
"cms:DescribeContactGroupList",
"cms:DescribeMetricMetaList",
"cms:PutResourceMetricRules",
"cms:DescribeMetricRuleList"
],
"Resource": "*",
"Effect": "Allow"
}
]
}Create alert rules
EMR provides two ways to create alert rules: using a built-in template or defining a custom rule. Use a template to quickly configure pre-built rules for common services. Use a custom rule when you need to monitor a specific metric not covered by the templates.
Both methods start from the Alert Management subtab. Navigate there first:
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
In the top navigation bar, select a region and a resource group.
On the EMR on ECS page, click the ID of the target cluster.
Click the Monitoring and Diagnostics tab, then click the Alert Management subtab.
Create alert rules from a template
On the Alert Management subtab, click Create Alert Rules.
In the Create Alert Rules panel, find the target service and click Create Alert Rules in the Actions column.
Configure the following parameters and click Create.
Parameter Description Rule Description The alert rules included in the template. Review metric names and adjust default thresholds as needed. For details on which services and metrics are covered, see Services in alert rule templates. Mute Period How long to wait before resending an alert notification while the alert remains active. Validity Period The time window during which the alert rules are active. The system evaluates metrics only within this window. Alert Contact Group The contact groups that receive alert notifications. Alert notification method The channels used to send notifications. Supported combinations: Phone Call, Text Message, Email, and DingTalk Chatbot; Text Message, Email, and DingTalk Chatbot; Email and DingTalk Chatbot. Alert Email Subject (optional): if you specify the alert email subject, the specified remarks are included in the alert notification email. Callback URL (Optional) A publicly accessible HTTP endpoint. CloudMonitor sends a POST request to this URL when an alert fires.
After the rule is created, it takes effect on all instances in the cluster. View it on the Alert Management subtab, or click Manage Alert Rules to open the CloudMonitor console for additional management options.
Create a custom alert rule
On the Alert Management subtab, click Create Alert Rules.
In the Create Alert Rules panel, click Create Custom Rule.
Configure the following parameters and click Create.
Parameter Description Alert Rule The rule name and condition that triggers the alert. Click Add Alert Rule to add multiple conditions. For available EMR metrics, see CloudMonitor metrics. Mute Period How long to wait before resending an alert notification while the alert remains active. Validity Period The time window during which the alert rule is active. Alert Contact Group The contact groups that receive alert notifications. Alert notification method The channels used to send notifications. Supported combinations: Phone Call, Text Message, Email, and DingTalk Chatbot; Text Message, Email, and DingTalk Chatbot; Email and DingTalk Chatbot. Alert Email Subject (optional): if you specify the alert email subject, the specified remarks are included in the alert notification email. Callback URL (Optional) A publicly accessible HTTP endpoint. CloudMonitor sends a POST request to this URL when an alert fires.
After the rule is created, it takes effect on all instances in the cluster. View it on the Alert Management subtab, or click Manage Alert Rules to open the CloudMonitor console for additional management options.
View alert rules
All alert rules are listed on the Alert Management subtab.
| Column | Description |
|---|---|
| Rule Name | The name of the alert rule. |
| Status | The current state of the rule in CloudMonitor: OK — the metric is within the configured threshold. Alert — the metric has breached the threshold and an alert is active. No Data — no metric data has been reported. This may indicate a monitoring gap or a stopped component. Disabled — the rule exists but is not evaluating metrics. Enabled — the rule is active and evaluating metrics. |
| Rule Description | The conditions that trigger the alert. |
| Alert Contact Group | The contact groups configured to receive notifications for this rule. |
| Actions | Details: go to the CloudMonitor console to view alert contact groups, alert history, and affected resources. Edit Rule: go to the CloudMonitor console to modify the rule parameters. |
Services in alert rule templates
The following table lists the services, components, and metrics covered by the built-in alert rule templates, along with the default alert conditions. You can adjust thresholds when creating a rule to match your cluster's workload.
| Service | Component | Metric | Default alert condition |
|---|---|---|---|
| Node (Host) | Disk | emr_node_part_max_used | Average > 80%, 2 consecutive checks, every 1 min |
| Node (Host) | CPU | emr_node_cpu_idle | Average < 5%, 5 consecutive checks, every 1 min |
| Node (Host) | Memory | emr_node_mem_used_percent | Average > 90%, 2 consecutive checks, every 1 min |
| HDFS | NameNode | hdfs_namenode_jvm_MemHeapUsedM / hdfs_namenode_jvm_MemHeapMaxM | Average > 95% for 2 consecutive checks, or no metric data reported, every 1 min |
| HDFS | NameNode | hdfs_namenode_rpc_service_activity_CallQueueLength | Average > 1000, 2 consecutive checks, every 1 min |
| HDFS | NameNode | hdfs_namenode_fsnamesystem_CorruptBlocks | Average > 1, 2 consecutive checks, every 1 min |
| HDFS | NameNode | hdfs_namenode_safemode_status | NameNode enters safe mode, every 1 min |
| HDFS | DataNode | hdfs_datanode_jvm_MemHeapUsedM / hdfs_datanode_jvm_MemHeapMaxM | Average > 95% for 2 consecutive checks, or no metric data reported, every 1 min |
| Spark | SparkHistoryServer | spark_history_jvm_old_space_utilization | Average > 95%, 2 consecutive checks, every 1 min |
| Spark | SparkThriftServer | spark_thrift_driver_jvm_heap_used / spark_thrift_driver_jvm_heap_max | Average > 95%, 2 consecutive checks, every 1 min |
| Hive | HiveMetaStore | hive_metastore_memory_heap_used / hive_metastore_memory_heap_max | Average > 95%, 2 consecutive checks, every 1 min |
| Hive | HiveMetaStore | hive_metastore_threads_blocked_count | Average > 50%, 2 consecutive checks, every 1 min |
| Hive | HiveServer2 | hive_server_memory_heap_used / hive_server_memory_heap_max | Average > 95%, 2 consecutive checks, every 1 min |
| Hive | HiveServer2 | hive_server_threads_deadlock_count | Average > 50%, 2 consecutive checks, every 1 min |
| YARN | ResourceManager | yarn_cluster_status | Within the previous 5 minutes: 2 or more HA switchovers, a node status of 1, or a node status that remains -1 |
| YARN | ResourceManager | yarn_resourcemanager_jvm_MemHeapUsedM / yarn_resourcemanager_jvm_MemHeapMaxM | Average > 95% for 2 consecutive checks, or no metric data reported, every 1 min |
| YARN | NodeManager | yarn_cluster_unhealthyNodes | Average > 1, 2 consecutive checks, every 1 min |
| YARN | NodeManager | yarn_nodemanager_jvm_MemHeapUsedM / yarn_nodemanager_jvm_MemHeapMaxM | Average > 95% for 2 consecutive checks, or no metric data reported, every 1 min |
| YARN | TimelineServer | yarn_timelineserver_jvm_MemHeapUsedM / yarn_timelineserver_jvm_MemHeapMaxM | Average > 95% for 2 consecutive checks, or no metric data reported, every 1 min |
| YARN | MRHistoryServer | yarn_jobhistory_jvm_MemHeapUsedM / yarn_jobhistory_jvm_MemHeapMaxM | Average > 95% for 2 consecutive checks, or no metric data reported, every 1 min |
| Zookeeper | Zookeeper | zk_znode_count | Average >= 10000, 2 consecutive checks, every 1 min |
| Zookeeper | Zookeeper | zk_watch_count | Average >= 1000, 2 consecutive checks, every 1 min |
| Kafka | KafkaBroker | Kafka_Broker_kafka_log_LogManager_OfflineLogDirectoryCount | Average > 0, 2 consecutive checks, every 1 min |
| Kafka | KafkaBroker | Kafka_Broker_kafka_server_ReplicaManager_UnderReplicatedPartitions | Average > 0, 2 consecutive checks, every 1 min |
| Presto/Trino | Trino | trino_QueryManager_FailedQueries_OneMinute_Count | Average >= 1, 2 consecutive checks, every 1 min |
| Presto/Trino | Trino | trino_ClusterMemoryPool_name_general_BlockedNodes | Average > 0, 2 consecutive checks, every 1 min |
| Presto/Trino | Presto | presto_QueryManager_FailedQueries_OneMinute_Count | Average >= 1, 2 consecutive checks, every 1 min |
| Presto/Trino | Presto | presto_ClusterMemoryPool_name_general_BlockedNodes | Average > 0, 2 consecutive checks, every 1 min |
| Impala | Impalad | num_waiting_queries | Average >= 10, 2 consecutive checks, every 1 min. Adjust the threshold based on the number of concurrent queries your cluster supports. |
| Kudu | kudu-master | kudu_cluster_replica_skew | Average >= 1000, 2 consecutive checks, every 1 min. Adjust the threshold based on your workload. |