All Products
Search
Document Center

E-MapReduce:Manage alert rules

Last Updated:Mar 26, 2026

E-MapReduce (EMR) lets you create alert rules to monitor service resource metrics in your clusters. When a metric breaches a configured threshold, CloudMonitor sends an alert notification so you can identify and respond to cluster issues promptly.

Background information: The alerting feature is provided by CloudMonitor. You can manage alert rules or use more monitoring and alerting features in the CloudMonitor console. For more information, see What is CloudMonitor?

Prerequisites

Before you begin, ensure that you have:

{
    "Version": "1",
    "Statement": [
        {
            "Action": [
                "cms:DescribeContactGroupList",
                "cms:DescribeMetricMetaList",
                "cms:PutResourceMetricRules",
                "cms:DescribeMetricRuleList"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Create alert rules

EMR provides two ways to create alert rules: using a built-in template or defining a custom rule. Use a template to quickly configure pre-built rules for common services. Use a custom rule when you need to monitor a specific metric not covered by the templates.

Both methods start from the Alert Management subtab. Navigate there first:

  1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

  2. In the top navigation bar, select a region and a resource group.

  3. On the EMR on ECS page, click the ID of the target cluster.

  4. Click the Monitoring and Diagnostics tab, then click the Alert Management subtab.

Create alert rules from a template

  1. On the Alert Management subtab, click Create Alert Rules.

  2. In the Create Alert Rules panel, find the target service and click Create Alert Rules in the Actions column.

  3. Configure the following parameters and click Create.

    ParameterDescription
    Rule DescriptionThe alert rules included in the template. Review metric names and adjust default thresholds as needed. For details on which services and metrics are covered, see Services in alert rule templates.
    Mute PeriodHow long to wait before resending an alert notification while the alert remains active.
    Validity PeriodThe time window during which the alert rules are active. The system evaluates metrics only within this window.
    Alert Contact GroupThe contact groups that receive alert notifications.
    Alert notification methodThe channels used to send notifications. Supported combinations: Phone Call, Text Message, Email, and DingTalk Chatbot; Text Message, Email, and DingTalk Chatbot; Email and DingTalk Chatbot. Alert Email Subject (optional): if you specify the alert email subject, the specified remarks are included in the alert notification email.
    Callback URL(Optional) A publicly accessible HTTP endpoint. CloudMonitor sends a POST request to this URL when an alert fires.

After the rule is created, it takes effect on all instances in the cluster. View it on the Alert Management subtab, or click Manage Alert Rules to open the CloudMonitor console for additional management options.

Create a custom alert rule

  1. On the Alert Management subtab, click Create Alert Rules.

  2. In the Create Alert Rules panel, click Create Custom Rule.

  3. Configure the following parameters and click Create.

    ParameterDescription
    Alert RuleThe rule name and condition that triggers the alert. Click Add Alert Rule to add multiple conditions. For available EMR metrics, see CloudMonitor metrics.
    Mute PeriodHow long to wait before resending an alert notification while the alert remains active.
    Validity PeriodThe time window during which the alert rule is active.
    Alert Contact GroupThe contact groups that receive alert notifications.
    Alert notification methodThe channels used to send notifications. Supported combinations: Phone Call, Text Message, Email, and DingTalk Chatbot; Text Message, Email, and DingTalk Chatbot; Email and DingTalk Chatbot. Alert Email Subject (optional): if you specify the alert email subject, the specified remarks are included in the alert notification email.
    Callback URL(Optional) A publicly accessible HTTP endpoint. CloudMonitor sends a POST request to this URL when an alert fires.

After the rule is created, it takes effect on all instances in the cluster. View it on the Alert Management subtab, or click Manage Alert Rules to open the CloudMonitor console for additional management options.

View alert rules

All alert rules are listed on the Alert Management subtab.

ColumnDescription
Rule NameThe name of the alert rule.
StatusThe current state of the rule in CloudMonitor: OK — the metric is within the configured threshold. Alert — the metric has breached the threshold and an alert is active. No Data — no metric data has been reported. This may indicate a monitoring gap or a stopped component. Disabled — the rule exists but is not evaluating metrics. Enabled — the rule is active and evaluating metrics.
Rule DescriptionThe conditions that trigger the alert.
Alert Contact GroupThe contact groups configured to receive notifications for this rule.
ActionsDetails: go to the CloudMonitor console to view alert contact groups, alert history, and affected resources. Edit Rule: go to the CloudMonitor console to modify the rule parameters.

Services in alert rule templates

The following table lists the services, components, and metrics covered by the built-in alert rule templates, along with the default alert conditions. You can adjust thresholds when creating a rule to match your cluster's workload.

ServiceComponentMetricDefault alert condition
Node (Host)Diskemr_node_part_max_usedAverage > 80%, 2 consecutive checks, every 1 min
Node (Host)CPUemr_node_cpu_idleAverage < 5%, 5 consecutive checks, every 1 min
Node (Host)Memoryemr_node_mem_used_percentAverage > 90%, 2 consecutive checks, every 1 min
HDFSNameNodehdfs_namenode_jvm_MemHeapUsedM / hdfs_namenode_jvm_MemHeapMaxMAverage > 95% for 2 consecutive checks, or no metric data reported, every 1 min
HDFSNameNodehdfs_namenode_rpc_service_activity_CallQueueLengthAverage > 1000, 2 consecutive checks, every 1 min
HDFSNameNodehdfs_namenode_fsnamesystem_CorruptBlocksAverage > 1, 2 consecutive checks, every 1 min
HDFSNameNodehdfs_namenode_safemode_statusNameNode enters safe mode, every 1 min
HDFSDataNodehdfs_datanode_jvm_MemHeapUsedM / hdfs_datanode_jvm_MemHeapMaxMAverage > 95% for 2 consecutive checks, or no metric data reported, every 1 min
SparkSparkHistoryServerspark_history_jvm_old_space_utilizationAverage > 95%, 2 consecutive checks, every 1 min
SparkSparkThriftServerspark_thrift_driver_jvm_heap_used / spark_thrift_driver_jvm_heap_maxAverage > 95%, 2 consecutive checks, every 1 min
HiveHiveMetaStorehive_metastore_memory_heap_used / hive_metastore_memory_heap_maxAverage > 95%, 2 consecutive checks, every 1 min
HiveHiveMetaStorehive_metastore_threads_blocked_countAverage > 50%, 2 consecutive checks, every 1 min
HiveHiveServer2hive_server_memory_heap_used / hive_server_memory_heap_maxAverage > 95%, 2 consecutive checks, every 1 min
HiveHiveServer2hive_server_threads_deadlock_countAverage > 50%, 2 consecutive checks, every 1 min
YARNResourceManageryarn_cluster_statusWithin the previous 5 minutes: 2 or more HA switchovers, a node status of 1, or a node status that remains -1
YARNResourceManageryarn_resourcemanager_jvm_MemHeapUsedM / yarn_resourcemanager_jvm_MemHeapMaxMAverage > 95% for 2 consecutive checks, or no metric data reported, every 1 min
YARNNodeManageryarn_cluster_unhealthyNodesAverage > 1, 2 consecutive checks, every 1 min
YARNNodeManageryarn_nodemanager_jvm_MemHeapUsedM / yarn_nodemanager_jvm_MemHeapMaxMAverage > 95% for 2 consecutive checks, or no metric data reported, every 1 min
YARNTimelineServeryarn_timelineserver_jvm_MemHeapUsedM / yarn_timelineserver_jvm_MemHeapMaxMAverage > 95% for 2 consecutive checks, or no metric data reported, every 1 min
YARNMRHistoryServeryarn_jobhistory_jvm_MemHeapUsedM / yarn_jobhistory_jvm_MemHeapMaxMAverage > 95% for 2 consecutive checks, or no metric data reported, every 1 min
ZookeeperZookeeperzk_znode_countAverage >= 10000, 2 consecutive checks, every 1 min
ZookeeperZookeeperzk_watch_countAverage >= 1000, 2 consecutive checks, every 1 min
KafkaKafkaBrokerKafka_Broker_kafka_log_LogManager_OfflineLogDirectoryCountAverage > 0, 2 consecutive checks, every 1 min
KafkaKafkaBrokerKafka_Broker_kafka_server_ReplicaManager_UnderReplicatedPartitionsAverage > 0, 2 consecutive checks, every 1 min
Presto/TrinoTrinotrino_QueryManager_FailedQueries_OneMinute_CountAverage >= 1, 2 consecutive checks, every 1 min
Presto/TrinoTrinotrino_ClusterMemoryPool_name_general_BlockedNodesAverage > 0, 2 consecutive checks, every 1 min
Presto/TrinoPrestopresto_QueryManager_FailedQueries_OneMinute_CountAverage >= 1, 2 consecutive checks, every 1 min
Presto/TrinoPrestopresto_ClusterMemoryPool_name_general_BlockedNodesAverage > 0, 2 consecutive checks, every 1 min
ImpalaImpaladnum_waiting_queriesAverage >= 10, 2 consecutive checks, every 1 min. Adjust the threshold based on the number of concurrent queries your cluster supports.
Kudukudu-masterkudu_cluster_replica_skewAverage >= 1000, 2 consecutive checks, every 1 min. Adjust the threshold based on your workload.