Manage alert rules - E-MapReduce - Alibaba Cloud Documentation Center

E-MapReduce (EMR) allows you to create alert rules to monitor the usage of service resources in EMR clusters. If resource metrics meet specific alert conditions, alerts are triggered and CloudMonitor sends alert notifications. This way, you can identify and handle the exceptions of monitored clusters at the earliest opportunity. This topic describes how to create and view alert rules in the EMR console.

Background information

The alerting feature is provided by CloudMonitor. You can manage alert rules or use more monitoring and alerting features in the CloudMonitor console. For more information, see What is CloudMonitor?

Prerequisites

An EMR cluster is created. For more information, see Create a cluster.

Limits

If you use a RAM user, you must grant the following permissions to the RAM user. For more information about how to grant permissions to a RAM user, see Grant permissions to a RAM user.

{
    "Version": "1",
    "Statement": [
        {
            "Action": [
                "cms:DescribeContactGroupList",
                "cms:DescribeMetricMetaList",
                "cms:PutResourceMetricRules",
                "cms:DescribeMetricRuleList"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Create alert rules

Create alert rules by using a template

Go to the Alert Management tab.
1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
2. In the top navigation bar, select a region and a resource group based on your business requirements.
3. On the EMR on ECS page, click the ID of the desired cluster.
4. On the page that appears, click the Monitoring tab.
5. Click the Alert Management tab.
On the Alert Management tab, click Create Alert Rules.
In the Create Alert Rules panel, find the desired service and click Create Alert Rules in the Actions column.

Configure parameters and click Create. The following table describes the parameters.

Parameter	Description
Rule Description	The description of the alert rules in the template. You can view the metric names and change the default thresholds of the metrics. For information about the services to which the template applies and metric description, see Services in alert rule templates.
Mute Period	The interval at which the alert notification is resent before the alert is cleared.
Validity Period	The period during which the alert rules are valid. The system monitors the data based on the alert rules only within the valid period.
Alert Contact Group	The alert contact groups to which alert notifications are sent.
Alert notification method	The methods that you want to use to send alert notifications and the alert email subject. Supported alert notification methods: Phone Call, Text Message, Email, and DingTalk Chatbot Text Message, Email, and DingTalk Chatbot Email and DingTalk Chatbot Alert Email Subject: optional. If you specify the alert email subject, the specified remarks are included in the alert notification email.
Callback URL	The callback URL that can be accessed over the Internet. CloudMonitor sends a POST request to push an alert to the callback URL that you specify. Only HTTP requests are supported.

After you create an alert rule, the rule takes effect on the instances in the cluster. You can view the created alert rules on the Alert Management tab.

You can also click Manage Alert Rules to go to the CloudMonitor console to view or modify alert rules.

Create custom alert rules

Go to the Alarm Management tab.
1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
2. In the top navigation bar, select a region and a resource group based on your business requirements.
3. On the EMR on ECS page, click the ID of the desired cluster.
4. On the page that appears, click the Monitoring tab.
5. Click the Alert Management tab.
On the Alert Management tab, click Create Alert Rules.
In the Create Alert Rules panel, click Create Custom Rule.

Configure parameters and click Create. The following table describes the parameters.

Parameter	Description
Alert Rule	The name and content of an alert rule. This parameter specifies the condition that triggers an alert. Note For information about the EMR metrics in alert rules, see CloudMonitor metrics. You can click Add Alert Rule to create multiple alert rules.
Mute Period	The interval at which the alert notification is resent before the alert is cleared.
Validity Period	The period during which the alert rules are valid. The system monitors the data based on the alert rules only within the valid period.
Alert Contact Group	The alert contact groups to which alert notifications are sent.
Alert notification method	The methods that you want to use to send alert notifications and the alert email subject. Supported alert notification methods: Phone Call, Text Message, Email, and DingTalk Chatbot Text Message, Email, and DingTalk Chatbot Email and DingTalk Chatbot Alert Email Subject: optional. If you specify the alert email subject, the specified remarks are included in the alert notification email.
Callback URL	The callback URL that can be accessed over the Internet. CloudMonitor sends a POST request to push an alert to the callback URL that you specify. Only HTTP requests are supported. This parameter is optional.

After you create an alert rule, the rule takes effect on the instances in the cluster. You can view the created alert rules on the Alert Management tab.

You can also click Manage Alert Rules to go to the CloudMonitor console to view or modify alert rules.

View alert rules

You can view alert rules on the Alert Management tab.

Parameter	Description
Rule Name	The name of the alert rule.
Status	The status of the alert rule in CloudMonitor. Valid values: OK, Alert, No Data, Disabled, and Enabled.
Rule Description	The description of the alert rule. An alert is triggered when the conditions of an alert rule are met.
Alert Contact Group	The alert contact groups to which alert notifications are sent.
Actions	Details: You can click Details to go to the CloudMonitor console to view the details of an alert rule, such as the alert contact groups, the alert history, and alert resources. Edit Rule: You can click Edit Rule to go to the CloudMonitor console to modify the parameters that are configured for the alert rule.

Services in alert rule templates

Service name	Component name	Metric	Description
Node (Host)	Disk	emr_node_part_max_used	If the condition that the average value of the specified metric is greater than 80% is met two consecutive times, an alert is triggered. The check is performed once every minute.
	CPU	emr_node_cpu_idle	If the condition that the average value of the specified metric is less than 5% is met five consecutive times, an alert is triggered. The check is performed once every minute.
	Memory	emr_node_mem_used_percent	If the condition that the average value of the specified metric is greater than 90% is met two consecutive times, an alert is triggered. The check is performed once every minute.
HDFS	NameNode	hdfs_namenode_jvm_MemHeapUsedM / hdfs_namenode_jvm_MemHeapMaxM	If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute.
	NameNode	hdfs_namenode_rpc_service_activity_CallQueueLength	If the condition that the average value of the specified metric is greater than 1000 is met two consecutive times, an alert is triggered. The check is performed once every minute.
	NameNode	hdfs_namenode_fsnamesystem_CorruptBlocks	If the condition that the average value of the specified metric is greater than 1 is met two consecutive times, an alert is triggered. The check is performed once every minute.
	NameNode	hdfs_namenode_safemode_status	If the condition that the NameNode is in safe mode is met, an alert is triggered. The check is performed once every minute.
	DataNode	hdfs_datanode_jvm_MemHeapUsedM / hdfs_datanode_jvm_MemHeapMaxM	If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute.
Spark	SparkHistoryServer	spark_history_jvm_old_space_utilization	If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, an alert is triggered. The check is performed once every minute.
Spark	SparkThriftServer	spark_thrift_driver_jvm_heap_used/spark_thrift_driver_jvm_heap_max	If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, an alert is triggered. The check is performed once every minute.
Hive	HiveMetaStore	hive_metastore_memory_heap_used/hive_metastore_memory_heap_max	If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, an alert is triggered. The check is performed once every minute.
	HiveMetaStore	hive_metastore_threads_blocked_count	If the condition that the average value of the specified metric is greater than 50% is met two consecutive times, an alert is triggered. The check is performed once every minute.
	HiveServer2	hive_server_memory_heap_used/hive_server_memory_heap_max	If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, an alert is triggered. The check is performed once every minute.
	HiveServer2	hive_server_threads_deadlock_count	If the condition that the average value of the specified metric is greater than 50% is met two consecutive times, an alert is triggered. The check is performed once every minute.
YARN	ResourceManager	yarn_cluster_status	If one of the following conditions is met in the previous five minutes, an alert is triggered: two or more HA switchovers occur, the status of a node is 1, or the status of a node is always -1.
	ResourceManager	yarn_resourcemanager_jvm_MemHeapUsedM / yarn_resourcemanager_jvm_MemHeapMaxM	If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute.
	NodeManager	yarn_cluster_unhealthyNodes	If the condition that the average value of the specified metric is greater than 1 is met two consecutive times, an alert is triggered. The check is performed once every minute.
	NodeManager	yarn_nodemanager_jvm_MemHeapUsedM / yarn_nodemanager_jvm_MemHeapMaxM	If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute.
	TimelineServer	yarn_timelineserver_jvm_MemHeapUsedM / yarn_timelineserver_jvm_MemHeapMaxM	If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute.
	MRHistoryServer	yarn_jobhistory_jvm_MemHeapUsedM / yarn_jobhistory_jvm_MemHeapMaxM	If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute.
ZooKeeper	ZooKeeper	zk_znode_count	If the condition that the average value of the specified metric is greater than or equal to 10000 is met two consecutive times, an alert is triggered. The check is performed once every minute.
ZooKeeper	ZooKeeper	zk_watch_count	If the condition that the average value of the specified metric is greater than or equal to 1000 is met two consecutive times, an alert is triggered. The check is performed once every minute.
Kafka	KafkaBroker	Kafka_Broker_kafka_log_LogManager_OfflineLogDirectoryCount	If the condition that the average value of the specified metric is greater than 0 is met two consecutive times, an alert is triggered. The check is performed once every minute.
Kafka		Kafka_Broker_kafka_server_ReplicaManager_UnderReplicatedPartitions	If the condition that the average value of the specified metric is greater than 0 is met two consecutive times, an alert is triggered. The check is performed once every minute.
Presto/Trino	Trino	trino_QueryManager_FailedQueries_OneMinute_Count	If the condition that the average value of the specified metric is greater than or equal to 1 is met two consecutive times, an alert is triggered. The check is performed once every minute.
	Trino	trino_ClusterMemoryPool_name_general_BlockedNodes	If the condition that the average value of the specified metric is greater than 0 is met two consecutive times, an alert is triggered. The check is performed once every minute.
	Presto	presto_QueryManager_FailedQueries_OneMinute_Count	If the condition that the average value of the specified metric is greater than or equal to 1 is met two consecutive times, an alert is triggered. The check is performed once every minute.
	Presto	presto_ClusterMemoryPool_name_general_BlockedNodes	If the condition that the average value of the specified metric is greater than 0 is met two consecutive times, an alert is triggered. The check is performed once every minute.
Impala	Impalad	num_waiting_queries	If the condition that the average value of the specified metric is greater than or equal to 10 is met two consecutive times, an alert is triggered. The check is performed once every minute. Note You can adjust the threshold based on the number of concurrent queries supported by the cluster.
Kudu	kudu-master	kudu_cluster_replica_skew	If the condition that the average value of the specified metric is greater than or equal to 1000 is met two consecutive times, an alert is triggered. The check is performed once every minute. Note You can adjust the threshold based on your business requirements.