This topic describes how to create an event-triggered alert rule and test a system event-triggered alert rule. Event-triggered alert rules allow you to receive alert notifications and handle exceptions immediately after the specified events occur in E-MapReduce (EMR).

Prerequisites

An application group is created and resources are added to the application group. Otherwise, you cannot apply alert rules to the instances in the application group. For more information about how to create an application group and add resources to the application group, see Create an application group and Add resources to an application group.

Create an event-triggered alert rule

  1. Log on to the CloudMonitor console.
  2. In the left-side navigation pane, choose Alerts > Alert Rules.
  3. On the Alert Rules page, click the Event Alert tab.
  4. On the Event Alert tab, click Create Event Alert.
  5. In the Create / Modify Event Alert panel, configure an alert rule and an alert notification method.
    1. In the Basic Information section, set Alert Rule Name.
    2. In the Event alert section, select an event type.
      • System events
        If you set the Event Type parameter to System Event, you must specify Product Type, Event Type, Event Level, Event Name, Resource Range, and Alert Type based on your business requirements.
        Note For more information about event names, see System events.

        The alert types supported by system events include Alert Notification, MNS queue, Function service, URL callback, and Log Service.

      • Custom events

        If you set the Event Type parameter to Custom Event, you must specify Application Groups, Event Name, Rule Description, and Notification Method based on your business requirements.

  6. Click OK.

Test a system event-triggered alert rule

After you create a system event-triggered alert rule, you can test the alert rule. You can check whether alert notifications can be received or whether events can be pushed to Message Service (MNS), Function Compute, Log Service, or a specific callback URL as configured.

  1. Log on to the CloudMonitor console.
  2. In the left-side navigation pane, choose Alerts > Alert Rules.
  3. On the Alert Rules page, click the Event Alert tab.
  4. On the Event Alert tab, find the system event-triggered alert rule that you want to test and click test in the Actions column.
  5. In the Create event test panel, select a system event and modify the event content as required.
  6. Click OK.
    CloudMonitor triggers the selected system event. Check whether an alert notification is received or the event is pushed to MNS, Function Compute, Log Service, or a specific callback URL as configured.

System events

Service Event ID APM event name Description
HDFS EMR-330200049 Maintenance:HDFS.NameNode.ActiveStandbySwitch A failover from the active NameNode to the standby NameNode is triggered.

Processing method: Identify the cause of a service time-out on the master node. Possible causes include garbage collection issues and block report storm.

EMR-350202005
EMR-350200012 Maintenance:HDFS.NameNode.OOM An out-of-memory (OOM) exception occurs in the NameNode.

Processing method: On the configuration page of the HDFS service in the EMR console, set the hadoop_namenode_heapsize parameter to a larger value to increase memory.

EMR-350200005 Maintenance:HDFS.NameNode.DirectoryFormatted The metadata directory of the NameNode is deleted.
Processing method: Submit a ticket.
Warning Do not restart the HDFS service. Otherwise, data may be lost.
EMR-350200006 Maintenance:HDFS.NameNode.LoadFsImageException An exception occurs when the NameNode reads data from the FsImage file.

Processing method: The exception may be caused by abnormal cloud disk data. Submit a ticket.

EMR-350200008 Maintenance:HDFS.NameNode.ExitUnexpectely The NameNode exits unexpectedly.

Processing method: View the logs of the unexpected exit and address the issue based on the logs.

EMR-350200015 Maintenance:HDFS.NameNode.WriteToJournalNodeTimeout A time-out occurs when the NameNode writes data to JournalNodes.

Processing method: Check whether the JournalNode service is normal, and check whether abnormal traffic is generated on the machine or the network where the JournalNode service is located.

EMR-350200014 Maintenance:HDFS.NameNode.ResourceLow The disk space of the NameNode is insufficient. As a result, the NameNode enters the Safemode state.

Processing method: Check the disk space occupied by the /mnt/disk1 directory of the master node. If the directory contains log files that are too large, process the log files. If the sizes of all the log files are normal, you can resize the disk.

EMR-350200007 Maintenance:HDFS.NameNode.SyncJournalFailed A time-out occurs when the NameNode uses JournalNodes to synchronize data.

Processing method: Check whether the JournalNode service is normal, and check whether abnormal traffic is generated on the machine or the network where the JournalNode service is located.

EMR-350201008 Maintenance:HDFS.DataNode.OOM.UnableToCreateNewNativeThread A DataNode cannot create a process.

Processing method: A possible cause is that an abnormal task has created a large number of threads on the DataNode and no more threads can be created on it. You can run the ps -eo nlwp,pid,args --sort nlwp |tail -n 5 command on the master node as the root user to find the first five processes that contain the largest number of threads. Then, analyze the processes and address the issue.

EMR-350202006 Maintenance:HDFS.ZKFC.TransportLevelExceptionInMonitorHealth A time-out occurs when ZKFailoverController (ZKFC) checks the health of the NameNode.

Processing method: The exception may be caused by insufficient memory of the NameNode, garbage collection issues, or block report storm. If all the preceding possible causes are excluded and the cluster size is large, you can set the ha.health-monitor.rpc-timeout.ms parameter to a larger value in the EMR console to prevent a time-out.

EMR-350202002 Maintenance:HDFS.ZKFC.UnableToConnectToQuorum ZKFC cannot be connected to ZooKeeper.

Processing method: Check the existing connections to ZooKeeper. If a large number of connections exist, ZKFC may fail to be connected to ZooKeeper. Close abnormal connections.

EMR-350202001 Maintenance:HDFS.ZKFC.UnableToStartZKFC ZKFC cannot be started.

Processing method: View the logs of ZKFC and address the issue based on the logs.

EMR-250201009 Maintenance:HDFS.DataNode.OomForJavaHeapSpace An OOM exception occurs in a DataNode.

Processing method: On the configuration page of the HDFS service in the EMR console, set the hadoop_datanode_heapsize parameter to a larger value to increase memory.

EMR-250201004 Maintenance:HDFS.DataNode.ExitUnexpected A DataNode exits unexpectedly.

Processing method: View the logs of the unexpected exit and address the issue based on the logs.

YARN EMR-330300053 Maintenance:YARN.ResourceManager.ActiveStandbySwitch A failover from the active ResourceManager to the standby ResourceManager is triggered.

Processing method: The failover may be caused by a garbage collection issue, insufficient resources of the master node, or a kernel error.

EMR-350300015 Maintenance:YARN.ResourceManager.ZKRMStateStoreCannotConnectZK The ResourceManager cannot use ZkClient to connect to ZooKeeper.

Processing method: Check the existing connections to ZooKeeper. If abnormal connections exist, close the connections first.

EMR-350300011 Maintenance:YARN.ResourceManager.ErrorInStarting The ResourceManager fails to be started.

Processing method: Check whether the failure is caused by invalid configurations.

EMR-350300010 Maintenance:YARN.ResourceManager.ErrorInTransitionToActiveMode An error is reported when the ResourceManager is switched to the Active mode.

Processing method: View the logs of the switchover failure and address the issue based on the logs.

EMR-350300004 Maintenance:YARN.ResourceManager.InvalidConf.CannotFoundRM_HA_ID RM_HA_ID cannot be found because some configurations of the ResourceManager are invalid.

Processing method: Check the parameters on the configuration page and rectify the invalid configurations.

EMR-350300013 Maintenance:YARN.ResourceManager.ExitUnexpected The ResourceManager exits unexpectedly.

Processing method: View the logs of the unexpected exit and address the issue based on the logs.

EMR-350302004 Maintenance:YARN.JobHistory.StartingError The Job History service fails to be started.

Processing method: View the logs of the startup failure and address the issue based on the logs.

EMR-350302003 Maintenance:YARN.JobHistory.ExitUnExpectedly The Job History service exits unexpectedly.

Processing method: View the logs of the unexpected exit and address the issue based on the logs.

EMR-350303004 Maintenance:YARN.TimelineServer.ErrorInStarting Timeline Server fails to be started.

Processing method: View the logs of the startup failure and address the issue based on the logs.

EMR-350303003 Maintenance:YARN.TimelineServer.ExitUnexpectedly The Timeline Server exits unexpectedly.

Processing method: View the logs of the unexpected exit and address the issue based on the logs.

EMR-250300001 Maintenance:YARN.ResourceManager.UnkownHostException The ResourceManager cannot parse data on some nodes.

Processing method: Check whether a Domain Name System (DNS) service is used and whether the exception is caused by unstable performance of the DNS service. If no DNS service is used, check whether valid host information is configured in the hosts file under the /etc/ directory.

EMR-250301010 Maintenance:YARN.NodeManager.UnHealthyForDiskFailed The NodeManager becomes unhealthy due to disk errors.

Processing method: Check whether the disk of the NodeManager is full. If the disk is full, submit a ticket.

EMR-250301006 Maintenance:YARN.NodeManager.ErrorRebootingNodeStatusUpdater The NodeManager fails to restart NodeStatusUpdater.

Processing method: Check NodeManager reconnection logs and the ResourceManager service.

EMR-250301015 Maintenance:YARN.NodeManager.OOM An OOM exception occurs in the NodeManager.

Processing method: Check whether the exception is caused by invalid configurations.

Hive EMR-350400016 Maintenance:HIVE.HiveMetaStore.JdbcCommunicationException A Java Database Connectivity (JDBC) exception occurs in Hive Metastore.

Processing method: Check the metadata storage connection and use the local MySQL client to perform a connectivity test.

EMR-350400010 Maintenance:HIVE.HiveMetaStore.DataBaseCommunicationLinkFailure A communication link exception occurs in Hive Metastore.

Processing method: Check the metadata storage connection and use the local MySQL client to perform a connectivity test.

EMR-350400009 Maintenance:HIVE.HiveMetaStore.DataBaseConnectionFailed A database connection failure occurs in Hive Metastore.

Processing method: Check the metadata storage connection and use the local MySQL client to perform a connectivity test.

EMR-350400014 Maintenance:HIVE.HiveMetaStore.OomOccured An OOM exception occurs in Hive Metastore.

Processing method: On the configuration page of the Hive service in the EMR console, set the hive_meastore_heapsize parameter to a larger value to increase memory.

EMR-350400015 Maintenance:HIVE.HiveMetaStore.DataBaseDiskQuotaUsedup The number of metadata storage connections exceeds the upper limit.

Processing method: If you use an on-premises MySQL database or ApsaraDB RDS to store metadata, you can manually increase the upper limit. If you manage metadata for multiple engines in a centralized manner, submit a ticket.

EMR-350400006 Maintenance:HIVE.HiveMetaStore.MaxQuestionsExceeded The number of metadata storage connections exceeds the upper limit.

Processing method: If you use an on-premises MySQL database or ApsaraDB RDS to store metadata, you can manually increase the upper limit. If you manage metadata for multiple engines in a centralized manner, submit a ticket.

EMR-350400007 Maintenance:HIVE.HiveMetaStore.MaxUpdatesExceeded The number of metadata storage connections exceeds the upper limit.

Processing method: If you use an on-premises MySQL database or ApsaraDB RDS to store metadata, you can manually increase the upper limit. If you manage metadata for multiple engines in a centralized manner, submit a ticket.

EMR-350400005 Maintenance:HIVE.HiveMetaStore.MaxUserConnectionExceeded The number of metadata storage connections exceeds the upper limit.

Processing method: If you use an on-premises MySQL database or ApsaraDB RDS to store metadata, you can manually increase the upper limit. If you manage metadata for multiple engines in a centralized manner, submit a ticket.

EMR-350400002 Maintenance:HIVE.HiveMetaStore.ParseConfError An error occurs when the system parses the configuration file of Hive Metastore.

Processing method: In most cases, the error is caused by invalid configurations. Check the configurations in the Hivemetastore-site.xml file and fix the error.

EMR-350400008 Maintenance:HIVE.HiveMetaStore.RequiredTableMissing The table requested by Hive Metastore is lost.

Processing method: Check whether the Metastore version and Hive version are consistent, and check whether the metadata is damaged.

EMR-350401013 Maintenance:HIVE.HiveServer2.HiveServer2OOM An OOM exception occurs in HiveServer2.

Processing method: On the configuration page of the Hive service in the EMR console, set the hive_server2_heapsize parameter to a larger value to increase memory.

EMR-350401007 Maintenance:HIVE.HiveServer2.CannotConnectByAnyURIsProvided HiveServer2 cannot be connected to Hive Metastore.

Processing method: Check the Metastore service.

EMR-350401004 Maintenance:HIVE.HiveServer2.ConnectToZkTimeout A time-out occurs when HiveServer2 is connected to ZooKeeper.

Processing method: Check the ZooKeeper service.

EMR-350401008 Maintenance:HIVE.HiveServer2.ErrorStartingHiveServer HiveServer2 fails to be started.

Processing method: View the logs of the startup failure and address the issue based on the logs.

EMR-350401006 Maintenance:HIVE.HiveServer2.FailedInitMetaStoreClient HiveServer2 fails to initialize the Metastore client.

Processing method: View the logs of the initialization failure and address the issue based on the logs.

EMR-350401005 Maintenance:HIVE.HiveServer2.FailedToConnectToMetaStoreServer HiveServer2 cannot be connected to the Metastore server.

Processing method: Check the Metastore service.

Host EMR-350100001 Maintenance:HOST.OomFoundInVarLogMessage An OOM exception occurs, and the logs of the exception are generated in the /var/log/message directory of the host.

Processing method: A process may be terminated due to insufficient memory of the host. Check whether memory allocation is appropriate. If the memory is insufficient, increase the memory.

EMR-250100006 Maintenance:HOST.CpuStuck An alert is reported due to high CPU utilization of the Linux kernel.

Processing method: Check whether CPU resources are occupied by abnormal tasks for a long time.

Spark HIstory EMR-350900001 Maintenance:SPARK.SparkHistory.OomOccured An OOM exception occurs in the Spark History service.

Processing method: On the configuration page of the Spark service in the EMR console, set the spark_history_daemon_memory parameter to a larger value to increase memory.

ZooKeeper EMR-350500001 Maintenance:ZOOKEEPER.UnableToRunQuorumServer ZooKeeper fails to be started.

Processing method: Check the logs of the startup failure and the configuration file of ZooKeeper. Check whether OOM has occurred due to a large number of znodes.

EMR-230500059 Maintenance:ZOOKEEPER.LeaderFollowerSwitch A failover occurs in ZooKeeper.

Processing method: Identify the cause of the failover and determine whether to restart ZooKeeper.