CloudMonitor is a monitoring and alerting service provided by Alibaba Cloud. You can create threshold-triggered alert rules in the CloudMonitor console to monitor the usage of E-MapReduce (EMR) resources. If the value of a metric exceeds the threshold that is specified in a rule, CloudMonitor automatically sends an alert notification. This way, you can receive the notification and handle the related exceptions at the earliest opportunity.

Prerequisites

An EMR cluster is created. For more information, see Create a cluster.

Procedure

  1. Log on to the CloudMonitor console.
  2. In the left-side navigation pane, choose Alerts > Alert Rules.
  3. On the Threshold Value Alert tab, click Create Alert Rule.
  4. On the Create Alert Rule page, set the parameters for an alert rule.
    Create cluster rules
    Parameter Description
    Product Select E-MapReduce from the drop-down list.
    Resource Range The resources to which the alert rule is applied. Valid values:
    • All Resources: The alert rule is applied to all the EMR clusters of the current Alibaba Cloud account.
    • Cluster: The alert rule is applied only to a specific cluster.
    Region All the Alibaba Cloud regions supported by EMR are listed. Select the region that you want to associate with the alert rule from the drop-down list.
    Note This parameter appears if you select Cluster for Resource Range.
    Cluster All existing clusters of the current Alibaba Cloud account are listed. Select the cluster that you want to associate with the alert rule from the drop-down list.
    Note This parameter appears if you select Cluster for Resource Range.
    Alert Rule The name of the alert rule.
    Rule Description The content of the alert rule. This parameter defines the condition that triggers an alert. For example, if you specify a condition in which the average CPU utilization every 5 minutes is greater than or equal to 90% for three consecutive cycles, CloudMonitor checks the metric every 5 minutes for three consecutive cycles.
    Note For more information, see Metrics.
    Mute for The period during which an alert is muted. This parameter specifies the interval at which an alert notification is sent to the specified contacts again if the alert is not cleared.
    Effective Period The period during which the alert rule is effective. The system monitors the metric and generates an alert only if the alert rule is in effect.
    Notification Contact The contact groups to which alert notifications are sent.

    You can select an existing contact group or create a contact group. For more information about how to create a contact group, see Create an alert contact or alert group.

    Notification Methods

    Valid value: Email + DingTalk (Info).

    Auto Scaling If you select Auto Scaling, a specific scaling rule is triggered when an alert is generated. You must set the Region, ESS Group, and ESS Rule parameters.
    Log Service If you select Log Service, the alert information is written to Log Service when an alert is generated. You must set the Region, Project, and Logstore parameters.

    For more information about how to create a project and a Logstore, see Quick start.

    Email Remark The additional information that you want to include in the alert notification email.
    HTTP CallBack The URL that can be accessed from the Internet. CloudMonitor uses a POST request to send alert information to this URL. Only HTTP requests are supported.
  5. Click Confirm.

Metrics

Service APM metric name Description
HDFS NameNodeIpcPortOpen The availability of the IPC port of the NameNode.
  • 1: available
  • 0: unavailable
TotalDFSUsedPercent The total HDFS capacity usage of a cluster.
DataNodeDfsUsedPercent The HDFS capacity usage of a DataNode.
DataNodeIpcPortOpen The availability of the IPC port of a DataNode.
  • 1: available
  • 0: unavailable
JournalNodeRpcPortOpen The availability of the RPC port of a JournalNode.
  • 1: available
  • 0: unavailable
ZKFCPortOpen The availability of the ZKFailoverController (ZKFC) port.
  • 1: available
  • 0: unavailable
dfs.FSNamesystem.MissingBlocks The number of missing blocks.
dfs.datanode.VolumeFailures The number of damaged disks detected by HDFS.
YARN ResourceManagerPortOpen The availability of the service port of the ResourceManager.
  • 1: available
  • 0: unavailable
JobHistoryPortOpen The availability of the service port of Job History:
  • 1: available
  • 0: unavailable
yarn.ClusterMetrics.NumUnhealthyNM The number of unhealthy NodeManagers.
ProxyServerPortOpen The availability of the WebAppProxy port.
  • 1: available
  • 0: unavailable
TimelineServerPortOpen The availability of the service port of Timeline Server.
  • 1: available
  • 0: unavailable
Hive MetastorePortOpen The availability of the Hive Metastore port.
  • 1: available
  • 0: unavailable
HiveServer2PortOpen The availability of the service port of HiveServer2.
  • 1: available
  • 0: unavailable
ThriftServerPortOpen The availability of the service port of Thrift Server.
  • 1: available
  • 0: unavailable
HBase HMasterIpcPortOpen The availability of the IPC port of HMaster.
  • 1: available
  • 0: unavailable
HRegionServerIpcPortOpen The availability of the IPC port of HRegionServer.
  • 1: available
  • 0: unavailable
ZooKeeper ZKClientPortOpen The availability of the listening port of the ZooKeeper client.
  • 1: available
  • 0: unavailable
Hue HuePortOpen The availability of the Hue port.
  • 1: available
  • 0: unavailable
Storm StormNimbusThriftPortOpen The availability of the Thrift port of Storm Nimbus:
  • 1: available
  • 0: unavailable
Host proc_total The total number of processes.
part_max_used The maximum usage of a disk partition.
disk_free_percent_mnt_disk1 The percentage of disk space occupied by the /mnt/disk1 directory.
disk_free_percent_rootfs The percentage of disk space occupied by the root file system.