CloudMonitor is a monitoring and alerting service provided by Alibaba Cloud. You can create threshold-triggered alert rules in the CloudMonitor console to monitor the usage of E-MapReduce (EMR) resources. If the value of a metric exceeds the threshold that is specified in a rule, CloudMonitor automatically sends an alert notification. This way, you can receive the notification and handle the related exceptions at the earliest opportunity.

Prerequisites

An EMR cluster is created. For more information, see Create a cluster.

Procedure

  1. Log on to the CloudMonitor console.
  2. In the left-side navigation pane, choose Alerts > Alert Rules.
  3. On the Alert Rules page, click Create Alert Rule.
  4. In the Create Alert Rule panel, configure the parameters for an alert rule. For more information, see Create an alert rule. The following table describes the parameters.
    Parameter Description
    Product Select E-MapReduce from the drop-down list.
    Resource Range The resources to which the alert rule is applied. Valid values:
    • All Resources: The alert rule is applied to all EMR clusters of the current Alibaba Cloud account.
    • Application Group: The alert rule is applied to all EMR clusters in a specific application group of the current Alibaba Cloud account.
    • Instances: The alert rule is applied only to a specific EMR cluster.
    Rule Description The content of the alert rule. This parameter specifies the condition that triggers an alert. For example, if you specify a condition in which the average CPU utilization every 5 minutes is greater than or equal to 90% for three consecutive cycles, CloudMonitor checks the metric every 5 minutes for three consecutive cycles.
    Note For more information, see Metrics.
    To configure the content of the alert rule, perform the following steps:
    1. Click Add Rules.
    2. In the Add Rule Description panel, configure the Alert Rule, Metric Type, Metric, and Threshold and Alert Level parameters.
    3. Click OK.
    Mute For The period during which an alert is muted. This parameter specifies the interval at which an alert notification is sent to the specified contacts again if the alert is not cleared.
    Note You can click Advanced Settings to configure this parameter.
    Effective Time The period during which the alert rule is effective. The system monitors the metric and generates an alert only if the alert rule is in effect.
    Note You can click Advanced Settings to configure this parameter.
    Alert Contact Group The contact groups to which alert notifications are sent.

    You can select one or more existing contact groups or create a contact group. For more information about how to create a contact group, see Create an alert contact or alert contact group.

    Alert Callback The URL that can be accessed from the Internet. CloudMonitor uses a POST request to send alert information to this URL. Only HTTP requests are supported.
    Auto Scaling If you turn on Auto Scaling, a specific scaling rule is triggered when an alert is generated. You must configure the Region, ESS Group, and ESS Rule parameters.
    Log Service If you turn on Log Service, the alert information is written to Log Service when an alert is generated. You must configure the Region, ProjectName, and Logstore parameters.

    For more information about how to create a project and a Logstore, see Getting Started.

    Message Service - topic If you turn on Message Service - topic, the alert information is written to a specific topic in Message Service (MNS) when an alert is triggered. You must configure the Region and topicName parameters.

    For information about how to create a topic, see Create a topic.

  5. Click OK.

Metrics

Service Metric name Description
HDFS NameNodeIpcPortOpen The availability of the IPC port of the NameNode.
  • 1: available
  • 0: unavailable
TotalDFSUsedPercent The total Hadoop Distributed File System (HDFS) capacity usage of a cluster.
DataNodeDfsUsedPercent The HDFS capacity usage of a DataNode.
DataNodeIpcPortOpen The availability of the IPC port of a DataNode.
  • 1: available
  • 0: unavailable
JournalNodeRpcPortOpen The availability of the RPC port of a JournalNode.
  • 1: available
  • 0: unavailable
ZKFCPortOpen The availability of the ZKFailoverController (ZKFC) port.
  • 1: available
  • 0: unavailable
dfs.FSNamesystem.MissingBlocks The number of missing blocks.
dfs.datanode.VolumeFailures The number of damaged disks detected by HDFS.
YARN ResourceManagerPortOpen The availability of the service port of the ResourceManager.
  • 1: available
  • 0: unavailable
JobHistoryPortOpen The availability of the service port of Job History.
  • 1: available
  • 0: unavailable
yarn.ClusterMetrics.NumUnhealthyNM The number of unhealthy NodeManagers.
ProxyServerPortOpen The availability of the WebAppProxy port.
  • 1: available
  • 0: unavailable
TimelineServerPortOpen The availability of the service port of Timeline Server.
  • 1: available
  • 0: unavailable
Hive MetastorePortOpen The availability of the Hive Metastore port.
  • 1: available
  • 0: unavailable
HiveServer2PortOpen The availability of the service port of HiveServer2.
  • 1: available
  • 0: unavailable
ThriftServerPortOpen The availability of the service port of Thrift Server.
  • 1: available
  • 0: unavailable
HBase HMasterIpcPortOpen The availability of the IPC port of HMaster.
  • 1: available
  • 0: unavailable
HRegionServerIpcPortOpen The availability of the IPC port of HRegionServer.
  • 1: available
  • 0: unavailable
ZooKeeper ZKClientPortOpen The availability of the listening port of the ZooKeeper client.
  • 1: available
  • 0: unavailable
Hue HuePortOpen The availability of the Hue port.
  • 1: available
  • 0: unavailable
Storm StormNimbusThriftPortOpen The availability of the Thrift port of Storm Nimbus.
  • 1: available
  • 0: unavailable
HOST proc_total The total number of processes.
part_max_used The maximum usage of a disk partition.
disk_free_percent_mnt_disk1 The percentage of available disk space.
disk_free_percent_rootfs The percentage of disk space that is occupied by the root file system.