Cluster monitoring is an important feature of EMR-APM. It provides monitoring details and the visualization of the services, components, and instances of each cluster.

Enter the cluster monitoring page

  1. Log on to the Alibaba Cloud E-MapReduce console.
  2. Click the Monitor tab.
  3. In the left-side navigation pane, click Cluster Monitoring.

Cluster Status page

The Cluster Status page lists all the clusters under your account, the cluster types, and the number of alerts issued for each cluster on the current day in a specific region. You can click Details in the Action column that corresponds to a specific cluster to access the Cluster Overview page or click Cluster Management to access the Clusters and Services page.

Cluster Status page

You can click Details in the Action column that corresponds to a specific cluster to view the cluster overview, instance monitoring, and service monitoring information. For more information, see Cluster overview, Instance monitoring, and Service Monitoring.

Cluster overview

Cluster Overview page: provides an overview of the monitored data of selected clusters, similar to the overview on the dashboard page. However, the overview displayed on the dashboard page is an overview of all clusters for your account in a specific region.

The Cluster Overview page includes the following sections:

  • Basic cluster information. This section includes the following statistical charts: Alerts (Today), Job Information (Today), YARN Compute Resource Usage, and HDFS Storage Usage.
  • Alerts and Details. This section displays a list of the latest critical exception events that occurred in a specific cluster.
  • Service Status. You can click View Details in the Action column that corresponds to a specific service to view the status details about this service.

Instance monitoring

  • Charts of metric statistics. These charts show the metric statistics of all instances in a specific cluster.
    • Alerts (Today)

      Displays the number of alerts issued in each instance of a cluster. Alerts include those triggered by CPUs, memory, networks, disks, and loads. These alerts do not affect the services that run on the cluster.

    • CPU Usage

      Displays the CPU utilization of all instances in a specific cluster on the current day. CPU utilization metrics include cpu_system and cpu_user. You can view a larger version of the chart and select a time range and time granularity. The default time interval is one hour.

    • Memory Usage

      Displays the memory usage of all instances in a specific cluster on the current day. You can view a larger version of the chart and select a time range and time granularity. The default time interval is one hour.

    • Disk Space Usage

      Displays the total disk space usage of all instances in a specific cluster on the current day. You can view a larger version of the chart and select a time range and time granularity. The default time interval is one hour.

    • Workload Within Five Minutes

      Displays the statistics of the load_five metric for all instances in a specific cluster on the current day. You can view a larger version of the chart and select a time range and time granularity. The default time interval is five minutes.

    • Network Traffic

      Displays the average network traffic of all instances in a specific cluster on the current day. You can view a larger version of the chart and select a time range and time granularity. The default time interval is one minute and data within the most recent two hours is displayed.

  • Alerts and Details.

    This section displays the instance-related alerts that are irrevelant to specific services.

  • Status.
    This section displays a list of all the instances in a specific cluster and snapshots of their performance metrics at the current time. These performance metrics include CPU, Memory, Load (5-minute load by default), Inbound Packet Rate, and Outbound Packet Rate. This section supports page breaks. You can click View Details in the Action column that corresponds to a specific instance to access the Instance Details page.Status section

Monitoring details for a single instance

You can click View Details in the Action column that corresponds to a specific instance in the Status section to access the Instance Details page.

Instance Details page
  • Charts of metric statistics

    See Instance monitoring.

  • Alerts and Details section

    See Instance monitoring.

  • Basic Information section
    This section displays the basic information of instances, such as the instance name, internal IP address, ECS instance status, ECS instance type, hardware settings, expiration date, and disk information.Basic Information section
  • Instance Snapshot Info section
    This section stores and displays snapshot information about core instance metrics at a fixed interval. These metrics are described in the following table.
    Metric Description
    Uptime Indicates the instance start time. The command is uptime.
    Last Logged in Users Indicates a list of last logged-on users. The command is last -w -n 25.
    Top CPU Processes Indicates the processes that consume the most CPU resources. The command is top -b -w 20480 -c -o %CPU -n 1 | head -20.
    Top Memory Processes Indicates the processes that consume the most memory. The command is top -b -w 20480 -c -o %MEM -n 1 | head -20.
    Memory Usage Indicates the memory usage of an instance. The command is free -m.
    Disk Space Usage Indicates the disk space usage of an instance. The command is df -h.
    Network Statistics Indicates the network statistics of an instance. The command is netstat -s -e.
    Dmesg Indicates the recent output of the dmesg command. The command is dmesg -dT | tail -n 25.
    Iostat Indicates the I/O statistics of an instance. The command is iostat -x 1 5.
    Vmstat Indicates the output of the vmstat command. The command is vmstat 1 5.
    Network Connections Indicates the network connection of an instance. The command is netstat -ap.
    Process List Indicates all the processes of an instance. The command is ps auxwwwf.
    /etc/hosts Indicates the /etc/hosts file configurations of an instance. The command is cat /etc/hosts.

    The system takes a snapshot of Network Connections, Process List, and /etc/hosts every hour. The system takes a snapshot of other metrics every five minutes.

    The Instance Snapshot Info section supports playback. You can select a point in time to view the snapshot information of that time. This way, you can restore some on-site data when you troubleshoot instance issues.

    Typical scenarios of instance snapshots are as follows:

    • Find the process that is terminated by OOM killer.

      A /var/log/message log is analyzed by using the log analysis and detection function of EMR-APM. It is found that a java process is terminated by OOM killer, and the process receives an alert message with code "**EMR-350100001**" in both the APM event list and DingTalk chatbot.

      However, the log records only simple prompt information for the process ID 887 instead of the complete parameters. Hence, the log does not identify the terminated process. You can use an instance snapshot to find the process list at the point in time this issue occurred. The process list records process details and helps you find the terminated process.Instance snapshot information (1)
    • Troubleshoot high CPU utilization.

      If your instances are configured with improper security settings, they are likely to be attacked by mining software. This can cause their CPU utilization to continuously increase. If this happens to your instances, find the process that continuously consumes CPU resources.

    • Troubleshoot high memory usage.

      If the memory usage of an instance continues to increase, you can click the Top Memory Processes tab in the Instance Snapshot Info section to view the list of processes with the highest memory usage at a specific point in time.

    • View port occupation.

      You can view the occupation of an instance port on the Network Connections tab.

    • View disk damage and other kernel issues.

      You can view the recent output of the dmesg command on an instance to determine disk damage or other kernel issues.

    • View disk usage.

      When a disk alert is received, you can click the Disk Space Usage tab in the Instance Snapshot Info section to view the usage of disk space.

  • Basic instance metricsBasic instance metrics
    • CPU
      • cpu_system: cpu_system
      • cpu_user: cpu_user
      • cpu_idle: cpu_idle
      • cpu_wio: cpu_wio
    • MEM
      • mem_free: memory free
      • mem_used_percent: memory used percent
      • mem_total: memory total
    • Traffic
      • pkts_in: packets in
      • pkts_out: packets out
      • bytes_in: bytes in
      • bytes_out: bytes out
    • Disk
      • disk_total: disk total
      • disk_free: disk free
      • disk_free_percent_rootfs: disk percent for rootfs
    • Other
      • proc_run: number of running processes
      • proc_total: number of total processes
    You can view larger versions of all the basic metric data presented in charts. When you view the larger version of a chart, you can select a time range and time granularity.time