The cluster monitoring feature monitors the services, components, and instances of clusters and visualizes the monitoring results.

Go to the cluster monitoring page

  1. Log on to the Alibaba Cloud EMR console.
  2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
  3. Click the Monitor tab.
  4. In the left-side navigation pane, click Cluster Monitoring.

View cluster status

The overview information of clusters is displayed in the Cluster Status list.

Cluster Status
  • Click Details in the Action column that corresponds to a specific cluster to view the cluster overview, instance monitoring, and service monitoring information.

    For more information, see Cluster overview, Instance monitoring, and Service monitoring.

  • Click Cluster Management in the Action column that corresponds to a specific cluster.

    The Clusters and Services page appears.

Cluster overview

The Cluster Overview page displays the following monitoring data of a cluster:
  • Four basic statistical charts: Alerts (Today), Job Information (Today), YARN Compute Resource Usage, and HDFS Storage Usage.
  • Alerts and Details: displays the latest critical exception events that occurred in the cluster.
  • Service Status: displays the services of the cluster and their status.

Instance monitoring

  • Statistical charts of metrics: The Instance Monitoring page displays the charts of all instance-related metrics for a cluster. You can click the icon in the upper-right corner of each chart to expand it and customize a time range and an interval.
    • Alerts (Today)

      Displays the number of alerts issued in each instance of the cluster. Alerts include those triggered by CPUs, memory, networks, disks, and loads. These alerts do not affect the services that run on the cluster.

    • CPU Usage

      Displays the CPU utilization of all instances in the cluster on the current day. The default interval is 1 hour.

    • Memory Usage

      Displays the memory usage of all instances in the cluster on the current day. The default interval is 1 hour.

    • Disk Space Usage

      Displays the total disk space usage of all instances in the cluster on the current day. The default interval is 1 hour.

    • Workload Within Five Minutes

      Displays 5-minute load statistics of all instances in the cluster on the current day. The default interval is 5 minutes.

    • Network Traffic

      Displays the average network traffic of all instances in the cluster on the current day. The default interval is 1 minute, and the data within the latest 2 hours is displayed.

  • Alerts and Details

    This section displays the instance-related alerts.

  • Status
    This section displays all the instances in the cluster and snapshots of their performance metrics at the current time. The performance metrics include CPU, Memory, Load (5-minute load by default), Inbound Packet Rate, and Outbound Packet Rate. Status

Monitoring details for a single instance

You can click View Details in the Action column that corresponds to a specific instance in the Status section to access the Instance Details page. Instance details
  • Statistical charts of metrics

    For more information, see Instance monitoring.

  • Alerts and Details

    For more information, see Instance monitoring.

  • Basic Information
    This section displays the basic information of instances, such as the instance name, internal IP address, ECS instance status, ECS instance type, hardware settings, expiration date, and disk information. Basic Information
  • Instance Snapshot InfoInstance Snapshot Info
    This section stores and displays snapshot information about core instance metrics at a fixed interval. These metrics are described in the following table.
    Metric Description
    Uptime Indicates the instance start time. The command is uptime.
    Last Logged in Users Indicates the last logged-on users. The command is last -w -n 25.
    Top CPU Processes Indicates the processes that consume the most CPU resources. The command is top -b -w 20480 -c -o %CPU -n 1 | head -20.
    Top Memory Processes Indicates the processes that consume the most memory. The command is top -b -w 20480 -c -o %MEM -n 1 | head -20.
    Memory Usage Indicates the memory usage of an instance. The command is free -m.
    Disk Space Usage Indicates the disk space usage of an instance. The command is df -h.
    Network Statistics Indicates the network statistics of an instance. The command is netstat -s -e.
    Dmesg Indicates the recent output of the dmesg command. The command is dmesg -dT | tail -n 25.
    Iostat Indicates the I/O statistics of an instance. The command is iostat -x 1 5.
    Vmstat Indicates the output of the vmstat command. The command is vmstat 1 5.
    Network Connections Indicates the network connection of an instance. The command is netstat -ap.
    Process List Indicates all the processes of an instance. The command is ps auxwwwf.
    /etc/hosts Indicates the configurations of the /etc/hosts file of an instance. The command is cat /etc/hosts.

    The system takes a snapshot of Network Connections, Process List, and /etc/hosts every hour. The system takes a snapshot of other metrics every 5 minutes.

    The Instance Snapshot Info section supports playback. You can select a point in time to view the snapshot information at that time. This way, you can restore some on-site data when you troubleshoot instance issues.

    Typical scenarios of instance snapshots:
    • Find the process that is terminated by OOM killer.

      A /var/log/message log is analyzed by using the log analysis and detection feature of EMR APM. It is found that a Java process is terminated by OOM killer. The process receives an alert message with code **EMR-350100001** in both the APM event list and DingTalk chatbot.

      However, the log records only simple prompt information for the process ID 887 instead of the complete parameters. The terminated process cannot be identified based on the log. You can use an instance snapshot to find the process list at the point in time this issue occurred. Process details are recorded in the process list. You can find the terminated process based on the process details. Instance snapshot information (1)
    • Troubleshoot high CPU utilization.

      If your instances are configured with improper security settings, they are likely to be attacked by mining software. This can cause their CPU utilization to continuously increase. If this happens to your instances, find the process that continuously consumes CPU resources.

    • Troubleshoot high memory usage.

      If the memory usage of an instance continues to increase, you can click the Top Memory Processes tab in the Instance Snapshot Info section. Then, you can view the list of processes with the highest memory usage at a specific point in time.

    • View port occupation.

      You can view the occupation of an instance port on the Network Connections tab.

    • View disk damage and other kernel issues.

      You can view the recent output of the dmesg command on an instance to determine disk damage or other kernel issues.

    • View disk usage.

      When a disk alert is received, you can click the Disk Space Usage tab in the Instance Snapshot Info section to view the usage of disk space.

  • Basic instance metricsBasic instance metrics
    • CPU
      • cpu_system: cpu_system
      • cpu_user: cpu_user
      • cpu_idle: cpu_idle
      • cpu_wio: cpu_wio
    • MEM
      • mem_free: memory free
      • mem_used_percent: memory used percent
      • mem_total: memory total
    • Traffic
      • pkts_in: packets in
      • pkts_out: packets out
      • bytes_in: bytes in
      • bytes_out: bytes out
    • Disk
      • disk_total: disk total
      • disk_free: disk free
      • disk_free_percent_rootfs: disk percent for rootfs
    • Other
      • proc_run: number of running processes
      • proc_total: number of total processes
    You can expand all the charts of basic instance metrics. After you expand a chart, you can select a time range and an interval. time