All Products
Search
Document Center

E-MapReduce:View the health status of nodes

Last Updated:Nov 24, 2023

You can check whether a node is run as expected based on the health status of the node. The health status is formed based on the check results of multiple health check items. This topic describes how to view the health status of a node and related health check items.

Prerequisites

An E-MapReduce (EMR) cluster is created. For more information, see Create a cluster.

Limits

This topic is applicable only to DataLake, Dataflow, online analytical processing (OLAP), DataServing, and custom clusters.

View the latest health status of nodes

  1. Go to the Nodes tab.

    1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

    2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.

    3. On the EMR on ECS page, find the desired cluster and click Nodes in the Actions column.

  2. On the Nodes tab, view the health status of nodes in each node group.

    • Green number in the Health Status column: indicates the number of nodes in the Good state in the current node group.

    • Yellow number in the Health Status column: indicates the number of nodes in the Warning state in the current node group.

    • Red number in the Health Status column: indicates the number of nodes in the Abnormal state in the current node group.

    • Gray number in the Health Status column: indicates the number of nodes in the Unknown state and nodes in the Stateless state in the current node group.

    On the Nodes tab, click the image.png icon on the left of the name of a node group. In the node list that appears, you can view the health status of each node in the Health Status column.

    A node may be in the following states: Good, Warning, Abnormal, Unknown, and Stateless. Different states are indicated by different icons.

    Icon

    Health status

    Description

    image.png

    Good

    The node is run as expected.

    image.png

    Warning

    The node is run as expected, but hidden risks are detected based on the health check items of the node. You need to focus on the hidden risks.

    image.png

    Abnormal

    The node is unavailable. Serious issues are detected based on the health check items of the node. You must troubleshoot the issues at the earliest opportunity.

    image.png

    Stateless

    No health check is performed on the node after an installation process or a manual stop. You do not need to focus on nodes that are in this state.

    image.png

    Unknown

    The results of health check items of the node cannot be obtained. If no issue occurs in the business, you do not need to focus on nodes that are in this state.

View health check items of a node

  1. On the Nodes tab, find the desired node group and click the image.png icon on the left of the name of the node group.

  2. Find the desired node and click View Check Items to the right of the health status in the Health Status column.

  3. In the panel that appears, view the latest results of health check items and the health check history of the current node.

    The following table describes the health check items. The value of each check item is indicated by u.

    Name

    Description

    Threshold

    Unit

    host_memory_utilization_check

    Checks the average memory usage in the past 3 minutes.

    • Good: 0 ≤ u < 85

    • Warning: 85 ≤ u < 95

    • Abnormal: 95 ≤ u < 100

    Percentage

    host_cpu_utilization_check

    Checks the average CPU utilization in the past 3 minutes.

    • Good: 0 ≤ u < 85

    • Warning: 85 ≤ u < 95

    • Abnormal: 95 ≤ u < 100

    Percentage

    host_cpu_load5_check

    Checks the average CPU load in the past 5 minutes.

    • Good: u < Number of vCPU cores × 1.5

    • Warning: u ≥ Number of vCPU cores × 1.5

    -

    host_network_transmission_check

    Checks the packet loss rate or error package rate during network transmission in the past 3 minutes.

    • Good: u < 1

    • Abnormal: u ≥ 1

    Percentage

    host_disk_space_check

    Checks the disk usage.

    • Good: 0 ≤ u < 90

    • Warning: 90 ≤ u < 95

    • Abnormal: 95 ≤ u < 100

    Percentage

    host_system_environment_check

    Checks important configuration items of the system environment. For example, the /etc/hostname and /etc/resolve.conf files, Java version, and Python version are checked.

    No threshold is specified. The current node is considered abnormal if one of the configuration items is detected as abnormal.

    -

    host_application_environment_check

    Checks the configuration items for the execution environment of each application that is installed on the current node, such as the installation package version, symbolic link, and log directory.

    No threshold is specified. The current node is considered abnormal if one of the configuration items is detected as abnormal.

    -

    host_user_permission_check

    Check the permissions of important users, such as the hadoop user and hdfs user.

    No threshold is specified. The current node is considered abnormal if the permissions of one of the important users are detected as abnormal.

    -

    host_fault_compensation_check

    Checks whether fault compensation occurs.

    No threshold is specified. The current node is considered abnormal if fault compensation occurs.

    -