Node health status shows whether a node is running as expected. Each node's status is derived from the results of multiple health check items. Nodes in the Warning or Abnormal state have specific check items you can inspect to diagnose and resolve issues.
Prerequisites
Before you begin, ensure that you have:
-
An E-MapReduce (EMR) cluster. See Create a cluster
Limitations
This feature applies only to DataLake, Dataflow, online analytical processing (OLAP), DataServing, and custom clusters.
View the health status of nodes
-
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
-
In the top navigation bar, select the region where your cluster resides and select a resource group.
-
On the EMR on ECS page, find your cluster and click Nodes in the Actions column.
-
On the Nodes tab, check the Health Status column for each node group. The column shows colored numbers that indicate how many nodes are in each state:
Color State Meaning Recommended action Green Good The node is running as expected. No action needed. Yellow Warning The node is running, but hidden risks are detected. Click View Check Items to review the check items and monitor the node. Red Abnormal The node is unavailable due to serious issues. Click View Check Items and troubleshoot immediately. Gray Unknown Health check results cannot be retrieved. No action needed if your workloads are unaffected. Gray Stateless No health check has run since the node was installed or manually stopped. No action needed. -
To see the health status of individual nodes within a group, click the
icon to the left of the node group name. Each node's status appears in the Health Status column.
View health check items of a node
When a node is in the Warning or Abnormal state, inspect its check items to identify what triggered the status.
-
On the Nodes tab, click the
icon to the left of the node group name. -
Find the node, then click View Check Items to the right of its health status.
-
In the panel that appears, review the latest check item results and the health check history for that node.
Health check items reference
Each check item's threshold value is represented by u. Items with no threshold perform a binary pass/fail check.
Items that require immediate action when Abnormal:
| Name | Description | Unit |
|---|---|---|
status_alive |
Whether the node status is normal | — |
host_disk_fault |
Whether a disk exception exists at the underlying layer | — |
host_system_fault |
Whether a system exception exists at the underlying layer | — |
host_system_env |
Availability of important configuration files, Java, and Python | — |
host_service_env |
Availability of storage directories and package files that cluster services depend on | — |
Items with Warning and Abnormal thresholds:
| Name | Description | Warning | Abnormal | Unit |
|---|---|---|---|---|
host_cpu_usage |
CPU load | 95 ≤ u < 99 | u ≥ 99 | % |
host_mem_usage |
Memory usage | 95 ≤ u < 99 | u ≥ 99 | % |
host_disk_space_usage |
Disk usage | 90 ≤ u < 99 | u ≥ 99 | % |
host_disk_inode_usage |
Index node (inode) usage of disks | 90 ≤ u < 99 | u ≥ 99 | % |
host_disk_io_latency |
Average disk read/write latency | 400 ≤ u < 800 | u ≥ 800 | ms |
host_fd_usage |
File descriptor usage | 95 ≤ u < 99 | u ≥ 99 | % |
host_network_transmit_drop_rate |
Outbound packet loss rate | 1.0 ≤ u < 2.5 | u ≥ 2.5 | % |
host_network_receive_error_rate |
Inbound packet error rate | 0.1 ≤ u < 0.5 | u ≥ 0.5 | % |
host_network_transmit_error_rate |
Outbound packet error rate | 0.1 ≤ u < 0.5 | u ≥ 0.5 | % |
host_network_receive_error_rate |
Inbound packet loss rate | 1.0 ≤ u < 2.5 | u ≥ 2.5 | % |