You can use the health check feature of an E-MapReduce (EMR) cluster to learn the health status of the cluster and resolve issues in the cluster based on suggestions. This can help ensure that the cluster remains in a healthy state.

Precautions

You must activate EMR Doctor before you can use health check in the EMR console. For information about how to activate EMR Doctor, see Activate EMR Doctor (Hadoop clusters).

View daily cluster reports

  1. Go to the Basic Information tab.
    1. Log on to the EMR on ECS console.
    2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
    3. On the EMR on ECS page, find the desired cluster and click the name of the cluster in the Cluster ID/Name column.
  2. On the page that appears, click the Health Check tab.
  3. On the Health Check tab, view the list of health check reports of the cluster.
    In the Daily Cluster Report section, the Health Status column displays the health status of the cluster.
    The following table describes the health status that corresponds to each score range.
    Score rangeDescription
    0 <= x <= 60The cluster is in an unhealthy state. Resolve issues in the cluster at the earliest opportunity.
    60 < x <= 80The cluster is in a sub-healthy state. We recommend that you optimize the cluster.
    80 < x <= 100The cluster is in a healthy state and no issues need to be resolved.
    Note The score indicates the health status of the cluster. Valid values range from 0 to 100.
  4. View the details of a daily cluster report.
    Click View Report in the Actions column of a report to view the details of the report for the cluster.

    This page displays an overview of the cluster health status and the basic information about the report, such as the health score, cluster ID, report ID, and diagnostics time. The diagnostic items and an analysis overview of the diagnostic items displayed on this page vary based on the type of the cluster. The analysis overview provides a summary of the cluster issues and directly displays the issues. You can refer to the details of diagnostic items to obtain analysis results of a specific issue.

Analysis of computing resources

Analysis details

This tab displays analysis details of computing resources. You can learn the basic information about the computing resource usage of a cluster, such as the computing score, the number of scanned jobs, and the health status distribution of jobs. This tab also displays the identified issues such as low memory usage. You can check the information about the job in which the issue is identified to resolve the issue.

Basic computing information

This section displays the trend charts of cluster computing scores, cluster memory consumed by different types of engines (GB*Sec), and cluster vCPUs consumed by different types of engines (VCore*Sec).

The following table provides information about cluster memory and cluster vCPUs.
MetricDescription
Cluster Memory (GB*Sec)The total cluster memory that is consumed by jobs in the cluster. It is an accumulated value and is calculated by using the following formula: Memory allocated to the jobs in the cluster in GB × Number of seconds that the execution of the jobs last.
Cluster vCPU (VCore*Sec)The total number of cluster vCPUs that are consumed by jobs in the cluster. It is an accumulated value and is calculated by using the following formula: Number of vCPUs allocated to the jobs in the cluster × Number of seconds that the execution of the jobs last.

Computing information analysis

This section displays the following charts:
  • Trend chart of compute engine scores
  • Trend chart of the number of compute engine jobs
  • Pie chart for memory consumed by different types of engines
  • Pie chart for vCPUs consumed by different types of engines
  • Pie chart for memory consumed by jobs that are submitted by different users

Job information

EMR Doctor collects jobs, processes and analyzes the jobs, and displays the key jobs that affect the cluster execution based on analysis results. You can resolve issues that are identified in the jobs to improve the cluster computing efficiency, and increase the cluster utilization rate and the profits.

This section displays the top 50 jobs that consume the most memory (GB*Sec) and the top 50 jobs sorted by scores in ascending order. The following table describes the information in each data record.

ParameterDescription
Job NameThe name of the job.
Engine TypeThe type of the compute engine. Compute engines include MapReduce, Tez, and Spark.
SQL StatementThis parameter needs to be configured only for SQL-type jobs.
APP IDSFor Hive on MapReduce jobs, an SQL statement may contain multiple application IDs.
UsernameThe user who submitted the job.
ScoreThe score of the job.
Health StatusSpecifies whether to mark the job for governance.
SuggestionThe optimization suggestion for the job.
Memory (GB*Sec)The total cluster memory consumed by the job.
Memory UsageThe average memory usage of the job.
CPU (vCore*Sec)The total cluster vCPUs consumed by the job.
CPU UtilizationThe average CPU utilization of the job.

Analysis of HDFS storage resources

By default, EMR Doctor does not collect information about storage resources. If you want to analyze the Hadoop Distributed File System (HDFS) or Hive storage resources, you can turn on Collect Information About Storage Resources on the Daily Cluster Report tab of the Health Check tab or perform the operations that are described in the Configuration topic to modify the information about storage resources.

Analysis details

This tab displays analysis details of HDFS storage resources. The analysis details describe basic information about the cluster resources, such as the total number of files and the total volume of stored data. This tab also displays the identified issues such as high proportion of small files and high proportion of stored cold data. In the issue details section, you can view the directory in which a specific issue is identified and the method to resolve the issue.

Basic HDFS information

In the Basic HDFS Information section, you can view the following information in charts:
  • Trend chart of the volume of stored data
  • Trend chart of the number of files
  • Trend chart of HDFS storage scores
  • Total number of files, total volume of stored data, number of small files, number of very small files, and volume of stored cold data

HDFS usage analysis

In the HDFS Usage Analysis section, you can view the following information in charts:
  • Pie chart for storage resources consumed by different HDFS users
  • Pie chart for the number of files used by different HDFS users
  • Pie chart for storage resources consumed by different HDFS groups
  • Pie chart for the number of files used by different HDFS groups
  • Pie chart for the distribution of HDFS files of different sizes
  • Pie chart for the distribution of cold data and hot data in HDFS
  • Distribution of data stored in level-1 HDFS directories

Distribution of files of different sizes stored in HDFS directories

Small files in HDFS can cause pressure on NameNode and shard issues. The number of small files in HDFS is an important metric. In the Directory File Size Distribution section, you can view the distribution of empty files, very small files, small files, medium files, and large files in each directory level. EMR Doctor can be used to drill down to up to four levels of directories.

The following table describes the file definitions.
File typeDescription
Empty fileFiles whose size is 0.
Very small fileFiles whose size is less than 1 MB.
Small fileFiles whose size is less than 128 MB.
Medium fileFiles whose size is greater than or equal to 128 MB and is less than or equal to 1 GB.
Large fileFiles whose size is greater than 1 GB.
The Directory File Size Distribution section displays the following information:
  • Top N directories at a specific level that store the maximum number of empty files
  • Top N directories at a specific level that store the maximum number of very small files
  • Top N directories at a specific level that store the maximum number of small files
  • Top N directories at a specific level that store the maximum number of medium files
  • Top N directories at a specific level that store the maximum number of large files

Each table displays the information about the top N directories, such as the specific path, volume of stored data, day-to-day comparison, and daily increment.

Distribution of cold data and hot data in directories

Cold data is data that is not accessed for a long period of time. We recommend that you store cold data in cold standby storage mode, such as the Cold Archive storage class in Object Storage Service (OSS). The distribution of cold data and hot data in directories can help you understand cluster usage and reduce costs. In the Directory Cold Data and Hot Data Distribution section, you can view the distribution of very cold data, cold data, warm data, and hot data in each directory level. EMR Doctor can be used to drill down to up to four levels of directories.
Data typeDescription
Very cold dataData that is not accessed for more than three months.
Cold dataData that is not accessed for more than one month but is accessed in three months.
Warm dataData that is not accessed for more than seven days but is accessed in one month.
Hot dataData that is accessed in recent seven days.
The Directory Cold Data and Hot Data Distribution section displays the following information:
  • Top N directories at a specific level that store the maximum volume of very cold data
  • Top N directories at a specific level that store the maximum volume of cold data
  • Top N directories at a specific level that store the maximum volume of warm data
  • Top N directories at a specific level that store the maximum volume of hot data

Each table displays the information about the top N directories, such as the specific path, volume of stored data, day-to-day comparison, and daily increment.

Analysis of HBase storage resources

Analysis details

This tab displays analysis details of HBase storage resources. The analysis details describe basic information about HBase usage, such as the average cluster load, cluster partition balancing degree, and the health status of RegionServers and user tables. This tab also displays the identified issues such as high average cluster load, low cluster partition balancing degree, and abnormal health status of RegionServers and user tables. In the issue details section, you can view the information such as the RegionServer, table, or partition in which a specific issue is identified and the method to resolve the issue.

Cluster overview analysis

In the Cluster Overview section, you can view the following information in charts:
  • Trend chart of cluster health scores
  • Trend chart of cluster partition balancing degrees
  • Pie chart for the number of partitions in the cluster for different RegionServers
  • Trend chart of the number of cluster requests
  • Total number of tables, total number of partitions, total number of nodes, average load, total volume of data, total number of read requests, total number of write requests, and total number of requests

RegionServer-related information

The RegionServer Related Information section displays detailed information such as the cache hit ratio, average GC duration, and number of daily read/write requests of a RegionServer.

  • Ranking of RegionServers sorted by the cache hit ratio in ascending order (table headers: RegionServer and Cache Hit Ratio)
  • Ranking of RegionServers sorted by the average GC duration (table headers: RegionServer and Average GC Duration)
  • Ranking of RegionServers sorted by the number of daily read requests (table headers: RegionServer and Number of Daily Read Requests)
  • Ranking of RegionServers sorted by the day-to-day daily read request increment (table headers: RegionServer and Day-to-Day Daily Read Request Increment)
  • Ranking of RegionServers sorted by the number of daily write requests (table headers: RegionServer and Number of Daily Write Requests)
  • Ranking of RegionServers sorted by the day-to-day daily write request increment (table headers: RegionServer and Day-to-Day Daily Write Request Increment)

Table-related information

The Table Related Information section displays detailed information such as the hot partitions in a table, volume of data in a table, number of partitions in a table, and number of read/write requests in a table.
  • Details of tables that contain hot partitions
  • Top N tables sorted by the partition balancing degree in ascending order
  • Top N tables sorted by the average data volume in partitions in ascending order
  • Top N tables sorted by the volume of stored data
  • Top N tables sorted by the day-to-day data storage increment
  • Top N tables sorted by the number of partitions
  • Top N tables sorted by the day-to-day partition increment
  • Top N tables sorted by the number of read requests
  • Top N tables sorted by the day-to-day read request increment
  • Top N tables sorted by the number of write requests
  • Top N tables sorted by the day-to-day write request increment

Analysis of Hive storage resources

Analysis details

This tab displays analysis details of Hive storage resources. The analysis details describe basic information about Hive usage, such as the total number of Hive databases, total number of Hive tables, total number of files in Hive tables, and total volume of data stored in Hive. This tab also displays the identified issues such as high proportion of small files, high proportion of stored cold data, and uneven distribution of storage formats. In the issue details section, you can view the database or table in which a specific issue is identified and the method to resolve the issue.

Basic Hive information

This section displays multiple common storage metrics for the usage of Hive storage resources, including the storage resource usage trend, file quantity trend, and score trend.

Hive usage analysis

In the Hive Usage Analysis section, you can view the following information in charts:
  • Distribution chart for consumed storage resources in different Hive databases
  • Distribution chart for total volume of data stored by different Hive users
  • Pie chart for the distribution of files of different sizes in Hive tables
  • Pie chart for the distribution of cold data and hot data in Hive tables
  • Pie chart for the distribution of storage formats of Hive tables

Hive details

The Hive Information section displays details of Hive databases and Hive tables.

Hive database information

The Hive Database Information section displays the following information:
  • Hive database details
  • Top N Hive databases sorted by the distribution of files of different sizes
  • Top N Hive databases sorted by the distribution of cold data and hot data
  • Top N Hive databases sorted by distribution of storage formats
The Hive Database Details section displays the following data:
  • Ranking of Hive databases sorted by storage resource consumption: name, consumed storage resources, day-to-day comparison, and daily increment
  • Ranking of Hive databases sorted by the number of files: name, number of files, day-to-day comparison, and daily increment
  • Score ranking: number of scores
  • Ranking of Hive databases sorted by the number of partitions: name, number of partitions, day-to-day comparison, and daily increment
You can obtain the following information based on the top N Hive databases sorted by the distribution of files of different sizes:
  • Top N Hive databases that store the maximum number of empty files
  • Top N Hive databases that store the maximum number of very small files
  • Top N Hive databases that store the maximum number of small files
  • Top N Hive databases that store the maximum number of medium files
  • Top N Hive databases that store the maximum number of large files
Note Small files in Hive can cause pressure on NameNode and shard issues. A large number of small files may slow down the computing process. The number of small files in Hive is an important metric.
You can obtain the following information based on the top N Hive databases sorted by the distribution of cold data and hot data:
  • Top N Hive databases that store the maximum volume of very cold data
  • Top N Hive databases that store the maximum volume of cold data
  • Top N Hive databases that store the maximum volume of warm data
  • Top N Hive databases that store the maximum volume of hot data
Note Cold data is data that is not accessed for a long period of time. We recommend that you store cold data in cold standby storage mode, such as the Cold Archive storage class in OSS. The distribution of cold data and hot data can help you understand cluster usage and reduce costs.

Hive supports different storage formats. Different storage formats are suitable for different use scenarios. In most cases, the mainstream columnar format reduces storage costs and improves query efficiency.

You can obtain the following information based on the top N Hive databases sorted by the distribution of storage formats:
  • Top N Hive databases that store the maximum volume of TextFile-formatted data
  • Top N Hive databases that store the maximum volume of Parquet-formatted data
  • Top N Hive databases that store the maximum volume of ORC-formatted data

Hive table information

The Hive Table Information section displays the following information:
  • Hive table details
  • Top N Hive tables sorted by the distribution of files of different sizes
  • Top N Hive tables sorted by the distribution of cold data and hot data
  • Top N Hive tables sorted by distribution of storage formats
Note For more information, see Hive database information.