All Products
Search
Document Center

E-MapReduce:FAQ about EMR Doctor

Last Updated:May 23, 2023

This topic provides answers to some frequently asked questions about E-MapReduce (EMR) Doctor.

Which types of clusters support EMR Doctor (health check feature in the EMR console)?

Only DataLake and Hadoop clusters support the health check feature. After you create an EMR cluster, you can click the Health Check tab on the page that appears after you click the name of the cluster to use the health check feature.

If you create an EMR Hadoop cluster, make sure that you activate EMR Doctor before you can use the health check feature in the cluster. For more information, see Activate EMR Doctor (Hadoop clusters).

Does the installation or upgrade of EMR Doctor exert impacts on services in an EMR cluster and jobs that run on the cluster?

During the installation or upgrade of EMR Doctor, no services in an EMR cluster are restarted, and no impacts are exerted on existing jobs that are run on the cluster. After EMR Doctor is installed, the required parameters for EMR Doctor are automatically configured for the cluster. You do not need to perform any manual configurations.

During the installation or upgrade of EMR Doctor, EMR delivers configurations of services such as YARN, Spark, Tez, and Hive to clusters. Before you install or upgrade EMR Doctor, we recommend that you check whether some service configurations are modified and saved but are not delivered and evaluate the impacts of delivering the service configurations to clusters.

What type of data does EMR Doctor collect?

EMR Doctor does not collect your actual data or scan your actual files or file content.

EMR Doctor collects only necessary event data, such as the start time, end time, metrics, and counters of a job.

Am I charged for EMR Doctor?

EMR Doctor is free of charge.

What are the impacts of job collection on job execution?

The storage metadata collection feature of EMR Doctor can dynamically adjust the amount of collected resources based on the amount of user resources. This prevents excess user resources from being occupied.

The job collection feature of EMR Doctor works based on the Java probe technology. The feature does not separately start Java process monitoring. Job data is collected in asynchronous mode. It does not block the main process of a job. If the job collection pressure is heavy, the collected data is automatically discarded, and you can adjust the collection frequency by configuring parameters.

The following table lists the data of some TPC-DS tests.

SQL and engine

Collection duration when EMR Doctor is used (average duration of job collection based on 10 calculation rounds)

Collection duration when EMR Doctor is not used (average duration of job collection based on 10 calculation rounds)

query7 (Spark)

21.0s

21.2s

query71 (Tez)

50.8s

49.8s

query19 (MapReduce)

68.6s

68.2s

Note

In this example, a test based on the TPC-DS benchmark is performed, but the test does not meet all requirements of a TPC-DS benchmark test. As a result, the test results may not match the published results of the TPC-DS benchmark test.

When can I see the collection report?

After EMR Doctor is installed or upgraded in an EMR cluster, the daily cluster report feature performs an analysis based on the jobs that users want to run and whether the storage metadata collection feature is enabled. In this case, the EMR cluster must contain jobs.

  • Computing jobs: After computing jobs in an EMR cluster are collected, the latest reports for the jobs can be viewed on the next day. The content of the reports is an overall evaluation on the cluster based on the execution status of the jobs in the cluster.

  • Storage analysis: The Collect Information About Storage Resources feature of EMR Doctor is disabled by default. You can manually enable this feature. After you enable the Collect Information About Storage Resources feature, the related information is collected at 10:00 in the morning on the current day. After data is collected, the data is analyzed in the morning on the next day and reports are generated based on the analysis results. If data is collected in the afternoon on the current day, you can view reports on the day after the next day.

Can specific values be provided for parameters?

Optimization suggestions that are provided by EMR Doctor are directional. For example, we recommend that you decrease the amount of memory and modify the garbage collection parameters without providing specific parameter values. EMR Doctor collects job data by using the recording and sampling method. EMR Doctor aims to prevent impacts on your program. You need to adjust parameters based on suggestions and check whether the configuration is suitable.