All Products
Search
Document Center

E-MapReduce:FAQ

Last Updated:Mar 04, 2024

This topic provides answers to some frequently asked questions about YARN.

What does a stateful cluster restart constitute?

A stateful cluster restart constitutes a ResourceManager restart and a NodeManager restart. ResourceManager maintains the basic information about and status of applications. NodeManager maintains the information about and status of containers that are running. ResourceManager and NodeManager constantly synchronize their own status to external storage systems such as ZooKeeper, LevelDB, and Hadoop Distributed File System (HDFS). The status of ResourceManager and NodeManager can be automatically reloaded and recovered after they are restarted. This ensures that the status of applications and containers can be automatically recovered after the cluster is upgraded or restarted. In most cases, the upgrade or restart of a cluster is imperceptible to applications and containers.

How do I enable ResourceManager high availability (HA)?

Check or configure the parameters on the Configure tab of the YARN service page of a cluster in the E-MapReduce (EMR) console. The following table describes the parameters.

Parameter

Description

yarn.resourcemanager.ha.enabled

Specifies whether to enable ResourceManager HA. Set the value to true to enable ResourceManager HA.

Default value: false.

yarn.resourcemanager.ha.automatic-failover.enabled

Specifies whether to enable automatic failover for ResourceManager. Default value: true.

yarn.resourcemanager.ha.automatic-failover.embedded

Specifies whether to enable embedded automatic failover for ResourceManager. Default value: true.

yarn.resourcemanager.ha.curator-leader-elector.enabled

Specifies whether to use Curator. Set the value to true to use Curator.

Default value: false.

yarn.resourcemanager.ha.automatic-failover.zk-base-path

The path in which information about a leader is stored. Use the default value /yarn-leader-electionleader-elector.

How do I configure a hot update?

Important

You can perform the operations in this section only in Hadoop 3.2.0 or later versions.

  1. Configure key parameters.

    You can check or configure parameters related to a hot update on the Configure tab of the YARN service page of a cluster in the EMR console. The following table describes the parameters.

    Parameter

    Description

    Recommended value

    yarn.scheduler.configuration.store.class

    The type of the backing store. If you set this parameter to fs, a file system is used as the backing store.

    fs

    yarn.scheduler.configuration.max.version

    The maximum number of configuration files that can be stored in the file system. Excess configuration files are automatically deleted if the number of configuration files exceeds the value of this parameter.

    100

    yarn.scheduler.configuration.fs.path

    The path in which the capacity-scheduler.xml file is stored.

    If you do not configure this parameter, a storage path is automatically created. If no prefix is specified, the relative path of the default file system is used as the storage path.

    /yarn/<Cluster name>/scheduler/conf

    Important

    Replace <Cluster name> with a specific cluster name. Multiple clusters for which the YARN service is deployed may use the same distributed storage.

  2. View configurations of the capacity-scheduler.xml file.

    • Method 1 (RESTful API): Access a URL in the following format: http://<rm-address>/ws/v1/cluster/scheduler-conf.

    • Method 2 (HDFS): Access the configuration path ${yarn.scheduler.configuration.fs.path}/capacity-scheduler.xml.<timestamp> to view configurations of the capacity-scheduler.xml file. <timestamp> indicates the time at which the capacity-scheduler.xml file is generated. The capacity-scheduler.xml file that has the largest timestamp value is the latest configuration file.

  3. Update the configurations.

    For example, you can modify the parameter yarn.scheduler.capacity.maximum-am-resource-percent and delete the parameter yarn.scheduler.capacity.xxx. To delete a parameter, you need to only remove the value field of the parameter.

    curl -X PUT -H "Content-type: application/json" 'http://<rm-address>/ws/v1/cluster/scheduler-conf' -d '
    {
      "global-updates": [
        {
          "entry": [{
            "key":"yarn.scheduler.capacity.maximum-am-resource-percent",
            "value":"0.2"
          },{
            "key":"yarn.scheduler.capacity.xxx"
          }]
        }
      ]
    }'

How do I handle the uneven distribution of resources among applications in a queue?

Important

You can perform the operations in this section only in Hadoop 2.8.0 or later versions.

In most cases, resources in a queue are occupied by large jobs, and small jobs fail to obtain sufficient resources. To ensure an even distribution of resources among jobs, perform the following steps:

  1. Change the value of the yarn.scheduler.capacity.<queue-path>.ordering-policy parameter from the default value fifo to fair for a queue.

    Note

    First in, first out (FIFO) scheduler and fair scheduler are two types of schedulers in YARN.

    You can also modify the parameter yarn.scheduler.capacity.<queue-path>.ordering-policy.fair.enable-size-based-weight. The default value of this parameter is false, which specifies that jobs are sorted by resource usage in ascending order. If you set the parameter to true, jobs are sorted by the quotient of resource usage divided by resource demand in ascending order.

  2. Enable intra-queue resource preemption.

    The following table describes the parameters that are used to control intra-queue resource preemption.

    Parameter

    Description

    Recommended value

    yarn.resourcemanager.scheduler.monitor.enable

    Specifies whether to enable preemption. This parameter is configured on the yarn-site tab. Other parameters related to queue resource preemption are configured on the capacity-scheduler tab.

    true

    yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.enabled

    Specifies whether to enable intra-queue resource preemption. Inter-queue resource preemption is enabled by default and cannot be disabled.

    true

    yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.preemption-order-policy

    The policy based on which the intra-queue resource preemption is performed. Default value: userlimit_first.

    priority_first

    yarn.scheduler.capacity.<queue-path>.disable_preemption

    Specifies whether to disable resource preemption for the specified queue. Default value: false.

    If you set the parameter to true, the resources of the specified queue cannot be preempted. If this parameter is not configured for a child queue, the child queue inherits the configuration of this parameter from the parent queue.

    true

    yarn.scheduler.capacity.<queue-path>.intra-queue-preemption.disable_preemption

    Specifies whether to disable intra-queue resource preemption for the specified queue. Default value: false.

    If you set the parameter to true, intra-queue resource preemption is disabled. If this parameter is not configured for a child queue, the child queue inherits the configuration of this parameter from the parent queue.

    true

How do I view the resource usage of a queue?

You can view the value of the Used Capacity parameter on the web UI of YARN to view the resource usage of a queue. The Used Capacity parameter specifies the percentage of resources used by a queue to all resources that are allocated to the queue. The percentage of memory resources used by the queue and the percentage of the number of vCores used by the queue are separately calculated. The greater value is used as the value of the Used Capacity parameter. To view the resource usage of a queue, perform the following operations:

  1. Access the web UI of YARN. For more information, see Access the web UIs of open source components.

  2. On the All Applications page, click the ID of a specific job.

  3. Click the desired queue in the Queue row.

    In the Application Queues section, view the resource usage of the queue.

    image.png

A large number of .out log files are generated by YARN service components and cannot be automatically cleared. Why?

  • Cause: Specific dependency libraries of Hadoop components call Java logging APIs to generate log records, and do not support Log4j-based log rotation. The standard error (stderr) output of the daemon components is redirected to .out log files. Due to the fact that no automatic cleanup mechanism is provided, prolonged log accumulation may cause the data disk storage to be fully occupied.

  • Solution: Use the head and tail commands in combination with the timestamp information generated in logs to check whether the logs that are generated by Java logging APIs occupy a large amount of storage space. In most cases, the dependency libraries generate INFO-level logs, which do not affect the use of components. To prevent the storage capacity of data disks from being occupied, you can modify configurations to disable the generation of INFO-level logs.

    For example, perform the following steps to disable log generation of Jersey dependency libraries of Timeline Server:

    1. Run the following command to monitor .out log files that are relevant to timelineserver- in the log storage path of YARN. In a DataLake cluster, the path is /var/log/emr/yarn/. In a Hadoop cluster, the path is /mnt/disk1/log/hadoop-yarn.

      tail /var/log/emr/yarn/*-hadoop-timelineserver-*.out

      The output contains records generated by the com.sun.jersey package.

      image.png

    2. On the node on which Timeline Server is run, run the following command to create and configure a configuration file as the root user to disable log generation of Timeline Server. This prevents INFO-level logs generated by Jersey dependency libraries from being exported to .out log files.

      sudo su root -c "echo 'com.sun.jersey.level = OFF' > $HADOOP_CONF_DIR/off-logging.properties"
    3. On the Configure tab of the YARN service page in the EMR console, find the YARN_TIMELINESERVER_OPTS parameter (corresponding to the yarn_timelineserver_opts parameter in a Hadoop cluster), and add -Djava.util.logging.config.file=off-logging.properties to the value of the parameter to specify the location of the created configuration file.

      image

    4. Save the configuration and restart Timeline Server for the configuration to take effect. If Timeline Server can start as expected, and .out log files no longer contain log information that is relevant to com.sun.jersey, log generation of Jersey dependency libraries is disabled.

How do I check whether the ResourceManager service is normal?

You can use one of the following methods to check whether the service is normal:

  • Check the ResourceManager HA status. In an HA cluster, make sure that only one ResourceManager process is in the Active state. You can use one of the following methods to check whether the value of the haState field is ACTIVE or STANDBY, and whether the value of the haZooKeeperConnectionState field is CONNECTED. The status of ResourceManager HA is determined based on the values of the haState and haZooKeeperConnectionState fields.

    • Command-line interface (CLI): Run the yarn rmadmin -getAllServiceState command.

    • RESTful API: Access a URL in the http://<rmAddress>/ws/v1/cluster/info format.

    Sample code RM-1

  • Check the status of YARN applications.

    Run the following command to check whether applications are stuck in the submitted or accepted state:

    yarn application -list
  • Check whether the submitted new application can run and stop running as expected. Sample command:

    hadoop jar <hadoop_home>/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar sleep -m 1 -mt 1000 -r 0

    You can add a parameter whose name is -Dmapreduce.job.queuename between sleep and -m to specify a queue. The default value of the added parameter is default.

How do I obtain the status of an application?

You can view the following information about an application to obtain the status of the application.

Information

Description

Basic information

The basic information about an application includes ID, User, Name, Application Type, State, Queue, App-Priority, StartTime, FinishTime, FinalStatus, Running Containers, Allocated CPU VCores, Allocated Memory MB, and Diagnostics. You can view the basic information about an application on one of the following pages or by using one of the following methods:

  • Applications page. The address of the applications page is in the http://<rmAddress>/cluster/apps format.

  • Details page of an application. You can click the ID of an application on the applications page to go to the details page of the application. The address of the details page of an application is in the http://<rmAddress>/cluster/app/<applicationId> format.

  • Details page of an application attempt. You can click the ID of an application attempt on the details page of the application to go to the details page of the application attempt. The address of the details page of an application attempt is in the http://<rmAddress>/cluster/appattempt/<appAttemptId> format.

  • RESTful API for an application. Access a URL in the http://<rmAddress>/ws/v1/cluster/apps/<applicationId> format.

  • RESTful API for an application attempt. Access a URL in the http://<rmAddress>/ws/v1/cluster/apps/<applicationId>/appattempts format.

Queue information

  • Schedulers page, which is displayed after a leaf node is expanded. The address of the schedulers page is in the http://<rmAddress>/cluster/scheduler format.

  • RESTful API for a scheduler. Access a URL in the http://<rmAddress>/ws/v1/cluster/scheduler format.

Container logs

  • Running applications

    • You can view the logs of a container on the NodeManager Log page. Address: http://<nmHost>:8042/node/containerlogs/<containerId>/<user>.

    • You can view the logs of a container in a subdirectory. Search for the subdirectory <containerId> in the directory ${yarn.nodemanager.local-dirs} in which a node runs.

  • Finished applications

    • You can view the logs of a container by running the following command: yarn logs -applicationId <applicationId> -appOwner <user> -containerId <containerId>.

    • You can view the logs of a container by running the following HDFS command: hadoop fs -ls /logs/<user>/logs/<applicationId>.

How do I troubleshoot issues for an application?

  1. Check the status of the application. You can go to the details page of the application or call the RESTful API for the application to view the status of the application. The application may be in one of the following states:

    • The state of the application is unknown. Possible causes:

      • The application fails and exits before the application is submitted to YARN. In this case, view the application submission logs to check whether issues occur on client components, such as BRS and FlowAgent.

      • The application cannot connect to ResourceManager. In this case, check whether the endpoint of ResourceManager is correct and whether the network connection is normal. If the network connection is abnormal, the following error may be reported in the client: com.aliyun.emr.flow.agent.common.exceptions.EmrFlowException: ###[E40001,RESOURCE_MANAGER]: Failed to access to resource manager, cause: The stream is closed.

    • NEW_SAVING: The information about the application is being written to the Zookeeper state store. Possible reasons that cause the application to be stuck in the NEW_SAVING state:

    • SUBMITTED: In most cases, an application is not in this state. If lock acquisition caused by excess node update requests occurs in Capacity Scheduler and Capacity Scheduler is blocked, the application may be in the SUBMITTED state. In most cases, the lock acquisition in Capacity Scheduler and the blocking of Capacity Scheduler occur in a large-scale cluster. You must optimize the request processing procedure to resolve the issues. For more information, see YARN-9618.

    • ACCEPTED: If the application is in this state, check the diagnostic information. You can select a solution based on the error message that appears:

      • Error message: Queue's AM resource limit exceeded.

        • Possible cause: The sum of used ApplicationMaster (AM) resources and requested AM resources exceeds the upper limit of AM resources in a queue. AM resource usage must observe the following inequation: ${Used Application Master Resources} + ${AM Resource Request} < ${Max Application Master Resources}.

        • Solution: Increase the upper limit of AM resources in a queue. For example, set the yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent parameter to 0.5.

      • Error message: User's AM resource limit exceeded.

        • Possible cause: The sum of used AM resources and requested AM resources exceeds the upper limit of AM resources in a queue for a specific user.

        • Solution: Increase the upper limit of AM resources in a queue for a specific user. For example, modify the yarn.scheduler.capacity.<queue-path>.user-limit-factor and yarn.scheduler.capacity.<queue-path>.minimum-user-limit-percent parameters.

      • Error message: AM container is launched, waiting for AM container to Register with RM.

        • Possible cause: The AM is started but the AM initialization is not complete. For example, ZooKeeper connection times out.

        • Solution: Further troubleshoot the issue based on the AM logs.

      • Error message: Application is Activated, waiting for resources to be assigned for AM.

        Perform Step 3 to analyze the cause of inadequate AM resources.

    • RUNNING: If the application is in this state, perform Step 2 to check whether the container resource request is complete.

    • FAILED: If the application is in this state, check the diagnostic information. You can select a solution based on the error message that appears:

      • Error message: Maximum system application limit reached, cannot accept submission of application.

        • Possible cause: The number of running applications in the cluster exceeds the upper limit specified by the yarn.scheduler.capacity.maximum-applications parameter. The default value of this parameter is 10000.

        • Solution: View the Java Management Extensions (JMX) metrics to check whether the number of running applications in each queue is as expected. Troubleshoot the applications that are repeatedly submitted. If all applications run properly, you can set the parameter to a greater value based on the cluster usage.

      • Error message: Application XXX submitted by user YYY to unknown queue: ZZZ.

        • Possible cause: The application is submitted to a queue that does not exist.

        • Solution: Submit the application to an existing leaf queue.

      • Error message: Application XXX submitted by user YYY to non-leaf queue: ZZZ.

        • Possible cause: The application is submitted to a parent queue.

        • Solution: Submit the application to an existing leaf queue.

      • Error message: Queue XXX is STOPPED. Cannot accept submission of application: YYY.

        • Possible cause: The application is submitted to a queue that is in the STOPPED or DRAINING state.

        • Solution: Submit the application to a queue in the RUNNING state.

      • Error message: Queue XXX already has YYY applications, cannot accept submission of application: ZZZ.

        • Possible cause: The number of applications in the queue reaches the upper limit.

        • Solution:

          1. Check whether a large number of applications are repeatedly submitted to the queue.

          2. Modify the yarn.scheduler.capacity.<queue-path>.maximum-applications parameter.

      • Error message: Queue XXX already has YYY applications from user ZZZ cannot accept submission of application: AAA.

        • Possible cause: The number of applications in a queue for a user reaches the upper limit.

        • Solution:

          1. Check whether a large number of applications are repeatedly submitted to the queue for the user.

          2. Modify the yarn.scheduler.capacity.<queue-path>.maximum-applications, yarn.scheduler.capacity.<queue-path>.minimum-user-limit-percent, and yarn.scheduler.capacity.<queue-path>.user-limit-factor parameters.

  2. Verify that YARN resource allocation is not complete.

    1. On the page that displays the list of applications, click the ID of the application to go to the details page of the application.

    2. Click the ID of an application attempt on the details page of the application to go to the details page of the application attempt.

    3. Check whether resources that are in the Pending state exist in the Total Outstanding Resource Requests list. You can also query resources that are in the Pending state by calling the RESTful API that is used to request pending requests.

      • If resources that are in the Pending state do not exist, the YARN resource allocation is complete. Exit the troubleshooting procedure and check the AM resource allocation.AM-Attempt

      • If resources that are in the Pending state exist, the YARN resource allocation is not complete. Proceed to the next step.

  3. Check limits on resources.

    Check resources in a cluster or in a queue. For example, check the values of the Effective Max Resource and Used Resources parameters.

    1. Check whether the following resources are fully consumed: resources in a cluster, resources in a queue, or resources in the parent queue.

    2. Check whether the resources of a dimension in a leaf queue are close to or reach the upper limit.

    3. If the resource usage of a cluster is close to 100%, such as more than 85%, the speed at which resources are allocated to applications may decrease. One reason is that most of the machines do not have resources. If a machine does not have resources to allocate, the machine is reserved. If the number of reserved machines reaches a certain number, the speed at which resources are allocated may decrease. Another possible reason is that memory resources are out of proportion to CPU resources. For example, some machines have idle memory resources but no idle CPU resources. However, some machines have idle CPU resources but no idle memory resources.

  4. Check whether resources are successfully allocated to a container but the container fails to start. On the App Attempt page of the web UI of YARN, you can view the number of containers to which resources are allocated and the changes on the number of containers over a short period of time. If a container fails to start, troubleshoot the failure based on NodeManager logs or the logs of the container.

  5. Dynamically adjust the log level. On the Log Level page of the web UI of YARN at http://RM_IP:8088/logLevel, change the log level for the org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity parameter to DEBUG. The parameter indicates the package of Capacity Scheduler. The following figure provides an example.动态修改日志级别

    Important

    You can adjust the log level when you reproduce the issue and change the log level back to INFO after tens of seconds. This is because many logs can be generated in a short period of time.

What parameters determine the maximum number of available resources for a single task or container?

The maximum number of available resources for a single task or container depends on the parameters of schedulers or queues. The following table describes the parameters.

Parameter

Description

Default value

yarn.scheduler.maximum-allocation-mb

The maximum memory resources that can be scheduled at the cluster level. Unit: MB.

The available memory resources of the core or task node group whose memory size is the largest. The memory size of a node group is specified when a cluster is created. This value is the same as that of the yarn.nodemanager.resource.memory-mb parameter of the node group whose memory size is the largest.

yarn.scheduler.maximum-allocation-vcores

The maximum CPU resources that can be scheduled at the cluster level. Unit: VCore.

32.

yarn.scheduler.capacity.<queue-path>.maximum-allocation-mb

The maximum memory resources that can be scheduled for a specified queue. Unit: MB.

By default, this parameter is not configured. If you configure this parameter, the cluster-level setting is overwritten. This parameter takes effect only for a specified queue.

yarn.scheduler.capacity.<queue-path>.maximum-allocation-vcores

The maximum CPU resources that can be scheduled for a specified queue. Unit: VCore.

By default, this parameter is not configured. If you configure this parameter, the cluster-level setting is overwritten. This parameter takes effect only for a specified queue.

Note

If the requested resources exceed the maximum available resources for a single task or container, the exception InvalidResourceRequestException: Invalid resource request… is recorded in application logs.

What do I do if the modification of YARN configurations does not take effect?

  • Possible causes:

    • For other configurations, the component associated with the configurations is not restarted.

    • For hot update configurations, the specified operation required for the configurations to take effect is not performed.

  • Solution: Make sure that the subsequent operations required for the configurations to take effect are correctly performed.

    Configuration file

    Category

    Subsequent operation

    • capacity-scheduler.xml

    • fair-scheduler.xml

    Scheduler configurations

    Perform the refresh_queues operation on ResourceManager. The configurations are hot update configurations.

    • yarn-env.sh

    • yarn-site.xml

    • mapred-env.sh

    • mapred-site.xml

    Configurations of YARN components

    Restart the component that is associated with the configurations. Examples:

    • Restart ResourceManager after you modify the YARN_RESOURCEMANAGER_HEAPSIZE parameter in the yarn-env.sh file or the yarn.resourcemanager.nodes.exclude-path parameter in the yarn-site.xml file.

    • Restart NodeManager after you modify the YARN_NODEMANAGER_HEAPSIZE parameter in the yarn-env.sh file or the yarn.nodemanager.log-dirs parameter in the yarn-site.xml file.

    • Restart MRHistoryServer after you modify the MAPRED_HISTORYSERVER_OPTS parameter in the mapred-env.sh file or the mapreduce.jobhistory.http.policys parameter in the mapred-site.xml file.

What do I do if the following error message is reported for an exception that occurs on an application client: Exception while invoking getClusterNodes of class ApplicationClientProtocolPBClientImpl over rm2 after 1 fail over attempts. Trying to fail over immediately?

  • Problem description: The active ResourceManager cannot be accessed. The ResourceManager logs contain the following exception information: WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 10.33.**.**:53144 got version 6 expected version 9.

  • Cause: An early version of Hadoop is used. The version of remote procedure call (RPC) used by the application client is incompatible with the Hadoop version.

  • Solution: Use Hadoop of a version that is compatible with the application client of the RPC version.

What do I do if ResourceManager cannot automatically switch from the Standby state to the Active state?

You can use one of the following methods to troubleshoot the issue:

  1. Check whether the parameters that are used to enable automatic status recovery are correctly configured. The parameters must be configured as described in the following table.

    Parameter

    Description

    yarn.resourcemanager.ha.enabled

    Set the value to true.

    yarn.resourcemanager.ha.automatic-failover.enabled

    Set the value to true.

    yarn.resourcemanager.ha.automatic-failover.embedded

    Set the value to true.

  2. If the issue persists after you set the preceding parameters to true, use one of the following methods to further troubleshoot the issue:

    • Check whether the ZooKeeper service is normal.

    • Check whether the data that is read by the ZooKeeper client (ResourceManager) exceeds the upper limit of the ResourceManager buffer.

      • Problem description: The ResourceManager logs contain the following exception information: Zookeeper error len*** is out of range! or Unreasonable length = ***.

      • Solution: On the Configure tab of the YARN service page of a cluster in the EMR console, click the yarn-env tab and set the yarn_resourcemanager_opts parameter to -Djute.maxbuffer=4194304. Then, restart ResourceManager.

    • Check whether the data that is written by the ZooKeeper server exceeds the upper limit of the ZooKeeper server buffer.

      • Problem description: The ZooKeeper logs contain the following exception information: Exception causing close of session 0x1000004d5701b6a: Len error ***.

      • Solution: Add the -Djute.maxbuffer= parameter or update the configuration of the -Djute.maxbuffer= parameter for each node of the ZooKeeper service. You can configure this parameter to increase the upper limit of the buffer. Unit: bytes.

    • Check whether the ZooKeeper node that is marked with the ephemeral flag and that is elected the leader by ResourceManagers is occupied by other sessions and is not released. Perform this check if no exception information is contained in the ResourceManager or ZooKeeper logs. You can run the stat command on the ZooKeeper node that is marked with the ephemeral flag in ZooKeeper-cli to check whether the ZooKeeper node is occupied by other sessions and is not released. ${yarn.resourcemanager.zk-state-store.parent-path}/${yarn.resourcemanager.cluster-id}/ActiveStandbyElectorLock is the configuration path of the ZooKeeper node. The issue may be caused by unknown issues of the default leader election method or by an issue in ZooKeeper.

      We recommend that you modify the leader election method. On the yarn-site tab, you can add a parameter whose name is yarn.resourcemanager.ha.curator-leader-elector.enabled and set the parameter to true. If the parameter is already configured, make sure that the parameter value is true. Then, restart the ResourceManager.

What do I do if an OOM issue occurs in ResourceManager?

Out of memory (OOM) issues can be divided into various types. You need to determine the issue type based on ResourceManager logs. This section describes the types, possible causes, and solutions of OOM issues.

  • Error message: Java heap space, GC overhead limit exceeded, or FullGC (repeatedly reported)

    • Cause:

      • Direct cause: The heap memory of a Java virtual machine (JVM) is insufficient. The internal objects in the ResourceManager process cannot obtain sufficient resources. One or more rounds of full garbage collection (GC) are performed to reclaim the invalid objects before the OOM exception is thrown. However, the objects cannot still obtain enough heap memory for resource allocation.

      • Analysis: ResourceManager has many resident objects that cannot be reclaimed by a JVM, including clusters, queues, applications, containers, and nodes. The heap memory occupied by these objects increases as the cluster size increases. Therefore, larger memory resources are required by ResourceManager in larger clusters. In addition, the memory space occupied by historical application data increases as time goes by. Even a cluster with only one node also requires a certain amount of memory space for storing historical application data. We recommend that you configure a heap memory of at least 2 GB for ResourceManager.

    • Solution:

      • If the master node has sufficient resources, we recommend that you increase the size of heap memory for ResourceManager based on your business requirements. You can increase the size of heap memory by modifying the YARN_RESOURCEMANAGER_HEAPSIZE parameter in the yarn-env.sh file.

      • For a small-scale cluster, we recommend that you reduce the number of historical applications whose data needs to be stored. You can reduce this number by modifying the yarn.resourcemanager.max-completed-applications parameter in the yarn-site.xml file. The default value of this parameter is 10000.

  • Error message: unable to create new native thread

    • Cause: The number of used threads on the node on which the ResourceManager resides reaches the system upper limit, and you can no longer create threads.

      The maximum number of threads depends on the maximum number of available threads for a user and the maximum number of available process identifiers (PIDs). You can run the ulimit -a | grep "max user processes" and cat /proc/sys/kernel/pid_max commands to view the maximum number of available threads and PIDs.

    • Solution:

      • If the number of available threads does not meet your business requirements, increase the number to a greater value in system settings. For nodes of small specifications, tens of thousands of threads must be configured. For nodes of large specifications, hundreds of thousands of threads must be configured.

      • If the number of available threads is properly configured, some processes may occupy too many threads. You can further identify such processes.

        Run the ps -eLf | awk '{print $2}' | uniq -c | awk '{print $2"\t"$1}' | sort -nrk2 | head command to view the top 10 processes that occupy the most threads. The number of threads is displayed in the format of [Process ID] [Number of threads].

Why does localization fail when a node starts to run a job, and why are job logs unable to be collected or deleted?

  • Problem description: The NodeManager logs contain the following exception information: java.io.IOException: Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.

  • Cause: The HDFS configurations are incorrect.

  • Solution:

    1. The exception information is encapsulated and is not the root cause of the issue. To identify the root cause, you must check the debug-level logs.

      • In the CLI for a Hadoop client, you can run a command such as hadoop fs -ls / to access Hadoop. Then, run the following command to enable debugging:

        export HADOOP_LOGLEVEL=DEBUG
      • In a runtime environment with the Log4j configuration, add log4j.logger.org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider=DEBUG to the end of the Log4j configuration.

    2. The example in the following figure shows that the root cause of the issue is that a user modified the configurations of NameServices. The user changed emr-cluster to hadoop-emr-cluster. However, the new node used the original configurations of NameServices after a scale-out. nameservices

    3. On the Configure tab of the HDFS service page of a cluster in the EMR console, check whether the parameters are correctly configured.

How do I handle a resource localization exception?

  • Problem description:

    • The AM container that is used to process jobs fails to start, and the following error message is reported:

      Application application_1412960082388_788293 failed 2 times due to AM Container for appattempt_1412960082388_788293_000002 exited with exitCode: -1000 due to: EPERM: Operation not permitted. The exception information is the diagnostic information about the failed job.
    • An error is reported when an application resource package is decompressed after it is downloaded for resource localization. You can check the following NodeManager logs:

      INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://hadoopnnvip.cm6:9000/user/heyuan.lhy/apv/small_apv_20141128.tar.gz, 1417144849604, ARCHIVE, null },pending,[(container_1412960082388_788293_01_000001)],14170282104675332,DOWNLOADING}
      EPERM: Operation not permitted
              at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
              at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:226)
              at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:629)
              at org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:186)
              at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:235)
              at org.apache.hadoop.fs.FileContext$10.next(FileContext.java:949)
              at org.apache.hadoop.fs.FileContext$10.next(FileContext.java:945)
              at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
              at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:945)
              at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:398)
              at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:412)
              at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:412)
              at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:352)
              at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
              at java.util.concurrent.FutureTask.run(FutureTask.java:262)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
              at java.util.concurrent.FutureTask.run(FutureTask.java:262)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
              at java.lang.Thread.run(Thread.java:744)
  • Cause: The application resource package contains soft links, which result in the resource localization exception.

  • Solution: Delete the soft links that are contained in the application resource package.

What do I do if the error message "No space left on device" is displayed and a container fails to start or run?

Possible causes and solutions:

  • Check whether no space is available on the disk.

  • Check the CGroups configuration for cgroup.clone_children in the /sys/fs/cgroup/cpu/hadoop-yarn/ and /sys/fs/cgroup/cpu/ directories.

    1. If the value of cgroup.clone_children is 0, change the value to 1. Run the echo 1 > /sys/fs/cgroup/cpu/cgroup.clone_children command for startup items.

    2. If the issue persists, check the cpuset.mems or cpuset.cpus file at a directory of the same level. The value of cgroup.clone_children in the hadoop-yarn directory must be the same as the value of cgroup.clone_children at the upper-level directory.

  • Check whether the number of subdirectories of the CGroups directory exceeds the upper limit, which is 65,535. To view the number of subdirectories, find the configuration file for YARN and check the configuration of the yarn.nodemanager.linux-container-executor.cgroups.delete-delay-ms or yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms parameter.

Domain names fail to be resolved in NodeManager or during the running of jobs. What do I do?

  • Problem description: The following error message is displayed: java.net.UnknownHostException: Invalid host name: local host is: (unknown).

  • Possible causes and solutions:

    • Check whether the Domain Name System (DNS) server is correctly configured.

      Run the following command to check the configuration of the DNS server:

      cat /etc/resolv.conf
    • Check whether the required rules are configured for Port 53 of the firewall.

      If the required rules are configured, disable the firewall.

    • Check whether the Name Service Cache Daemon (NSCD) service is enabled for the DNS server.

      Run the following command to check the status of the NSCD service:

      systemctl status nscd

      If the NSCD service is enabled for the DNS server, run the following command to disable the NSCD service:

      systemctl stop nscd

What do I do if an OOM issue occurs in NodeManager?

OOM issues can be divided into various types. You need to determine the issue type based on NodeManager logs. This section describes the types, possible causes, and solutions of OOM issues.

  • Error message: Java heap space, GC overhead limit exceeded, or FullGC (repeatedly reported)

    • Cause:

      • Direct cause: The heap memory of a JVM is insufficient. The objects in the NodeManager process cannot obtain sufficient resources. One or more rounds of full GC are performed to reclaim the invalid objects before the OOM exception is thrown. However, the objects cannot still obtain enough heap memory for resource allocation.

      • Analysis: NodeManager has few resident objects, including the current node, applications, and containers. These objects do not occupy a large amount of heap memory. The cache and buffer of an external shuffle service may occupy a large amount of heap memory. The heap memory occupied by an external shuffle service is determined by the configurations of shuffle services, such as the spark.shuffle.service.index.cache.size or spark.shuffle.file.buffer parameter for Spark and the mapreduce.shuffle.ssl.file.buffer.size or mapreduce.shuffle.transfer.buffer.size parameter for MapReduce. The heap memory occupied by an external shuffle service is also proportional to the number of applications or containers that use the external shuffle service. Therefore, larger memory resources are required by NodeManager on nodes with lager specifications that support more tasks. We recommend that you configure a heap memory of at least 1 GB for NodeManager.

    • Solution:

      • If the node on which NodeManager resides has sufficient resources, we recommend that you increase the size of heap memory for NodeManager based on your business requirements. You can increase the size of heap memory by modifying the YARN_NODEMANAGER_HEAPSIZE parameter in the yarn-env.sh file.

      • Check whether the configurations of shuffle services are reasonable. For example, the cache of Spark should not occupy most of the heap memory.

  • Error message: Direct buffer memory

    • Cause:

      • Direct cause: The OOM issue is caused by the overflow of off-heap memory and is generally related to external shuffle services. For example, off-heap memory is used if remote procedure calls (RPCs) of a shuffle service use NIO DirectByteBuffer.

      • Analysis: The off-heap memory occupied is proportional to the number of applications or containers that use shuffle services. For a node on which a large number of tasks use shuffle services, you need to check whether the off-heap memory size for NodeManager is excessively small.

    • Solution:

      Check whether the off-heap memory size is too small. You can view the value of XX:MaxDirectMemorySize in the YARN_NODEMANAGER_OPTS parameter in the yarn-env.sh file. If the parameter is not configured, the size of off-heap memory is the same as that of heap memory by default. If the off-heap memory size is too small, increase the size to a greater value.

  • Error message: unable to create new native thread

    For more information, see the solution for the error "unable to create new native thread" in the What do I do if an OOM issue occurs in ResourceManager? section in this topic.

After an ECS instance is restarted, NodeManager fails to be started because the cgroups directory is missing. What do I do?

  • Error message: ResourceHandlerException: Unexpected: Cannot create yarn cgroup Subsystem:cpu Mount point:/proc/mounts User:hadoop Path:/sys/fs/cgroup/cpu/hadoop-yarn

  • Cause: The abnormal restart of the ECS instance may be caused by a kernel defect of the instance. This is a known issue in the version 4.19.91-21.2.al7.x86_64. After the ECS instance is restarted, the cgroups become invalid. This is because the data of cgroups in the memory is deleted after the restart.

  • Solution: Modify the bootstrap scripts of node groups. Create a directory of cgroups in the cluster, and run the rc.local script to create the directory if the ECS instance is started. The following code provides an example:

    # enable cgroups
    mkdir -p /sys/fs/cgroup/cpu/hadoop-yarn
    chown -R hadoop:hadoop /sys/fs/cgroup/cpu/hadoop-yarn
    # enable cgroups after reboot
    echo "mkdir -p /sys/fs/cgroup/cpu/hadoop-yarn" >> /etc/rc.d/rc.local
    echo "chown -R hadoop:hadoop /sys/fs/cgroup/cpu/hadoop-yarn" >> /etc/rc.d/rc.local
    chmod +x /etc/rc.d/rc.local

What do I do if the resource configurations of NodeManager do not take effect after I save the configurations and restart NodeManager?

  • Description: The yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb parameters are modified and saved. However, the configurations do not take effect after NodeManager is restarted.

  • Cause: The CPU cores and memory size of instances in a node group may vary. You need to modify the yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb parameters for a node group for the settings to take effect.

  • Solution: Log on to the E-MapReduce (EMR) console. On the Configure tab of the YARN service page, select Node Group Configuration from the drop-down list right to the search box, and select the node group to which NodeManager belongs. Then, modify the yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb parameters. For more information, see Manage configuration items.修改NM配置

What do I do if a node is marked as unhealthy?

  • Causes:

    • Exceptions are detected during the disk health check: If the ratio of the number of normal directories to the total number of directories on a node is lower than the specified ratio, the node is marked as unhealthy. The ratio is specified by the yarn.nodemanager.disk-health-checker.min-healthy-disks parameter in the yarn-site.xml file. The default value of this parameter is 0.25. If multiple disks, such as four disks, are deployed on a node on which NodeManager resides, the node is marked as unhealthy only if directories are abnormal on all the four disks. Otherwise, the message "local-dirs are bad" or "log-dirs are bad" is recorded in the report of NodeManager status. For more information, see What do I do if the message "local-dirs are bad" or "log-dirs are bad" is reported?

    • Exceptions are detected by the health check script of NodeManager: By default, the script is disabled. You need to configure the yarn.nodemanager.health-checker.script.path parameter in the yarn-site.xml file to enable the health check script.

  • Solutions:

What do I do if the message "local-dirs are bad" or "log-dirs are bad" appears?

  • Cause: An exception is detected in the disk health check. By default, the feature of disk health check is enabled. This feature periodically checks whether local-dirs and log-dirs meet certain conditions. local-dirs is the cache directory of tasks and stores the files and intermediate data that are required for running tasks. log-dirs is the log directory of tasks and stores all the logs of tasks. If one of the following conditions is not met, local-dirs or log-dirs is marked as bad:

    • Readable.

    • Writable.

    • Executable.

    • The disk usage is lower than the threshold specified by the yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage parameter in the yarn-site.xml file. The default value of this parameter is 90%.

      The remaining available disk space is larger than the minimum available disk space that is specified by the yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb parameter in the yarn-site.xml file. The default value of this parameter is 0.

  • Solutions:

    • In most cases, the issue is caused by insufficient disk space. If one of the following conditions is met, we recommend that you scale out disks:

      • The specifications of the node on which NodeManager resides are high, and a large number of tasks are running on the node.

      • The disk space is small.

      • The size of data or files required for running tasks is large.

      • A large amount of intermediate data is generated when tasks are running.

      • The size of task logs is large, and task logs occupy a high proportion of disk space.

    • Check the yarn.nodemanager.localizer.cache.target-size-mb parameter in the yarn-site.xml file. If the ratio of the cache size to the disk size is too large, most disk space is occupied by cached task data. This is because the cache can be automatically cleared only if the specified threshold is exceeded.

    • Repair the damaged disks. For more information about how to repair a damaged disk, see Replace a damaged local disk for an EMR cluster.

What do I do if the error message "User [dr.who] is not authorized to view the logs for application ***" is displayed?

  • Problem description: Information in the following figure is displayed when the log page is opened. ERROR_info

  • Cause: The access control list (ACL) rules are checked when you access the NodeManager Log page. If ACL rules are enabled and a remote user wants to view the logs of an application, the remote user must meet one of the following conditions:

    • The remote user is the admin user.

    • The remote user is the owner of the application.

    • The remote user meets the ACL rules that are customized for the application.

  • Solution: Check whether the remote user meets one of the preceding conditions.

What do I do if the error message "HTTP ERROR 401 Authentication required" or "HTTP ERROR 403 Unauthenticated users are not authorized to access this page" is displayed?

  • Problem description: The error HTTP ERROR 401 or HTTP ERROR 403 is reported when you access a web UI or call a RESTful API. The details of HTTP ERROR 401 is shown in the following figure.

    ERROR 401

  • Cause: YARN uses the simple authentication method and does not allow anonymous access. For more information, see Authentication for Hadoop HTTP web-consoles in the official documentation of Apache Hadoop.

  • Solutions:

    • Method 1: Configure a URL parameter to specify a remote user, such as user.name=***.

    • Method 2: In the Configuration Filter section on the Configure tab of the HDFS service page of a cluster in the EMR console, search for the hadoop.http.authentication.simple.anonymous.allowed parameter and change the value of the parameter to true to allow anonymous access. For more information about the parameter, see Authentication for Hadoop HTTP web-consoles in the official documentation of Apache Hadoop. Then, restart the HDFS service. For more information, see Restart a service.

Why is the display value of TotalVcore incorrect?

In the cluster or metrics RESTful API section in the upper-right corner of the web UI of YARN, the display value of TotalVcore is incorrect. This is a computing logic issue for TotalVcore in Apache Hadoop versions earlier than 2.9.2. For more information, see Total #VCores in cluster metrics is wrong when CapacityScheduler reserved some containers in the official documentation of Apache Hadoop.

The issue is fixed in EMR V3.37.x, EMR V5.3.x, and their later minor versions.

What do I do if the information about an application that is displayed on the web UI of TEZ is incomplete?

Open the Developer tools of a browser and troubleshoot the issue.

  1. If you identify the issue when you access an address in the http://<rmAddress>/ws/v1/cluster/apps/APPID format, a possible cause is that the application is cleared by ResourceManager. ResourceManager in YARN reserves information about a maximum of 1,000 applications by default. Applications that exceed the maximum number are cleared based on the sequence that the applications are started.

  2. If you identify the issue when you access an address in the http://<tsAddress>/ws/v1/applicationhistory/... format, and the error code 500 which indicates that the application is not found is returned, a possible cause is that the information about the application fails to be stored or the application information is cleared by the Timeline store. You can check the configuration of the yarn.resourcemanager.system-metrics-publisher.enabled parameter to determine whether the application information fails to be stored. You can check the time to live (TTL) of LevelDB to determine whether the application information is cleared by the Timeline store.

  3. If you identify the issue when you access an address in the http://<tsAddress>/ws/v1/timeline/... format, and the code 200 is returned but NotFound is displayed in the code. Perform the following operation:

    • View the information that is printed when the AM syslog service starts. You can check the following normal initialization information:

       [INFO] [main] |history.HistoryEventHandler|: Initializing HistoryEventHandler withrecoveryEnabled=true, historyServiceClassName=org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService
       [INFO] [main] |ats.ATSHistoryLoggingService|: Initializing ATSHistoryLoggingService with maxEventsPerBatch=5, maxPollingTime(ms)=10, waitTimeForShutdown(ms)=-1, TimelineACLManagerClass=org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager
    • If the following exception occurs, the configuration of the yarn.timeline-service.enabled parameter is incorrect for AM that is running. A possible cause is that an issue occurs in FlowAgent. A Hive job can be implemented in FlowAgent by running a Hive command or a Beeline command. By default, the value of the yarn.timeline-service.enabled parameter is set to false in FlowAgent.

      [WARN] [main] |ats.ATSHistoryLoggingService|: org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService is disabled due to Timeline Service being disabled, yarn.timeline-service.enabled set to false

Why is a Spark Thrift JDBC/ODBC Server task running on the web UI of YARN?

If you select Spark when you create a cluster, a default Spark Thrift Server service is automatically started. The service occupies the resources of one YARN driver. By default, tasks that run in Spark Thrift Server apply for required resources from YARN by using this driver.

Can the yarn.timeline-service.leveldb-timeline-store.path parameter be set to an OSS bucket URL?

No, the yarn.timeline-service.leveldb-timeline-store.path parameter cannot be set to an OSS bucket URL.

If the cluster that you create is a Hadoop cluster, the default value of the yarn.timeline-service.leveldb-timeline-store.path parameter is the same as the value of the hadoop.tmp.dir parameter. Do not modify the hadoop.tmp.dir parameter, which affects the settings of the yarn.timeline-service.leveldb-timeline-store.path parameter.

What do I do if the connection to Timeline Server times out or the CPU load or memory usage is extremely high?

In the case of a large number of Tez jobs, the connection to Timeline Server of YARN may time out when data is written to Timeline Server. This is because the Timeline Server process occupies a large amount of CPU resources and the CPU load of the node on which Timeline Server runs reaches the upper limit. As a result, job development and other non-core services, such as report generation, are affected. In this case, you can temporarily stop the Timeline Server process to reduce the CPU load of the node. You can add or modify the following configuration items to resolve the issue.

Add configurations to or modify configurations of the Tez and YARN services in the EMR console. For more information, see Manage configuration items.

  1. Go to the tez-site.xml tab on the Configure tab of the Tez service page in the EMR console. Add a configuration item whose name is tez.yarn.ats.event.flush.timeout.millis and value is 60000. This configuration item specifies the timeout period that is allowed when a Tez job is run to write events to Timeline Server of YARN.

  2. Go to the yarn-site.xml tab on the Configure tab of the YARN service page in the EMR console and add or modify the configuration items that are described in the following table. After the addition or modification operation is complete, you must restart Timeline Server on the Status tab of the YARN service page.

    Configuration item

    Value

    Description

    yarn.timeline-service.store-class

    org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore

    Specifies the event storage class of YARN Timeline Server.

    yarn.timeline-service.rolling-period

    daily

    Specifies the event rolling period of YARN Timeline Server.

    yarn.timeline-service.leveldb-timeline-store.read-cache-size

    4194304

    Specifies the size of the read cache of the YARN Timeline Server LevelDB store.

    yarn.timeline-service.leveldb-timeline-store.write-buffer-size

    4194304

    Specifies the size of the write buffer of the YARN Timeline Server LevelDB store.

    yarn.timeline-service.leveldb-timeline-store.max-open-files

    500

    Specifies the maximum number of files that can be opened in the YARN Timeline Server LevelDB store.