This topic describes the impact imposed by the defect that is caused by the YARN-4946 issue and provides a solution to the defect.

Background information

The defect is caused by the YARN-4946 issue. YARN ResourceManager (RM) can consider an application to be complete and remove it from its history only when the log aggregation for the application has entered a completed state. If log aggregation for an application is in a completed state but the state is not synchronized to the RM state store, after the RM is restarted and the application is loaded, the RM considers the application to be incomplete and cannot delete the application. This results in stack of applications in the RM. If the number of applications for which the log aggregation has entered a completed state but the log aggregation state is not synchronized to the RM state store has reached the value specified by the ${yarn.resourcemanager.state-store.max-completed-applications} or ${yarn.resourcemanager.max-completed-applications} parameter, job scheduling of the RM will be affected. The default value for both parameters is 10000.

The defect is caused by the YARN-4946 issue. For more information about the YARN-4946 issue, see YARN-4946.

The YARN-4946 issue can be reverted by using the YARN-9571 issue. For more information about the YARN-9571 issue, see YARN-9571.

Impact imposed by the defect

  • Affected service: YARN service deployed in a high-availability (HA) E-MapReduce (EMR) Hadoop cluster that contains the ZooKeeper service.
  • Severity level: Critical. We recommend that you fix the defect. An EMR cluster that has this defect may not work upon a restart after a long period of running.
  • Defect description: The RM log keeps displaying the following error information: Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 10000, but not removing app XXX from memory as log aggregation have not finished yet. The defect causes the RM to become unavailable or to become unavailable for a long period of time upon a restart.

Solution

Replace the JAR packages of the RMs that contain the defect of the YARN service in your Hadoop cluster. Then, restart the RMs. You must restart the standby RM first and then the active RM.

Take note of the following items when you use this solution:
  • This solution is suitable for the following EMR versions: V4.6.0, V4.7.0, V4.8.0, V4.9.0, V5.1.0, V5.2.0, and V5.2.1.
    Note Hadoop 3.2.1 is used in clusters of the preceding EMR versions.
  • After you implement this solution, you must restart the affected service in your cluster. A restart of the affected service may cause jobs to fail. We recommend that you restart the service during off-peak hours.

Fixing procedure

Fix the defect in RMs by replacing the JAR packages of the RMs. You must perform the following operations on the standby RM first and then the active RM.
Notice
  • During the JAR package replacement, restart one RM and then restart another after the first one starts to normally run.
  • If you have not enabled HA for your Hadoop cluster, you do not need to perform the following operations.
  1. Click hadoop-yarn-server-resourcemanager-3.2.1.jar to download a YARN RM JAR package.
  2. Log on to the master node of your Hadoop cluster and store the downloaded JAR package in the Hadoop software installation directory.
    In this example, the /usr/lib/hadoop-current/share/hadoop/yarn/ directory is used.
  3. Back up the original JAR package and copy the downloaded JAR package to the Hadoop software installation directory that you specify.
    mv $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-3.2.1.jar /tmp/
    cp hadoop-yarn-server-resourcemanager-3.2.1.jar $HADOOP_HOME/share/hadoop/yarn/

    In the commands, $HADOOP_HOME indicates the Hadoop software installation directory. In this example, the Hadoop software installation directory is /usr/lib/hadoop-current.

  4. Restart the standby YARN RM.
    Observe the status of the standby YARN RM after a restart. If the RM log stops displaying the error information "but not removing app XXX from memory as log aggregation have not finished yet." after the restart, and jobs can be submitted, the defect is fixed.
  5. Search for the active RM in the EMR console and restart the active RM.

Rollback procedure

If you encounter a problem and want to perform a rollback, you can run the following command to use the original JAR package that is stored in the backup directory to replace the downloaded JAR package, and restart the RM deployed on the corresponding node:
cp /tmp/hadoop-yarn-server-resourcemanager-3.2.1.jar $HADOOP_HOME/share/hadoop/yarn/