This topic describes the impact imposed by the defect that is caused by the YARN-4946 issue and provides a solution to the defect.
Background information
The defect is caused by the YARN-4946 issue. YARN ResourceManager (RM) can consider an application to be complete and remove it from its history only when the log aggregation for the application has entered a completed state. If log aggregation for an application is in a completed state but the state is not synchronized to the RM state store, after the RM is restarted and the application is loaded, the RM considers the application to be incomplete and cannot delete the application. This results in stack of applications in the RM. If the number of applications for which the log aggregation has entered a completed state but the log aggregation state is not synchronized to the RM state store has reached the value specified by the ${yarn.resourcemanager.state-store.max-completed-applications} or ${yarn.resourcemanager.max-completed-applications} parameter, job scheduling of the RM will be affected. The default value for both parameters is 10000.
The defect is caused by the YARN-4946 issue. For more information about the YARN-4946 issue, see YARN-4946.
The YARN-4946 issue can be reverted by using the YARN-9571 issue. For more information about the YARN-9571 issue, see YARN-9571.
Impact imposed by the defect
- Affected service: YARN service deployed in a high-availability (HA) E-MapReduce (EMR) Hadoop cluster that contains the ZooKeeper service.
- Severity level: Critical. We recommend that you fix the defect. An EMR cluster that has this defect may not work upon a restart after a long period of running.
- Defect description: The RM log keeps displaying the following error information: Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 10000, but not removing app XXX from memory as log aggregation have not finished yet. The defect causes the RM to become unavailable or to become unavailable for a long period of time upon a restart.
Solution
Replace the JAR packages of the RMs that contain the defect of the YARN service in your Hadoop cluster. Then, restart the RMs. You must restart the standby RM first and then the active RM.
- This solution is suitable for the following EMR versions: V4.6.0, V4.7.0, V4.8.0,
V4.9.0, V5.1.0, V5.2.0, and V5.2.1.
Note Hadoop 3.2.1 is used in clusters of the preceding EMR versions.
- After you implement this solution, you must restart the affected service in your cluster. A restart of the affected service may cause jobs to fail. We recommend that you restart the service during off-peak hours.
Fixing procedure
- During the JAR package replacement, restart one RM and then restart another after the first one starts to normally run.
- If you have not enabled HA for your Hadoop cluster, you do not need to perform the following operations.
Rollback procedure
cp /tmp/hadoop-yarn-server-resourcemanager-3.2.1.jar $HADOOP_HOME/share/hadoop/yarn/