Hadoop YARN keeps applications and containers running through three complementary high availability (HA) mechanisms: ResourceManager High Availability (RM HA), ResourceManager restarts, and NodeManager restarts. Together, they eliminate single points of failure (SPOFs) and ensure continuity across common disruption scenarios—ResourceManager SPOFs, and ResourceManager or NodeManager upgrades and restarts.
How it works
YARN is a distributed cluster resource management system built on a master-slave architecture. The ResourceManager (master) schedules jobs and manages cluster resources. The NodeManager (slave) manages and monitors jobs on individual nodes.
All three HA mechanisms depend on ZooKeeper for distributed leader election and state persistence, which guarantees strong consistency across the cluster.
| Configuration file | Configuration item | Recommended value | Description |
|---|---|---|---|
| core-site.xml | hadoop.zk.address | <zk1-host>:<zk1-port>,<zk2-host>:<zk2-port>,<zk3-host>:<zk3-port> | ZooKeeper endpoints used for leader election and application state storage. Separate multiple endpoints with commas (,). |
RM HA
RM HA runs multiple ResourceManager processes on different nodes. At any time, only one active ResourceManager records and synchronizes application state to ZooKeeper. If the active ResourceManager or its host node fails, the standby ResourceManagers hold a new election using ZooKeeper's distributed locking mechanism. The newly elected active ResourceManager restores all application state from ZooKeeper and resumes resource management and scheduling.

Configuration
All RM HA settings go in yarn-site.xml.
| Configuration item | Recommended value | Default | Description |
|---|---|---|---|
yarn.resourcemanager.ha.enabled | true | false | Enables RM HA. |
yarn.resourcemanager.ha.automatic-failover.enabled | true or leave blank | true | Enables automatic failover. |
yarn.resourcemanager.ha.automatic-failover.embedded | true or leave blank | true | Uses the embedded leader elector to elect the active ResourceManager. |
yarn.resourcemanager.ha.curator-leader-elector.enabled | true | false | Uses non-curator components for leader election. |
yarn.resourcemanager.ha.automatic-failover.zk-base-path | Leave blank | /yarn-leader-election | Root directory in ZooKeeper where the leader elector state is stored. |
yarn.resourcemanager.ha.rm-ids | rm1,rm2,rm3 | — | IDs of all ResourceManagers. Separate multiple IDs with commas (,). |
yarn.resourcemanager.cluster-id | <cluster-id> | — | Cluster ID. The RM HA storage path in ZooKeeper is derived from this value. |
yarn.resourcemanager.hostname.<rm-id> | Leave blank | — | Hostname of a ResourceManager instance. Configure one entry per ResourceManager. |
yarn.resourcemanager.address.<rm-id> | Leave blank | — | Remote procedure call (RPC) address for YARN clients to submit jobs. Configure one entry per ResourceManager. |
yarn.resourcemanager.scheduler.address.<rm-id> | Leave blank | — | RPC address for ApplicationMasters to request resources. Configure one entry per ResourceManager. |
yarn.resourcemanager.resource-tracker.address.<rm-id> | Leave blank | — | RPC address for NodeManagers to report resource and container status. Configure one entry per ResourceManager. |
yarn.resourcemanager.admin.address.<rm-id> | Leave blank | — | RPC address for admin commands. Configure one entry per ResourceManager. |
yarn.resourcemanager.webapp.address.<rm-id> | Leave blank | — | HTTP address for the ResourceManager web UI. Configure one entry per ResourceManager. |
yarn.resourcemanager.webapp.https.address.<rm-id> | Leave blank | — | HTTPS address for the ResourceManager web UI. Required only when yarn.http.policy is set to HTTPS_ONLY. Configure one entry per ResourceManager. |
ResourceManager restarts
A ResourceManager restart continuously synchronizes application state to ZooKeeper. When the ResourceManager restarts, it reloads that state so applications can resume automatically after an EMR cluster upgrade or restart.
Two restart modes are available:
| Restart mode | Behavior | Impact on running applications |
|---|---|---|
| Work-preserving | The ResourceManager takes over all running applications from restored state. | Minimal—applications continue without interruption. |
| Non-work-preserving | All running applications are stopped and resubmitted based on restored state. | Significant—all running applications are interrupted. |
Configuration
All ResourceManager restart settings go in yarn-site.xml.
| Configuration item | Recommended value | Default | Description |
|---|---|---|---|
yarn.resourcemanager.recovery.enabled | true | false | Enables the ResourceManager restart feature. |
yarn.resourcemanager.work-preserving-recovery.enabled | true or leave blank | true | Enables work-preserving restart. |
yarn.resourcemanager.store.class | org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore | — | State store implementation class. Only ZooKeeper supports RM HA. |
yarn.resourcemanager.zk-state-store.parent-path | Leave blank | /rmstore | ZooKeeper path where application state is stored. |
NodeManager restarts
A NodeManager restart continuously synchronizes running container state to local storage (such as LevelDB). When the NodeManager restarts, it reloads that state so running containers are not affected. Applications are not interrupted during node upgrades or restarts.
Configuration
All NodeManager restart settings go in yarn-site.xml.
| Configuration item | Recommended value | Default | Description |
|---|---|---|---|
yarn.nodemanager.recovery.enabled | true | false | Enables the NodeManager restart feature. |
yarn.nodemanager.recovery.dir | /home/hadoop/yarn-nm-recovery | ${hadoop.tmp.dir}/yarn-nm-recovery | Local directory for container state storage. Use a system disk path other than /tmp, and make sure the hadoop user has read and write permissions. The /home/hadoop/yarn-nm-recovery path is recommended to avoid data loss from /tmp cleanup or disk failures. |
yarn.nodemanager.recovery.supervised | true | false | Retains local state when the NodeManager exits. When set to true, state is restorable after a restart. |
yarn.nodemanager.address | ${yarn.nodemanager.hostname}:8041 or 0.0.0.0:8041 | — | RPC address of the NodeManager, also used as its ID. Use a fixed port. If the port is set to 0, a random port is assigned on each restart, changing the NodeManager's ID and making recovery invalid. |