All Products
Search
Document Center

E-MapReduce:High availability feature of YARN

Last Updated:Mar 26, 2026

Hadoop YARN keeps applications and containers running through three complementary high availability (HA) mechanisms: ResourceManager High Availability (RM HA), ResourceManager restarts, and NodeManager restarts. Together, they eliminate single points of failure (SPOFs) and ensure continuity across common disruption scenarios—ResourceManager SPOFs, and ResourceManager or NodeManager upgrades and restarts.

How it works

YARN is a distributed cluster resource management system built on a master-slave architecture. The ResourceManager (master) schedules jobs and manages cluster resources. The NodeManager (slave) manages and monitors jobs on individual nodes.

All three HA mechanisms depend on ZooKeeper for distributed leader election and state persistence, which guarantees strong consistency across the cluster.

Configuration fileConfiguration itemRecommended valueDescription
core-site.xmlhadoop.zk.address<zk1-host>:<zk1-port>,<zk2-host>:<zk2-port>,<zk3-host>:<zk3-port>ZooKeeper endpoints used for leader election and application state storage. Separate multiple endpoints with commas (,).

RM HA

RM HA runs multiple ResourceManager processes on different nodes. At any time, only one active ResourceManager records and synchronizes application state to ZooKeeper. If the active ResourceManager or its host node fails, the standby ResourceManagers hold a new election using ZooKeeper's distributed locking mechanism. The newly elected active ResourceManager restores all application state from ZooKeeper and resumes resource management and scheduling.

1

Configuration

All RM HA settings go in yarn-site.xml.

Configuration itemRecommended valueDefaultDescription
yarn.resourcemanager.ha.enabledtruefalseEnables RM HA.
yarn.resourcemanager.ha.automatic-failover.enabledtrue or leave blanktrueEnables automatic failover.
yarn.resourcemanager.ha.automatic-failover.embeddedtrue or leave blanktrueUses the embedded leader elector to elect the active ResourceManager.
yarn.resourcemanager.ha.curator-leader-elector.enabledtruefalseUses non-curator components for leader election.
yarn.resourcemanager.ha.automatic-failover.zk-base-pathLeave blank/yarn-leader-electionRoot directory in ZooKeeper where the leader elector state is stored.
yarn.resourcemanager.ha.rm-idsrm1,rm2,rm3IDs of all ResourceManagers. Separate multiple IDs with commas (,).
yarn.resourcemanager.cluster-id<cluster-id>Cluster ID. The RM HA storage path in ZooKeeper is derived from this value.
yarn.resourcemanager.hostname.<rm-id>Leave blankHostname of a ResourceManager instance. Configure one entry per ResourceManager.
yarn.resourcemanager.address.<rm-id>Leave blankRemote procedure call (RPC) address for YARN clients to submit jobs. Configure one entry per ResourceManager.
yarn.resourcemanager.scheduler.address.<rm-id>Leave blankRPC address for ApplicationMasters to request resources. Configure one entry per ResourceManager.
yarn.resourcemanager.resource-tracker.address.<rm-id>Leave blankRPC address for NodeManagers to report resource and container status. Configure one entry per ResourceManager.
yarn.resourcemanager.admin.address.<rm-id>Leave blankRPC address for admin commands. Configure one entry per ResourceManager.
yarn.resourcemanager.webapp.address.<rm-id>Leave blankHTTP address for the ResourceManager web UI. Configure one entry per ResourceManager.
yarn.resourcemanager.webapp.https.address.<rm-id>Leave blankHTTPS address for the ResourceManager web UI. Required only when yarn.http.policy is set to HTTPS_ONLY. Configure one entry per ResourceManager.

ResourceManager restarts

A ResourceManager restart continuously synchronizes application state to ZooKeeper. When the ResourceManager restarts, it reloads that state so applications can resume automatically after an EMR cluster upgrade or restart.

Two restart modes are available:

Restart modeBehaviorImpact on running applications
Work-preservingThe ResourceManager takes over all running applications from restored state.Minimal—applications continue without interruption.
Non-work-preservingAll running applications are stopped and resubmitted based on restored state.Significant—all running applications are interrupted.

Configuration

All ResourceManager restart settings go in yarn-site.xml.

Configuration itemRecommended valueDefaultDescription
yarn.resourcemanager.recovery.enabledtruefalseEnables the ResourceManager restart feature.
yarn.resourcemanager.work-preserving-recovery.enabledtrue or leave blanktrueEnables work-preserving restart.
yarn.resourcemanager.store.classorg.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStoreState store implementation class. Only ZooKeeper supports RM HA.
yarn.resourcemanager.zk-state-store.parent-pathLeave blank/rmstoreZooKeeper path where application state is stored.

NodeManager restarts

A NodeManager restart continuously synchronizes running container state to local storage (such as LevelDB). When the NodeManager restarts, it reloads that state so running containers are not affected. Applications are not interrupted during node upgrades or restarts.

Configuration

All NodeManager restart settings go in yarn-site.xml.

Configuration itemRecommended valueDefaultDescription
yarn.nodemanager.recovery.enabledtruefalseEnables the NodeManager restart feature.
yarn.nodemanager.recovery.dir/home/hadoop/yarn-nm-recovery${hadoop.tmp.dir}/yarn-nm-recoveryLocal directory for container state storage. Use a system disk path other than /tmp, and make sure the hadoop user has read and write permissions. The /home/hadoop/yarn-nm-recovery path is recommended to avoid data loss from /tmp cleanup or disk failures.
yarn.nodemanager.recovery.supervisedtruefalseRetains local state when the NodeManager exits. When set to true, state is restorable after a restart.
yarn.nodemanager.address${yarn.nodemanager.hostname}:8041 or 0.0.0.0:8041RPC address of the NodeManager, also used as its ID. Use a fixed port. If the port is set to 0, a random port is assigned on each restart, changing the NodeManager's ID and making recovery invalid.

What's next