Diagnose ZooKeeper Instability and Data Migration Issues - E-MapReduce

Why is my ZooKeeper service unstable or restarting unexpectedly?
How do I migrate the ZooKeeper data directory without interrupting the service?

Why is my ZooKeeper service unstable or restarting unexpectedly?

The most common cause is too many znodes or snapshots that are too large. ZooKeeper keeps all znodes in memory and synchronizes data between znodes — if either limit is exceeded, memory pressure causes the service to become unstable or crash.

ZooKeeper is a distributed coordination service, not a file system. If your znode count is climbing into the hundreds of thousands, check whether upstream applications are writing to ZooKeeper beyond its intended purpose.

Keep within the following limits:

Resource	Recommended limit
Znode count	Fewer than 100,000
Snapshot size	Smaller than 800 MB per snapshot

To check the znode count, go to the Monitoring tab on the cluster details page in the E-MapReduce (EMR) console.

To check snapshot sizes:

On the Configure tab of the ZooKeeper service page, search for dataDir to find the data directory path.
Run the following command to list snapshot files and their sizes:
```
ls -lrt /mnt/disk1/zookeeper/data/version-2/snapshot*
```

If either limit is exceeded, check the distribution of znodes, then stop the upstream applications that are writing excessively to ZooKeeper based on the distribution of znodes.

How do I migrate the ZooKeeper data directory without interrupting the service?

If disk space runs out or disk performance degrades, migrate the ZooKeeper data directory to a new path. Process followers first, then the leader — this keeps the ZooKeeper ensemble available throughout the migration.

The following example migrates from /mnt/disk1/zookeeper to /mnt/disk2/zookeeper. In this cluster, master-1-2 is the leader and master-1-1 and master-1-3 are followers.

Step 1: Update the data directory configuration

On the Configure tab of the ZooKeeper service page, search for dataDir and change its value to /mnt/disk2/zookeeper.
Click Save.
In the Save dialog, fill in Execution Reason and click Save.

Step 2: Deploy the updated configuration

In the upper-right corner of the Configure tab, click Deploy Client Configuration.
In the dialog, fill in Execution Reason and click OK.
In the confirmation message, click OK.

Step 3: (Optional) Verify the new data directory

Log on to your EMR cluster in SSH mode. For more information, see Log on to a cluster.
Run the following command and confirm that dataDir points to /mnt/disk2/zookeeper:
```
cat /etc/emr/zookeeper-conf/zoo.cfg
```

Step 4: Migrate data on each node

Perform the following on master-1-1 and master-1-3 (followers) first, then on master-1-2 (leader).

For each node:

On the Status tab of the ZooKeeper service page, find the node and click Stop in the Actions column. Fill in Execution Reason and click OK, then OK again.

Log on to the master node in SSH mode. Run the following command to copy the data directory and set the correct permissions:

sudo rm -rf /mnt/disk2/zookeeper && sudo cp -rf /mnt/disk1/zookeeper /mnt/disk2/zookeeper && sudo chown hadoop:hadoop -R /mnt/disk2/zookeeper

On the Status tab, find the node and click Start in the Actions column. Fill in Execution Reason and click OK, then OK again.
Refresh the page until Health Status shows Healthy for the node before proceeding to the next one.

Note After you stop master-1-2 (the original leader), it becomes a follower and master-1-1 or master-1-3 takes over as the new leader.

Migration is complete when all nodes show Health Status as Healthy.