This topic describes the disaster recovery of data and services in EMR clusters.

Disaster recovery of data

In HDFS, each file is divided into multiple blocks, each block has three replicas by default, and the replicas are distributed to different racks. You can customize the number of replicas. In most cases, the replication factor of HDFS is 3. One replica is placed on a node of the local rack, another replica is placed on a different node of the same rack, and the last replica is placed on a node of a remote rack.

HDFS scans the replicas on a regular basis. If a replica is missing, HDFS copies the data and generates a new replica. If a node is missing, HDFS restores all data on the node. If you use disks on Alibaba Cloud, three replicas are created for each disk. If an error occurs on one of the replicas, HDFS copies data from another replica to the failed replica to ensure data reliability.

HDFS is proven to be a reliable data storage system that provides storage for large volumes of data. You can also integrate HDFS with Object Storage Service (OSS) based on cloud features to back up data. This ensures higher data reliability.

Disaster recovery for services

The core components of HDFS, such as YARN, HDFS, Hive Server, and Hive Meta, are deployed in high availability (HA) mode. In HA mode, at least two nodes are deployed for a service to support disaster recovery. When a node is faulty, the service is switched over to another node to ensure that the service is not affected.