By Li Weiwei from Alibaba Cloud Storage Team
Geo-disaster recovery for data is a universal requirement for enterprise customers, especially for customers in the government and finance sectors. Data is the core asset and lifeline of an enterprise in the big data era. Disasters occur from time to time in the real world. When they happen, disaster tolerance becomes the key to the survival of enterprises.
Traditional disaster recovery solutions often require enterprises to build recovery centers, purchase private lines, and invest in an O&M workforce, which brings high costs. More enterprise customers consider cloud disaster recovery in the era of rapid development of cloud computing. Cloud disaster recovery services, commonly known as Disaster Recovery as a Service (DRaaS), can save the cost of self-built disaster recovery centers and subsequent O&M costs. It helps customers quickly establish cross-region disaster recovery solutions. Its features of ready-to-use and ready-to-release also provide users with great flexibility. The following table lists the comparison between DRaaS and traditional disaster recovery solutions. DRaaS has the characteristics of zero infrastructure, fewer O&M, and high elasticity compared with traditional disaster recovery. Therefore, DRaaS has become the trend of disaster recovery in the cloud computing era.
ESSD of Alibaba Cloud EBS is the world's leading flagship product and has gradually matured. Alibaba Cloud EBS also launched its own DRaaS product to serve enterprise customers better and meet their needs for disaster recovery on the cloud, which enables asynchronous replication of cloud disks across regions. This article describes how to select an appropriate cloud disaster recovery product and analyzes the similarities and differences of different disaster recovery architectures from a technical perspective. Then, this article discusses how to select a disaster recovery architecture for ESSD architecture and the technical principles behind asynchronous cloud disk replication.
When choosing a disaster recovery solution, an enterprise should first determine its disaster recovery level according to its business characteristics. In the industry of disaster recovery, RPO is usually used to measure the maximum amount of data that a disaster recovery system may lose. RTO is used to measure the maximum duration required from when the disaster happens until the entire system is recovered.
China has issued relevant standards, dividing disaster tolerance capabilities into six levels, as shown in the figure below:
For enterprises, the higher the level, the lower the risk of data loss, but the higher the cost of disaster recovery construction. Usually, data backup and archiving products can meet level 1 and 2 disaster recovery requirements in the traditional storage industry. The backup function of common storage arrays can meet level 3 to 5 requirements. The asynchronous replication function of high-end storage arrays can meet level 4 to 5 requirements. High-end storage synchronous replication, active-active function, and application-based replication can meet level 5 to 6 requirements.
Major cloud vendors also provide a wide range of cloud products to meet the requirements of different disaster recovery levels. The cloud disaster recovery center usually provides cross-region or cross-availability zone cloud disaster recovery services, which can meet the requirements of levels 1 to 4. Asynchronous replication and synchronous replication products can meet the disaster recovery requirements of levels 5 to 6. Mainstream applications, such as database services, usually have their own disaster recovery products, which can achieve the I/O-level disaster recovery granularity.
Asynchronous replication can meet levels 4 to 5 of disaster recovery needs. This is exactly what is needed by financial customers such as banks and governments.
From the perspective of implementation, the existing disaster recovery solutions of cloud vendors are roughly divided into three categories: application-based, instance-based, and block storage-based.
This type is usually a disaster recovery solution for a specific application service, such as a cloud database, message queue, and Object Storage Service (OSS). Users that use relevant cloud services can choose the disaster recovery service of the corresponding product according to their needs. The advantage of this disaster recovery service is that it can often achieve application-level consistency in combination with businesses. The disadvantage is that its universality is not strong, and only businesses based on specific applications can use it.
Some customers have only purchased IaaS services or have their customized businesses. Some application-level disaster recovery services may not meet the requirements. They can choose a cloud host-based disaster recovery solution. This solution will protect the data consistency of the whole machine or across instances. The disaster recovery end usually recovers the host network and data storage, which is convenient to use. This kind of disaster recovery service has the advantages of simple operation and strong universality. The disadvantage is that you also need to buy host resources at the disaster recovery end, and the cost is high.
The core of disaster recovery is data disaster recovery. Therefore, some vendors have also launched cross-region replication products for cloud disks. This product form is relatively flexible and generally has no restrictions on applications. You do not have to purchase hosts for the disaster recovery end during replication. It can reduce user costs and can be seamlessly used with other cloud services to provide similar effects of application-level disaster recovery. Similarly, a consistency group can be generated based on cloud disks, and the replication data of a group of cloud disks meet the crash consistency semantics.
The asynchronous replication capability of the cloud disk supports asynchronous replication of disk data across regions for ESSD. The RPO is 15 minutes. Users only need three simple steps to complete the creation of a disaster recovery pair. First, select the cloud disk to be copied. Second, select the disaster recovery site and create a secondary disk. Third, create a disaster recovery pair and enable it. After the disaster recovery pair is enabled, the disk data will be periodically copied to the secondary disk corresponding to the disaster recovery site. When users want to temporarily stop replication, they can use the stop function of the disaster recovery pair to terminate replication.
When a fault occurs, users can use the failover function to complete the switchover between the primary and secondary sites. The failover disconnects the replication process and allows the secondary device to recover to the consistency point of the previous replication, providing the user with read and write permissions.
If the user wants to restore the service to the original production site after the disaster recovery, the reverse recovery function can be used to restore the incremental data generated from the site back to the primary site.
This part explores the implementation principles of asynchronous replication widely used in disaster recovery products and its architecture similarities and differences with traditional storage. The core of disaster recovery is data disaster recovery. Block storage disaster recovery is the most common disaster recovery solution. Therefore, the following part focuses on the asynchronous replication of cloud disks.
There are roughly three ways to implement asynchronous replication of traditional storage:
Storage Gateway-Based: The storage gateway is located between servers and storage devices. It is a storage service technology built on SAN networks. The storage gateway can provide I/O flow with flexible storage services. The storage gateway is separated from the host server and the array, which does not occupy the host and storage-end resources. It can support replication between heterogeneous systems. Due to the I/O gateway process, the performance will be affected, and it is not very suitable for businesses that require high performance.
Host-Based: This way is usually implemented on the Initiator end in SAN storage, and data diversion is performed on the host end according to I/O replication requirements. A typical implementation is DRBD. This architecture does not have requirements on the backend storage array, and corresponding software needs to be installed on the host end. Third-party vendors engaged in disaster recovery services usually adopt this architecture.
Storage Array-Based: Most storage array vendors implement a storage array-based replication architecture according to their array features. Under this architecture, vendors perform data tracking and double write at the Target end based on their array I/O architecture.
Cloud vendors usually take their product features into consideration and adopt two technical architectures:
Implementation Architecture with Agent: This method usually requires a plug-in to be installed in the user virtual host as an I/O agent. The plug-in intercepts user I/O requests and forwards them for replication. The advantage of this solution is that it can provide application-level consistency semantics. This architecture has no special requirements on cloud disk vendors, and it is easy to achieve disaster recovery for heterogeneous systems. The disadvantage is that users need to deploy plug-ins before use, and it may have restrictions on the version of the operating system. Third-party service providers of cloud products typically adopt this architecture.
Implementation Architecture without Agent: This implementation method is often based on the underlying storage system, relying on the consistency points provided by the storage system and obtaining data difference bitmaps for full or incremental data replication. It can provide users with data crash consistency semantics. The advantage of this method is that it can be combined with the storage system for efficient differential data replication. Besides, there is no intrusion into the user host system. The business usage method is simpler. The disadvantage is that it cannot achieve application-level data consistency. Mainstream cloud vendors usually have in-house block storage services. They usually combine the architecture features of block storage services to implement an agentless replication architecture.
Alibaba Cloud EBS also launched its own asynchronous replication product. This chapter will describe how Alibaba Cloud combines its architecture to implement asynchronous replication products.
The asynchronous replication feature of Alibaba Cloud EBS uses the agentless implementation method. The system architecture is shown in the figure.
The disaster recovery management software is deployed at the production site and the disaster recovery site, respectively. The disaster recovery management system periodically initiates replication tasks for the asynchronous replication I/O component to perform. The replication component obtains data changes from the backend of the disk storage system and replicates the changed data to the target area. Currently, the RPO for cross-region replication is designed as 15 minutes.
Asynchronous replication is technically implemented through high-availability architecture. The disaster recovery management components will be deployed at the production site and the disaster recovery site at the same time to ensure that the system can still be available in crash scenarios instead of in one site or a third zone. The metadata of disaster recovery management is synchronized between the primary and secondary sites of the disaster recovery pair. This guarantees that in the event of a disaster at the primary site, the disaster recovery management functionality of the secondary site is still available. In addition, the disaster recovery management software and the replication process adopt a high-availability architecture, respectively. All management nodes are deployed in one-primary-two-secondary mode, ensuring the service continuity of the disaster recovery service.
The process of asynchronous replication is incremental, which can minimize the amount of data to be replicated and transmitted and improve efficiency. Relying on the efficient internal consistency point acquisition technology of the underlying storage system, the consistent data view of data crash on the cloud disk can be efficiently obtained. The incremental changes in data can also be efficiently obtained through the internal index technology of the storage system. The following figure shows how a storage system obtains the difference bitmap of consistency points. The obtained difference bitmap is serialized as a data change log (DCL). Then, DCL is sent to the replication component. Based on the difference bitmap, the consistency point data of the corresponding area is read and written to the secondary disk.
The replication procedure also performs automatic sharding and concurrent replication during the replication process according to the size, bandwidth, and other characteristics of the cloud disk. This improves the efficiency of replication and meets the RPO to the greatest extent. The following figure shows the processes of the replication component obtaining the difference bitmap and replicating the task.
The cloud disk is split into multiple data shards stored on different data servers. The replication component obtains the DCL of the consistent view from the storage server. It also determines how many subtasks a cloud disk is split into for replication based on the cloud disk size and bandwidth to accommodate the replication bandwidth better. The following figure shows how the replication I/O component works.
Relying on the efficient indexing system and the high-performance consistency point generation technology of the storage system, asynchronous replication has little impact on the performance of the primary site cloud disk. The performance of the primary disk meets the standards of official sales.
Traditional backup service usually stores data on external systems such as OSS. If you need to use it, you must use OSS snapshots to create disks and load data. This leads to a long RTO, usually at the minute level or longer. Asynchronous replication of the cloud disk performs periodic writes to the secondary cloud disk. The cloud disk cannot be read or written to during the replication phase and can be instantly available after failover. The RTO can reach the second level. This is benefited from the design of the always-online disk data and the architectural advantage that the storage system can be quickly restored to the consistency point.
This article introduces the forms of cloud disaster recovery products. Enterprises can choose the appropriate disaster recovery solutions based on their needs. It also analyzes the technical architectures of traditional and cloud disaster recovery of asynchronous replication products. Alibaba Cloud EBS also implements asynchronous replication products of block storage based on its architecture features. Compared with traditional geo-disaster recovery solutions, EBS disaster recovery has the following advantages:
Low-Cost: You do not need to bind virtual machines to use it. You only need to purchase cloud disks at the disaster recovery site instead of a backup virtual machine. You can purchase virtual machines as needed during disaster recovery. This significantly reduces the operation cost.
Ease of Use: You do not need to install the agent plug-in for the user's virtual machine. This makes the application imperceptible and has no version requirement for the user's operating system. It supports on-demand purchases. It is simple to operate and can be used out of the box. It supports one-click switching and one-click generation of disaster recovery drill disks.
High Availability: All disaster recovery components are designed with highly available zones to ensure the disaster recovery system can perform switchover operations in disaster scenarios.
Fast Service Recovery: It provides a short service recovery time with an RTO in seconds.
Extremely Low-Performance Overhead: The performance of the primary disk is almost unaffected during replication.
Asynchronous replication products of EBS are committed to providing users with simple, efficient, easy-to-use, and low-cost geo-disaster recovery solutions. We will continue to add new product features, such as consistent replication groups, shared disk support, data process compression, and deduplication. We will enrich the use scenarios and reduce user costs to build reliable, easy-to-use, and low-cost DRaaS services. Stay tuned!
Alibaba Cloud Community - April 25, 2022
Alibaba Cloud Community - April 24, 2022
Alibaba Cloud Community - April 24, 2022
Alibaba Cloud Community - April 25, 2022
ApsaraDB - September 19, 2022
ApsaraDB - June 9, 2022
Block-level data storage attached to ECS instances to achieve high performance, low latency, and high reliabilityLearn More
Build a Data Lake with Alibaba Cloud Object Storage Service (OSS) with 99.9999999999% (12 9s) availability, 99.995% SLA, and high scalabilityLearn More
Plan and optimize your storage budget with flexible storage servicesLearn More
A cost-effective, efficient and easy-to-manage hybrid cloud storage solution.Learn More
More Posts by Alibaba Cloud Community