Asynchronous replication of Alibaba Cloud storage enterprise features

Data is the lifeblood of the enterprise

Data remote disaster recovery is a universal demand of enterprise customers, especially for big customers such as government and finance. In the age of big data, data is the core asset and the lifeline of an enterprise. In the real world, disasters occur from time to time. When disasters occur, disaster tolerance becomes the key to the survival of enterprises.

In the "911" incident in the United States, the twin buildings in the United States collapsed, resulting in the destruction of the data centers of several banks. Deutsche Bank, because it had backed up data dozens of kilometers away, quickly recovered its business and was praised by users, while Bank of New York closed down a few months later because it had no disaster recovery plan.

In March 21, the OVHcloud computer room, the largest data center operator in France, caught fire, affecting more than 3.5 million websites.

In the past 720 flood in Zhengzhou, River Hospital Area, the first affiliated hospital of Zhengzhou University, was affected by continuous rainstorm. The whole hospital area was flooded, leading to power failure. The hospital started a remote disaster recovery mechanism, and it took only 15 minutes to switch the core business to the core machine room in the east, ensuring the normal operation of the other two hospital areas.

Disaster recovery on cloud becomes a trend

The alarm bells are ringing in real cases, which also makes the enterprise's investment in data protection and disaster recovery continue to expand. Traditional disaster recovery schemes often require enterprises to build their own disaster recovery centers, purchase special lines, and invest manpower in operation and maintenance, which costs a lot. In the era of rapid development of cloud computing, more and more enterprise customers consider cloud disaster recovery. On cloud disaster recovery service, usually called DRAaS (disaster recovery as a service), can not only save the cost of self built disaster recovery center, but also save the subsequent operation and maintenance costs. It helps customers quickly establish a cross regional disaster recovery scheme, which is ready to use, ready to release, and also provides users with great flexibility. The following table summarizes the comparison between DRAaS and traditional disaster recovery solutions. It can be seen that DRAaS has the characteristics of zero infrastructure, less operation and maintenance, and high flexibility compared with traditional disaster recovery solutions. Therefore, in the era of rapid development of cloud computing, DRAaS has also become the trend of disaster recovery.



Asynchronous replication of ESSD cloud disk

Alibaba Cloud Block Storage ESSD product is a leading flagship product in the world and has gradually matured. In order to better serve enterprise customers and meet their cloud disaster recovery needs, Alibaba Cloud Storage has also launched its own DRAaS product, cloud disk asynchronous replication, to achieve cross regional asynchronous replication of cloud disks. This article introduces how users choose appropriate cloud disaster recovery products, analyzes the similarities and differences between different disaster recovery architectures from a technical perspective, and then introduces how we choose disaster recovery architectures for the ESSD architecture and the technical principles behind cloud disk asynchronous replication.

How do enterprises choose cloud disaster recovery solutions

Select the appropriate disaster recovery type according to RPO and RTO

When selecting a disaster recovery scheme, enterprises should first determine their disaster recovery level according to their business characteristics. In the disaster recovery field, RPO (Recovery Point Objective) is usually used to measure the maximum length of data that a disaster recovery system may lose, and RTO (Recovery Time Objective) is used to measure the maximum length of Time from the disaster to the Recovery of the entire system.

The state has issued relevant standards to divide disaster tolerance into six levels, as shown in the figure below

For enterprises, from Level 1 to Level 6, the higher the level, the lower the risk of data loss, but the higher the cost of disaster recovery. In the traditional storage industry, usually data backup and archiving products can meet the first to second level disaster recovery requirements, the backup function of ordinary storage arrays can meet the third to fifth level requirements, the asynchronous replication function of high-end storage arrays can meet the fourth to fifth level requirements, and high-end storage synchronous replication, dual active function, and application based replication can meet the fifth to sixth level requirements.

On the cloud, the major cloud manufacturers also provide a wealth of cloud products to meet the needs of different disaster recovery levels. The cloud disaster recovery center usually provides cross regional or cross availability zone cloud disaster recovery services, which can meet the needs of level one to four, while asynchronous replication and synchronous replication products can meet the needs of level five to six disaster recovery. Mainstream applications, such as database business, usually have their own disaster recovery products, which can achieve the highest IO level disaster recovery granularity.

It can be seen from the above levels that asynchronous replication can meet the needs of level four to five disaster recovery, and is also widely needed by financial customers such as banks and government units.

Select appropriate disaster recovery service according to system characteristics

From the perspective of implementation methods, the disaster recovery schemes of existing cloud manufacturers are roughly divided into three categories: application based, instance based, and block based storage:

Application based

This type of disaster recovery scheme is usually aimed at a specific application service, such as cloud database, message queue, object storage, etc. Users using the relevant cloud services can choose the disaster recovery service of the corresponding product according to their own needs. The advantage of this disaster recovery service is that it can achieve consistency at the application level when combining with the business. The disadvantage is that it is not universal, and only the business based on a specific application can use it.

Based on virtual machine

For those who only purchase IaaS services, or have their own customized services, or the application level disaster recovery service cannot meet the needs, you can choose a disaster recovery scheme based on virtual machine. This scheme will provide data consistency protection for the whole machine, or data consistency protection across instances. The disaster recovery end usually recovers the host network in addition to the storage data, which is relatively convenient to use, The advantages of this disaster recovery service are simple operation and strong universality. The disadvantage is that the host resources need to be purchased at the disaster recovery end, and the cost is relatively high.

Block based storage (cloud disk)

The core of disaster recovery is data disaster recovery. Therefore, some manufacturers have introduced cross regional replication products for the cloud disk itself. This product is flexible in form and generally has no restrictions on applications. During replication, it does not require the disaster recovery end to purchase a host, which can reduce user costs. It can also be used seamlessly with other cloud services, forming an application level disaster recovery similar effect. The cloud disk can also be used as a consistency group. The replication data of a group of cloud disks meet the crash consistency semantics.

Asynchronous replication of ESSD cloud disk, simple steps to help you with business recovery

The cloud disk asynchronous replication function supports the asynchronous replication of cloud disk data across regions (Region) with an RPO of 15 minutes. Users can complete the creation of a disaster tolerance pair in only three simple steps: first select the cloud disk to be copied, second select a disaster tolerance site and create a slave disk, and third create and activate a disaster tolerance pair. After the disaster recovery pair is activated, the cloud disk data will be periodically copied to the corresponding slave disk of the disaster recovery site. When users want to temporarily stop copying, they can use the stop function of the disaster recovery pair to temporarily stop copying.

When a failure occurs, the user can use the failover function to complete the switchover between the primary and standby sites. The failover will disconnect the replication link and restore the slave device to the last replication consistency point, providing the user with read-write permissions.

After disaster recovery, if you want to restore the business to the original production site, you can use the reverse recovery function to restore the incremental data generated from the site back to the primary site.

Technology behind asynchronous replication on the cloud

This chapter discusses the implementation principle of asynchronous replication technology widely used in disaster recovery products and its similarities and differences with traditional storage architecture. The core of disaster recovery is data disaster recovery, and block storage disaster recovery is the most common disaster recovery scheme. Therefore, the following discussion focuses on the asynchronous replication technology of cloud disk.

Replication architecture for traditional storage

There are roughly three ways to implement traditional storage asynchronous replication:

Based on the storage gateway: The storage gateway is located between the server and the storage device. It is a storage service technology built on the SAN network. The storage gateway can provide flexible and diverse storage services for incoming IO streams. The storage gateway is separated from the host server and array, does not occupy the resources of the host and storage, and can easily support replication between heterogeneous systems. However, due to more gateway links in IO, there will be some performance loss, which is not suitable for businesses with high performance requirements.

Host based: In SAN storage, it is usually implemented on the initiator side. On the host side, data flow is conducted according to IO replication requirements. Typical implementations are DRBD. This architecture has no requirements for back-end storage arrays, and the host side needs to install corresponding software. Third party vendors that provide disaster recovery services mostly use this architecture.

Storage array based: Most storage array manufacturers will implement a storage array based replication architecture based on their own array characteristics. Under this architecture, manufacturers will track and double write data at the Target end according to their array IO architecture.

Two technical architectures for asynchronous replication on the cloud

Cloud manufacturers usually combine their own product features and have two technical architectures:

Proxy implementation architecture: In this way, plug-ins are usually installed in the user's virtual host as IO agents. The plug-ins replicate by intercepting user IO requests and forwarding them. The advantage of this scheme is that it can provide application level consistency semantics. This architecture has no special requirements for cloud disk manufacturers, and it is easy to achieve disaster tolerance for heterogeneous systems. The disadvantage is that users need to deploy plug-ins to use it, which may limit the version of the user's operating system. Third party service providers of cloud products usually adopt this architecture.

Agent free implementation architecture: This implementation method is often based on the underlying storage system. It relies on the consistency points provided by the storage system and technologies such as obtaining data difference bitmaps for full or incremental data replication, which can provide users with data crash consistency semantics. The advantage of this method is that it can combine with the storage system for efficient differential data replication, and has no invasion to the user's host system. The business use mode is simpler. The disadvantage is that it cannot achieve application level data consistency. Mainstream cloud manufacturers usually have independently developed block storage services, and usually combine the architectural characteristics of block storage to achieve agentless replication architecture.

Alibaba Cloud ESS D Cloud Disk Asynchronous Replication Architecture

Alibaba Cloud block storage has also introduced its own asynchronous replication products. This chapter describes how Alibaba Cloud implements asynchronous replication products based on its own architecture.

The asynchronous replication function of Alibaba Cloud storage adopts the agentless implementation method. The system architecture is shown in the figure. The disaster recovery management software is deployed at the production site and disaster recovery site respectively. The disaster recovery management and control system periodically initiates replication tasks to asynchronous replication IO components. The replication components obtain data differences from the cloud disk storage system back-end and replicate the difference data to the target area. At present, the RPO design goal for cross regional replication is 15 minutes.

Highly available architecture

The asynchronous replication technology is implemented in a highly available architecture. Considering that the system can still be available in the failure scenario, the disaster recovery management component will be deployed at the production site and the disaster recovery site at the same time, instead of deploying the disaster recovery component unilaterally or in the third deployment area. The metadata information of disaster recovery management will be synchronized with the master and slave of the disaster recovery pair. This ensures that the disaster recovery management function of the secondary site is still available in the event of a disaster at the primary site. In addition, the disaster recovery management software and the replication link adopt a high availability architecture respectively. All control nodes are deployed in the mode of one active and two standby, ensuring the service continuity of the disaster recovery service itself.

Efficient replication

The replication process of asynchronous replication adopts incremental replication, which can minimize the amount of data to be replicated and transmitted, which also improves the replication efficiency. Relying on the efficient internal consistency point acquisition technology of the underlying storage system, we can efficiently obtain the data crash consistent data view of the cloud disk, and we can efficiently obtain the incremental difference of data through the internal index technology of the storage system. The following figure shows how a storage system obtains the difference bitmap of consistency points. The obtained difference bitmap will be serialized into a data difference log, that is, the DCL (Data Change Log) will send it to the replication component, read the consistency point data of the corresponding region according to the difference bitmap, and write it to the slave disk.

The replication link also automatically shards the replication process according to the size and bandwidth of the cloud disk, and replicates concurrently, thereby improving the replication efficiency and maximizing RPO. The following figure shows the process of the replication component acquiring the difference bitmap and the replication task. The cloud disk will be split into multiple data pieces according to the size and stored on different data servers. The replication component will obtain the DCL of the consistency view from the storage server. According to the size and bandwidth of the cloud disk, it will decide how many sub tasks a cloud disk will be split into for replication to better adapt the replication bandwidth. The following figure shows how replication IO components work.

Main disk performance is not damaged

Relying on the efficient index system and high-performance consistency point generation technology of the storage system, asynchronous replication has little impact on the performance of the cloud disk at the primary site, which can be ignored. The performance of the primary disk fully meets the official sales standards.

Second RTO

Traditional backup businesses usually store data on external systems such as OSS. When they need to use OSS snapshots, they create disks, create disks, and load data. This will result in longer RTO times, usually minutes or longer. Cloud disk asynchronous replication periodically writes to the slave cloud disk. The cloud disk is not readable or writeable during the replication phase, but can be instantly available after a failover. RTO can reach seconds. This is due to the design of the disk data that is always online and the architecture advantage of the storage system that can quickly restore to the consistency point.

Summary and outlook

This paper introduces the forms of cloud disaster recovery products. Enterprises can choose appropriate disaster recovery solutions based on their own characteristics. For asynchronous replication products, it analyzes the technical architecture of traditional disaster recovery and cloud disaster recovery. Alibaba Cloud block storage has also implemented asynchronous replication products for block storage based on its own architecture features. Compared with traditional remote disaster recovery solutions, the block storage disaster recovery solution has the following advantages:

Low cost: It is unnecessary to bind virtual machines for use. Users only need to purchase cloud disks at disaster recovery sites, instead of purchasing standby virtual machines. Virtual machines can be purchased as needed during disaster recovery, thus greatly reducing operating costs.

Ease of use: no agent plug-in is required for the user's virtual machine, so that the application is not aware and there is no version requirement for the user's host operating system. It is easy to buy and use as needed, and supports one button switching and one button generation of disaster recovery drill disk.

High availability: The disaster recovery components are designed with high availability zones to ensure that the disaster recovery system can perform disaster recovery switching operations in disaster scenarios.

Rapid service recovery: Provide low service recovery time, and RTO can reach second level.

Very low performance overhead: The performance of the primary disk is almost unaffected during replication.

The block storage asynchronous replication product is committed to providing users with simple, efficient, easy-to-use and low-cost remote disaster recovery solutions. In the future, we will continue to enrich the product features, successively introduce consistent replication groups, shared disk support, data link compression and de duplication and other features, enrich the product use scenarios, reduce users' use costs, and create reliable, easy-to-use and low-cost DRAaS services. Please look forward to it.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us