By Fanjun from Alibaba Cloud Storage Team
This article introduces how the snapshot service of Alibaba Cloud Block Storage (EBS) improves its performance with high-performance ESSD in the cloud-native era, providing a lightweight and real-time user experience. The technologies used in this process will also be introduced. Based on the industry development and cloud data protection scenarios, the snapshot service of Alibaba Cloud EBS provides enterprise users and backup vendors with technical solutions for data protection based on the advanced features of snapshots. These solutions can meet the urgent needs of data protection for cloud users and ensure business continuity for cloud enterprises.
In July 2021, Gartner released the Magic Quadrant report on public cloud Infrastructure as a Service (IaaS) and (Platform as a Service (PaaS) platforms. With its leading technical capabilities, Alibaba Cloud was recognized as a Visionary by Gartner for the first time as a public cloud vendor. Alibaba Cloud EBS won first place in a single item, and Alibaba Cloud scored first in computing, storage, network, and security worldwide. High-performance cloud disk products, such as ESSD, provide users with high-availability, high-reliability, and high-performance block-level random access services and native snapshot data protection capabilities.
With the development of cloud-native technology, more enterprises have built various large-scale distributed business scenarios with elastic expansion. These are based on the virtualization and elastic expansion of cloud computing, booming distributed frameworks of cloud-native technology, containers, orchestration systems, continuous delivery, and rapid iteration. The storage, computing, and other resource requirements increase exponentially as the deployment scale of enterprise application grows. As a result, traditional data protection solutions cannot handle the new technological changes on the cloud. Users face an increasingly competitive market environment. They urgently need cloud data protection solutions that can adapt to their business scale and development to increase competitiveness and meet business development needs. The business background and scenarios of data protection have changed due to cloud computing and cloud-native, but users' demands for data protection have not changed. The measurement standard is still Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
The primary goal pursued by users is still business continuity, referring to the rapid business recovery from business interruption and rapid business expansion when the business is growing. Users have the following urgent requirements for data protection and snapshot services on the cloud for their diverse business scenarios.
According to Storage Networking Industry Association (SNIA), a snapshot is a fully usable copy of a specified data set. The copy includes an image of the corresponding data at a certain time point (the time point when the copy starts). Snapshots of Alibaba Cloud EBS provide consistent data images of ESSD from a certain time. The snapshot service continues to discover users' needs and scenarios in line with the development trend of the industry. Our team made unremitting efforts to develop new functions, drive iteration and evolution, and optimize the advanced new enterprise features of ESSD snapshots. These features include rapid snapshot availability, application consistency snapshot, and consistent group snapshots adaptive to the distributed application architecture and cross-region replication of snapshots for geo-disaster recovery. In the process of independent output and integration, we met the needs of cloud enterprise users and served big data, games, artificial intelligence, finance, and other industries. We also received feedback from users and other Alibaba Cloud teams, such as the ApsaraDB RDS Team, Hybrid Backup Recovery (HBR) Team, Elastic Container Instance (ECI) Team, and Alibaba Cloud Container Service for Kubernetes (ACK) Team.
Lightweight, real-time snapshots with the extremely quick availability feature, advanced features of consistency group snapshots, and application consistency snapshots help enterprise users and third-party backup vendors quickly build application scenarios. These scenarios include fast backup and recovery, disaster recovery testing, copy utilization, and Copy Data Management (CDM) for disaster recovery transition. Container backup, cloud data backup, and CDM are listed as industry trends for data protection in the next few years. The prediction is made in the analysis of technology trends (Hype Cycle) on storage and data protection released by Gartner in July 2021. The definition of CDM by Gartner is that the primary storage snapshot creates Golden Image in the secondary storage based on application consistency and uses it for backup, disaster recovery, and testing. Heterogeneous storage is the foundation of the capability. The advanced snapshot service features of Alibaba Cloud ESSD fully meet the CDM building requirements and can help users achieve the typical scenario of native data protection: cloud copy data management.
Backup Recovery: A combination of fast backup and standard backup provides flexible backup choices. Based on the whole machine protection of ECS instances on the cloud and container applications in the Kubernetes environment, create rapidly available snapshots regularly. After the consistency group snapshot feature and the rapid availability feature are enabled, the generation interval of local instant snapshots can be shortened to several seconds. Instant copies of snapshots are retained locally and become high-speed backups for lossless recovery of I/O performance within seconds. Upper layer-based enterprise applications generate snapshots of the whole machine application consistency periodically. At the same time, the local snapshot copy is uploaded to the Object Storage Service (OSS) via the network as a standard backup. After the backup data is uploaded, the standard backup is visible in all the availability zones of the region. It is suitable for storing historical data for a long time.
Disaster Recovery Test: Based on High-Speed Backup. CDM requires periodic testing of the disaster recovery environment. Periodic testing can improve the reliability of the disaster recovery environment and avoid improper configurations and environmental changes preventing the disaster recovery transition when a disaster occurs, which will cause the business to be unable to rapidly recover the disaster recovery system. Based on the fast cloning technology of local snapshot copy, disaster recovery instances and pull-up container applications are periodically mounted and backed up for testing and validation. In traditional solutions based on replication technology, you need to wait for the snapshot to be replicated and available on the disaster recovery side before you can perform test drills. After adopting the high-speed backup method, cloning, mounting, and startup testing at the disaster backup end can be completed within seconds.
Copy Utilization: Data Analysis Based on High-Speed Backup. On the premise of not affecting the production environment, pull up container applications based on the rapid cloning technology in the disaster recovery environment and perform big data computing and analysis of the copy to explore the value of data. Copy utilization is also reflected in the practice. The MySQL database application pulls up read-only secondary databases based on high-speed backup for offline data analysis.
Disaster Recovery Transition: Business is switched from the production environment to the disaster recovery environment. When the production is interrupted by a disaster, and the business and production cannot be resumed in a short time, the business will switch from the production center to the disaster recovery center. After the business in the production center is recovered, the business is switched back for disaster recovery.
Compared with the traditional CDM solutions, the cloud computing environment and the cloud-native environment have a large-scale flexible and homogeneous computing environment. Enterprise users do not need to invest in equipment resources and software. High-speed backup and cloning technology significantly reduce the RTO in copy development, testing, and disaster recovery transition. The unified backup data format of the cloud snapshot service reduces the number of copies required in various management processes, thus eliminating data format compatibility issues between backup software.
We have made a lot of optimizations for distributed snapshot algorithms and implementations, allowing users to put aside concerns about performance and perform lightweight, real-time data protection at any time.
Instant access can meet users' needs in real-time data protection and DevOps quick orchestration.
The snapshot service with instant availability can be used for data backup, compliance scenarios, and long-term archiving services but also allows cloud disk data to be backed up to Alibaba Cloud OSS with one click. Combined with local snapshot copies at an interval of seconds, it can form a flexible snapshot protection strategy, which enables lightweight snapshot creation, fast cloning available in real-time, and lossless rollback within seconds.
Instant Cloning: In a cross-availability zone disaster recovery environment isolated from production, snapshot cloning of new disks enables writable snapshots, application testing and validation, and service recovery preparation. This eliminates the pressure on cloud businesses and enables horizontal business expansion. For example, the creation and read/write splitting of instances need to be pulled up within seconds for the horizontal expansion of MySQL database applications and the building of secondary databases. Instant cloning realizes the availability of local snapshot copies within seconds in the local region and cross-cluster through lazy loading. It quickly clones new disks to realize the pull-up of instances within seconds.
Rollback within Seconds: The data of the local snapshot copy is stored in the cloud disk, implementing I/O lossless rollback and recovery within seconds. The snapshot generation process is based on the improved ROW technology and holographic indexing. As the data blocks written to the ESSD change, the cloud disk read performance is optimized based on the optimal mode of ESSD I/O read performance. There is no need to pull data from the remote OSS instances. The I/O performance of rollback within seconds is lossless.
After multiple rapidly available snapshots are created, and the rollback is initiated, the read performance of the cloud disk is not affected. We also tested the performance of the cloud disk product from one of our competitors: After keeping multiple local snapshots in the cloud disk, the I/O read performance has different degrees of latency jitter.
Container environments and ECS instances need to protect stateful applications associated with multiple disks. The major problems with single-disk snapshots are listed below:
Stateful applications adopt LVM, Windows dynamic disks, and file systems as persistent storage media. Single-disk snapshot data backup error occurs. Database applications take performance and data security into account. Log files (WAL,Write-Ahead Log) and data files are located on unused storage devices. As a result, system backup and disaster recovery cannot be carried out regularly.
In addition to the deployment of stateful applications in the POD of Kubernetes and the deployment method of a single ECS instance, there are distributed application deployment architecture, application high-availability clusters (such as Windows Failover Cluster), highly available architecture of host and secondary application server, and shared storage-based application architecture of Oracle RAC. These distributed architectures also require cross-cloud disk and cross-node data consistency protection.
The cloud computing storage backend often uses distributed storage architecture. The lack of a global logical clock in a distributed environment makes it difficult to implement consistent group snapshots of a single ECS instance, cross-ECS instances, single POD in the Kubernetes environment, and cross-node multiple cloud disks. It is more technically challenging to achieve the lowest impact of snapshots on I/O performance. The implementation technology in the industry for multi-disk crash consistency snapshots is divided into two major categories.
Consistency group snapshot takes the second approach. It pursues lossless I/O performance and minimal impact of snapshots on application performance.
Implementation Principle: I/O-based sequencing algorithm is adopted. Snapshot creation does not require writing I/O blocking. Many users worry that creating snapshots will affect I/O performance, so they only perform snapshot data protection during the business trough. The optimized multi-disk consistency group snapshot algorithm made a breakthrough in I/O snapshot performance. Based on the write order preservation mechanism, we actively adopt I/O marking and sequencing processes according to the order in which the write I/O reaches the underlying storage. Determine the I/O data sets that should be included in the snapshot based on the snapshot completion time point and I/O sequencing. Compared with the traditional way, the snapshot sequencing process will not prevent the I/O writing process. Compared with the traditional copy-on-write (COW) method, the snapshot generation process adopts the redirect-on-write (ROW) method. The reference generation process of the background data set has no impact on I/O procedures. The impact of the snapshot on I/O performance is reduced to the lowest, realizing lossless I/O performance in the read-write scenario of database services.
For the database application, we use two disks, two clients, 4TB capacity, random write, iodepth=16, jobs=1, and 16 KB of block size for the testing database in the high IOPS scenario. We tested the influence of the snapshot creation process on I/O performance. The influence on the I/O performance of competitor 1 and competitor 2 increased by almost 1-3 times.
The consistency types of snapshot data of ESSD mainly include crash consistency and application consistency. Crash consistency requires the file system and application to have breakdown recovery capability. It is characterized by low RPO and business impact. However, the RTO within seconds and high reliability of data backup cannot be met in the following scenarios.
Implementation Principle: Compared with traditional backup methods, the value of application consistency snapshots to users lies in providing cloud-native agentless application consistency snapshots. It reduces the costs generated by customers using traditional backup methods, including resource consumption, release complexity, software compatibility, kernel development, and software maintenance costs. It combines cross-platform plug-ins and proprietary consistency components to realize data silence of I/O and application transactions during snapshots based on the file system kernel and VSS mechanism on Windows. This will meet the data consistency requirements of enterprise applications in storage snapshots. The adopted generation protocol automatically restores the I/O impact based on the impact duration. The snapshot consistency type depends on the submission result of the creation protocol and application state. It will optimize the link length from the upper-layer application to the underlying storage and the consistency component performance, reducing the I/O impact duration to within seconds. The creation frequency interval can achieve file system consistency creation within seconds and a minute-level application consistency snapshot interval according to business requirements.
ESSD snapshot consistency classification is in compliance with the full-type snapshot consistency classification of block storage public cloud in the industry, including crash consistency, application consistency, single-disk consistency snapshots, and multi-cloud disk group snapshots. In terms of security risks and application support scalability, compared with competitors, the advantages of native agentless snapshots include no resident service, no public IP address and opening port risks, secure role authorization, no additional kernel driver participation, and support for dynamic discovery of logical volumes and enterprise applications. Based on ESSD storage snapshots, there is no agent backup, no need to maintain the kernel driver, and no data read and transmission in the virtual machine.
SQL Server database applications based on ESSD system disks and data disks can realize second-level write I/O blocking and minute-level snapshot interval through the test of snapshot creation duration and I/O impact duration of major cloud vendors in and outside China. The creation duration of application consistency snapshots is 2-3 times lower than our competitors. Whole machine recovery of application consistency avoids the log replay process during crash consistency snapshot recovery, improving the startup speed of database applications.
Compared with the snapshot features of other public cloud providers in the industry, ESSD is currently the only one that fully supports the features of rapidly available snapshots and consistency group snapshots. It can meet the requirements of snapshot RTO and RPO in the data protection scenario of enterprise core applications migrating to the cloud.
Data protection is not remedial work but a precautionary action. With the vigorous development of cloud-native technologies, especially the evolution of container technology, enterprise users have increasingly higher requirements for the RPO and RTO of cloud protection. In the future, we will also introduce more new functions in ESSD, such as high-density snapshots, continuous data protection, and application consistency protection capability based on multiple ECS instances. We will continue to provide users with the light, fast, and elastic features of snapshots, reduce RTO and RPO of enterprise data protection, provide more advanced features of native snapshot services, and facilitate enterprise data protection.
Enterprise-Level Tool: Alibaba Cloud NVMe Disk and Shared Storage
1,003 posts | 246 followers
FollowAlibaba Cloud Community - April 25, 2022
Alibaba Cloud Community - April 24, 2022
Alibaba Cloud Community - April 24, 2022
Alibaba Cloud Community - April 25, 2022
Junho Lee - June 22, 2023
ApsaraDB - August 6, 2019
1,003 posts | 246 followers
FollowBlock-level data storage attached to ECS instances to achieve high performance, low latency, and high reliability
Learn MoreProvides scalable, distributed, and high-performance block storage and object storage services in a software-defined manner.
Learn MoreBuild a Data Lake with Alibaba Cloud Object Storage Service (OSS) with 99.9999999999% (12 9s) availability, 99.995% SLA, and high scalability
Learn MorePlan and optimize your storage budget with flexible storage services
Learn MoreMore Posts by Alibaba Cloud Community