Alibaba Cloud storage ESSD snapshot service

Introduction: With Cloud Native as the background, this article introduces how Alibaba Cloud storage snapshot service can improve the performance of snapshot service based on high-performance ESSD cloud disk, provide lightweight, real-time user experience, and disclose the technical principles behind it. According to industry development and cloud data protection scenarios, provide enterprise users and backup manufacturers with technical solutions for data protection based on advanced snapshot features, meet the urgent needs of cloud users for data protection, and ensure the business continuity of cloud enterprises.

In July 2021, Gartner, an internationally renowned consulting company, released the Magic Quadrant of the public cloud's IaaS (infrastructure as a service) and PaaS (platform as a service) platforms. Alibaba Cloud became a public cloud service provider in the "visionary" quadrant for the first time by virtue of its leading technical capabilities. Among them, Alibaba Cloud block storage achieved the first single score, and Alibaba Cloud computing, storage, The score of network and security won the first place in the world. Behind the leading storage industry, high-performance ESSD cloud disk products provide users with highly available, reliable, high-performance block level random access services and native snapshot data protection capabilities.

New requirements for primary business

With the development of cloud native technology, more and more enterprises are building large-scale, resilient and rich cloud distributed business scenarios based on cloud computing virtualization, elastic expansion and the booming cloud native technology's distributed framework, container technology, orchestration system, continuous delivery and rapid iteration. The deployment scale, storage, computing and other resource requirements of enterprise applications have grown exponentially, resulting in the traditional data protection scheme being unable to meet the new technological changes in the cloud. The market competition environment that users are facing is more intense. It is urgent to adapt cloud data protection schemes to the business scale and development to meet their own competitiveness and business development needs. Although the business background and scenario of data protection have changed due to cloud computing and cloud origin, users' demands for data protection have not changed, and the measurement criteria are still the recovery time point objective RTO and recovery point objective RPO.

The primary goal pursued by users is still business continuity, that is, to quickly realize business recovery in the face of the threat of business interruption; The business is under pressure of growth and rapidly expands. Users put forward the following urgent requirements for cloud data protection and snapshot services according to business scenarios:

• Short creation time: Snapshots can be completed quickly, and key businesses can be backed up immediately.

• Fast availability: Snapshots are fast available to deal with emergencies and complete cloud disk rollback recovery.

• Business expansion: sudden increase in business volume requires business expansion.

• Complete machine protection: single ECS instance and multiple disks associated with multiple ECS instances are protected for consistent data.

• Test verification: data test verification and recovery can be performed outside the production environment.

• Fast recovery: The file system and application data are in the application consistent backup state to avoid the application downtime recovery process.

• Container backup: The rapid iteration and release of the container business environment urgently need to protect metadata and application business data.

According to the definition of snapshot by the storage network industry association SNIA, a snapshot is a fully available copy of a specified data set, which includes the image of the corresponding data at a point in time (the point in time when the copy starts). Alibaba Cloud storage snapshot is to provide a consistent data image of the ESSD cloud disk at a certain time. To adapt to the development trend of the industry, the snapshot service continues to discover new needs and scenarios of users, and has made unremitting efforts to develop new functions and iterative evolution, to maximize the upgrading and optimization of the advanced enterprise new features of ESSD cloud disk snapshots: snapshot extreme availability, application consistency snapshots, consistency group snapshots suitable for distributed application architecture, and remote disaster recovery functions of snapshot cross regional replication. In the development process of continuous independent output and integration, it has met the needs of enterprise users on the cloud, serving big data, games, artificial intelligence, financial industry and other fields, and has also received feedback from other Alibaba Cloud teams such as RDS, hybrid cloud backup team, elastic container instance ECI, container service ACK and other business teams and users:

• The evaluation of RDS industry users of the cloud database team is that the second level backup product of RDS can effectively reduce the use of instance resources for the original physical file backup, and effectively reduce the risk of data protection.

• Elastic container instance ECI container acceleration benefit customer Tucson commented that the extremely fast cache acceleration function accelerated the release of container applications, reduced the computing time of the simulation platform, reduced the computing tasks to an average of less than 5 minutes, and greatly shortened the product release cycle.

• According to the hybrid cloud backup customers, the application consistency machine backup capability fully matches the snapshot function of the VMware virtualization platform.

• The consistency group snapshot and application consistency capabilities provided by the snapshot service fully meet Gartner's ability to evaluate Alibaba Cloud storage services in 2021. The container business ACK team will evaluate the capability of Forrestor container backup in 2021.

Typical Scenarios

The lightweight, real-time snapshot is extremely fast available, and the consistency group snapshot and application consistency snapshot are advanced features. For enterprise users and third-party backup manufacturers, they can quickly build a copy data management application scenario, including fast backup recovery, disaster tolerance testing, replica utilization, and disaster tolerance switching. In the analysis on the technology trend of storage and data protection (Hype Cycle) released by Gartner in July 2021, container backup, cloud data backup and replica data management (CDM) are listed as the industry development trends of data protection in the next few years. Gartner's basic definition of replica data management is that the primary storage snapshot based on application consistency generates a "Golden Image" on the secondary storage, and uses it for backup, disaster recovery, and testing. Heterogeneous storage is the basic condition for capabilities. The advanced snapshot service features of Alibaba Cloud's ESSD fully meet the requirements for building CDM, and help users achieve a typical scenario of native data protection for cloud replica data management:

Backup and recovery: The combination of fast backup and standard backup provides near dense and far sparse backup recovery points. Based on the whole machine protection of ECS instances on the cloud and container application of K8S environment, create fast available snapshots regularly. After the consistency group snapshot feature and the extreme availability feature are enabled, the local instant snapshot can be generated every second. The snapshot instant copy is retained locally and becomes a fast backup for second level IO performance lossless recovery. Periodically generate the consistency snapshot of the whole machine application based on the upper enterprise application. The local snapshot copy is also uploaded to the object storage OSS through the network as a standard backup. After the standard backup completes the upload of backup data, the full availability zone of the local domain is visible, which is suitable for historical data with a long retention time.

Disaster recovery test: disaster recovery test based on fast backup. In replica data management, the disaster recovery environment is required to be tested regularly. Regular testing can improve the reliability of the disaster recovery environment, and avoid configuration problems and environment changes that make it impossible to correctly complete disaster recovery switching when a real disaster occurs, thus preventing the business from quickly recovering the disaster recovery system. Rapid cloning technology based on local snapshot replicas, disaster recovery instances and container applications, and periodic mounting and backup data test verification. The traditional scheme based on replication technology needs to wait for the snapshot to be replicated and available on the disaster recovery side before testing and drilling. After adopting the fast backup mode, the second level cloning, second level mounting and second level startup test of the disaster recovery end are realized.

Replica utilization: data analysis based on fast backup. Without affecting the production environment, in the disaster recovery environment, based on the rapid cloning technology, the container application is pulled up regularly, the replica is calculated and analyzed with big data, and the data value is mined. In practice, replica utilization is also reflected in MySQL database application, which is based on fast backup to instantly pull up the read-only backup database for offline data analysis.

Disaster recovery switching: the business is switched from the production environment to the disaster recovery environment. When a major disaster occurs in production, the business cannot be recovered in a short time, and production cannot continue. Switch the business from the generation center to the disaster recovery center; After the business of the production center is restored, the business is switched back to disaster recovery.

Compared with the traditional replica data management CDM scheme, the cloud computing environment and cloud native environment have a large-scale elastic homogeneous computing environment, and enterprise users do not need to invest in equipment resources and software; Rapid backup and cloning technologies greatly reduce the recovery point in time objective (RTO) of replica development, testing and disaster recovery switching; The unified backup data format of the cloud snapshot service reduces the number of replicas required in various management processes and eliminates data format compatibility problems between backup software.

Technical principles

We have made a lot of optimizations to the distributed snapshot algorithm and implementation, so that users can put aside the concerns that affect performance and perform lightweight, real-time data protection at any time. "Light": It does not affect IO read and write performance during snapshot creation. "Fast": ESSD cloud disk snapshots can be created at the second level, rolled back at the second level, and cloned at the second level - extremely fast availability features to meet users' needs for real-time data protection and DevOps rapid orchestration.

Extreme availability

The snapshot service with the feature of fast availability not only enables data backup, compliance scenarios and long-term archiving businesses, but also enables cloud disk data to be backed up to Alibaba Cloud's Object Storage Service with one click, forming a close and remote snapshot protection strategy with the retention of local snapshot replicas at second intervals, enabling lightweight snapshot creation, real-time available fast cloning, and second level lossless rollback.

Rapid cloning: In a disaster recovery environment that is isolated from production and cross availability zones, snapshot cloning enables writable snapshots of new disks, application test verification and business recovery preparation; Eliminate the pressure on cloud business and realize horizontal business expansion. For example, the horizontal expansion of MySQL database applications, the building of standby databases, instance creation and read/write separation all need to be pulled up in seconds. Fast cloning uses the delayed loading technology to make the local domain and cross cluster data of local snapshot replicas available in seconds. New disks can be quickly cloned to achieve real case pull up in seconds.

Second level rollback: local snapshot replica data and cloud disk local storage to achieve second level IO lossless rollback recovery. The snapshot generation process is based on improved ROW technology and holographic indexing technology. With the change of cloud disk data blocks written to ESSD, the cloud disk reading performance is optimized according to the best mode of ESSD cloud disk IO performance reading. There is no need to pull data from the remote object storage, and the rollback IO performance is lossless at the second level.

Under the test conditions after the cloud disk has created multiple extremely fast available snapshots and initiated a rollback, the cloud disk performance reading performance basically remains unchanged. After a friend's cloud disk retains multiple local snapshots, the IO read performance shows varying degrees of delay jitter.

Consistency Group Snapshot

The container environment and ECS instances need to protect the stateful applications of associated multiple disks. The biggest problem with single disk snapshots is that stateful applications are based on multi cloud disk LVM, Windows dynamic disk and file system as persistent storage, and single cloud disk snapshot data backup is wrong; The database application gives consideration to both performance and data security. The log file WAL and data file are located on unused storage devices, which makes it impossible to regularly backup the whole system and disaster recovery.

In addition to the deployment of stateful applications in POD under K8S and single ECS instance deployment, there are also distributed application deployment architectures, application high availability clusters such as Windows Failover Cluster, active and standby application server high availability architecture, and RAC application architecture based on shared storage. These distributed architectures also require data consistency protection across cloud disks and nodes.

Cloud computing storage backend often adopts distributed storage architecture. The lack of a global logical clock in a distributed environment makes it difficult to achieve consistency group snapshots of single ECS instances and cross ECS instances, single POD and cross node multi cloud disks in K8S environments. It is technically challenging to minimize the impact of snapshots on IO performance. The industry's implementation technologies for multi disk crash consistent snapshots are mainly divided into two categories:

• Adopt the method of blocking write IO during snapshots to achieve point in time based consistency of data crashes across multiple disks

• The logical clock sequencing algorithm is adopted, but it depends on distributed storage implementation, which is difficult to achieve.

Consistency group snapshots adopt the second method, which aims to achieve the goal of lossless snapshots on IO performance and minimize the impact of snapshots on application performance

Implementation principle: IO sequencing algorithm is adopted, and no write IO blocking is required for snapshot creation. Many users are worried that creating snapshots will affect IO performance. Snapshot data protection is only performed during the business downturn. Our optimized and improved multi disk consistency group snapshot algorithm breaks people's impression of the impact of snapshot IO. Based on the write order preservation mechanism, we take the initiative to take the IO marking and sequencing process according to the order in which write IO reaches the underlying storage. The IO data set that should be included in the snapshot is determined based on the snapshot completion point and IO sequencing. Compared with the traditional method, the snapshot ordering process will not prevent the IO writing process; Compared with the traditional copy on write COW mode, the snapshot generation process adopts the write mode of redirecting the ROW on write. The background data set reference generation process has no impact on the IO link, and reducing the snapshot has the least impact on the IO performance. The read and write scenarios of the database business achieve the IO performance without loss.

The database application uses two disks, two clients, a capacity of 4TB, random write, iodepth=16, jobs=1, and a test database with a block size of 16KB. In the high IOPS scenario, the impact on IO during the snapshot creation process is tested. The impact on IO performance during the snapshot creation process of Friend 1 and Friend 2 almost increases by 1 to 3 times.

Apply consistent snapshots

The consistency types of ESSD cloud disk snapshot data mainly include crash consistency and application consistency. Crash consistency requires that file systems and applications have the ability to recover from downtime. It is characterized by low RPO of recovery point objectives and small business impact. However, the following scenarios cannot meet the requirements of high data backup reliability and second level recovery point in time target RTO:

• Atomicity defect risk: It is difficult for file system and database applications to realize transaction atomicity, and there may be defects. The article "All File Systems Are Not Created Equal" published at the top system conference USENIX explained that there may be implementation defects in the application and kernel to ensure atomicity.

• Data loss risk: Mainstream file systems work in a performance first manner by default, and crash consistent backup has data loss risk. The default data writing mode of ext4 file system on Linux is ordered mode, and there is a risk of data loss in the file system verification and repair process; The database application is configured to give priority to performance, and the business data is at risk of loss.

• Long generation time and great impact: Traditional file level physical backup and backup agent methods rely on the generation of logical volume snapshots, which takes a long time and has a great impact on the system. The backup agent needs to install a kernel driver, which has poor compatibility and high maintenance costs; The file backup process needs to read data, which consumes system CPU and IO resources. The application consistency snapshot only communicates with the application at the time point of consistency generation, without incremental data generation and backup read/write operations.

Implementation principle: Compared with traditional backup methods, the value of application consistent snapshots for users lies in providing cloud native agentless application consistent snapshots, which simplifies the cost of resource consumption, publishing complexity, software compatibility, kernel development, and software maintenance incurred by customers using traditional backup methods. The method of combining cross platform plug-ins with proprietary consistency components is adopted to realize the data silence of IO and application transactions during snapshots based on the file system kernel and the VSS mechanism on Windows, so as to meet the data consistency requirements of enterprise applications in storing snapshots. The generation protocol adopted automatically recovers the IO impact based on the impact duration. The snapshot consistency type depends on the creation protocol submission result and application status. It optimizes the link length and consistency component performance from the upper application to the lower storage, reducing the IO impact duration to seconds. The creation frequency interval can achieve file system consistency in seconds and application consistency snapshot interval in minutes according to business requirements.

From crash consistency to application consistency, from single disk consistency snapshots to multi cloud disk group consistency snapshots, the consistency classification of ESSD snapshots fully classifies all types of snapshots of the standard industry block storage public cloud. In terms of security risk and application support scalability, the advantages of the native agentless snapshot realized compared with that realized by friends: no resident service, no public network IP address and port opening risk, role security authorization, and no participation of additional kernel drivers; Support dynamic discovery of logical volumes and enterprise applications. Based on ESSD cloud disk, snapshots are stored without proxy backup, no maintenance of kernel drivers is required, and there is no data reading and transportation inside the virtual machine.

Through the actual test of snapshot creation duration and IO impact duration of major cloud manufacturers at home and abroad, the SQL Server database application based on the ESSD system disk and data disk can achieve second level write IO blocking and minute level snapshot interval, and the application consistency snapshot creation duration is 2 to 3 times lower than that of friends. The application of consistent whole machine recovery avoids the log replay process during crash consistent snapshot recovery, thus improving the startup speed of database applications.

Industry Function Comparison

Compared with the snapshot features of other friends of the industry's public cloud, ESSD Cloud Disk is the only cloud manufacturer that fully supports snapshot extreme availability and consistency group snapshots, meeting the requirements for snapshot RTO and RPO in the cloud data protection scenarios of enterprise core applications.

Future outlook

Data protection is not a matter of mending the situation, but of preparing for a rainy day. With the vigorous development of cloud native technology, especially the evolution of container technology, enterprise users have increasingly high requirements for recovery point objectives (RPOs) and recovery point in time objectives (RTOs) protected on the cloud. Later, we will also introduce more new functions based on the ESSD cloud disk, such as high-density snapshots, continuous data protection, and application consistency protection capabilities based on multiple ECS instances. We will continue to provide users with the "light", "fast", and "snappy" characteristics of snapshots, reduce the RTO and RPO of enterprise data protection, provide more advanced features of native snapshot services, and help enterprise data protection.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us