Community Blog Cloud Enterprise-Level Storage: New Dimensions of Storage to Promote Business Innovation

Cloud Enterprise-Level Storage: New Dimensions of Storage to Promote Business Innovation

Part 5 of this 5-part series explains how ESSD combines the features of cloud and enterprise-level storage to provide a more intelligent storage service experience.

By Mangong from Alibaba Cloud Storage Team

1. Preface

When it comes to enterprise-level storage, we may think of high stability, high performance, and rich enterprise-level features. When it comes to cloud computing, we talk about unique features, such as large scale, global deployment, elasticity, service orientation, intelligence, instant activation, and pay-as-you-go. If the two are combined, what new storage forms will be produced? The goal of enterprise-level storage on the cloud is to perfectly integrate the features of enterprise-level storage and the cloud and explore new possibilities of storage. This makes business sustainable for users and facilitates their business innovations.


Let's take block storage as an example. A common enterprise-level solution is Storage Area Network (SAN), which connects storage arrays and business hosts through a private network. SAN provides unified storage management and sharing and realizes high-performance and low-latency data access. However, SAN has shortcomings, such as high cost, complex operation and maintenance (O&M), and poor scalability. However, these problems are what cloud technology was designed to combat. Thus, Alibaba Cloud launched enterprise-level storage (ESSD) to help users meet the needs of digital transformation and innovation.

2. ESSD: Enterprise-Level Cloud Disk

ESSD provides users with high-availability, high-reliability, and high-performance block-level random access services. It also provides rich enterprise features, such as native snapshot data protection and cross-region disaster recovery. The ESSD project was started in 2016 and was based on Pangu 2.0 distributed storage base. It used RDMA and NVMe SSD all-user state I/O technology and was combined with Alibaba's experience accumulated over more than a decade in in-house distributed storage technology. It made its debut during the Double 11 Global Shopping Festival in 2017 and carried partial peak traffic of core businesses, such as database and middleware, achieving amazing performance. Therefore, it was widely used in Alibaba (starting from 2018) and was open to some external customers, who gave us very positive feedback. In 2019, ESSD was commercialized on a large scale, leading the cloud disk into the microsecond era. In 2020, the economical ESSD PL0 was launched to enable small and medium-sized customers to benefit from the ESSD full flash technology. By September 2021, ESSD has become available in 59 AZs, and 95% of Alibaba Cloud's head customers have chosen to use ESSD. It has become the most popular cloud disk product.


As a cloud product, ESSD provides service-oriented, secure, and intelligent O&M management and control services. It shields the underlying complex hardware and system O&M for users and uses declarative and open APIs to help users build upper-layer business systems. At the same time, the ESSD service is deployed and operated globally along with the development of the cloud infrastructure. Whether it is a central region, a local cloud, or an edge cloud, it aims to help meet the diverse deployment needs of users.

ESSD provides users with three major data services: a high-stability, high-performance, and high-elasticity data access service, a lightweight, real-time, and elastic native snapshot data protection service, and a multi-active disaster recovery service that works anytime and anywhere.

In terms of data access, ESSD provides high reliability of 99.9999999% and high availability of 99.999%. It also provides end-to-end data protection, low latency of 100 microseconds, and IOPS of 1 million. It supports custom key encryption, online scale-out, and performance configuration change within seconds. In addition, ESSD Auto PL supporting auto scaling by business load performance was recently released. It supports NVMe standard protocol, shared access, and exclusive clusters that meet security compliance and physical isolation.

Apart from basic data access services, ESSD provides users with a native snapshot service to help users protect data more conveniently. It provides flexible snapshot policies and does not affect the frontend I/O read/write performance during snapshots. It can create, roll back, and clone snapshots within seconds. It supports creating consistency group snapshots and application consistency snapshots for multiple disks. It provides cross-region replication of snapshots and meets the requirements of large-scale batch creation of cloud disks through snapshots for real-time access in cloud-native and container scenarios.

In addition to snapshot data protection, ESSD has launched the asynchronous replication service to help meet the needs of users for multi-region and multi-active disaster recovery. This allows users to start with a zero threshold and utilize the infrastructure and private lines deployed by Alibaba Cloud worldwide to implement a geo-disaster recovery architecture. More disaster recovery services will be provided for users in the future, such as synchronous replication and cross-region multi-active disaster recovery.

ESSD is service-centric, combining the features of cloud and enterprise-level storage to build enterprise-level storage services on the cloud. We have chosen the latest product and functionality features of ESSD to explain them in detail.


2.1 High-Elasticity I/O of ESSD Auto PL

ESSD Auto PL cloud disks are launched to cope with the problems faced by many users. They cannot accurately estimate the business peaks and make precise planning for performance configuration. If the performance configuration reservation is too high, a large number of resources will be wasted. However, if the performance reservation is insufficient, the business will be affected by sudden peaks. ESSD Auto PL hopes to help users solve this dilemma. It supports performance-specific configuration and auto scaling by the business load. The performance of a single disk can be automatically improved with IOPS of up to 1 million, providing a secure and convenient automatic performance configuration for unexpected burst access. When performance auto scaling is enabled, users only need to pay for the number of read and write operations that exceed the pre-configured performance. This ensures stable business operation and minimizes the user's resource configuration overhead.


ESSD Auto PL is the first cloud disk that supports performance/capacity decoupling and performance auto-scaling by the load. It needs to solve many technical challenges, including quickly perceiving the business load changes, dynamically applying for and releasing resources on-demand to support performance scaling, and quickly balancing load scheduling. After repeated refinements, ESSD Auto PL can achieve business load perception and prediction in ten milliseconds. It can complete dynamic queue scheduling and concurrency adjustment in seconds. Fine-grained segmentation of a single disk allows it to properly use the resources of the entire backend storage cluster and quickly and dynamically make adjustments. It has also solved two other problems to eliminate users' concerns:

  1. It deals with the simultaneous increase in cloud disk load of a large number of users through real-time monitoring and prediction of cluster capacity and performance level and minute-level cross-cluster scheduling balancing, which may exceed the upper limit of single cluster performance.
  2. The performance interference among multiple tenants due to elasticity improvement is avoided in multi-tenant scenarios through multi-level QoS isolation and priority management, including dynamic queue distribution of hardware offloading, I/O tagging, and execution cost evaluation and rearrangement.

We hope these technologies will help ESSD Auto PL simplify the performance configuration for users and help users ensure smooth operation during business peaks.

2.2 NVMe and Shared Access

With the rapid development and popularization of flash memory technology, storage media are no longer the bottleneck of storage. However, software processing based on the media has become the biggest bottleneck. The NVMe protocol is a newly launched data access protocol for high-performance devices. Compared with the traditional SCSI protocol, the NVMe protocol is simpler and lighter and provides rich extension features. This time, ESSD allows users to use the NVMe protocol to access data more efficiently. At the same time, shared access to cloud disks is implemented based on NVMe Persistent Reservation standards.

Many mainstream commercial databases (such as Oracle RAC and SAP HANA) need to use shared disk access to achieve high availability. NVMe Persistent Reservation provides secure and lightweight support for shared access and permission management, significantly shortening failover time. ESSD also uses hardware offloading technology to reduce NVMe virtualization latency by 30%. We also use in-house Solar-RDMA network protocol to support efficient data transmission for ESSD, and network multipath failover can be completed within seconds.


2.3 Lightweight, Real-Time, and Elastic Native Snapshot Data Protection

ESSD provides native snapshots to users for convenient data protection. In addition to adding multi-disk consistency snapshot groups and application consistency snapshots, the snapshot experience is upgraded and optimized in three aspects: lightweight, fast, and elastic.

Lightweight: I/O read and write performance is not affected during snapshot creation. Many users worry that snapshot creation will affect I/O performance, so they only perform snapshot data protection during the business trough. We have made a lot of optimizations for distributed snapshot algorithms and implementations, allowing users to put aside concerns about performance and perform data protection at any time. The measured data in the following figure shows when a consistent snapshot is created for two ESSD disks being written in large quantities, the write latency in the foreground remains unchanged. We also measured the snapshot performance of the two products of our competitors and found that their I/O latency increased by nearly 1-3 times.

Fast: ESSD snapshots can be created, rolled back, and cloned within seconds, meeting users' needs for real-time data protection and DevOps quick orchestration.

Elastic: With the popularization and application of cloud-native and container technologies, users hope to be able to pull up a large number of container Pods in a short time. We have made a lot of optimizations for snapshot batch cloning of cloud disks while performing real-time data access, enabling users to pull up thousands of Pods within minutes for quick startup and operation.


2.4 Asynchronous Replication and Cross-Region Disaster Recovery

Data is the core asset of an enterprise. In the real world, there will always be serious disasters, resulting in large-scale suspension of IDCs and even data loss. Data disaster recovery is a universal requirement of enterprise customers. Traditional disaster recovery solutions often require users to build disaster recovery centers, purchase private lines, and invest a lot of time and effort in O&M, testing, and verification. The cost is huge, and the cycle is long. The infrastructure deployed by cloud computing services worldwide naturally builds disaster recovery capabilities for users anytime and anywhere. ESSD launched the asynchronous replication service this time to help users start cross-region data disaster recovery on-demand with a zero threshold at any time.

In the design and implementation of ESSD asynchronous replication technology, we have made many innovative optimizations to the replication algorithm of the cloud disk consistency group. This ensures strong consistency of the time sequence of the primary and secondary disk groups and provides multiple cross-checks, and the read and write performance of the primary disk foreground is lossless. At the same time, in the data transmission procedure, to ensure the minimum incremental data replication, we use multi-channel concurrent scheduling to compress the replication time cycle and perform real-time detection and switching of network health state. Users can activate the asynchronous replication service at any time with a few clicks in the console. They only need to pay for the usage.


2.5 ESSD Exclusive Cluster

Some cloud users want to implement physical isolation of data to meet the industry standards. Exclusive clusters of ESSD allow users to benefit from unified O&M on the cloud and continuous iteration of software and hardware and provide exclusive clusters to achieve physical resource isolation and customization.


2.6 New Generation of High-Performance ESSD PL-X

The high performance and rich enterprise features of ESSD are favored by many users. We have also learned a lot from users. We will continue refining and iterating our products to provide users with a better experience. Many users have reported that they hope ESSD could go further in performance to meet the needs in their most demanding scenarios that require higher performance. We have also been working hard in this direction. There is good news for everyone in advance. We will invite users to try out the new generation of high-performance ESSD PL-X before the official release.

Compared with the previous ESSD PL-3 disks with the strongest performance, ESSD PL-X reduces the end-to-end latency of 4K data writing by 70%, with only 30 μs. IOPS is increased by three times, up to 3 million, and throughput is increased from 4 GB/s to 15 GB/s. Compared with other high-performance cloud disks in the market, ESSD PL-X has clearer advantages in terms of performance.


ESSD PL-X uses the latest high-speed RDMA network and persistent memory technology to deeply optimize the data procedures. It uses the innovative high-concurrent read/write consistency protocol to extremely reduce the protocol serialization overhead. At the same time, considering that the unit cost of persistent memory is much higher than SSD, ESSD PL-X uses both persistent memory and NVMe SSD storage medium and adopts intelligent hierarchical data storage and management. This brings the highest cost performance for users.

Judging from our current FIO measured data, the end-to-end latency of a 4-KB single write of ESSD PL-X is only 25.44 μs. This latency includes 10.6 μs for host-end virtualization latency, 13 μs for RDMA network transmission, and 1.8 μs for storage backend processing.


We also tested the performance of ESSD PL-X in database scenarios. We deployed MySQL 8.0.18 Community Edition on a cloud server with a 32-core CPU and 64 GB of memory. Then, we tested the performance of multiple local disks and cloud disks through sysbench. The following figure shows the performance of ESSD PL-X exceeds other local disks and cloud disks in write-only and read-only scenarios. At the same time, ESSD supports 16-KB atomic write, which meets the requirement of MySQL for disabling double write to improve performance. We also expect to improve the performance by continuously optimizing the elastic cache algorithm for persistent memory. The lower right figure shows how the MySQL read performance will continue to increase as the hit rate of persistent memory as the read cache increases.


3. Summary

ESSD innovatively combines the features of cloud and enterprise-level storage to provide users with a more convenient and intelligent storage service experience. We believe storage will no longer be the cumbersome iron box in people's impression in the future. Enterprise-level storage on the cloud is centered on services and will explore more possibilities of storage, making storage more flexible and intelligent. The release of new product features of ESSD is a big step in this direction. We will continue working hard for stable, secure, high-performance, economical, intelligent, and new storage.

ESSD Technology Interpretation Series

0 0 0
Share on

Alibaba Cloud Community

601 posts | 99 followers

You may also like


Alibaba Cloud Community

601 posts | 99 followers

Related Products