Community Blog ESSD Auto PL Specifications, Leading A New Direction of I/O Performance Elasticity

ESSD Auto PL Specifications, Leading A New Direction of I/O Performance Elasticity

Part 4 of this 5-part series introduces the new features of Auto PL and the technical principles behind it based on the typical business scenarios in block storage.

By Xi Jian from Alibaba Cloud Storage Team

1. Preface

As one of the most important core components of IaaS, Alibaba Cloud ESSD provides block storage services with low latency, high durability, and high reliability for Elastic Compute Service (ECS). It has become the industry benchmark for all-flash block storage of cloud vendors. As more enterprises and core applications migrate to the cloud, and containers and Serverless architectures continue to thrive, new challenges and requirements are being raised for the elasticity of block storage I/O performance. Against this backdrop, the Alibaba Cloud Storage Team launched the new cloud disk specification, ESSD Auto PL, which decouples performance from capacity and provides two key features of I/O performance on demand. This article introduces the new features of Auto PL and the technical principles behind it based on the typical business scenarios in block storage.

2. I/O Elasticity Requirements and Business Pain Points for Cloud Storage

With the development of cloud-native technologies, more enterprises utilize cloud computing virtualization, elastic extension, and the booming distributed framework, container technology, orchestration system, continuous delivery, and rapid iteration of the cloud-native technologies. Based on these, they have built large-scale, elastically extensive, and rich distributed business scenarios on the cloud. New computing forms are gradually becoming short-cycle and lightweight, which demands higher elasticity of block storage I/O performance. The performance is usually described in Input/Output Operations per Second (IOPS) and Throughput Bytes per Second (BPS). Common business pain points are listed below:

  • VM/Container Batch Startup: When a computing instance is started, the system disk consumes a large number of IOPS and throughput BPS in a short period.
  • Business Peak: The customer's business is faced with unexpected burst scenarios. This needs cloud disks and VMs to have the elastic extension capability for short-term burst performance requirements.
  • Periodic Task Processing: OLAP/batch processing periodically submits a large number of tasks within a foreseeable period. Thus, cloud disks are required to have burst the elastic extension capability to deal with it.

Traditional block storage products are designed with performance/capacity coupling. Users can obtain higher IOPS/BPS performance upper limit by buying cloud disk capacity. Cloud disk scale-out also brings improvement in disk capacity and I/O performance. ESSD supports multiple performance levels, including PL0/1/2/3 (PL: performance level). Different PLs have different I/O performance caps. Customers can upgrade PLs through the configuration change function of cloud disks to obtain higher IOPS/BPS performance caps. Cloud-native businesses make full use of the elasticity of the cloud. The business requirements last for a long time, so some storage performance margins are usually reserved. In addition, a considerable part of cloud business traffic has obvious peaks and troughs. Most of the time, the cloud is in a low business load state, and the peak time and peak value of businesses are difficult to accurately predict. A typical burst I/O traffic business may have one or more times of burst I/O traffic within a certain time. Burst I/O traffic is characterized by a short time and a high peak. It is common in burst business scenarios, such as flash sales, which poses new challenges to performance planning. If too many performance margins are reserved, a large number of resources will be wasted. If the performance margin reserved is not enough, the business operation will be affected by burst traffic. All in all, it is very difficult to plan the performance accurately through disk scale-out or configuration change.


3. ESSD Auto PL

Alibaba Cloud has introduced a product specification in response to the business pain points above: ESSD Auto PL. It supports the on-demand configuration and on-demand burst modes of performance and supports the ultra-high unit capacity performance cap, up to 1,000 IOPS/GB. On-demand performance configuration is mainly for predictable periodic I/O traffic scenarios. In addition to selecting storage capacity when creating a new ESSD Auto PL, users can configure an additional I/O performance upper limit. Thus, the decoupling of I/O performance and capacity is realized. Users can flexibly adjust I/O performance for predictable I/O peaks according to business requirements to provide predictable response capabilities.


Auto PL supports the on-demand burst performance mode for peaks hard to predict burst business. It provides a maximum of 1,000,000 IOPS per disk and an extreme I/O performance of 4 GB/s. The cloud disk is automatically adjusted according to actual performance requirements without I/O performance prediction and planning. It makes full use of the elasticity of ESSD distributed storage capability and completely solves the performance planning problem under burst traffic. This feature adopts pay-as-you-go payment. Users only need to pay for the number of read and write that exceeds the pre-configured performance. This ensures the stable operation of the service and minimizes the user's resource configuration overhead. Here is an example of a burst traffic scenario in a large Internet e-commerce company. At first, the business used ESSD PL1 with a maximum performance of 50,000 IOPS and 350 MB/s. In the burst traffic scenario, 2.3% of the disk reached the performance cap of PL1 and affected the business. The business peak time is short, and the peak traffic value cannot be accurately estimated. Traditionally, we need to use ESSD PL2 to meet the burst business traffic. When ESSD Auto PL is used and on-demand burst mode is enabled, the service storage TCO is reduced by 49%.

Auto PL is still compatible with the baseline performance of ESSD PL1. The performance of the standard Auto PL cloud disk is the same as ESSD PL1. This enables seamless transition for existing customers and business scenarios. Moreover, for the first time in the industry, ESSD Auto PL supports both on-demand performance configuration and on-demand burst performance. They can be used in combination, and users can flexibly configure them according to the actual I/O traffic model.


4. Auto PL Technology Interpretation

ESSD Auto PL is the first cloud disk that supports both performance capacity decoupling and performance auto-scaling according to load. It needs to solve many technical challenges, including quickly perceiving the business load changes, dynamically applying for and releasing resources on-demand to support performance scaling, and quickly balancing load scheduling. After repeated refinements, ESSD Auto PL has designed a fine-grained cloud disk segmentation mechanism. This enables it to properly use the resources of the entire backend storage cluster and quickly and dynamically make adjustments. We can safeguard against issues, such as traffic impact and multi-tenant I/O interference introduced by I/O performance burst, through real-time monitoring and scheduling of cluster capacity and performance level, multi-level Quality of Service (QoS) isolation, and other measures.

4.1 Fine-Grained Sharding of Cloud Disks

ESSD Auto PL supports a maximum of 1,000 IOPS/GB, which far outstrips the IOPS performance per unit capacity of Nand SSD. The LBA address space of each ESSD disk is divided into multiple strip groups. The I/O of the strip group is scattered by a distributed algorithm and processed by different storage nodes. This is to fully utilize RDMA networks and high-performance storage capabilities. ESSD Auto PL is designed with a fine-grained management mechanism for address space. So, small-capacity cloud disks can also be fully scattered to multiple storage nodes to achieve a wider range of I/O scheduling capabilities. At the same time, a wide range of I/O scheduling capabilities can reduce standalone hot spots in a storage cluster and some I/O long-tail latencies.


4.2 Multi-Tenant Isolation and I/O Priority Management

EBS is a typical multi-tenant service. Burst high-throughput or high-IOPS traffic potentially affects the I/O latency of low-load tenants. The extreme performance of 1,000,000 IOPS I/O Burst places higher requirements on the isolation capability. ESSD supports QoS of both instances and cloud disks. Instance QoS provides I/O isolation among multiple virtual machines. The upper limit is strongly related to the number of vCPU cores of instances purchased by users. Some instances of small specifications support the storage credit burst capability. This can accumulate the idle I/O quota to provide a maximum performance burst of 30 minutes. Cloud disk QoS provides the upper limit of performance of each cloud disk within the instance, which is related to cloud disk specifications. I/O sent from the VM passes through two levels of QoS in turn: the cloud disk and instance. It performs Burst I/O traffic marking to ensure the Burst traffic in traffic congestion scenarios can be accurately identified in the whole process. This is to ensure that non-Burst traffic is given priority. In response to the local highlights and I/O blocking of the system caused by Burst I/O traffic, the business load perception and prediction of I/O traffic at the level of ten milliseconds are realized. The dynamic queue scheduling and concurrency adjustment are completed at the second level. Performance interference due to elastic lifting in multi-tenant scenarios can be avoided with the dynamic queue distribution mechanism of hardware offloading.


4.3 Load Balancing for Multi-Cluster Performance Level

Extreme I/O performance elasticity introduces new challenges to performance SLA. In particular, the I/O burst performance cap of 1,000,000 IOPS introduces greater traffic congestion risk. For this reason, ESSD has designed a new load balancing mechanism for the multi-cluster performance level. The new intelligent balancing scheduling mechanism consists of multi-level scheduling: clusters, storage nodes, and I/O threads. According to the disk performance configuration, the I/O load of components is monitored in real-time. The second-level I/O load balancing and minute-level traffic scheduling within the cluster are realized. When there is a significant performance level difference in traffic between clusters and storage nodes, disk hot migration is triggered in real-time to solve the performance contention caused by the simultaneous increase in cloud disk load of a large number of users.


5. Summary

As the main product of ESSD in the future, ESSD Auto PL covers all industries and customers currently needing elastic computing. Auto PL has the flexibility and elasticity to reduce the difficulty of IT scale planning and the risk of improper planning. It will be favored by O&M employees or IT resource procurers. ESSD Auto PL can be purchased as an alternative to ESSD PL1 for new and existing Alibaba Cloud customers. Auto PL is economical, simple, and convenient for users to handle burst business growth. We welcome you to use our Auto PL products and give us your valuable feedback to help us do better. We will continue improving the performance and service quality assurance capabilities of ESSD through technological innovation. We will keep improving user experience and provide customers with continuous and reliable computing services.

ESSD Technology Interpretation Series

0 0 0
Share on

Alibaba Cloud Community

896 posts | 201 followers

You may also like