Leading the new direction of IO performance elasticity

Preface

As one of the most important core components of IaaS, Alibaba Cloud ESS D provides ECS with low latency, persistence and high reliability block storage services, becoming the industry benchmark of cloud manufacturers' full flash block storage. With the rapid development of more and more enterprises and core applications on the cloud, as well as containers and serverless architectures, new challenges and requirements are put forward for the flexibility of block storage IO performance. Against this background, the Alibaba Cloud storage team has introduced the new cloud disk specification of ESSD Auto PL, which decouples performance from capacity and provides two key features of IO performance on demand. Based on typical business scenarios of block storage, this paper introduces the features of new Auto PL products and the technical principles behind the revelation.

IO elastic requirements and business pain points of cloud storage

With the development of cloud native technology, more and more enterprises are building large-scale, resilient and rich cloud distributed business scenarios based on cloud computing virtualization, elastic expansion and the booming cloud native technology's distributed framework, container technology, orchestration system, continuous delivery and rapid iteration; New computing forms are gradually developing towards short cycle, lightweight and other directions, and more requirements are put forward for block storage IO performance elasticity (performance is usually described by IOPS: Input/Output Operations per Second and throughput BPS: Bytes per Second). The following are common business pain points:

• VM/container batch startup: When the computing instance starts, the system disk consumes a lot of IOPS and throughput BPS in a short time

• Business peak: the customer's business is facing unexpected sudden scenarios, which requires the cloud disk and VM to have the elastic expansion capability of short-term sudden performance requirements

• Periodic task processing: OLAP/batch processing periodically submits massive tasks within the predictable time, which requires the cloud disk to have sudden elastic expansion capability

Traditional block storage products are designed with performance/capacity coupling. Users can obtain the corresponding IOPS/BPS performance upper limit by purchasing cloud disk capacity, and obtain disk capacity and IO performance through cloud disk expansion. ESSD supports multiple performance levels (PL: performance level) of PL0/1/2/3. Different PL levels have different IO performance upper limits. Customers can improve PL levels through the cloud disk configuration change function to obtain higher IOPS/BPS performance upper limits. The cloud native business makes full use of the elasticity of the cloud. There is a long time period for business requirements, and usually some storage performance margin is reserved. In addition, a considerable part of cloud traffic has obvious peak and valley behavior, most of the time it is in the low load period of the business, and it is difficult to accurately predict the peak and peak of the business. A typical IO traffic burst service may have one or more burst IO traffic within a certain period of time. The burst time is short and the burst performance peak is high. It is common in the Internet second kill and other burst business scenarios, posing new challenges to the performance planning: if the performance configuration is reserved too high, it will cause a lot of idle waste of daily resources; However, if the performance reservation is insufficient, the business will suffer from sudden flood peak. In a word, it is very difficult to make more accurate performance planning through cloud disk expansion/configuration change.

ESSD Auto PL

In response to the above business pain points, Alibaba Cloud has introduced the ESSD Auto PL product specification, which supports the on-demand configuration and on-demand burst modes, and supports the ultra-high performance ceiling per unit capacity of 1000 IOPS/GB. Performance on-demand configuration is mainly targeted at the predictable periodic IO traffic scenarios. When creating an ESSD Auto PL, users can not only select storage capacity, but also separately configure additional IO performance upper limit, which realizes the decoupling of IO performance and capacity. For the predictable IO peak, users can flexibly adjust the IO performance according to the business requirements to provide predictable response capability.

For unexpected sudden business peaks, Auto PL supports the performance on demand burst mode, providing the maximum single disk 100W IOPS, 4GB/s limit IO performance. The cloud disk automatically adjusts according to the actual performance needs, eliminating the need for IO performance prediction and planning. It makes full use of the resilience of ESS D distributed storage, and completely solves the problem of performance planning under sudden traffic. This function adopts the post payment mode. Users only need to pay according to the actual number of reads and writes that exceed the pre configured performance, so as to ensure the stable operation of the business and maximize the cost of resource configuration for users. Taking a burst traffic scenario of a large Internet e-commerce as an example, the business originally used ESSD PL1, with a performance upper limit of 50000 IOPS and 350MB/s. In the business burst traffic scenario, 2.3% of cloud disks hit the PL1 performance upper limit, affecting the business, and the business peak time was short, so the traffic peak could not be accurately estimated. The traditional demand is to use ESSD PL2 to meet the service burst traffic. With ESSD Auto PL and on-demand burst mode, the service storage TCO decreases by 49%.

Auto PL is still compatible with the benchmark performance of ESSD PL1. The performance of the standard Auto PL cloud disk is completely consistent with that of ESSD PL1, enabling seamless switching between inventory customers and business scenarios. In addition, for the first time in the industry, the ESSD Auto PL supports both performance on-demand configuration and performance on-demand burst, which can be used together. Users can flexibly configure according to the actual IO traffic model.

Auto PL Technology Analysis

As the first cloud disk that supports performance capacity decoupling and performance elastic scaling based on load, the ESSD Auto PL needs to solve many technical challenges, such as how to quickly perceive business load changes, how to dynamically release resources on demand to support performance scaling, and how to quickly balance load scheduling. After repeated polishing, the ESSD Auto PL cloud disk has designed a fine-grained cloud disk segmentation mechanism, enabling it to use the resources of the entire back-end storage cluster in a balanced way and make rapid dynamic adjustments; Real time monitoring and scheduling of cluster capacity/performance water level, multi-level QoS isolation and other issues such as traffic impact and multi tenant IO interference caused by IO performance bursts are guaranteed.

Cloud disk fine-grained segmentation

The ESSD Auto PL supports a maximum of 1000 IOPS/GB, far exceeding the IOPS performance per unit capacity of Nand SSDs. The LBA address space of each ESSD cloud disk will be divided into multiple stripe groups. The IO of stripe groups will be dispersed by distributed algorithms and processed by different storage nodes to make full use of RDMA networks and high-performance storage capabilities. The ESSD Auto PL has designed a fine-grained address space management mechanism, which allows small capacity cloud disks to be fully distributed to multiple storage nodes to achieve a wider range of IO scheduling capabilities. At the same time, a wide range of IO scheduling capabilities can also reduce the storage cluster stand-alone hotspot and some IO long tail delays.

Multi tenant isolation and IO priority management

EBS is a typical multi tenant service. Burst high throughput/high IOPS traffic will potentially affect the IO latency of low load tenants. The extreme performance of 100W IOPS IO Burst puts forward higher requirements for isolation capability. The ESSD supports instance and cloud disk QoS. The instance QoS provides the IO isolation capability between multiple virtual machines. The upper limit is strongly related to the number of vCPU cores of the instance purchased by the user. Some small instances support the ability to store credit burst, and can accumulate idle IO quotas to provide a maximum of 30 minutes of performance burst; Cloud disk QoS provides the upper limit of the performance of each cloud disk in the instance, which is related to the cloud disk specification. The IO sent from the VM passes through the cloud disk and instance level QoS on the link in turn, and the Burst IO traffic is marked to ensure that the full link can accurately identify Burst traffic in the traffic congestion scenario, and ensure that non Burst traffic is prioritized. Aiming at the local hotspots and IO congestion of the system caused by Burst IO traffic, the service load awareness and prediction of 10 millisecond level of IO traffic are realized, dynamic queue scheduling and concurrency adjustment are completed at second level, and the dynamic queue distribution mechanism of hardware offloading is combined to avoid the performance interference between multi tenants due to elastic improvement in the multi tenant scenario.

Multi cluster performance water level load balancing

The extreme IO performance elasticity introduces new challenges to the performance SLA, especially the IO burst performance limit of 100W IOPS introduces greater traffic congestion risk. For this reason, ESSD has designed a new multi cluster performance water level load balancing mechanism. The new intelligent balanced scheduling mechanism consists of cluster/storage node/IO thread multi-level scheduling. According to the cloud disk performance configuration, the component IO load is monitored in real time to achieve second level IO load balancing within the cluster and minute level inter cluster traffic scheduling. When there is a significant performance water level difference between the cluster/storage node traffic, cloud disk hot migration is triggered in real time to solve the problem of users competing for the performance of a large number of cloud disk loads at the same time.

Summary

As the main product of ESSD in the future, ESSD AutoPL covers all industries and customers facing elastic computing. AutoPL has the flexibility and flexibility to reduce the difficulty of IT scale planning and the risk caused by improper planning, which will be favored by operation and maintenance personnel or IT resource procurement personnel. Whether new or existing Alibaba Cloud customers, ESSD AutoPL can be purchased as an alternative to ESSD PL1. AutoPL provides customers with an affordable, simple and convenient experience for their business growth. We look forward to your extensive use of AutoPL products and provide us with valuable feedback to help us do better. We will continue to improve the performance and service quality assurance capability of ESSD through technological innovation, improve the user experience, and provide customers with always on computing services.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us