In June 2020, Alibaba Cloud Object Storage Service (OSS) increased the guaranteed availability in its service level agreement (SLA) by 1000%. We were able to do this because of the technical expertise we accumulated over more than a decade. Our new guaranteed availability marks a global first and is 10 to 20 times higher than other cloud vendors, as shown in the following figure:
For Standard Zone-Redundant Storage (ZRS) in OSS, the availability provided in the SLA was increased from 99.95% to 99.995%. This means we will pay compensation if server errors are returned for five out of 100,000 requests.
Annual failure time is a common way to describe availability in the industry. Data centers are assigned different levels, T1 to T4, which have the following availability metrics:
The availability of a network service is usually represented by the service unavailability duration. For example, 99.999% availability indicates an annual failure time of about five minutes. The following table below describes the annual failure time for each availability level.
For an instance-based cloud service that provides compute instances, such as Alibaba Cloud Elastic Compute Service (ECS), the service availability is directly related to the available time. Therefore, the service availability of the service is also defined by the annual failure time.
As a cloud-based resource access service, OSS provides serverless API calls instead of instances. Therefore, we cannot calculate the service availability of OSS by the annual failure time. Alibaba Cloud OSS uses the error rate (the number of failed requests as a proportion of the total number of requests) to calculate the service availability.
Error rate per five minutes = Number of failed requests in five minutes/Total number of valid requests in five minutes × 100%
Using a longer time interval to calculate a request error rate makes cloud services look better. This is because a longer interval includes more requests, usually resulting in a lower error rate. Therefore, we calculate the error rate every five minutes to hold ourselves to a higher standard. Redundancy is the key to the design of a high-availability system. Five minutes is the typical troubleshooting time for machines in the industry because it enables quick machine restoration and a lower system error rate.
Service availability = (1 - Σ Error rates per five minutes in a service cycle/Total number of five-minute periods in the service cycle) × 100%
OSS charges monthly fees, so the service cycle is one calendar month. The service availability in a month is obtained by summing the error rates per five minutes in this service cycle, dividing the sum by the total number of five-minute periods in the service cycle (30 × 24 × 60/5 = 8640, for a month containing 30 days), and then subtracting the average error rate from 1.
According to this formula, if the error rate per five minutes is too high, the service availability declines. Therefore, improving the request success rate per five minutes is crucial to increasing availability.
If the annual failure time is 26 minutes, the resulting service availability is 99.995%. However, assume we use the request error rate approach for OSS and the error rate is calculated every five minutes. Say the error rate is 100% for at least the 5 five-minute periods that make up these 26 minutes and all other five-minute periods are assumed to be free of errors. Here, the maximum availability is calculated as 1 - 5 × 100%/8640 = 1 - 0.058% = 99.942%. Therefore, the request error rate approach for OSS is stricter.
Using the preceding formula, we can calculate the actual availability for one calendar month. According to the SLA of OSS, if the service availability requirement is not met, we will pay the promised compensation to improve service availability for customers.
An analysis of the SLAs of cloud vendors such as AWS, Azure, GCS, Alibaba Cloud, Tencent Cloud, and Huawei Cloud shows that the availability provided by Alibaba Cloud OSS is 10 to 20 times higher than its competitors. In addition, Alibaba Cloud OSS adopts the most rigorous approach of error rates per five minutes. This exemplifies our "customer first" philosophy. Among the foregoing vendors, one public cloud manufacturer with roots in the traditional storage industry still calculates availability by available time, just as for traditional offline storage.
Alibaba Cloud OSS is a cloud service based on the R&D we have done for more than a decade. We formulated the following availability system after making availability our core trait.
Alibaba Cloud OSS provides Locally Redundant Storage (LRS) that is deployed in a zone and ZRS that is deployed in three zones. Sharing the same logical architecture, these storage types mainly include the following modules: Apsara Name Service and Distributed Lock Synchronization System, Apsara Distributed File System, OSS metadata (Youchao distributed key-value (KV) indexing), OSS servers, and network load balancing.
In terms of the physical architecture, ZRS offers disaster recovery at the data center level by distributing replicas of user data to multiple zones in a single region. In this way, when a data center is unavailable due to a fire, typhoon, flood, power failure, or network disconnection, we can still provide services with high consistency. In Alibaba Cloud OSS, neither service interruption nor data loss occurs during failover. This meets the strict demands of zero recovery time objectives (RTOs) and zero recovery point objectives (RPOs) for critical service systems. ZRS can provide 99.9999999999% data durability and 99.995% service availability.
To achieve higher availability, we need a sound redundancy design at the physical layer. The following technologies are used:
Established as one of the core modules at the underlying layer of Apsara in 2009, the system provides services including consistency, distributed locking, and message notifications. In performance, scalability, and O&M, it is superior to open-source software (such as ZooKeeper and etcd) with similar functions.
Apsara Name Service and Distributed Lock Synchronization System have a two-layer architecture where the backend module is used to maintain consistency while the frontend host is used for shunting.
This enables fast failover through multi-VIP redundancy, frontend transparent switching, and Paxos groups for redundancy consensus arbitration, providing high availability during consensus collaboration.
As a second-generation distributed storage system developed by Alibaba, Apsara Distributed File System 2.0 has reached its full potential in performance, scale, and cost-effectiveness and can be accessed in more ways. It further enhances the automation and intelligence of system deployment and O&M, while inheriting the high reliability, high availability, and strong data consistency of Apsara Distributed File System 1.0.
A Youchao distributed KV system provides distributed KV metadata for OSS. As the earliest system developed by Alibaba Cloud, it has gained years of experience in large-scale clusters by serving in OSS. In 2014, it incorporated multi-instance redundancy by dividing KV pairs into partition groups composed of multiple replicas. In a partition group, a leader node is elected by using a consensus protocol to provide services to external entities. When the leader node fails or a network partition occurs, a new leader node can be quickly elected to take over the services for the partition. This feature improves the availability of OSS metadata, as shown in the following figure.
The service layer of OSS focuses on data organization and function implementation. Due to the distributed capabilities of the underlying Apsara Distributed File System and Youchao, the OSS service layer is designed in a stateless manner so failover can be quickly implemented and availability can improve. However, due to the multi-tenant feature of OSS, quality of service (QoS) monitoring and isolation are the keys to ensuring availability for tenants.
OSS is subject to a large number of access requests. Therefore, load balancing is implemented at the access layer. In load balancing, VIPs are bound to provide high-availability services and connect to frontend clusters of OSS. This enables quick failover upon the failure of any module to ensure availability. OSS offers high-traffic and high-performance access based on load balancing expertise from the Alibaba Cloud network product team.
Due to its HTTP- and HTTPS-based data access services, OSS is prone to attacks from the Internet and VPC networks, such as distributed denial-of-service (DDoS) attacks. Protection against attacks is crucial to ensuring OSS availability. One purpose of cyberattacks is to compromise the services of OSS, which reduces overall service availability.
Hackers can attack OSS by trying to use up OSS bandwidth (bandwidth congestion) or exhaust OSS computing resources (resource exhaustion). Potential attacks include network traffic-based attacks (L3 and L4 DDoS attacks) for bandwidth congestion, L4 CC attacks (link resources), and L7 attacks (application resources) for resource exhaustion. The following table classifies potential attack types.
Storage Operations and Maintenance System, a management and control platform of Alibaba Cloud OSS, is intended for internal development, O&M, and operation users. It has been made available for five major services: OSS, File Storage NAS, Tablestore, Log Service (SLS), and Function Compute. For these services, it provides features such as real-time data monitoring, intelligent O&M management, rapid alert response, and security auditing. It also strives to empower security services with accuracy, efficiency, and intelligence.
To better manage OSS availability metrics and improve O&M capabilities, Storage Operations and Maintenance System is designed for monitoring and alerting, analysis and diagnosis, and problem resolution based on fault identification, location, and troubleshooting. It also provides monthly SLA management to monitor a monthly list of underperforming SLA metrics and determine the reasons for this underperformance. This allows us to continuously improve our SLA metrics.
OSS Brain is a smart O&M platform that aims to leverage data and algorithms to ensure OSS stability and enable online O&M and operation. It analyzes online data to provide intelligent decision-making services, including machine isolation, active online warning, user profiling, anomaly detection, resource scheduling, and user isolation. It implements agile intelligent O&M and fast error isolation to improve availability.
OSS is a regional service and may be unavailable due to a regional fault. To offer higher service availability, OSS provides a high-availability solution for active geo-redundancy, as shown below:
This enables quick failover in different regional fault scenarios, offering an RPO in seconds and ensuring service application continuity.
OSS also provides the following management mechanisms to improve service availability:
We provide services in support of Double 11, serving millions of users. Coping with the Double 11 traffic peaks for many years, OSS has continuously improved its product architecture, features, and stability. Furthermore, OSS successfully serves millions of users from Alibaba Cloud's public cloud service system, handling the loads from various industries. Based on our years of experience, we have formulated a mechanism for continuous availability improvement.
Although OSS has improved its SLA availability 10 times over, we must continue to improve our availability in scenarios such as abnormal upgrades, super hotspots, and highly frequent attacks.
Learn more about Alibaba Cloud Object Storage Service by visiting the product page.
Alibaba Cloud MaxCompute - May 5, 2019
Alibaba Clouder - May 17, 2021
Alibaba Cloud Storage - July 1, 2020
Alibaba Cloud MaxCompute - March 3, 2020
Alibaba Clouder - January 2, 2020
Alibaba Cloud ECS - September 25, 2018
An encrypted and secure cloud storage service which stores, processes and accesses massive amounts of data from anywhere in the worldLearn More
Provides scalable, distributed, and high-performance block storage and object storage services in a software-defined manner.Learn More
Block-level data storage attached to ECS instances to achieve high performance, low latency, and high reliabilityLearn More
Plan and optimize your storage budget with flexible storage servicesLearn More
More Posts by Alibaba Cloud Storage