By Yang Zeqiang, nicknamed Zhujian at Alibaba.
The successful conclusion of last year's Double 11 Shopping Festival, the world's largest online shopping event, has once again thrown a bright spotlight on Alibaba's technological innovations and cloud computing capabilities, which allowed for 268.4 billion CNY (more than 38 billion USD) in sales all in a single 24-hour span while still maintaining excellent customer experience for its several million online users. When it comes this year's Double 11 technology, we cannot forget to mention that Alibaba Group was able to move all of its core systems to the cloud in 2019 as well.
As the underlying product that made Alibaba Group's migration to the cloud possible, Elastic Compute Service (ECS) represents Alibaba Group's core cloud migration infrastructure. In order to ensure superior stability and performance during cloud migration process, the long-term efforts of the ECS team were essential. As one of a team of site reliability engineers, this post's author, Yang Zeqiang participated in this revolution. In this post, Yang will be sharing some insights about the site reliability system that was put into place at Alibaba Cloud, as well as cast a light on why so many big tech firms are going all in on site reliability engineering, or SRE.
Site reliability engineering (SRE) refers to the engineering involved in website reliability. Site reliability engineering was first proposed and applied by Google more than a decade ago. In recent years, SRE has been widely used by leading Internet companies inside and outside China. As we see it, Google and Netflix have on their hands the two most successful implementations of SRE in the industry. Google created a strong system, becoming the global authority in the field, and Netflix has taken SRE to new heights in terms of practice. It's reported that, with fewer than 10 core site reliability engineers, Netflix is still able to support the O&M procedures needed for their service that is set up in 190 countries, sent to hundreds of millions of paying customers, and involves tens of thousands of microservice instances. With the development of DevOps in recent years, SRE has become a familiar concept in the industry. Leading Internet companies in China, like Baidu, Alibaba, Tencent, and Meituan, have all gradually incorporated SRE into their organizational structures and recruitment. In Alibaba Group, for example, different business units have set up own SRE teams. However, SRE responsibilities are different in different departments. So what are site reliability engineers ultimately doing for a company?
At Google, site reliability engineers are primarily responsible for service availability, system performance, and capacity-related matters for all of Google's core business systems. Based on the content in the company's publication "Site Improvement Engineering", the work of Google SRE system includes but is not limited to:
In China, many SRE departments have similar responsibilities to traditional O&M departments. In essence, they are responsible for the technical O&M work that supports Internet services behind the scenes. Unlike traditional O&M SRE, we have been exploring and practicing SRE in the business R&D team for more than a year. I think the core of business team SRE is the redefinition of R&D and O&M through a software engineering methodology to drive and empower business evolution. In what follows, I will describe some practices for implementing SRE in Alibaba Cloud Elastic Compute Service and the thinking behind it.
As the central product of Alibaba Cloud, Elastic Compute Service (ECS) is the main architecture behind Alibaba's several business and e-commerce platforms. It runs the Group's internal cloud services and cloud products. As the largest cloud-computing vendor in Asia, Alibaba Cloud serves large-, medium-, and small-size enterprise customers all over the world, including various private domains and private cloud systems. As the core scheduling brain, the successful management of ECS is of immense importance. With the acceleration of Alibaba Group's cloud migration and the deployment of cloud products on ECS, hundreds of millions of API calls are made each day, with millions of ECS instances created every day. Because of all of this, the ECS management and scheduling system need to overcome several challenges in terms of capacity, system performance, and service availability:
Before we created the ECS SRE team, one major challenge for my team was the question of how we could build a highly available and stable system while also allowing for rapid growth and support for business development over the next three to five years. Before the SRE team was established, the ECS team divided instances, storage, images, networks, executive support systems, and ROS based on several different business domains. With the preceding organizational structure, the R&D team was able to dig deep into different vertical channels of development. However, the team as a whole did not have the necessary perspective needed to understand all of our systems, and so it was difficult to see the overall picture.
Conway's law holds that organizations should design system architectures that mirror their communication structures. Simply put, this law can be understood as: an organizational architecture should more or less spell out into their system architecture, with the design of the two being more or less the same when boiled down to the fundamental parts. Following this logic, we needed to be able to build a stability system by taking in the overall perspective that incorporates all of our different business systems and teams, as such thing would be the best guarantee to implement an organizational architecture. This is how the ECS SRE team came into being.
As we discussed in another section before, the responsibilities of Google's SRE team include capacity planning, distributed system monitoring, load balancing, service fault tolerance, on-call, firefighting, and business collaboration support. We also briefly described Chinese SRE teams that focus on system O&M. While exploring the implementation of ECS SRE, we learned from past successes in the industry and formed a unique methodology and system of practices based on the business and team characteristics of the ECS team. My personal opinion is that there is no universal standard. We need to constantly explore solutions that incorporate the present situation, business characteristics, and team features. The following section describes how the ECS SRE team has worked to build a stability system.
Given that ECS can get up to hundreds of millions of API calls everyday, and the peak number of ECS instances created in a single day can be as high as a few million, the capacity and performance of management services face severe problems. For example, the database capacity could be exhausted and the system must deal with frequent long-tail requests. With Alibaba Group's cloud migration and its deployment of cloud products on ECS along with the rapid development of its cloud-native environment, we urgently needed to take on measures to prepare for future problems. Consider Alibaba workflow engine, which is central to ECS control. With the rapid growth of business volumes, the data in a workflow task table can exceed 3 TB in just one month. This means that even high-configuration databases cannot support several months of business development. In addition to workflows, core order, purchase, and resource tables all face the same problem. Above anything else, in periods of rapid business development, ensuring business continuity is the most pressing issue we face. To solve the current capacity and performance problems and prepare for further expansion in the future, we have upgraded and renovated the basic components developed by ECS, including the workflow engine, as well as the idempotence, cache, and data cleansing frameworks. In order to empower other cloud products or teams in the future, all basic components are output in a standard manner using second-party packages.
Basic component upgrade: We upgraded the architecture of the basic business components developed by the ECS team to cope with future large-scale business growth. More specifically, we did the following:
Performance optimization: We started to use multidimensional performance optimization policies to improve the performance metrics of control services.
Systematic stability construction has been the most important aspect of our efforts over the past year. I believe that, in the area of stability governance, we must adopt an overall, end-to-end perspective with precise subdivisions from top to bottom. The following section will briefly introduce the stability governance system for ECS control.
Databases are the vital lifelines of applications. For ECS control, all core businesses run on ApsaraDB for RDS. If a database fails, the damage to an application is critical on both the control plane and the data plane. Therefore, the first thing SRE does is to maintain these vital lifelines by exercising comprehensive control over database stability. First, let's take a look at the problems faced by databases for the ECS team in large-scale business scenarios:
When faced with database problems, our strategy is to focus on both databases and businesses. It is not enough to simply optimize the database or perform business optimization alone. For example, in a typical large table problem, a great deal of space is occupied and queries are slow. If we simply provide more space at the database level and perform index optimization, this can solve the problem in the short term. However, when the business scale is large enough, database optimization can only go so far, so business optimization is required. I will now briefly introduce some ideas for optimization:
The following figure illustrates several approaches for managing database stability in ECS.
Alert monitoring and governance are critical for the discovery of problems and faults. Especially in large-scale distributed systems, precise and timely monitoring can help R&D personnel identify problems as soon as possible and mitigate or even avoid faults. However, invalid, redundant, and inaccurate low-quality alerts not only waste time, but also affect the satisfaction of SRE on-call personnel and seriously affect fault diagnosis. The main causes of the low quality of ECS alerts include:
In response to the preceding issues, our strategy is to systematically organize alerts to ensure their authenticity, accuracy, precision, and high quality. Our approach involves the following steps:
We adopt a 1-5-10 model for fault recovery, which means that problems are detected in 1 minute, located in 5 minutes, and resolved in 10 minutes. The one-minute detection of problems depends on high-quality monitoring and alert systems mentioned previously, while five-minute problem locating depends on systems' fault diagnosis capabilities. To quickly diagnose faults based on existing alert information, we must confront a series of challenges:
At first, we divided the construction of the fault diagnosis system into three phases:
No system has 100% reliability, but there is no essential difference between 99.999% and 100% availability for end users. Our goal is to ensure a 99.999% availability service experience through continuous iteration and optimization. However, the behaviors initiated by end users go through a series of intermediate systems, so reliability problems in any one system will affect the overall customer experience. However, we must find a way to measure the stability of all nodes. To this end, we have started the construction of an end-to-end SLO system. The main strategies are as follows:
The Consistency, Availability, and Partition tolerance (CAP) principles of distributed systems cannot be achieved simultaneously. As a large-scale distributed system, the ECS control system also faces a resource consistency problem. Specifically, data inconsistencies exist in ECS, disks, bandwidth, and other systems such as ECS control, orders, and metering. To ensure system availability, distributed systems usually implement final consistency at the expense of the real-time consistency of some data. Based on the technical architecture and business characteristics of ECS, we implement the following policies to ensure resource consistency:
In the context of concurrent R&D by nearly 100 people, the daily release of core applications, and several thousand releases throughout the year, ECS is one of the key factors that have allowed us to reduce the failure rate each year. It provides a complete set of R&D and process change assurances. The following section briefly describes some explorations the ECS SRE team has made concerning the R&D and process change systems.
R&D process: Increase the standardization of the R&D process throughout the software lifecycle.
From the perspective of software engineering, the earlier problems are addressed, the less labor required and economic losses. The subsequent maintenance costs of poorly designed systems are much higher than the costs of implementing a better design beforehand. To control the design quality, we have explored the following design processes and specifications:
Previously, ECS Code Review was mainly deployed on GitLab. The main problem was that the integration of GitLab with the continuous integration of relevant internal Alibaba components was not stable, and scheduled admission could not be set. The Aone Code Review platform solves the integration problem with the Aone lab and provides a scheduled configuration function for code integration. In addition, we have defined a code review process for ECS control, as shown below:
We have migrated all the core ECS control applications to the standard CI platform in a unified manner and in accordance with a standard mode. This greatly improved the CI success rate and reduced the waste of human resources due to the need for frequent manual intervention. Our solution is as follows:
The UT parallel running mode was transformed to improve the UT running efficiency.
The ECS deployment environment is extremely complex. Specifically, the deployment architecture is complex and there are many deployment tools and dependencies. The environment is dependent on all the core middleware and applications of Alibaba Group and Alibaba Cloud. By adopting an approach where nearly 100 people perform R&D in parallel, the stable and reliable end-to-end daily environment is the basic guarantee of R&D efficiency and quality. The transformation of the end-to-end daily environment cannot be achieved overnight. Our current construction approaches are roughly as follows:
The staging environment and production environment use the same database, so staging testing can easily affect the stability of production services. Given that data cannot be isolated between the staging and production environments, our short-term solution was to improve the quality of the staging code through standardized processes to minimize or avoid such problems.
Staging is equivalent to production. Staging can only be deployed after passing CI and basic daily verification.
DDL and large table queries can be staged only after being reviewed. This prevents slow staging SQL statements from impacting RDS stability and affecting the production environment.
We verify the stability of staging code by running API-based functional test cases early each morning. This is the last line of defense for daily release access. The 100% FVT pass rate goes a long way to ensure the success rate of the daily releases for ECS core control.
In the current release mode, the on-duty staff pulls release branch deployment staging based on the Develop branch the night before the release. On the day of the release, we check that the FVT success rate is 100% and then release the branch in batches through Aone. The staff observers business monitoring indicators, alerts, and error logs for each batch. In this mode, core applications are released every day, and the process takes about half a man-day of work. To further improve the efficiency of human resources, we have looked into automated release processes:
Change process: The change efficiency is improved while ensuring change quality by standardizing the change process, connecting to GOC for strong control, change white screen, and automating the process.
We ensure that all changes can be monitored, pre-released, and rolled back by restricting existing control change behaviors such as hot upgrades, configuration changes, DDL changes, constraint configuration changes, data correction, and CLI operations.
By connecting to the strong control of the group, we can ensure that all changes can be traced and reviewed. We also hope that the platform can be connected to strong control system to eliminate cumbersome manual change work.
The integration of all ECS resources, management systems, diagnostics, performance, O&M, visualization, and the ECS Operations and Maintenance System capabilities creates a unified, secure, and efficient O&M platform for elastic computing.
We will automate all the cumbersome tasks that require manual intervention.
During the construction of the stability system, the capacity and performance optimization of basic components, the construction of the end-to-end stability system, and the upgrade of the R&D and change processes provide the foundation for stable operations. The establishment of a culture and continuous operation are indispensable for long-term planning and efficient operation. The following are some of the approaches the ECS SRE team has taken to a stability operation system.
On-call rotation: Google SRE adopts a 24/7 on-call rotation system, which is responsible for monitoring and handling production system alerts and performing firefighting. SREs are essentially software engineers. In the ECS control team, each SRE must handle online alerts, respond to emergencies, and participate in the troubleshooting of difficult problems while performing R&D work. In order to ensure that the core R&D work of SREs is not interrupted, we are trying to implement an on-call rotation mechanism.
The fault replay mechanism replays issues that occur or affect internal stability after the event. In ECS, problems affecting production stability are uniformly defined as "internal faults". Our view is that all "internal faults" have the potential to become real faults, so they must receive sufficient attention. For this reason, we often communicate and cooperate with the group's fault team, and have studied internal fault replays and management modes. The following describes some basic concepts of fault replay and some ECS control practices in fault replay. Fault replay is not intended to assign blame, but to discover the underlying technical and management problems behind the fault symptoms.
The summaries of fault replays are an important knowledge asset. Internally, we produce an in-depth summary of each fault replay and form the internal knowledge base "Learn From Failure".
Stability itself is a product that requires daily and continuous operation. The main ECS control modes include daily stability reports and biweekly stability reports.
The previous content briefly introduced some practical experience of the ECS SRE team. As an SRE, I have participated in ECS stability governance and R&D work since 2018. Next, I will share some of my thoughts about SRE practices over the past year. These are simply my own personal opinions.
SRE is more than just O&M. It is true that, in some companies, the responsibilities of SREs are similar to traditional O&M or system engineers. However, generally and certainly in the future, SRE is a position that demands a wide range of skills beyond O&M capabilities, such as software engineering, technical architecture, coding, project management, and team collaboration.
If you lack a business architecture, you lack a soul! SREs are unqualified if they do not understand the business. SREs must participate in the optimization and future planning of the technology and O&M architecture. At the same time, they should coordinate with the business team to perform troubleshooting and solve difficult problems. These tasks cannot be performed well without a clear understanding of the business.
When addressing the misconception that SRE is nothing more than O&M, I mentioned that SREs require a wide range of capabilities. I now want to present my idea of an SRE capability model for the future. This is only a preliminary idea to be used for reference.
Business team SREs must first possess R&D capabilities. For instance, elastic computing SREs need to develop common middleware components, such as workflow frameworks, idempotence frameworks, cache frameworks, and data cleansing frameworks. R&D capabilities are the most necessary skills of SREs.
SRE evolved from O&M in the DevOps development process. For both manual and automated O&M, SREs must possess comprehensive O&M capabilities. In the elastic computing team, SREs are responsible for ensuring the stability of the production environment (networks, servers, databases, middleware, and so on). During daily on-call and fault emergency response work, O&M capabilities are essential.
SREs should not only focus on the current stability and performance of the business, but also plan the capacity and performance of the business from a future perspective. This requires a familiarity with the business system architecture and excellent architectural design capabilities. As an SRE for elastic computing, one important task is to take the technical architecture as a future plan and provide an executable roadmap.
Here, engineering capabilities mainly refer to the ability to implement software engineering and reverse engineering. First, SREs must be able to think like a software engineer and implement large-scale software engineering tasks. In addition, one of the core daily tasks of an SRE is the handling of stability problems and other difficult problems. Reverse engineering capabilities play a critical role in troubleshooting abnormal problems in a large-scale distributed system, especially when handling unfamiliar problems.
An SRE who does not understand the business is not qualified to be an SRE. In particular, business team SREs can better carry out architecture planning and troubleshooting only when they are familiar with the business' technical architecture, development status, and even the details of the business modules. For instance, an elastic computing SRE must be familiar with the current elastic computing business dashboard, future development plans, and even the business logic of the core modules.
As an engineer, there is no doubt that communication skills are essential. Most of the work done by SREs involve different teams and even different business units, so communication skills are particularly important. In the elastic computing team, SREs must communicate and cooperate closely with multiple business teams to ensure business stability. Externally, we must cooperate with the group's unified R&D platform, basic O&M, monitoring platform, middleware, and network platform teams. Sometimes we even have to directly interact with external customers. Therefore, I cannot stress the importance of communication skills enough.
SREs must appreciate the importance of team collaboration, especially in the case of fault emergencies when we must cooperate closely with multiple teams to reduce MTTR faults. During daily work, SREs must actively coordinate with the business team and external dependent teams to guide and promote the performance of stability-related work.
The work of an SRE is technically complex and the transactions are cumbersome. When the daily on-call and firefighting responsibilities are added, project management is very important from the team perspective. This ensures that all the work can be carried out in an orderly and healthy manner. From a personal point of view, time management is extremely valuable. In my own elastic computing SRE team, we carried out several small projects over the past year to ensure the rapid implementation of the stability system, and the results were very good. Currently, we are managing virtual organizations and long-term projects.
As mentioned earlier, SREs require team collaboration and engineering capabilities. At the same time, SRE personnel must upgrade how they think. For example, they should be able to think in reverse, have an awareness of cooperation, show empathy, and quickly adapt to new situations.
From my own experience, I think that these are the core concepts of SRE:
We are the last line of defense. Site reliability engineers must have a strong sense of responsibility and mission. As the guardians of stability, we should be fearless and determined in the process of team collaboration.
In this era of explosive information growth, technology is developing rapidly. Technicians must not only maintain their enthusiasm for technology, but also have the ability to think. There are no universal solutions, only provisional solutions tailored to local conditions and constraints. We can expect the road ahead to be difficult and full of twists and turns.
Alibaba Clouder - March 26, 2020
Alibaba Clouder - April 10, 2018
Alibaba Cloud Product Launch - December 11, 2018
Alipay Technology - May 14, 2020
Alibaba Cloud Product Launch - December 12, 2018
Alibaba Container Service - April 28, 2020
An online computing service that offers elastic and secure virtual cloud servers to cater all your cloud hosting needs.Learn More
A powerful and accessible data visualization toolLearn More
A dedicated network connection between different cloud environmentsLearn More