How to build your own SRE system from 0 to 1

Why does ECS build its own SRE system?

The reason why the ECS team builds SRE separately is related to product business characteristics and organizational background.

The following figure shows the business features of ECS:

​First of all, from the perspective of product business, ECS is the largest and most core cloud product of Alibaba Cloud. As the base of the Alibaba economy and other cloud products deployed on ECS, ECS also supports a lot of business at home and abroad.

Alibaba Cloud's global share ranks third, and the contribution of ECS is undoubtedly the first. As an infrastructure base, ECS has particularly high stability requirements.

Secondly, due to the adoption of Alibaba's internal economy to the cloud and the popularity of cloud computing as a whole, the number of ECS external OpenAPI calls has increased several times every year, which means that the capacity of the system will face new challenges every year.

At the same time, Alibaba Cloud Elastic Computing has initiated an organizational adjustment to remove PE, that is, the business team does not have full-time operation and maintenance engineers and system engineers, which means that operations and maintenance, system architecture planning, and horizontal products require teams to undertake.

Exploration and Practice of ECS Construction SRE System

Since the beginning of construction in 2018, ECS’s SRE system construction has learned from the practices of Google and Netflix, and combined with the characteristics of the team and business. The final SRE system can be briefly summarized into the following five levels:

Lay the foundation. There is a saying in Alibaba Cloud's cultural proposition that "if the foundation is weak, the earth will shake." The specific things in the team are the full link stability management system and performance capacity engineering, which are also important foundations.

set standards. This is very important in a large R&D team. The ECS team mainly looks at the complete life cycle of software, from design->coding->CR->testing->deployment->operation and maintenance->offline, each link defines standard. Through regular technical training and regular operation, we will first popularize the awareness to everyone, and at the same time, we will see the effect through the trial operation of a small team. If it meets expectations, we will find a way to automate it.

Build a platform. Continuously release the human work of SRE by building an automation platform.

Do empowerment. In addition to doing a good job in the horizontal basic components and automation platform, the SRE of the business team also needs to do a lot of promotion and assist business research and development; at the same time, the SRE has to deal with a lot of early warning responses, online problem troubleshooting and fault response every day. To maximize the value of SRE, I think the core is empowerment.

Build a team. Finally, I will take elastic computing as an example to introduce the responsibilities of the SRE team, the cultural concept and how to become a qualified SRE.

Lay the foundation

Basic framework construction and performance tuning

The core business of elastic computing is the Java technology stack, and a small number of golang and python. It is essentially a large-scale distributed system developed by Java. Internally, in order to support business scale and abstract integration as much as possible, we have developed a series of basic frameworks for business students to use, including lightweight bpm framework, idempotent framework, caching framework, data cleaning framework, etc., the abstraction of each framework We have considered the support of large-scale capacity and miniaturized output in both design and design. Taking the workflow framework as an example, we support the creation and operation of hundreds of millions of workflows every day, and the framework scheduling overhead reaches the 5ms level.

In addition to the basic framework, a series of optimizations have been made for the JVM in terms of performance optimization, such as enabling the wisp coroutine for IO-intensive applications, and tuning the JVM for each core application to reduce STW time.

In addition, from the perspective of service performance, at the data layer, we have optimized the slow SQL of more than 100 ms in the entire network; at the application layer, we have provided a multi-level caching solution for core links, which can ensure that the hottest data can be retrieved from the memory at the fastest time. Fast return; business layer, we provide batch API and asynchronous transformation.

Full Link Stability Governance

The above figure lists several typical points, such as early warning management. The common problem is that the amount of early warning is too large and the signal-to-noise ratio is not high. The information that early warning can provide is very limited, and the help for troubleshooting is relatively limited.

In the early years, we also faced the same problem and shared two real stories of early warning governance:

The database of a core application went down at night, but the notification channels of the early warning configuration are email and Wangwang, and the recipient is not the current application owner. As a result, it took the owner a very long time to locate the database problem when it found the fault.

There was a chain reaction failure of the whole link before, and the cause of the failure was caused by one of the services. At that time, we spent two hours locating and recovering the problem. Only after reviewing the disk afterwards did we find that 5 minutes before the failure started, there had been There is a related alarm, but the alarm received too many warnings from classmates, and important warnings were missed.

Therefore, a very important part of stability governance is early warning governance. The main governance methods are monitoring layers, a unified early warning configuration platform, and a unified early warning optimization configuration strategy, such as the recipients of early warnings and notification methods.

Database Stability Governance

The database is the lifeblood of the application, and failures in the database are often fatal. Whether it's data accuracy or data availability being compromised, the disaster to a business is usually devastating.

Two Dilemmas: Slow SQL and Large Tables

When using MySQL for data storage, the most frequently encountered scenarios are the two problems of slow SQL and large tables. Slow SQL will slow down the business and even cause a chain reaction of the entire link, leading to an avalanche, and the problem of large tables is usually inseparable from slow SQL. When the amount of data in the table is large enough, the MySQL optimizer is doing index selection. From time to time, some strange problems will be encountered, so the governance of the database basically revolves around slow SQL and large tables.

Governance solution for slow SQL

The general solution is to synchronize the collected slow SQL to SLS, perform near-real-time slow SQL analysis through SLS, and then assign the slow SQL to a designated team for repair through database table information. During this process, SRE students will give optimization solutions and general Basic components, such as providing bigcache and nexttoken solutions for large page queries, common cache for hot data, and read-write separation and degradation capabilities.

Governance solution for large tables

For the problem of large tables, the usual solution in the industry is to separate databases and tables, but the R&D and operation and maintenance costs it brings are very high. The more common way for our internal general business is to do it through historical data archiving. Here, SRE also has a unified Provided by the basic framework, the business side only needs to provide the data archiving conditions and trigger frequency, and the framework will automatically archive the historical data to the offline database and physically delete the data in the business database, so as to achieve a certain reuse of space by moving data holes. The premise of ensuring that the limited data space does not expand to support business development.

High availability system

The high availability of distributed systems can be viewed at four levels. From the lowest deployment layer, there are runtime, data layer, and business layer from bottom to top. Our high-availability system is also divided into four layers.

For the deployment layer, as the best practice for ECS research and development, we recommend multi-availability zone deployment for the simple reason that disaster tolerance is better and more flexible.

The data layer, such as the database stability management mentioned above, on the one hand, our work on the data layer revolves around the continuous management of slow SQL, hot SQL and large tables. On the other hand, we have done more reading and reading from the technical architecture Write an automatic downgrade framework, which can automatically downgrade some large table queries to the read-only library, and at the same time can guarantee the exception of the read-write library. The core API can still provide services through the read-only library, thereby ensuring the high availability of the database as a whole.

The high-availability system of the business layer is a complex distributed system. One of the problems is to solve the complexity of dependencies. How to ensure or damage the guarantee of its own core business when the dependent party is unstable? The business system is very challenging and meaningful.

Failure case

We once had a glitch where a core system was avalanche by a very inconspicuous edge business scenario. A three-party system dependency is introduced into the core system. When the service RT of the relying party slows down, all our HTTP requests are blocked due to the unreasonable timeout period set, which causes all threads to block waiting for the service to return. The result is that all The service RT grows longer until it is no longer responsive.

You must know that in a huge distributed system, there is no absolutely reliable credit chain. Our design philosophy is Design For Failure, and Failure as a service.

You can refer to the following ideas:

Define the nature of the dependency during the design phase, whether it is a strong dependency or a weak dependency
What is the SLO/SLA provided by the other party, what problems may occur to the relying party and what is the impact on our service?

What is our strategy if an expected/unexpected exception occurs on the relying party?
How to ensure maximum availability of our services. Maximum availability means that the response may be damaged. For example, if the peer is weakly dependent, we may directly downgrade and return a mock result. If the peer is strongly dependent, we may adopt an isolation or fuse strategy to fail quickly Part of the request, and record as much information as possible to facilitate subsequent offline compensation.

set standards

Elastic computing R&D personnel are probably more than 100 people. At the same time, some brother teams and outsourced personnel will participate in R&D. Since the first day of SRE construction, we have gradually established various R&D standards and processes.

Taking the UT standard case as an example, the failure rate caused by the lack of UT is high; the difficulty is that the amount is large, and R&D does not pay attention to it, so there is actually no way to converge. The solution is to continuously improve by establishing UT standards, CI automation systems, and quantitative indicators. The effect is that the failures caused by the absence of UT have been greatly reduced, from 30% to less than 0.3%.

R & D process system

We define a set of standards for almost every environment from design to release.

1. Design stage. We have uniformly defined a set of design document templates to regulate and constrain developers. From the perspective of software engineering, the earlier the problem is introduced, the lower the cost, so one of our R&D principles is also redesign. A good design should not only clearly define the problem from the business point of view, define the UserStory, UserCase and constraints clearly, but also clearly explain the tradeoff of the technical solution and how to implement Design for failure, how to grayscale, monitor rollback, etc. from a technical point of view. . We hope that R&D will focus more on the design stage and less on rework or patching after release. At the same time, our review mechanism has also changed from an offline large team review to a small team offline + large team live broadcast method, with as few meetings as possible to save everyone's time.

2. Research and development stage. Our R&D process is similar to git-flow. It is multi-feature parallel development. After development and testing, it is merged into the development branch. There will be a unified process for daily deployment based on development. We have expanded based on the Java protocol of Alibaba Group and added custom static scanning rules. At the same time, unified UT The framework, after implementing CR, automatically runs the protocol scan to perform static checks, and runs CI at the same time to output UT operation reports. Only static scans, CI results are mainly UT success rate, line incremental coverage, and code likes, and MR can meet the conditions at the same time It will be merged and entered into the next release list.

3. Testing phase. It is mainly divided into daily testing, pre-release testing, functional testing before going online, and regression testing during the grayscale period. The problem that everyone in a large-scale R&D team faces is how to deal with the environment? So fast replication and isolation? In the previous mode, we only had one set of daily and pre-release, and there were often code problems of one person or code conflicts of multiple people, which made the test particularly time-consuming. Later, we made a project environment, and simple functions can quickly replicate the full-link project environment through the container method. For cases that require full-link joint debugging, we have expanded multiple sets of pre-release environments, so that each business R&D team can pre-release one set, and everyone will not compete with each other, so the problem of pre-release is solved.

The functional test before going live is mainly for daily deploy. We will automatically pull the branch from develop every night at 11 o'clock and deploy it to the pre-release environment. At the same time, we will automatically run a full amount of functional test cases at this time to ensure the reliability of the release. , if the FVT fails unexpectedly during the daily release, the release will be cancelled.

The last test process is to automatically return to core fvt during the grayscale period. Our release is a rolling release mode. Usually, a region will be grayed out for grayscale verification. Core fvt does this. After the core fvt runs through, subsequent batches can be carried out. release, otherwise judge whether to roll back.

Change process system

When building SRE, we deliberately planned the change control process. Different standard requirements have been made for the current change types, such as database change checklist+review mechanism, daily release/hotfix/rollback batches and pause duration, middleware configuration specifications and black screen changes, etc.

Having the change process and specifications is only the first step, and then we have done tooling construction for high-frequency operation and maintenance operations, some of which cooperate with the existing DevOps platform, and we have done the parts that are independent of the current DevOps R&D support, such as log cleaning and automatic process restart, and developed automation tools to automatically clean up large files and restart faulty processes.

An example is data correction. Data is arranged asynchronously to achieve final consistency, so data correction will be a particularly frequent change. The power of a simple correction SQL sometimes exceeds our imagination. We had a The failure was caused by a data correction error, the impact was very serious, and the troubleshooting process was also very difficult.

build platform
SRE Automation Platform

The purpose of our SRE automation platform is to implement standards through automation, such as reading and writing degradation and current limiting in the high-availability system during the research and development stage. While providing framework capabilities, we also provide operation and maintenance interfaces and white-screen tools to help develop automated/semi-automated high-availability capabilities.


1-5-10 Metrics for Elastic Computing Teams

The following changes, monitoring, early warning, diagnosis, and recovery correspond to the various subdivision stages of MTTR. There is a 1-5-10 indicator in Ali Group, which means that problems are found in minutes, located in 5 minutes, and recovered in 10 minutes. The problem is, this is a very high standard that is very difficult to meet.

In order to meet the improvement 1-5-10 indicators, the elastic computing team has established its own monitoring platform and early warning platform. What we do is the secondary consumption of early warning capabilities. We layer all basic monitoring including system indicators CPU and mem, JVM monitoring, and various middleware monitoring, as well as a lot of business monitoring, and each early warning will include Various meta information, such as affiliation team, importance, associated diagnostic scenarios, fast recovery strategy, and recommended changes, etc.

This closes the entire process of change, early warning, diagnosis, and quick recovery. When a warning appears, it can automatically analyze the whole link to find the site according to TraceID, and recommend relevant changes at the same time, and generate impact surfaces such as how many APIs, who are the users, and what is the solution recommended by the warning. At the same time, a hook can be provided. Perform quick recovery actions.

For example, a classmate corrected a business. There was a problem with the writing of SQL, which led to service exceptions for several large online customers. Our monitoring system recognized the business exception within a few seconds after the SQL was executed, and sent The early warning information and the early warning stack and impact surface analysis were generated, and the associated database change information was also recommended. Within 1 minute of combining these information, we located the problem and immediately performed a rollback. The business was very fast It is restored. Of course, the platform still has limitations. We plan to do more intelligent early warning and diagnosis this year.

Finally, it is important to mention the drill, Chaos engineering, first proposed by Netflix. In the past two years, we have played back historical faults many times through fault drills and fault injection, which is also very helpful in finding online problems. Fault drills have now been integrated into our daily routine as a normal thing.

After sharing our SRE automation platform system, after having a platform, a very important thing is empowerment.

I think the greatest value of business team SRE is empowerment, through empowerment to make everyone act.


do empowerment
Full link SLO quantification system

There are many upstream and downstream dependent parties of ECS, any instability in any environment will affect the stability of ECS export services.

For example, ECS is connected to virtualization and block storage. As long as virtualization and fast storage are slow, it means that the ECS instance startup is slow at the user level. How to evaluate the speed? Maybe 5ms is already very slow for us to do distributed services, but for virtualization, he thinks that my interface is normal for 1s, and SLA is needed at this time.

In the second year of being an SRE, we sorted out dozens of dependencies in the whole link and hundreds of core APIs, and selected the most concerned indicators of each business party, which is SLI, set SLO values for different confidence spaces, and built a unified The quantitative platform continues to follow up through real-time and offline methods. From the establishment of the SLO system to continuous operation for more than a year, the SLO compliance rate of our relying party availability and latency has been managed from more than 40% at the beginning to more than 98%. This direct business is our improvement of the success rate of the user API, and the user's physical experience is even better.

The Four Stages of Landing an SLO

Select SLI, that is, determine with your relying party which indicators need to be concerned, such as availability and latency that are usually concerned;

Agree on the SLO with the relying party, that is, clarify what the target values P99 and P999 of a certain API and certain SLI are;

The usual way to calculate SLO is to record the log, collect the log to SLS, and calculate the indicator value through data cleaning and reprocessing.

Visualization, continuous follow-up by making SLO into real-time/offline real-time reports.

knowledge base

A large part of SRE's responsibility lies in troubleshooting and troubleshooting. At the same time, we have developed a series of frameworks and tools, as well as a lot of operation and maintenance manuals and fault recovery materials. We have accumulated these according to a unified template, which can be used To guide the research and development students to troubleshoot and change the use of daily problems, and we also shared some of the documents with other products of Alibaba Cloud.

Through the knowledge base, we have also empowered many brother teams. In addition, many basic business components in our research and development process, such as workflow, idempotence, caching, downgrading, and Dataclean, are also used by other Alibaba Cloud teams.

Stability Culture Construction

SREs are custodians of stability and evangelists. Only when everyone realizes the importance of stability can our system be truly stable. We started to build the process from the first day of building the SRE team. The team promoted the culture of stability through daily, weekly, monthly reports, as well as occasional online live broadcasts and offline training. Gradually form our unique stability culture in the team, such as Code Review culture, safety production culture and post-event failure recovery culture.

Fault recovery practice

In the early days of SRE, we started rolling out crash recovery. Our definition of failure is that all abnormal cases that damage business or human efficiency are failures. In the beginning, the failure recovery was led by SRE and cooperated with the business team, but the whole process was very unpleasant.
As some automated tools and processes of SRE later really helped R&D to avoid failures, and some insights from SRE gave positive feedback to R&D during the failure recovery process, the culture of failure recovery gradually began to be accepted by the team , Each business will write a fault review report by itself, and open a live broadcast to share it. Students from other teams will also give feedback very actively. After the culture is truly deeply rooted in the hearts of the people, it will produce a very good virtuous circle.

Stability is the product, and stability is not a one-off deal. What SRE has to do is to endow stable products with the soul, to nurture, iterate, and evolve in the same way as products.

The methodology of software engineering solves the stability problem of production system. A big difference between SRE and business development and operation and maintenance is that SRE solves the stability problem of the production environment and redefines the operation and maintenance model through the methodology of software engineering.
Automate everything that costs your team. The essence of SRE is that software engineering defines operation and maintenance, and maximizes value through automation and business empowerment.

Review and Personal Prospect of SRE Construction

A brief summary of the four-year SRE development history of the Elastic Computing Team is, system building-quantification-automation-intelligence.

First Year: Systematic Exploration

The main work this year is to build the SRE system from 0-1 in combination with the current business and team status, such as the unified construction of the basic framework and the stability of the governance system.

Second year: SLO system

The focus of the second year is to define the SLO (Service Level Object, SLO) quantification system for the dozens of system dependencies of the entire link and the core functions of the internal system, and complete the construction of the SLO quantification system for the entire link across BUs. At the same time, it began to focus on building a stable operation system and its own data operation platform to quantify the availability of core APIs that internal and external cores depend on, and the data on time delays and continue to follow up and manage them.

Third Year: Automation

We have automated as much as possible all the things that require manual participation during the R&D process, from design, coding, testing, deployment to early warning response after launch. For example, UT coverage card points are added in the CR stage, and CRs that do not meet the standards will be automatically blocked. Unattended is connected during the release phase, and release interception is performed according to the error log during the release period. Of course, a better way here may be to judge through comprehensive indicators such as the service SLA of the grayscale machine. In addition, during the suspension period of the Grayscale regional release, we will automatically run corefvt to return to the core test cases to verify the reliability of the release. In abnormal cases, Grayscale will be automatically blocked and a one-click rollback operation is recommended.

Fourth Year: Intelligent Mind

Some of the things we are doing this year are highly automated, such as unmanned release on duty, and automated early warning root cause analysis. When the scale of my distributed system is large enough and the application complexity is high enough, it is very difficult to rely on human judgment. Therefore, the automation platform to be built by SRE must have an intelligent mind, and replace or even surpass humans through systematic capabilities.

Personal views on the future development of SRE:

Stability is a product. We really need to treat stability as a product. To make a product means that we must clearly define the problem, produce a solution, and continue to iterate the speech. This is not a one-shot deal, but needs to be nurtured of.

I think with the popularization of cloud computing, SRE's skills will tend to be R&D skills. Of course, system engineering capabilities are also necessary, because cloud computing as an infrastructure will help us shield a lot of problems at the computer room, network, and OS levels. In this way, the focus of SRE is to use software engineering methodology to redefine operation and maintenance, and use automation to improve efficiency.

Netflix proposed a concept of CORE SRE. Netflix interprets CORE in this way. C is Cloud. We all know that Netflix runs on AWS, and Cloud is the foundation.

Operation is operation and maintenance, so I won’t say much about Reliability and Engineering.

Another interpretation of CORE SRE is that a small number of core SRE personnel support and ensure the stability of large-scale services.

I think the core idea behind a small number of SREs supporting large-scale services is automation, maximizing everything that can be automated as much as possible, and intelligent automation.

To sum up the stability, the proportion of product + Dev will increase + CORE SRE + automate everything.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us