By Zhou Bo
During the Double 11 Global Shopping Festival each year, comprehensive stress testing has been performed in the production link for a long time. In practice, we found that stress testing performed in production is closely related to the structure, maturity, and processes of an IT organization. Therefore, we extracted it from simple production and made it into the overall business continuity plan.
This article will elaborate on four aspects:
The preceding figure shows three typical questions during the testing and communications between different IT organizations.
What is fragility? Fragility is like glass. Everyone knows glass is brittle and fragile. What is an antonym of fragility? The word is not toughness but may be anti-fragility. Well, what is anti-fragility? Just like in table tennis, we all know that the table tennis ball can be destroyed on the ground without any big force. However, in the case of high-speed movement, the greater the intensity we apply, the stronger the rebound of the table tennis ball will be, which means the feature of anti-fragility is perceived during the movement.
Our IT system is the same. No code can be 100% non-problematic. Our infrastructure may also be fragile. Servers or databases always have restrictions, and our framework is always fragile as well. With all these problems together, we hope to solve them all with solutions, such as pre-planning, risk identification, or fault tolerance, or by fusing these problems to make the entire IT system anti-fragile. In conclusion, we hope to bring enough sufficient redundancy to IT systems with plenty of plans for sudden risks of uncertainty.
How can we equip IT systems with anti-fragility? We hope to use online stress testing to provide indefinite factors and identify the uncertainties through real-time monitoring and pre-planning capacity. Then, we may process these uncertainties in the online production stress testing through review afterwards. Next, we might adopt the means above in a production environment to do a normalized pressure test of its stability to implement a long-term table scenario. Eventually, we may achieve what is required by anti-fragility, such as the overall monitoring capability, operation protection capability, and routing control capability. This will equip the whole IT system with anti-fragility.
How can comprehensive-procedure stress testing be conducted in the production environment? What technical means does it need?
In general, how does the test evolve from offline to online? I divide it into four stages:
1. Currently, most IT can do offline stress testing for single-line systems, a single interface, or a single scenario, as well as system analysis and performance analysis. However, in complex business scenarios, we may not be able to fully identify problems, many of which are carried out by the development or testing fellows.
2. We have set up an institution similar to a testing laboratory or organization. Such a large department may construct a batch of performance testing environments similar to the production environment, in which we may do more work. For example, we can perform comprehensive-procedure stress testing in the offline environment and perform some offline work based on our previous experience, including performance diagnosis. This step is a step forward for the test. The procedures in the test environment need to be analyzed and coupled with some capability evolutions, such as risk control.
3. Currently, the vast majority of IT and Internet enterprises are willing to try business stress testing in the online production environment. This part is similar to the second stage, but in this process, it is artificially divided into two layers. The first layer is to only do comprehensive procedure stress testing. Many IT companies have chosen to conduct stress testing for read-only services in non-production links for fear of data pollution. On the next layer, some organizations may perform further comprehensive procedure stress testing during normal production hours. In this case, this organization is required to have higher capabilities.
For example, we need to color the entire stress testing traffic to distinguish normal business data from abnormal stress testing traffic. Some of them may need some environmental isolation. For the production stress testing during the business production period, we need to consider the offset and throttling of the entire traffic, including the circuit breaking mechanism. No matter how the business is done, a certain impact on the final production business is unavoidable. Thus, a quick circuit breaking mechanism is required for a potential problem.
4. Achieving compression, fusing, and rendering, including the circuit breaking mechanism. With this capability, the last stage is to do comprehensive procedure stress testing for the entire production procedure, including reading and writing services. In this regard, we introduce database tables and technical means, together with technical means to perform comprehensive-procedure stress testing in production. We also have the capabilities of system failure drill and production change drill, so we may eventually have the capabilities of data isolation, monitoring isolation, and log isolation.
For comprehensive procedure stress testing, we need several key technologies:
It is possible to read the flow by making some marks on the compressor, such as adding a suffix, or through some identifiers to disperse it into the relevant table. At the same time, we also need to identify the traffic during the traffic display process of the whole procedure. We hope to identify every middleware and every service the stress testing traffic passes through accurately. This is the first step to know whether the traffic comes from the stress testing machine or normal traffic.
We need methods like shadow library to perform data isolation. The O&M personnel needs to create the same shadow library as the production environment and then switch to the shadow library. They can also create the same shadow table for data isolation in the production database. The first method is securer, but the disadvantage is that the entire production environment is unavailable when we use the shadow library. The shadow library in the production environment cannot simulate the entire online situation completely because the shadow table requires a stronger technical capacity to ensure the entire link to be traced, including the ability to recover the entire data in the event of an error.
This is also called the risk circuit breaker mechanism. Once the system detects that online stress testing in the production environment has affected our businesses, we need some rules or metrics, such as control, to trigger the risk circuit breaker mechanism automatically. Whether it is to provide the traffic of the pressing machine or isolate the damaged part of the production system for business, such a method is necessary for comprehensive-procedure stress testing in the production process.
Logs will not have much impact on the comprehensive procedure. However, with the improvement of digitalization, logs are virtually the most important data source for BI personnel and the operation personnel to analyze the entire business. If log isolation is not performed, it may affect BI decision-making. For example, during stress testing, we may use a lot of traffic from a certain region to access the production environment. Then, the BI personnel, through log analysis, may find that too much data from a certain region leads to incorrect operation decisions. Therefore, for comprehensive-procedure stress testing, in the entire production process, log isolation is required to distinguish the storage between normal production traffic and stress testing traffic.
These are the required functions for the comprehensive-procedure stress testing and business continuity platform:
With this architecture, we can cut costs by 40% compared to the overall environment, which virtually has no impact on the entire production business.
Next, let's talk about how to build a shadow database, including the entire traffic identification.
The orange part is the real pressure test traffic where we will add an identifier on the pressing machine. (Currently, we add a suffix.) We will also do a filter on the server, which is an interceptor. We will intercept the relevant identifiers in the traffic and then distinguish, dye, and track it. Thus, each request can be transparently visible in any middleware and project heap.
During the real stress testing process, we can rewrite the bytecode of the Agent to replace the byte condition with compression condition. Of course, the shadow library must be built first. The tracking at the bottom layer can put the corresponding traffic into the shadow library. If the database runs clearly, we will do a traffic test on the shadow area to see if it is clear enough. We can ensure that the whole test data is equipped with identifiers. Even if the diagnosis is not performed during the whole process, we can also perform deletion in the normal table, with every passing area visible.
With this method, there are three testing stages for most IT organizations. However, there are only two stages for some of the well-developed ones:
These are the goals we hope to achieve in each stage of the test life cycle.
Since different IT organizations are in different situations, the suggestions we provide are not necessarily applicable to all of them; it is only for your reference.
Generally, we implement comprehensive procedure stress testing for third parties and online production stress testing through five stages:
The first step is to work with a third party to sort out the business phases. We will do the following:
The second step is to transform some applications. For example, do the traffic tagging. The business system is determined through the monitored traffic. The relevant monitoring may be accessed to the business system, and the relevant third-party components will perform mocking. The entire stress testing scenario creation will be communicated with the third party, including the insert of traffic statement construction and pre-planning.
The third step is about the entire stress testing process. Comprehensive-procedure stress testing in the production state will optimize the performance and evaluate the capacity of the entire system.
The fourth step is to normalize the online comprehensive-procedure stress testing, which involves throttling, degradation, chaos engineering acceptance, production, and release.
The fifth step is to review the entire activity to see whether the emergency plan takes effect and what else needs to be optimized. This is the life cycle of comprehensive-procedure stress testing in the production process.
We are now doing something more in-depth. Throughout the development process, everyone is currently using DevOps. Perhaps the performance test of the single interface has been used in the process. Currently, we have built an interface-level stand-alone performance test for enterprises. With the stand-alone test tool, the performance problems of the single interface are received during the release process to ensure that no code-level errors will occur when the interface is sent online. We will ultimately eliminate the need for integrated stress testing, including stress testing in the test environment, so users can go directly to the online stress testing process. In the single interface phase, we support the stress testing of the corresponding mainstream framework. We are also working on the support for the stress testing of the test environment cluster in the hope that users can skip this step to start the online stress testing for traffic isolation.
The preceding figure shows what we consider as the functions required by a complete business continuity platform.
Our ultimate goals for this platform are listed below:
Long-term solidified production stress testing will place a limit on the capacity and water level, so the implementation of the plan during the drill can help us avoid problems and perform protection in the emergency process.
Let's use Alibaba as an example. Their stress testing can be performed monthly. Taobao has monthly promotions and three major promotions every year on June 18, November 11 (Double 11), and December 12 (Double 12.) Currently, we can do small-scale drills of Double 11, Double 12, or June 18 promotions weekly. Moreover, we can organize stress testing activities within or across the business unit, with very clear expansion plans.
The following are our implementation cases for the third parties:
We decomposed the applications of their system. At first, there were about four compression scenarios confirmed. Later, about 23 colored scenarios were confirmed through traffic rendering, traffic coloring, and traffic tracking. Then, the shadow table was created online before a small amount of traffic was colored to connect the entire shadow library and the shadow table to the production environment without any impact on the production stress testing. In the 23 scenarios we performed stress testing, no problems occurred during last year's Double 11, including warehouse explosion or overbooking.
When they did previously, it took more than 50 people about four months to maintain a separate environment. This was still somewhat different from what we did. After the release, there was still a backlog of orders. However, with our comprehensive procedure stress testing, it takes an average group of five core people about a month to do the job. We already have the ready-to-go application, which is equipped with the capability of self-copying, traffic application, and traffic coloring. The test cycle is in days. A small iteration can complete the performance regression of the entire online product in one to two days. Performance regression of the entire primary procedure for high-traffic promotions like Double 11 and Double 12 can be completed in about one week. The capacity in the current production environment can be evaluated, including expansion and changes to the production environment.
For a customer in the cosmetics industry, all the systems were originallydeveloped by third-party vendors with no performance evaluation. What's worse, some of the third-party vendors were replaced, making the entire application to be more complex. As a result, the entire system crashed when a function was taken offline. We evaluated that the hardware cost after each transaction was about 0.18 yuan. When I was doing the stress testing for Taobao in 2012, the 0.18 yuan was about 9-10 times the cost of Taobao in 2014, and they still had many unknown risks. For example, when they wanted to promote a new application, a failure occurred, causing the flash sales system to collapse. In the end, the promotional activity did not work well.
We spent more than a month developing an online environment for them, sorting out 22 core procedures, 22 systems, and more than 600 servers. The time we spent on the construction of the first production procedure was quite long, about half a month, and then they implemented the rest themselves. It took 55 days to clarify the capacity of the entire operating system online for a total of 22 procedures. We did not pollute the production data, and the entire log was isolated. Throughout the process, based on the co-construction principle, we helped our customers build the regression mechanism of daily online stress testing.
In terms of short-term benefits, we made some adjustments to the number of application servers, adjusting some servers from procedures with lower benefits to procedures with higher benefits to reduce the resource consumption rate to about 20%. After we did comprehensive-procedure stress testing, we made a baseline for them based on where they do performance iterations each time.
So far, they have mastered the entire stress testing process for the production environment, and they can do it as planned each time they go online. This year, their goal is to reduce the resources of the entire server by at least 50%.
zcm_cathy - November 11, 2019
Aliware - July 21, 2021
Alibaba Clouder - November 8, 2018
Alibaba Clouder - April 3, 2020
Alipay Technology - November 26, 2019
Alibaba Clouder - December 1, 2020
Penetration Test is a service that simulates full-scale, in-depth attacks to test your system security.Learn More
Accelerate software development and delivery by integrating DevOps with the cloudLearn More
Link IoT Edge allows for the management of millions of edge nodes by extending the capabilities of the cloud, thus providing users with services at the nearest location.Learn More
Leverage cloud-native database solutions dedicated for FinTech.Learn More
More Posts by Aliware