Community Blog Constructing a Comprehensive Stress Testing System for Double 11

Constructing a Comprehensive Stress Testing System for Double 11

This article discusses the methods, significance, and importance of stress testing.

By Zhou Bo

During the Double 11 Global Shopping Festival each year, comprehensive stress testing has been performed in the production link for a long time. In practice, we found that stress testing performed in production is closely related to the structure, maturity, and processes of an IT organization. Therefore, we extracted it from simple production and made it into the overall business continuity plan.

This article will elaborate on four aspects:

  1. The significance of comprehensive-procedure stress testing: why we should do comprehensive-procedure stress testing in the production process
  2. Technical points and solutions for implementation
  3. Suggestions on the comprehensive-procedure stress testing process in the production process based on the different tolerance of each organization;
  4. Implementing business continuity from a third party to a production environment, including the results of stress testing

Significance of Comprehensive-Procedure Stress Testing


The preceding figure shows three typical questions during the testing and communications between different IT organizations.

  1. Many testing colleagues said they had also done performance testing offline. However, there were still many problems after moving online because it was unlikely that an offline environment would be simulated the same way as an online one. Many third-party interfaces rarely simulate the entire scene on the line. This is the conclusion we have drawn after much offline testing and why it doesn't work well for many companies when moving from offline to online.
  2. Currently, all IT organizations are engaged in DevOps development, and our function iteration has been accelerated from once a month to once a week, leaving less time for testing. The function test time has been shortened from one or two weeks to two to four days. Therefore, the performance test cannot go online on time, resulting in potential performance problems, which will directly affect the brand influence of enterprises.
  3. Normally, the online business level is relatively low, rarely reaching a peak period, but there might be some unexpected situations. For example, the pandemic turned many businesses online, such as the education industry, where the face-to-face teaching mode in the classroom was transferred to the online platform. Such contingencies are tricky for test engineers, development, and O&M teams to handle. Now, I would like to introduce a concept called fragile and anti-fragile that was put forward by Nassim Nicholas Taleb, the author of Black Swan.


What is fragility? Fragility is like glass. Everyone knows glass is brittle and fragile. What is an antonym of fragility? The word is not toughness but may be anti-fragility. Well, what is anti-fragility? Just like in table tennis, we all know that the table tennis ball can be destroyed on the ground without any big force. However, in the case of high-speed movement, the greater the intensity we apply, the stronger the rebound of the table tennis ball will be, which means the feature of anti-fragility is perceived during the movement.

Our IT system is the same. No code can be 100% non-problematic. Our infrastructure may also be fragile. Servers or databases always have restrictions, and our framework is always fragile as well. With all these problems together, we hope to solve them all with solutions, such as pre-planning, risk identification, or fault tolerance, or by fusing these problems to make the entire IT system anti-fragile. In conclusion, we hope to bring enough sufficient redundancy to IT systems with plenty of plans for sudden risks of uncertainty.


How can we equip IT systems with anti-fragility? We hope to use online stress testing to provide indefinite factors and identify the uncertainties through real-time monitoring and pre-planning capacity. Then, we may process these uncertainties in the online production stress testing through review afterwards. Next, we might adopt the means above in a production environment to do a normalized pressure test of its stability to implement a long-term table scenario. Eventually, we may achieve what is required by anti-fragility, such as the overall monitoring capability, operation protection capability, and routing control capability. This will equip the whole IT system with anti-fragility.

Solution for Comprehensive-Procedure Stress Testing

How can comprehensive-procedure stress testing be conducted in the production environment? What technical means does it need?

Evolution of Stress Testing Process


In general, how does the test evolve from offline to online? I divide it into four stages:

1.  Currently, most IT can do offline stress testing for single-line systems, a single interface, or a single scenario, as well as system analysis and performance analysis. However, in complex business scenarios, we may not be able to fully identify problems, many of which are carried out by the development or testing fellows.

2.  We have set up an institution similar to a testing laboratory or organization. Such a large department may construct a batch of performance testing environments similar to the production environment, in which we may do more work. For example, we can perform comprehensive-procedure stress testing in the offline environment and perform some offline work based on our previous experience, including performance diagnosis. This step is a step forward for the test. The procedures in the test environment need to be analyzed and coupled with some capability evolutions, such as risk control.

3.  Currently, the vast majority of IT and Internet enterprises are willing to try business stress testing in the online production environment. This part is similar to the second stage, but in this process, it is artificially divided into two layers. The first layer is to only do comprehensive procedure stress testing. Many IT companies have chosen to conduct stress testing for read-only services in non-production links for fear of data pollution. On the next layer, some organizations may perform further comprehensive procedure stress testing during normal production hours. In this case, this organization is required to have higher capabilities.

For example, we need to color the entire stress testing traffic to distinguish normal business data from abnormal stress testing traffic. Some of them may need some environmental isolation. For the production stress testing during the business production period, we need to consider the offset and throttling of the entire traffic, including the circuit breaking mechanism. No matter how the business is done, a certain impact on the final production business is unavoidable. Thus, a quick circuit breaking mechanism is required for a potential problem.

4.  Achieving compression, fusing, and rendering, including the circuit breaking mechanism. With this capability, the last stage is to do comprehensive procedure stress testing for the entire production procedure, including reading and writing services. In this regard, we introduce database tables and technical means, together with technical means to perform comprehensive-procedure stress testing in production. We also have the capabilities of system failure drill and production change drill, so we may eventually have the capabilities of data isolation, monitoring isolation, and log isolation.

Key Technologies of Comprehensive-Procedure Stress Testing


For comprehensive procedure stress testing, we need several key technologies:

  • Comprehensive-Procedure Traffic Coloring

It is possible to read the flow by making some marks on the compressor, such as adding a suffix, or through some identifiers to disperse it into the relevant table. At the same time, we also need to identify the traffic during the traffic display process of the whole procedure. We hope to identify every middleware and every service the stress testing traffic passes through accurately. This is the first step to know whether the traffic comes from the stress testing machine or normal traffic.

  • Comprehensive-Procedure Data Isolation

We need methods like shadow library to perform data isolation. The O&M personnel needs to create the same shadow library as the production environment and then switch to the shadow library. They can also create the same shadow table for data isolation in the production database. The first method is securer, but the disadvantage is that the entire production environment is unavailable when we use the shadow library. The shadow library in the production environment cannot simulate the entire online situation completely because the shadow table requires a stronger technical capacity to ensure the entire link to be traced, including the ability to recover the entire data in the event of an error.

  • Comprehensive-Procedure Risk Control Mechanism

This is also called the risk circuit breaker mechanism. Once the system detects that online stress testing in the production environment has affected our businesses, we need some rules or metrics, such as control, to trigger the risk circuit breaker mechanism automatically. Whether it is to provide the traffic of the pressing machine or isolate the damaged part of the production system for business, such a method is necessary for comprehensive-procedure stress testing in the production process.

  • Comprehensive-Procedure Log Isolation

Logs will not have much impact on the comprehensive procedure. However, with the improvement of digitalization, logs are virtually the most important data source for BI personnel and the operation personnel to analyze the entire business. If log isolation is not performed, it may affect BI decision-making. For example, during stress testing, we may use a lot of traffic from a certain region to access the production environment. Then, the BI personnel, through log analysis, may find that too much data from a certain region leads to incorrect operation decisions. Therefore, for comprehensive-procedure stress testing, in the entire production process, log isolation is required to distinguish the storage between normal production traffic and stress testing traffic.

Core Functions of the Comprehensive-Procedure Stress Testing and Business Continuity Platform


These are the required functions for the comprehensive-procedure stress testing and business continuity platform:

  1. A stress testing traffic tool must be available with full-region data traffic mining and traffic transformation.
  2. Then, it's the entire stress testing identification, including some of the functions of shadow storage. The yellow part is the normal traffic, and the blue part is the traffic for pressure testing. We may add some identifiers to the blue part through the modification of the pressing machine. The Agent technology can identify the traffic with identifiers and drop the traffic into the corresponding shadow database, shadow table, or the cache of the shadow region through the Agent technology at the bottom layer.
  3. We need a proper console to manage the circuit breaking rule. We may install the agent here, which includes the entire architecture management, database and table maintenance, rule maintenance, and circuit breaker mechanism maintenance.
  4. The last is the pressing part. Probes or agents may be installed. These agents capture traffic to the specified shadow table and monitor the metrics. For example, if our error rate reaches 1% or the check time exceeds a certain threshold, the Agent will report the error in time and implement throttling through rule configuration.

With this architecture, we can cut costs by 40% compared to the overall environment, which virtually has no impact on the entire production business.

Risk Prevention and Control Capabilities of Comprehensive-Procedure Stress Testing


Next, let's talk about how to build a shadow database, including the entire traffic identification.

The orange part is the real pressure test traffic where we will add an identifier on the pressing machine. (Currently, we add a suffix.) We will also do a filter on the server, which is an interceptor. We will intercept the relevant identifiers in the traffic and then distinguish, dye, and track it. Thus, each request can be transparently visible in any middleware and project heap.

During the real stress testing process, we can rewrite the bytecode of the Agent to replace the byte condition with compression condition. Of course, the shadow library must be built first. The tracking at the bottom layer can put the corresponding traffic into the shadow library. If the database runs clearly, we will do a traffic test on the shadow area to see if it is clear enough. We can ensure that the whole test data is equipped with identifiers. Even if the diagnosis is not performed during the whole process, we can also perform deletion in the normal table, with every passing area visible.

With this method, there are three testing stages for most IT organizations. However, there are only two stages for some of the well-developed ones:

  1. Identify the problems before going online. Many of the problems are detected during offline development or testing and debugging. Then, optimize the whole interface to ensure no existing code problems, including DNS problems. Such problems are solved in the offline development environment.
  2. In the deployment process, we will make third-party plug-ins for security and other issues. Currently, the development and deployment environment will be gradually diluted because of the development of containers.
  3. Before performing real online stress testing in the production environment, capacity planning or stress testing may be required, including testing for the entire environment (like CDN or DNS) or the entire online system capacity evaluation.

These are the goals we hope to achieve in each stage of the test life cycle.

Suggestions on the Stress Testing Process

Since different IT organizations are in different situations, the suggestions we provide are not necessarily applicable to all of them; it is only for your reference.

Generally, we implement comprehensive procedure stress testing for third parties and online production stress testing through five stages:

The first step is to work with a third party to sort out the business phases. We will do the following:

  1. Evaluate the performance and capacity metrics of business systems based on past system usage
  2. Sort the system architecture of the existing information system and determine the entire route of the colored traffic
  3. Communicate more during the entire stress testing process, including intervals, and confirm the design of relevant stress testing scenarios
  4. Perform the desensitization of the production data if it is sensitive

The second step is to transform some applications. For example, do the traffic tagging. The business system is determined through the monitored traffic. The relevant monitoring may be accessed to the business system, and the relevant third-party components will perform mocking. The entire stress testing scenario creation will be communicated with the third party, including the insert of traffic statement construction and pre-planning.

The third step is about the entire stress testing process. Comprehensive-procedure stress testing in the production state will optimize the performance and evaluate the capacity of the entire system.

The fourth step is to normalize the online comprehensive-procedure stress testing, which involves throttling, degradation, chaos engineering acceptance, production, and release.

The fifth step is to review the entire activity to see whether the emergency plan takes effect and what else needs to be optimized. This is the life cycle of comprehensive-procedure stress testing in the production process.

We are now doing something more in-depth. Throughout the development process, everyone is currently using DevOps. Perhaps the performance test of the single interface has been used in the process. Currently, we have built an interface-level stand-alone performance test for enterprises. With the stand-alone test tool, the performance problems of the single interface are received during the release process to ensure that no code-level errors will occur when the interface is sent online. We will ultimately eliminate the need for integrated stress testing, including stress testing in the test environment, so users can go directly to the online stress testing process. In the single interface phase, we support the stress testing of the corresponding mainstream framework. We are also working on the support for the stress testing of the test environment cluster in the hope that users can skip this step to start the online stress testing for traffic isolation.


The preceding figure shows what we consider as the functions required by a complete business continuity platform.

  1. The stress testing traffic initiating console is the traffic initiator that manages the entire stress testing traffic and scenario design.
  2. We hope a unified cut for traffic can be achieved through the traffic isolation console. When problems occur, users can cut off the stress testing traffic and unify the routing immediately.
  3. Traffic monitoring, including system monitoring, is implemented during the stress testing process. There is also a performance monitoring platform for the entire application during the stress testing process, including procedure monitoring, JVM Monitoring, and component monitoring.
  4. The corresponding rules for chaos engineering, including traffic limiting rules, isolation rules, and degradation rules, will be maintained here.

Our ultimate goals for this platform are listed below:

  1. Implement comprehensive procedure stress testing anytime, anywhere, and at a low cost
  2. Implement periodical fault drills for O&M platforms and make such capability available to the O&M team for initiating changes anytime and anywhere;
  3. Do some preparation for the entire launch activity, including large-scale promotions, to avoid the breakdown caused by unexpected activities.

Long-term solidified production stress testing will place a limit on the capacity and water level, so the implementation of the plan during the drill can help us avoid problems and perform protection in the emergency process.

Let's use Alibaba as an example. Their stress testing can be performed monthly. Taobao has monthly promotions and three major promotions every year on June 18, November 11 (Double 11), and December 12 (Double 12.) Currently, we can do small-scale drills of Double 11, Double 12, or June 18 promotions weekly. Moreover, we can organize stress testing activities within or across the business unit, with very clear expansion plans.

Customer Cases

The following are our implementation cases for the third parties:

Case 1

We decomposed the applications of their system. At first, there were about four compression scenarios confirmed. Later, about 23 colored scenarios were confirmed through traffic rendering, traffic coloring, and traffic tracking. Then, the shadow table was created online before a small amount of traffic was colored to connect the entire shadow library and the shadow table to the production environment without any impact on the production stress testing. In the 23 scenarios we performed stress testing, no problems occurred during last year's Double 11, including warehouse explosion or overbooking.


When they did previously, it took more than 50 people about four months to maintain a separate environment. This was still somewhat different from what we did. After the release, there was still a backlog of orders. However, with our comprehensive procedure stress testing, it takes an average group of five core people about a month to do the job. We already have the ready-to-go application, which is equipped with the capability of self-copying, traffic application, and traffic coloring. The test cycle is in days. A small iteration can complete the performance regression of the entire online product in one to two days. Performance regression of the entire primary procedure for high-traffic promotions like Double 11 and Double 12 can be completed in about one week. The capacity in the current production environment can be evaluated, including expansion and changes to the production environment.

Case 2

For a customer in the cosmetics industry, all the systems were originallydeveloped by third-party vendors with no performance evaluation. What's worse, some of the third-party vendors were replaced, making the entire application to be more complex. As a result, the entire system crashed when a function was taken offline. We evaluated that the hardware cost after each transaction was about 0.18 yuan. When I was doing the stress testing for Taobao in 2012, the 0.18 yuan was about 9-10 times the cost of Taobao in 2014, and they still had many unknown risks. For example, when they wanted to promote a new application, a failure occurred, causing the flash sales system to collapse. In the end, the promotional activity did not work well.

We spent more than a month developing an online environment for them, sorting out 22 core procedures, 22 systems, and more than 600 servers. The time we spent on the construction of the first production procedure was quite long, about half a month, and then they implemented the rest themselves. It took 55 days to clarify the capacity of the entire operating system online for a total of 22 procedures. We did not pollute the production data, and the entire log was isolated. Throughout the process, based on the co-construction principle, we helped our customers build the regression mechanism of daily online stress testing.


In terms of short-term benefits, we made some adjustments to the number of application servers, adjusting some servers from procedures with lower benefits to procedures with higher benefits to reduce the resource consumption rate to about 20%. After we did comprehensive-procedure stress testing, we made a baseline for them based on where they do performance iterations each time.

So far, they have mastered the entire stress testing process for the production environment, and they can do it as planned each time they go online. This year, their goal is to reduce the resources of the entire server by at least 50%.

0 0 0
Share on

Alibaba Cloud Native

151 posts | 12 followers

You may also like