Community Blog Three Measures for Test Stability

Three Measures for Test Stability

In this post, a senior Alibaba tech expert shares his best practices and advice on ensuring test stability.


Testing is an essential part of every software development cycle. However, if you have experienced testing a software under realistic environments, you should be familiar with the challenges to ensure reliable and stable tests. In this blog, we discuss several methodologies in order to stabilize tests and ensure a more robust way to test our systems.

1. Test Stability Problems

Ideally, we hope that each failed test case[1] is caused by real defects. In actual situations, a test case failure often occurs due to the following reasons:

  • The deployed service version is incorrect.
  • The hard disk of the test server is out of space because the logs generated during the previous running are not cleared.
  • The database has dirty data.
  • The test case is inappropriate.
  • During the test, someone has manually executed a scheduled task and used the pipeline.
  • The message is sent incorrectly.

During debugging, developers and testers come across these problems repeatedly, and, as a result, can stop noticing these problems. Some developers and testers simply judge that the problem is caused by the runtime environment and do not continue with the debugging process. As a result, many real defects are ignored.

2. Three Measures for Test Stability

How can we solve the problems of test stability? Many people say we can ensure test stability through environment, process control, monitoring, tool, more machines, and dedicated personnel. All these measures are acceptable. However, they all are specific solutions, having nothing to do with the methodology and the theoretical system.

In terms of methodology and the theoretical system, we can take three measures to ensure production security. These are grayscale, monitoring, and rollback. Similarly, I also have the following three measures for test stability:

  • Frequency
  • Isolation
  • Disposable test environments

3. Measure 1: Frequency

"If it hurts, do it more often" have been my most frequently spoken words. This sentence is sourced from Martin Fowler. If you are interested, you can read his article Frequency Reduces Difficulty.

The benefits of running tests at a high frequency are as follows:

  • The verification delay is shortened.
  • Active verification is changed to "passive waiting".
  • Intermittent problems are identified.
  • Unstable factors at various layers are exposed.
  • The automation of human effort input is reinforced.
  • More data for analysis is provided.

Frequency is not only a measure for ensuring test stability, but also a game changer for handling other engineering problems:

  • Continuous packaging: In the past, packaging was performed only before the deployment of the testing environment. It took a long time to deploy the testing environment due to packaging, affecting the progress of subsequent tests. To address this problem, we have designed continuous packaging to package the master head every hour. Once a problem occurs (for example, the absence of a dependent MVN package or Antx configuration), the problem can be immediately fixed.
  • Release to the production environment every day: Currently, the code is released to the production environment weekly, which is laborious. I suggest that the code should be released to the production environment every day. The new code can be released weekly. For the rest of the days in a week, the code can be released to the production environment again every day even if no new code is generated. The purpose is to expose problems through high frequency, and to reinforce the automation of human effort input and the optimization of each link.
  • Since it is troublesome to merge branches, it is a good option to merge frequently, for example, once a day or even multiple times a day. At the other extreme, it becomes backbone development, accompanied by constant rebasing and submission.
  • This high-frequency method is also used by the Site Reliability Engineers (SREs) of Ant Financial. To strengthen the disaster recovery capacity and improve the success rate of disaster recovery drills, the SRE team adheres to the main principle of high-frequency drills, and uses high-frequency drills to fully expose problems, which in turn promotes capacity building.

However, it is never an easy thing to implement the high frequency method.

High-frequency testing requires support from infrastructures. First, high-frequency testing requires resources. High-frequency implementation also puts unprecedented pressure on all aspects of infrastructure. Additionally, high-frequency testing requires certain capabilities. Take SRE's high-frequency drills for an example. If many problems are caused during this drill, it is impossible to do this frequently. The premise for frequent drills is that our isolation mechanism and recovery capabilities have reached a certain level. For test and running engineers, to ensure the effectiveness of high frequency tests, isolation and disposable test environments must be ensured.

A common concern for high-frequency test runs is that, since we do not have time to troubleshoot the failed test cases when the test is performed once a day, do we still have the time for troubleshooting when the frequency increases? My answer is positive. This is because the problems converge soon after the high-frequency test run starts, so the total number of problems you need to handle are basically the same or even reduced.

4. Measure 2: Isolation

Compared with other two measures (frequency and disposable test environments), the importance of isolation is more widely accepted. Benefits of isolation include:

  • The avoidance of mutual impacts between test runs and reduction of noise.
  • Improvement of the efficiency and the elimination of the need to coordinate with each other when performing some destructive tests.

Isolation is divided into hard isolation and soft isolation. The specific isolation mode depends on the technology stack, architecture, and service form. However, both modes can reach the final state:

  • Hard isolation includes full isolation environment and physical isolation. To reach the final state in hard isolation mode, the key is the costs. We need to reduce the costs without enlarging the quality blind zone. For example, it is an ideal final state if the entire payment test can be compressed and run on one server[2], and all the functions (including middleware-level tasks, such as scheduled tasks, message subscriptions, and Zdal rules) can be thoroughly covered. Everyone can build several full environments at any time, which is excellent. Besides, architecture breakdown or decoupling (for example, independent release by domain) helps cut the costs of hard isolation and greatly reduce the deployment scope of the entire System Under Test (SUT).
  • Soft isolation includes semi-shared environment, logic isolation, and link isolation. To implement soft isolation in the end, the key is the effectiveness of isolation. If isolation is perfect, we can deploy today's joint debugging environment to the production environment. In this way, no environment stability problems occur. The real testing in production is implemented, which is also an ideal final result.

The preceding two final results have been achieved in my previous work. They do work. It is a technical challenge to implement either isolation mode in the end. Both cost reduction and complete and reliable logic isolationare technical issues.

For today's payment or e-commerce systems, will our final result be hard isolation or soft isolation? It is hard to say now. Judging from the technical feasibility, soft isolation is more likely to become our final result. Due to physical limits of the architecture, it is difficult to implement hard isolation after a certain level is reached. Breaking through the physical limits of the architecture may lead to new quality blind zones. However, hard isolation will be greatly helpful for us over a long period of time. For example, we need hard isolation when performing various unconventional tests. Technically, it is highly complex to perform unconventional tests in software isolation mode. Since the last fiscal year, I have been working on a project in my team that involves building an incremental testing environment in one click (hard isolation). I have the following reason: It is relatively easy to pull up an incremental testing environment in one click, and the key issue is automation. The router-based soft isolation solution is not ready now and cannot meet our isolation requirements in the short term.

Hard isolation and software isolation do not conflict with each other. They can be used together. For example, when we build a router-based isolation environment, a new database is built. Hard isolation is performed at the database layer to supplement the soft isolation capability, which is absent at the database layer.

In short, isolation is essential. The specific isolation solution depends on complexity, cost, effect, and other factors at each stage.

5. Measure 3: Disposable Test Environments

Another of my favorite sentences is: The test environment is ephemeral. This sentence comes from myself. Ephemeral means short-living. I often repeat these words to my QA team and hope that my team can always remember this principle in their daily work.

The test environment is ephemeral. Consequently, we must have the following capabilities:

  1. We must have a strong test setup capability. The project building an environment in one click that we are now working on is part of such a capability. After the test setup is completed, we must be able to verify the environments quickly.
  2. Our test strategy, test plan, testability design, and test automation must be independent of a long living test environment, including the old data in the long living test environment. For example, test automation must be able to create all the data required automatically.

With these capabilities, we can build an out-of-the-box test environment that is fast and repeatable with no labor cost. Additionally, we can create all the data required for testing and make the test environment disposable. In other words, we can create a new environment for each test and then destroy the environment after the test is completed. Next time when we need a test environment, we can build a new one. In addition to the test environment, the test executor must also be disposable.

For test environments that must be retained for a certain period of time after the test is completed, we need to set a short upper limit. For example, the following was my practice in the past:

  • Set the default lifecycle of the joint debugging test environment to seven days.
  • If the environment needs to be retained after seven days, extend the expiration date. The expiration date can be extended for seven days at the most. In other words, newExpDate = now + 7 instead of newExpDate = currentExpDate + 7.
  • Starting from createDate, it can be extended up to 30 days. If you need to extend for more than 30 days, submit an application for special approval, for example, approval by the CTO of the Business Group.
  • The benefit is the reverse transmission of the pressure. It is a bit painful at the beginning, but soon everyone will get used to it, and automation will soon be improved. A lot of improvements are made under pressure.

You can use a disposable test environment to:

  • Solve environmental decay and reduce dirty data.
  • Improve repeatability and ensure that the environments for each test are consistent.
  • Promote the construction of various optimization and automation capabilities, such as test environment preparation and data creation.
  • Improve resource mobility. Increase the actual capability by increasing the mobility, with the given physical resources.

When the test environment is disposable, some new quality risks are introduced. If you have a set of long-term maintenance environments and the data in the environment is generated by the code of earlier versions, after the code of the new version is deployed, the old data can help us find the data compatibility problems in the new code. If the environment is disposable, no old data is available and the data compatibility may not be detected.

This risk does exist. To solve this problem, we need to work harder. We need to explore other solutions for data compatibility and other tests and methods to guarantee the quality. We even need to think about how to eliminate the data compatibility problem through the architecture design.

6. Implementation

The three measures, namely frequency, isolation, and disposable test environments are a bit idealistic. Today, our infrastructure, architecture, and automation are still far away from the ideal state.

Even so, we need to be a bit idealistic. It is quite challenging, technically, to implement these three measures. But we are optimistic, and believe we can eventually achieve our goal. We are also practical. We can break down the goal and achieve it step by step based on actual situations.


[1] The use cases here mainly refer to functional test cases, including use cases for the unit testing, single-system API testing, and full-link and end-to-end testing.
[2] In this way, one possible negative impact in practice is that it may discourage microservice governance, including domain autonomy, independent test, and independent release capabilities.

0 0 0
Share on

Alibaba Tech

15 posts | 1 followers

You may also like