Work ideas for stability and high availability assurance

A deep understanding of stability and high availability

Stability and high availability are two clichéd words. Based on experience and feelings, we know that improving these two indicators of the system will make the system healthier and the product will have a better user experience. But if you want to give a definition of stability and high availability, how should you express it? What is the difference and connection between stability and high availability? In my opinion, we must first understand these two issues before we can set clear goals and formulate a complete and feasible plan systematically.

Search for stability on Wikipedia, defined as follows:

Stability is a mathematical or engineering term that determines whether a system produces bounded outputs for bounded inputs. If yes, the system is called stable; if not, the system is called unstable.
Look at high availability again:

High availability (English: high availability, abbreviated as HA), an IT term, refers to the ability of a system to perform its functions without interruption, and represents the degree of availability of the system. It is one of the criteria for system design. A highly available system can run for a longer period of time than the individual components that make up the system.

First distill the key words from the definition of stability -- system, input, output. In Ant's current technical architecture, an application can be regarded as a system, the service request between applications is the input, and the service response is the output. When the service response meets expectations, the application system is considered stable. When they are combined to form a larger system and expressed to users as a business product, the user's request is used as input, and the product expression is used as output. When the product function is running normally, the product system can be considered stable. To sum up, the definition of stability can be summarized as follows - when the system receives the input, it can produce the correct and expected output, and the system is called stable; otherwise, the system is called unstable.

Back to the proposition, why is it called stability guarantee? Can you change the term to improve stability? From the above definition we can conclude that stability describes the behavior of the system. Whether a system is stable, just like we evaluate whether a person is healthy, it is difficult to describe and quantify it in a statement. But it can be quickly judged in a negative way. People reduce the occurrence of diseases and maintain good health through good diet and living habits. The same is true for ensuring the stability of the system or improving the stability of the system. We need to use various methods to avoid those unstable situations. The so-called more stable does not exist objectively, but subjectively hopes to avoid or reduce the occurrence of instability.

Unlike stability, usability is a quantifiable indicator, and the calculation formula is described in Wikipedia as follows:

Based on system damage, unavailable time, and time from inoperative to operational condition compared to total system operating time.

The three nines (99.9%) and four nines (99.99%) we often hear measure the availability of the system. High availability is to ensure that this indicator of the system is maintained at a high level. In the definition description of the formula, the running time of the system is divided into three parts

The time during which the system operates normally, that is, the time during which the system is in a steady state.
The time when the system is damaged or unavailable, that is, the time when the system is in an unstable state.
The time it takes for the system to recover from inoperability to operability, that is, the time it takes for the system to recover from an unstable state to a stable state.
System availability and system stability are positively correlated. However, in real life, the system cannot be in a stable state forever. Thinking backwards and converting the above formula is more conducive to our analysis:

image.png

So far, the goal of this proposition, KPI is clear. The goal of ensuring the stability and high availability of the system is to keep the system in a stable working state, not to have a negative impact on users, and to avoid online problems and P-level failures. The core kpi is the availability of the system. In order to improve the availability of the system, we should first ensure the stability of the system and reduce the occurrence of unstable conditions. Secondly, when the system is in an unstable state due to failure of various components, it can be quickly discovered and restored to a stable and available state. .

Two core ideas of stability and high availability guarantee

Through the above deduction, we can get two basic problem-solving ideas for the goal of improving system availability. According to the picture, in order to solve the problem, the first task is to find and define the problem. Therefore, in order to improve the stability of the system, we first list the common unstable situations in the application system, and then prescribe the right medicine one by one:

Functionality: The application performed an error in a function that was not as expected.

Capacity: When the number of requests received by the system increases, the application cannot be processed normally, and an exception or timeout occurs, resulting in service failure.

Security: When the system receives an unauthorized or malicious attack request, the application will be abnormal or even the service will fail.

Fault Tolerance: The application cannot properly handle user errors in usage.

When the above situation occurs, it means that the system is in an unstable state, and we need to be able to detect and deal with it in time. The causes of these problems can usually be classified into the following three categories in software systems:

Human failure: Insufficient thinking in all aspects of software development, or various problems caused by careless execution.

Hardware failure: network failure, insufficient hard disk space, memory crash, etc.

Software failure: thread pool exception, JVM exception, middleware or other dependent application service exceptions.
For a dynamically evolving system, we have no way to reduce the probability of failure to 0. We can only minimize its occurrence by establishing process specifications and mechanisms during software production. Secondly, for a running system, we need to establish and improve the monitoring and early warning mechanism to detect faults in the system in time, and to restore the system quickly through the implementation of the plan. Based on the above conclusions, in order to improve the availability of the system, it is necessary to start from the following three aspects: fault prevention, fault discovery and fault recovery.


The probability of human beings making mistakes is far greater than that of machines, so the most important thing for fault prevention is to establish a mechanism, reach a consensus within the team and continue to carry out research and development work according to this process, thereby reducing personal factors (thinking, execution, status, etc.) impact on system stability. For fault discovery and fault recovery, it is necessary to quickly discover system abnormalities and restore them through system monitoring and emergency solutions, so as to minimize the impact of faults. Taking Ant Ant’s daily product development process as an example, starting from the four core elements of function, capacity, security, and fault tolerance, a set of solutions is given for reference only.

1 R&D specifications

design phase

*Team segmentation document template
* High availability design specification
*Coding stage

code specification

1. General code specification
2. Engineering structure specification

single test coverage

*Single test pass rate
* code coverage
*log specification
*Security vulnerability repair specification
* Release phase

Change specification: three axes

2 Capacity Guarantee
capacity assessment

*Machine capacity
*DB capacity
*Cache capacity
*Pressure testing
*Current limiting scheme
*Downgrade plan

3 Monitoring alarm

log specification
Monitoring and sorting

Application Basic Monitoring
*Gateway monitoring
*Service monitoring
*Business monitoring
* Current limit monitoring
*Alarm specification
*Data check

4 emergency response

Daily plan

*Hardware exception plan
*Middleware exception plan
*Business exception plan
*Promotion plan
*Plan execution specification

three summary

How to ensure stability and high availability is a huge proposition, and any small part of the content can be searched for a large number of articles on the intranet. The purpose of writing this article is to summarize my understanding of stability and high availability guarantee work, and share a set of system framework ideas with you. I hope everyone can have a more comprehensive understanding of safe production after reading this book, without getting bogged down in details.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us