Stability and High Availability Guarantee Work Ideas

Stability and High Availability Guarantee Work Ideas Introduction

Stability and high availability are clichéd words. With experience and feelings, we know that by improving these two indicators of the system, the system will be healthier and the product will have a better user experience. But what if we wanted to define stability and high availability? What is the difference and connection between stability and high availability? In my opinion, we must first understand these two issues before we can set clear goals and systematically formulate complete and feasible solutions.

An in-depth understanding of stability and high availability

Stability and high availability are clichéd words. With experience and feelings, we know that by improving these two indicators of the system, the system will be healthier and the product will have a better user experience. But what if we wanted to define stability and high availability? What is the difference and connection between stability and high availability? In my opinion, we must first understand these two issues before we can set clear goals and systematically formulate complete and feasible solutions.
Searching for stability on Wikipedia, it's defined as:

Stability is a mathematical or engineering term for determining whether a system with bounded inputs also produces bounded outputs. If so, the system is said to be stable; if not, the system is said to be unstable.
Take another look at high availability:
High availability (English: high availability, abbreviated as HA), an IT term, refers to the ability of a system to perform its functions without interruption, representing the degree of availability of the system. It is one of the criteria for system design. A high-availability system can run longer than the individual components that make up the system.

Stability and High Availability Guarantee Work Ideas.First, the key words are extracted from the definition of stability - system, input, output. In the current technical architecture of Ant, an application can be regarded as a system, the service request between applications is the input, and the service response is the output. When the service response meets the expectations, the application system is considered to be stable. When they are combined with each other to form a larger system, when they are expressed to users as business products, the user's request is used as the input, and the expression of the product is used as the output. When the product functions

normally, it can be considered that the product system is stable. To sum up, we can summarize the definition of stability as - when the system receives input, it can produce correct and expected output, and the system is called stable; otherwise, the system is called unstable.
Back to the proposition, why is it called stability guarantee? Can it be another way of saying it is called improving stability? From the definitions above, we can conclude that stability describes the behavior of the system. Whether a system is stable, just like how we evaluate whether a person is healthy, it is difficult to describe and quantify in a statement. But it can be quickly judged by negative means. People can reduce the occurrence of diseases and maintain good health through good diet and living habits. The same is true for ensuring the stability of the system or improving the stability of the system. We need to avoid those unstable situations through various methods. The so-called more stable does not exist objectively, but subjectively hopes to avoid or reduce the occurrence of unstable situations.

Stability and High Availability Guarantee Work Ideas.Unlike stability, availability is a quantifiable indicator, and the calculation formula is described in Wikipedia as follows:
total system uptime based on system damage, time of inactivity, and time to return from inoperability to operational condition .
The 3 9s (99.9%) and 4 9s (99.99%) we often hear measure the availability of the system. High availability is to ensure that this indicator of the system is maintained at a high level. In the definition description of the formula, the running time of the system is divided into three parts
1.The time that the system is in normal operation, that is, the time that the system is in a stable state.
2.The time the system is damaged, unusable, that is, the time the system is in an unstable state.
3.The time for the system to recover from an inoperable state to an operational state, that is, the time for the system to recover from an unstable state to a stable state.

Stability and High Availability Guarantee Work Ideas.System availability and system stability are positively correlated. However, in real life, the system cannot be in a stable state forever. Thinking backwards, converting the above formula is more conducive to our analysis:

Stability and High Availability Guarantee Work Ideas.At this point, the goal of this proposition, the KPI is clear. The goal of ensuring the stability and high availability of the system is to keep the system in a stable working state, without negative impact on users, and to avoid online problems and P-level failures. The core kpi is the availability of the system. In order to improve the availability of the system, we should first ensure the stability of the system and reduce the occurrence of unstable conditions. Secondly, when the system is in an unstable state due to the failure of various components, we can quickly find and restore it to a stable and usable state. .

Core ideas of stability and high availability guarantee


Through the above deduction, for the goal of improving system availability, we can get two basic problem-solving ideas. According to the picture, in order to solve the problem, the first task is to find and define the problem. Therefore, in order to improve the stability of the system, we first list the common unstable situations in the application system, and then prescribe one by one:
•Function: An error occurred in the function performed by the application, not as expected.
•Capacity: When the number of requests received by the system increases, the application cannot handle it normally, an exception or timeout occurs, causing the service to fail.
•Security: When the system receives an unauthorized or malicious attack request, the application is abnormal or even the service fails.
•Fault Tolerance: The application cannot handle properly the usage of user errors.
When the above situation occurs, it means that the system is in an unstable state, and we need to be able to detect and deal with it in time. The reasons for these problems can usually be attributed to the following three categories in software systems:
•Human failure: Various problems caused by insufficient thinking in all aspects of developing software, or careless execution.
•Hardware failure: network failure, insufficient hard disk space, memory crash, etc.
•Software failures: Thread pool exceptions, JVM exceptions, middleware or other dependent application service exceptions.

Stability and High Availability Guarantee Work Ideas.For a dynamically evolving system, we have no way to reduce the probability of failure to 0. We can only minimize its occurrence by establishing process specifications and mechanisms in the process of software production. Secondly, for a running system, we need to establish and improve the monitoring and early warning mechanism to detect faults in the system in time, and make the system recover quickly by implementing the plan. Based on the above conclusions, in order to improve the availability of the system, it is necessary to start from the following three aspects: fault prevention, fault discovery and fault recovery.

The probability of human mistakes is much greater than that of machines, so the most important thing for failure prevention is to establish a mechanism to reach a consensus within the team and continue to carry out research and development work according to this process, so as to reduce personal factors (thinking, execution, status, etc.) impact on system stability. For fault discovery and fault recovery, it is necessary to quickly find and recover system abnormalities through system monitoring and emergency plans, so as to minimize the impact of faults. The following takes Ant's daily product development

process as an example. Starting from the four core elements of function, capacity, security, and fault tolerance, a set of solutions is given for reference only.

1 R&D Specifications
•design phase
•Team Segmentation Document Template
•High Availability Design Specifications
•encoding phase
•code specification
1. General code specification
2. Engineering structure specification
•Single test coverage
•Single test pass rate
•code coverage
•log specification
•Security Vulnerability Repair Specification
•release stage
•Change Specification: Three-Blade Axe
1 R&D Specifications
•design phase
•Team Segmentation Document Template
•High Availability Design Specifications
•encoding phase
•code specification
1. General code specification
2. Engineering structure specification
•Single test coverage
•Single test pass rate
•code coverage
•log specification
•Security Vulnerability Repair Specification
•release stage
•Change Specification: Three-Blade Axe
2 Capacity Guarantee
•Capacity assessment
•machine capacity
•DB capacity
•Cache capacity
•pressure test
•Current limiting scheme
•Downgrade Scenario
3 Monitoring alarms
•log specification
•Monitoring and combing
•Application Basic Monitoring
•Gateway monitoring
•service monitoring
•business monitoring
•Current limit monitoring
•Alarm specification
•data check
4 Emergency quick response
•daily plan
1 R&D Specifications
•design phase
•Team Segmentation Document Template
•High Availability Design Specifications
•encoding phase
•code specification
1. General code specification
2. Engineering structure specification
•Single test coverage
•Single test pass rate
•code coverage
•log specification
•Security Vulnerability Repair Specification
•release stage
•Change Specification: Three-Blade Axe
2 Capacity Guarantee
•Capacity assessment
•machine capacity
•DB capacity
•Cache capacity
•pressure test
•Current limiting scheme
•Downgrade Scenario
3 Monitoring alarms
•log specification
•Monitoring and combing
•Application Basic Monitoring
•Gateway monitoring
•service monitoring
•business monitoring
•Current limit monitoring
•Alarm specification
•data check
4 Emergency quick response
•daily plan
2 Stability and High Availability Guarantee Work Ideas.Capacity Guarantee
•Capacity assessment
•machine capacity
•DB capacity
•Cache capacity
•pressure test
•Current limiting scheme
•Downgrade Scenario
3 Stability and High Availability Guarantee Work Ideas.Monitoring alarms
•log specification
•Monitoring and combing
•Application Basic Monitoring
•Gateway monitoring
•service monitoring
•business monitoring
•Current limit monitoring
•Alarm specification
•data check
4 Stability and High Availability Guarantee Work Ideas.Emergency quick response
•daily plan
•Hardware exception plan
•Middleware exception plan
•business exception plan
•Big promotion plan
•Plan execution specification
Three summary
How to ensure stability and high availability is a huge proposition, and a large number of articles can be found on the intranet for any small part of the content. The purpose of writing this article is to summarize my understanding of stability and high availability guarantee work, and to share with you a set of systematic framework ideas. I hope that after reading it, you can have a more comprehensive understanding of safe production and not get caught up in the details.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00