Four steps to landing stability guarantee work

What is stability?
The first time I came into contact with the word stability was at the Double Eleven KO meeting in the first year of joining Ali. When I come into contact with vocabulary such as current limiting, capacity expansion, and pressure testing, I only feel that the stability work is trivial, complicated, without process, without clear measurement indicators, and impossible to start.
This year, I have invested a lot of energy in the stability assurance work together with two small partners. I have also learned a lot of knowledge about stability assurance from them, and I have begun to have a certain understanding of stability work.
Stability work is also organized and step-by-step. Follow the steps step by step, and you can easily complete the stability guarantee work, do it right, and do it well.
The so-called good memory is not as good as a bad pen. Take the opportunity to sort out and record it, so that it can be handy for subsequent use.
What is Stability Guarantee
So what exactly is a stability guarantee?
According to previous conclusions, stability assurance is to ensure the stability of the system, and it can still operate and provide services continuously and stably when various unpredictable situations occur.
Personally, I feel that the stability guarantee is very similar to a water conservancy project. In the application system, water can be user flow or capital flow. The stability work is to ensure that the water can flow according to the predetermined channel path, to ensure that there is no phenomenon of water seepage, water leakage, or channel collapse, or that such phenomena can be repaired in time to minimize losses.
What does stability assurance do?
What exactly does stability assurance do? Naturally, it is to achieve the goal of stability guarantee.
According to the definition of stability assurance above, it can be seen that "continuous and stable operation and service provision in the event of various unpredictable situations" is the goal of stability assurance work.
How to achieve this goal? Still confused, no way to start. That's because the goal is too big and too empty.
When you encounter a large and imaginary target, you can use the target refinement divide and conquer method. Several steps are required as indicated by the red dotted arrows in the image below.
First, the goal is subdivided into subgoals, which can be defined by extracting keywords in the goal. Unforeseen circumstances, when they occur, continuous stability, operation, and service provision can be extracted from "continuous and stable operation and service provision when various unpredictable situations occur". These five keywords are the sub-goals of stability assurance work.
Then, ask questions for each sub-goal, and list all questions you have. Including but not limited to questions about the understanding of goals, questions about methods for realizing goals, and questions about standards for realizing goals.
After that, find answers or solutions to all your doubts. It's like the process of the tutor giving the topic during school, and understanding and solving it by yourself. This may require data review, status analysis, program selection, final decision-making, and actual implementation.
In this way, all the sub-goals will be achieved one by one, and finally, the realization of the sub-goals will complete the overall realization of the ultimate goal.
To sum up, the content in the outermost light blue circle in the above figure is the work we need to do to achieve the goal of stability assurance. At first glance, it looks like a lot, but upon closer inspection, there is a hidden pattern. After further inspection, we find that the essence is these points: what phenomenon? How to find out? What effect? How to deal with it? Phenomena, impacts, and handling methods are all coupled with specific anomalies, and detection depends on monitoring and alarming.
To sum it up: sort out abnormal situations -> configure monitoring alarms -> evaluate impact areas -> plan solutions.
Next, we will start with these four steps to explain how the stability guarantee work will be implemented. The large frame is first displayed as shown in the following figure:
Sort out abnormal situations
So what is an abnormal situation? There are many abnormal situations that are very trivial, such as high RT, message queue blocking, FullGC, NPE, data exceptions, data inconsistencies, fund amount calculation errors, database connection timeouts, network exceptions, code bugs causing exceptions, etc. So many things to sort out How to ensure coverage?
Continue to use the divide-and-conquer method of target refinement described above, but target refinement cannot be achieved by splitting keywords at this time. At this time, sub-targets can be defined by category refinement.
Categorize exceptions. From the above-listed exceptions, we can classify them. It can be seen that message queues, database connections, and FullGC are all exceptions caused by middleware, while code bugs, data inconsistencies, and miscalculations of funds are written by developers. bugs or product design flaws. In summary, exceptions such as middleware can be defined as infrastructure exceptions, and exceptions such as bugs and product defects can be defined as business function exceptions.
To sort it out in reverse, infrastructure exceptions include network, capacity, connection, disk, cache, JVM and other middleware or underlying hardware facilities; such exceptions generally do not occur in daily life, and can only occur in large and sudden When the traffic changes, the infrastructure cannot handle the excessive traffic and cause abnormalities. Therefore, it mostly occurs during the promotion period.
Business function exceptions include code logic exceptions and fund exceptions. It is closely related to business functions, so such exceptions generally appear after each change of business logic, corresponding to daily needs. It can be sorted out at the same time as each business requirement is developed, and switches and correction tools can also be preset in the code according to the results of abnormal sorting. In case of an abnormal situation after going online, the preset switches and correction tools can be used to stop bleeding and repair.
By the way, according to the classification of exceptions, everyone habitually divides the stability guarantee work into daily stability guarantee and big promotion stability guarantee.
• The daily stability guarantee is mainly aimed at abnormal business functions. This kind of stability guarantee work is carried out simultaneously with the development of each business requirement, and the release must precede the release of business functions. Routine business function changes generally do not cause infrastructure abnormalities, so daily stability guarantees rarely involve infrastructure guarantees.
• The stability guarantee of the big promotion is mainly aimed at infrastructure abnormalities. Because the big promotion will lead to traffic growth that is tens of times, hundreds or even thousands of times higher than usual, at this time the infrastructure will face enormous pressure, and various abnormalities will appear. The stability guarantee work of the big promotion is the work that needs to be completed before each big promotion.
Configure Monitoring Alarms
Monitoring alarms are divided into three categories, infrastructure monitoring alarms, business function monitoring alarms, and fund security monitoring alarms.
Infrastructure monitoring alarms are generally configured at the beginning of application creation, covering applications and all middleware, networks, etc. The group has clear regulations on the coverage of basic monitoring alarms.
Business function monitoring alarms are configured by developers during the development of daily business functions to monitor specific business scenarios.
Fund security monitoring and alarms are mainly for fund-related applications, such as order placement and payment. If not, it needs to be created from 0 to 1, and then incrementally set up with each business function development.
On the eve of the big promotion, all monitoring and alarms will generally be sorted out and leaks will be checked to fill in the gaps.
Data Flow Diagram
Monitoring alarm configuration is actually an omitted sentence, and its complete expression should be: preparation of monitoring alarm data source, monitoring alarm configuration.
The overall data flow of monitoring alarms is shown in the figure above. Mainly use logs, messages, and persistent data as data sources, and collect data for monitoring and alarm configuration.
Among them, the log is mainly used to monitor the market display, real-time response to the real situation on the line; the message is mainly used as a trigger medium for real-time verification, triggering a fund security check, and the fund security check adopts bypass check, fund consistency check, pairwise check, etc. Check the online logic; the persistent data is mainly used for offline checking, and the correctness of the data is verified by pairwise checking. The specific checking logic will be described in detail later.
Finally, the monitoring and checking will summarize the alarm information to the alarm system and trigger the alarm. Synchronize to the alarm responder for processing.
configuration steps
To sum up, once you have an understanding of the data source and flow direction of monitoring alarms, you can configure monitoring alarms according to the following steps:
It is worth mentioning that, although there must be data first, monitoring can be configured. But in fact, data preparation and monitoring configuration should be parallel. Prepare monitoring data according to the general planning of monitoring items, and then configure monitoring alarms. If any data does not meet the configuration, return to the data preparation step.
Personally, I think that the quality of a monitoring system is judged mainly through three indicators: correctness, coverage, and intuitiveness. Correctness guarantees the basic functions of monitoring and can correctly reflect the real situation. Monitoring without correctness is meaningless. Coverage is a key indicator to measure the success of a monitoring system. The higher the coverage, the more perfectly the monitoring system can reflect the actual operation of the system. Intuitive display of monitoring indicators helps to quickly discover abnormalities and quickly locate problems.
For alarms, I personally think that the most important things in alarm configuration are: timeliness, effectiveness, and accountability. The occurrence of an alarm is generally due to an abnormal situation, which may involve capital loss or malfunction. The more timely the discovery, the more timely the bleeding can be stopped, and the less the loss will be. Alarms are ultimately processed manually, and invalid alarms will waste manpower costs. Therefore, attention should be paid to filtering noise in alarm configuration to ensure the effectiveness of alarms and ensure that the reported problems are indeed problems. As for the responsibility system, it emphasizes that someone must respond to the alarm, and it is best to assign each alarm to someone. Responsive alerts are final and effective.
Fund security check
The essence of the fund security check is to check whether there is an event of capital loss, so as to promptly report to the police and quickly stop the bleeding, and finally achieve the purpose of capital loss prevention and control. Fund logic has some commonalities compared with general business logic, so we can think of some common fund security check methods and asset loss prevention and control measures based on these commonalities.
Regarding the methodology of fund security check, according to the summary of predecessors, the problem of fund security is mainly caused by abnormalities in the process of processing key elements of funds. The life cycle of key elements of funds mainly includes three important nodes: production, delivery and consumption. The errors that may occur in the three nodes are production errors, missed transmissions and mistransmissions, and consumption errors. In response to these errors, the predecessors proposed three major verification methods. The so-called verification is to find a correct data as the expectation, and compare and check the actual situation with the expected data.
Baseline checking is to compare and check historical data as expected. This method relies on the correctness of historical data, requires less investment, has low effectiveness, and can find large financial problems.
Two-by-two checks take the upstream as the expectation and carry out comparative checks. The accuracy is high, the timeliness is high, the cost is relatively high, and it is difficult to cover comprehensively.
The business logic check takes the experience of business experts as an expectation, which requires a lot of manpower investment and is highly dependent on experience, but it has high accuracy and high timeliness.
In the face of asset loss prevention and control, what specific measures can we adopt?
My little partner who specializes in asset loss prevention and control made a very comprehensive summary. As shown in the figure below, the strategies for asset loss prevention and control include preservation volume, disk increment, high-risk control, and uniqueness. Capital loss prevention and control is a long-term process that requires frequent maintenance, and continuously optimizes the preservation of stock deployment and control; conducts capital loss assessment for new changes, confirms card points before release, and ensures that there are corresponding verification rules; for vulnerable asset loss scenarios and data Carry out special re-insurance; pay special attention to the unique logic or scene of the big promotion.
To prevent and control capital loss, it is necessary to sort out the capital security risk scenarios of the whole link, analyze the capital loss scenarios from various aspects such as blood loss type, rule expression, rule type, and dependency factors, and establish a corresponding blood loss model. In order to ensure fast and accurate, asynchronous messages can be monitored for real-time verification, combined with specific error log alarms, coupled with hour-level offline persistent data verification to ensure the safety of funds. For the verification script, its correctness can be guaranteed through organizational review and offensive and defensive verification.
Establish a capital loss market, arrange special personnel to watch the market during the peak period of the big promotion, and respond to problems in a timely manner. At the same time, preset necessary emergency plans for items with high capital loss risks.
Estimated impact area
The impact surface is naturally also related to abnormalities.
There are many kinds of abnormalities in the infrastructure, for example, the minor ones are only a short-term high load, and the serious ones are that a certain middleware (such as MetaQ) is unavailable, the computer room is powered off, and the optical cable is cut. The severity of the abnormality directly determines the impact area. It may be that the error rate is soaring, the RT is soaring, the message is blocked, and the FullGC is frequent, which affects the continuous stability of the system. It may also be paralyzed and unavailable, the network is unavailable, and the traffic drops to zero. However, such serious abnormalities generally do not occur. The computer rooms are deployed in multiple computer rooms, and hardware disaster recovery considerations are guaranteed by a dedicated team. Middleware also needs to be guaranteed by a dedicated team for operation and maintenance.
Stability guarantee work generally only considers problems such as insufficient capacity and high system pressure caused by large traffic surges for infrastructure abnormalities. The reason for this kind of problem is clear, that is, the traffic is too large. Immediate phenomena are soaring error rate, RT soaring, message blocking, frequent FullGC, etc. In more serious cases, it will cause customer complaints and public opinion.
Anomalies in business functions are errors. Whether it is logic anomalies, capital calculation errors, or capital flow errors, it is essentially a lack of product design, a bug left by development, or a misconfiguration somewhere. This type of anomaly is related to specific business scenarios. Small ones only affect a certain local small function, while large ones affect the core function. It may cause customer complaints and public opinion, and abnormal funds may lead to capital losses.
Book a solution
In the previous big framework, we already know that the predetermined solutions include measures such as limited flow, stress testing, capacity expansion, and pre-plans, so how do these solutions come into effect? Is there a certain order between them?
The resolution is strongly related to the exception type. Different types have different solutions. Therefore, solutions are also divided into solutions for abnormal business functions and solutions for abnormal infrastructure.
Solutions for business exceptions
Stability guarantee mainly thinks about solutions for business exceptions from the perspective of "what to do if something happens". Therefore, it is necessary to prepare a kit in advance for emergencies.
Solutions to business exceptions are generally divided into three categories, hemostatic solutions, temporary solutions, and long-term solutions. The time that needs to be consumed gradually increases, and the degree of solving the problem gradually increases. They are all solutions to problems.
Hemostasis solutions generally do not require code changes and releases, and are implemented through pre-plans, settings, switches, etc., and usually require planning and preparation in advance. Temporary solutions and long-term solutions generally require code changes and releases, which takes a long time. Therefore, when encountering a problem, the most efficient hemostatic solution is generally implemented first. If the hemostatic solution still suffers a large loss, it is necessary to quickly come up with a temporary solution to solve the problem. Although the temporary solution solves the problem to a certain extent, there may be A few minor functional issues, performance issues, or flaws in elegance. Therefore, it is necessary to think of a stable and elegant long-term solution after the problem has been alleviated. Of course, that is something for another time, and it does not belong to the scope of work of stability guarantee.
For abnormal business functions, it is necessary to plan in advance during daily development, prepare switches or settings that can degrade in multiple dimensions and degrees, and reserve them for emergency hemostasis when abnormalities occur. That is the plan.
The plan needs to be verified by rehearsal to ensure the correctness of the plan configuration and execution.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us