How to perform chaos engineering on the cloud?
In the traditional system operation and maintenance mode, we usually only focus on the system backbone process and business process, and ignore the bypass system or underlying architecture. When the system alarm occurs, it is very likely that it is caused by the part that we do not pay attention to, which causes the operation and maintenance personnel to be unable to respond well. In the event of a large-scale fault, it may be necessary to coordinate the personnel of various business domains to deal with the fault, but the personnel come from different businesses and have different responsibilities and expertise. They cannot work well together and deal with the fault efficiently.
The governance, operation and maintenance level of business system can be divided into four levels according to different business system levels:
① Business system: the system that business developers are best at or most involved in, such as development code.
② Related business: refers to some two-party dependencies developed by the same company or department.
③ Middleware and basic components: General developers no longer care about this layer.
④ Infrastructure layer: operation and maintenance personnel are familiar with it, and developers hardly participate in the operation and maintenance of this layer.
Each technology tier will face different problems.
⚫ The most common problems of business systems are code bugs, burst traffic, and publishing intermediate status problems.
⚫ The problem faced by the associated business layer is how the business will behave when it encounters problems. For example, when the associated business is down, there may be associated business dependency error reporting or sudden increase in RT; If there is a code bug in the associated business, it is a logic error.
⚫ The problems encountered by middleware and basic components are mostly middleware unavailability, slow SQL, RDS failure or message delay.
⚫ Most of the problems encountered in the physical infrastructure layer are system downtime or network problems, including network congestion, network packet loss, increased delay, and even the unavailability of the entire computer room.
In order to solve the unknown problems encountered at different technical levels above, we introduced chaos engineering to verify from four aspects:
① Capacity: Only by specifying the system capacity can we know how many users or calls the system can bear.
② System complexity: As the system ages, the system will contain more and more hidden dangers. The updating and iteration of business and technology will also lead to more and more technical debts, longer and longer links, and more lack of troubleshooting methods.
③ Availability: There are no 100% available systems in the world, and any system may fail. Availability is what happens when the system goes wrong.
④ People, process and personnel cooperation efficiency are the most unstable factors in system fault handling.
All drilling scenarios of chaos engineering are from faults. After the fault is summarized, the environmental water test shall be verified first. Later, mature cases can be turned into online environment drills. After the drill is completed, conduct a review and summarize the problems that need to be solved in the system. This is the ultimate goal of chaos engineering, that is, get the points that can be optimized from the drill and optimize them. Finally, all stable cases are gathered and summarized into the case set of chaotic engineering automation. The collection will run the online system stably, do the regression of the online system performance and complexity, and prevent some changes from affecting the system stability or availability.
The practice of chaos engineering is divided into four classic steps:
Step 1: Define and measure the steady state of the system. It is necessary to clarify the conditions under which the system can support what kind of requests, or the performance of the system when it runs stably. For example, the system can provide stable services at 1000QPS.
Step 2: Create assumptions. Find the variables that may affect the stability of the system in the stable state of the system. For example, if the cache cannot be serviced normally, the system can still provide services at 1000 QPS.
Step 3: Simulate the assumptions into possible events in the real world. For example, if the cache cannot be served normally, events may occur in real life: the cache server network is down or the cache system is forced to be eliminated.
Step 4: prove or refute the hypothesis. For example, if the system is unstable after the cache fails to service normally, can the maximum QPS reach 1000. If the result is that the system QPS can still reach 1000, then the system stability can be passed at least in the model case; If the result is that the QPS cannot reach 1000, for example, the system is unstable when the QPS is 200, then the system bottleneck can be found and treated.
The practice of chaos engineering has five principles:
① Establish a hypothesis around the behavior of stable state: chaos engineering should focus on whether the system can work normally when unstable events occur, rather than trying to verify how the system works.
② Diversify events in the real world: First of all, we need to experiment with the real possible events without caring about the impossible events. Secondly, it is necessary to list as many points in the system where problems may occur as possible, with high probability of occurrence or the priority of events that have occurred in the first place.
③ Running experiments in the production environment: at the initial stage of elastic computing, it is also impossible to run experiments in the production environment, because the system stability is not high, and the observability is not good, and the influence range cannot be well observed when injecting faults online. For example, when testing in an environment where code is isolated but data is not isolated, it is completely impossible to detect the real bottleneck of the system, because any small change or any point different from the online will affect the accuracy of the final result. So we advocate running experiments in the production environment to maximize the performance of the system when problems occur.
④ Continuous automatic operation experiment: taking performance as a part of regression requires not only functional regression, but also automatic performance regression.
⑤ Minimize the explosion radius: When there is enough strong observability, control the possible impact of the drill on the system. The purpose of the drill should be to verify the weak points of the system, rather than completely destroy the system. Therefore, we should control the scope of the exercise, minimize the impact, and try not to cause excessive impact on online users.
02 Chaotic engineering practice of elastic calculation
During the pressure test, we set the call QPS upper limit for each API of each user, that is, flow control. The superposition of all users in the pressure test scenario is relatively stable, but there are still unstable points in the system. First, although there is flow control for users, the flow control is not verified. Only users can call within the threshold to ensure that the function can be completed, without too much pressure on the system; Second, the total capacity of a single interface or the entire application is unknown; Single user call can ensure system stability.
Due to the superposition of different user call peaks from a single call source, the system back-end pressure will be too high, leading to problems in the peak system.
The above process can be summarized into four steps:
Step 1: establish a relatively stable system.
Step 2: bury the problem - the system capacity is unknown.
Step 3: The user will trigger the problems that may occur in the real environment, that is, the sudden call of different users on the same platform.
Step 4: The conclusion is that the system cannot bear the pressure, which eventually leads to failure.
There are four problems to be solved in the above process: the unknown system capacity leads to serious consequences, the unknown system performance under the peak flow, the unknown cause of the peak flow, and the unknown processing process after the system crash. In order to explore the stability of the system, we need to change the unknown into the known, so we introduce the pressure measurement, which is mainly divided into three steps.
Step 1: Baseline. Define system bottlenecks and pressure test stop conditions.
Step 2: actual pressure measurement. The pressure test generally takes API as the entry point, which may be external API, internal and internal call. There are three pressure test methods:
⚫ Simple brute force pressure test: mainly for some very simple APIs. A single API call can exert pressure on the system and concurrently call the interface to achieve pressure test.
⚫ Logic process class pressure test: before calling, some resources will be prepared, and the interface cannot be called separately. It is also called an interface with context semantics, and the interface has state. Arranging the interface and creating instances or querying instances, arranging the serial flow for each instance, conducting parallel pressure test on the serial flow, and finally forming the pressure test case of the logical flow class.
⚫ Online playback pressure test: used for more complex cases or internal call cases.
Step 3: automatic pressure test. Unify stable cases and cases that are not harmful to the system into automatic case sets, and set automatic alarms according to the baseline. When the automatic pressure test regression finds that an interface does not meet the previously predicted baseline, it will automatically alarm to achieve system performance regression and finally solve the four problems mentioned above:
⚫ In view of the serious consequences caused by the unknown system capacity, the system capacity baseline is set, and the system capacity is determined by hitting the pressure to the system limit;
⚫ In view of the unknown cause of the flow peak, the test is realized by assuming the peak value, multi-user simulation and flow control release. In the actual production environment, after defining the system capacity, the problem of unknown flood peak source can be solved through user flow control calculation, and the flow control threshold can be set to a reasonable range;
⚫ The performance of the system under peak traffic is unknown. When the system bottleneck is detected, the performance of the system bottleneck can be found;
⚫ Since the processing process after the system crash is unknown, it will make a surprise attack through pressure test to check whether the business personnel can handle the high-traffic scenarios quickly and stably, and reduce the time of discovery, location and recovery in case of failure.
Fault drill and observability are inseparable. When the observability construction is relatively complete, fault drill can be injected into the online system and the system explosion surface can be controlled.
Observability is divided into four layers:
⚫ Business layer: including result mocks, JVM OOMs, business logic exceptions, and full CPU in the application.
⚫ Dependent business layer: there is a special business interface monitoring for dependent business, and there will be a part of SLA agreement for self and internal dependence.
⚫ Middleware and basic software layer: there are monitoring for middleware business, such as cache hit rate, slow SQL monitoring or network status monitoring.
⚫ Server layer: there are business detection and network status monitoring.
There are different fault drills for different layers. For example, the mocks at the business level need to align all business domain personnel, monitor at the business level, and pay attention to the internal of VM; For dependent services, pay more attention to the rise of RT, result mocking or result error reporting; The middleware and basic software layer may face problems such as cache slowing down, breakdown, slow MySQL, inability to connect, etc; As an infrastructure provider, elastic computing pays more attention to the server layer. When elastic computing goes online in each region, the control system needs to go through multiple zones, downtime, and drills to verify the availability of the control system. Therefore, the server layer has rich experience in drills.
The steps of fault drill are as follows:
Step 1: The concept of fault drill is to increase system avalanches and unstable events as much as possible, which is in conflict with the daily concept of developers. Therefore, first of all, let everyone accept the fault drill. The professional drill team will arrange the fixed drill time and clear drill arrangement, and all businesses will participate.
Step 2: Daily drill organization. The selection principle of events in the daily exercise organization is that frequent problems should be given priority and the risk should be from low to high. Secondly, first test the water in the low-risk environment, confirm the impact in the isolation environment, and conduct destructive experiments and large-scale fault simulation in the low-risk environment. For example, the fault with completely uncontrollable impact needs to be conducted in the low-risk environment. Only the relatively stable case or the case that can confirm the impact can conduct the online environmental drill. When conducting online environment drills, the discovery-location-recovery process should be followed.
Step 3: surprise attack. The raids included red and blue military exercises and one-button drills. Among them, the red and blue military exercises are relatively conservative, and some people who are familiar with the drill case will be selected from the drill team to participate in the fault drill as the Red Army, and inject problems into the system from time to time; All other business personnel are Blue Army, responsible for verifying the discovery-location-recovery time of the problem. One-click drill is a more radical way. Usually, the business leader directly injects the fault to drill the fault handling process of all business personnel. Only highly mature systems can achieve the goal of one-click drill.
Step 4: Summary and improvement. Summary and improvement is the ultimate goal of fault drill and pressure test in chaos engineering. Determine system limits through fault drill and pressure test, including system water level limit, operation and maintenance response limit, problem discovery limit and system recovery limit, and clarify system performance and problem handling process; Record the unavailable nodes and performance bottlenecks, and finally extract the unavailable nodes as the improvement target item. The responsibility is to improve the system stability.
As a large-scale distributed system, what is the difference between the chaotic engineering of elastic computing system and simple chaotic engineering?
① Multiple sets of environment deployment. Elastic computing has 20 sets of highly available deployments and multiple sets of environment planning, so we hope to have an automated tracking system to automate case coverage and region coverage.
② There are many cohesion and coupling nodes. There are many dependencies to be planned and many business parties to be pulled. The solution is to pull in more business domains and arrange which drills to do first according to the priority.
③ Massive real-time call. During the drill, we need to face a very large flow system, and the flow cannot be stopped for a short time. Therefore, if a part of the system is unstable during the drill, it will be amplified as a very serious fault online, and we can avoid the problem through grayscale drill, SLA and fuse degradation.
④ The interface has many functions. It is hoped that case coverage can be achieved through automated drills, and automatic regression of case can automatically return to online systems every day, and return to system performance changes.
03 System evaluation and chaos engineering tools
The maturity of chaotic engineering system can be divided into five levels vertically.
Level 1: Most of them are start-up systems, which are deployed in a single environment and a single region. They can only be drilled in the development and test environment.
Level 2: It has preliminary multi-zone deployment and can inject slightly complex faults.
Level 3: You can drill in the production environment and grayscale environment.
The fourth level: relatively mature system, which can run experiments in the production environment, can conduct relatively mature fault injection or associated business fault injection.
Level 5: system fault and data plane fault can be injected in various environments, and it is also the target system.
The cloud provides many supporting tools.
In terms of pressure test, there are performance test PTS tools that can configure scenarios, control the pressure test progress, and simulate user requests through limited nodes deployed nationwide, which is closer to user behavior. During pressure measurement, the observability work can be completed through cloud monitoring ARMS and PTS monitoring, and the system pressure can be observed at the same time.
The fault injection platform AHAS can inject problems at various levels, including application layer, bottom layer and physical machine layer.
If there is pressure test and drill, there must be defense. The main tools for pressure measurement include PTS, Jmeter, etc. The prevention and control tools for pressure testing have internal and external documents, such as Sentinel and AHAS. When there is a problem with the system, the system can be degraded through Hytrix and AHAS. The drills include AHAS, ChaosBlade, ChaosMonkey, etc. For the drill scenario, the fault scenario can be fused through AHAS. In addition, some observability components are also provided on the cloud, including ARMS, cloud monitoring, etc. Prometheus can be used for observability both on and off the cloud.
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Explore More Special Offers
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00