Here is a reference of the overall stability solutions based on the above principles.
Architecture Design Principle
The architecture of software systems has evolved from the Monolithic Application architecture where all functions are running within one application, Traditional Distributed Application architecture where different functional modules are deployed on different servers, Microservices architecture where services are finely divided and communicate with each other through lightweight communication mechanisms, to Cloud-native architecture that combines cloud computing, containerization, and microservices architecture. The basic attributes of the system remain unchanged in the evolution of software system architectures, including storage, computing, and networking, while the implementation methods and scales of storage, computing, and networking change, evolving towards large-scale, high-performance, high-reliability, and easy scalability. Therefore, there are higher requirements for architecture stability.
The predictable stability risks of the system include software and hardware failures and unpredictable traffic, ranging from thread-level risks to regional disasters. Based on this understanding, system architecture stability can be established through disaster recovery, failure tolerance, and capacity.
Disaster Recovery
Disaster recovery (DR) refers to keeping business operations uninterrupted when a disaster occurs, while minimizing data loss. Geo-redundancy and Active/Active within the same region are examples of disaster recovery. With the capability of multiple regions and availability zones provided by Alibaba Cloud, applications can be deployed at lower cost for disaster recovery architecture.
Disaster recovery requires a relatively complete data protection and disaster recovery capability to ensure the integrity of data and the continuity of business when the production center fails to work normally, and to take over by the standby center in the shortest time, recover the normal operation of the business system, and minimize the losses. For data disaster recovery, Cloud Backup provides backup, disaster recovery protection, and policy-based archiving management for ECS, ECS databases, file systems, NAS, etc.
Failure Tolerance
Failure tolerance refers to the reliable mechanisms and strategies designed and implemented in distributed systems, which can automatically detect, eliminate, or correct errors when failures occur, so as to ensure that the system can run normally and improve the reliability and stability of the system.
Capacity
Capacity refers to the maximum workload that a system can bear or data volume that a system can handle within a certain period of time. System capacity is closely related to hardware, software, architecture, and network bandwidth. For applications on cloud, it is also necessary to pay attention to the cloud service quotas within each Alibaba Cloud account to avoid business failures caused by cloud service quota limits. By using the Alibaba Cloud Quota Center, users can query the quotas limits or adjust them of corresponding cloud services on demand.
Change Management Design Principle
Change Requests must be common in each and every enterprise operation and management process. Change refers to the addition, modification, or deletion of any content that may directly or indirectly affect services. When changes fail, they may have serious consequences, such as business interruption, customer sentiment, etc. To reduce the business risks caused by changes, it is necessary to follow the change management design principles: grayscale capability, observability, and rollbackability.
Grayscale Capability
To achieve grayscale capability, it is necessary to establish a complete grayscale release mechanism, which will help reduce business impact and improve user experience in change failure.
The grayscale release mechanism includes but is not limited to the following points: grayscale release methods, release batches, interval time, release observation, etc. The following points should be noted in the grayscale release:
Release interval time: Set a reasonable release interval time. A long gray release interval time may cause data inconsistency issues in downstream applications.
Release method: Choose a reasonable release method by users, regions, channels, etc., to avoid inconsistent user experience during the release process.
Release batches: It is recommended to start with a small-scale grayscale release verification and gradually expand the release scope.
Release observation indicators: Clearly define observable indicators during the grayscale release period to identify the release results and avoid chain reactions.
Rollbackability
In theory, rollback is the most appropriate and effective method to recover change failure in case of emergency.
When a problem occurs, it is the top priority to ensure continuous business operation. In practice, there may be other solutions, but the consequences are unpredictable. In common cases, rollback is always the best way.
During change release, it is recommended to execute frequent, small, reversible changes. Major version differences may block the rollback due to system dependency issues.
Observability
During the change process, the existing environment as well as upstream and downstream businesses may be affected. By enabling the business, links, and resources observable, issues can be identified as soon as possible. During the observation process, pay attention to business metrics (e.g., order success rate) and application metrics (e.g., CPU, load, exception count, etc.). When there are many metrics, focus on high-priority business metrics first. Business metrics can most intuitively reflect the current system status. When business metrics change, application metrics often change correspondingly.
The corresponding inspection checklist of monitor metrics should be prepared before the change. During the change period, it is supposed to be continuously observed and monitored to determine if there are negative impacts or problems. After the change is completed, the business metrics should be compared before and after. Only when there are no problems of metrics should the change be called an end.
Best Practices on Alibaba Cloud
By adopting Alibaba Cloud's Cloud Excellence (BizDevOps) services, it is easier to achieve application grayscale release and rollback during the release process, ensuring controllable release changes. During the change process, it is necessary to monitor the affected resources, traces, and business situations. Alibaba Cloud provides a rich set of monitoring products, such as CloudMonitor, Application Real-Time Monitoring Service (ARMS), Log Service (SLS), etc.
Emergency Response Mechanism
The key point of the emergency response mechanism is the standard operating procedures and actions after an event occurs. Alibaba Cloud has accumulated an emergency response mechanism in the past decade or so, called the 1-5-10 mechanism. 1-5-10 means discovering faults within 1 minute, organizing relevant personnel for preliminary investigation within 5 minutes, and carrying out failure recovery process within 10 minutes. This may set an example for enterprises when designing an emergency response mechanism, ensuring that relevant parties should be clear about their roles and responsibilities when an event occurs.
Failure Detection
Once a failure occurs, the earlier it is detected, the earlier the response can be made. It is recommended to achieve instant failure detection through the following methods:
Unified alarm. After a failure is identified, information needs to be timely notified to relevant personnel, including system administrators, operation and maintenance personnel, etc. Alarms can be sent through SMS, email, DingTalk, etc., to ensure that all relevant personnel are informed of the failure information as soon as possible for rapid emergency response.
Monitoring dashboard. The monitoring dashboard refers to displaying the running status of all systems in a graphical way on the screen, so as to monitor the health status of the system in real-time. When a failure occurs, the monitoring dashboard can display the error and provide relevant data as a basis for troubleshooting and handling.
Risk prediction. Risk prediction refers to predicting system risks through data analysis, machine learning, and other methods before faults occur. Risk prediction can be an important reference in emergency response to faults, helping to quickly identify the root causes of problems and improving the efficiency and accuracy of fault handling.
Failure Response
After a fault is detected, it is necessary to quickly locate the problem. Here are some common practices.
Coordination. After a fault occurs, it is necessary to quickly organize relevant personnel for emergency response. Required coordination includes setting up a command center, determining the emergency response process, assigning tasks, etc. The purpose of these steps is to improve the efficiency and accuracy of emergency response, so that everyone knows their roles and responsibilities, avoiding confusion and misoperations.
Alarm correlation analysis. When a fault occurs, the system should automatically generate alarm information. To better locate the cause of the error, it is necessary to perform correlation analysis on various alarm information. It will help quickly determine the scope and impact of the fault, and find the root cause. There are various tools and algorithms for alarm correlation analysis, such as event correlation analysis, machine learning, etc.
Knowledge graph. A knowledge graph organizes and associates various data and knowledge to establish a unified database or graph, so as to quickly locate and solve problems when faults occur. In emergency response, the knowledge graph can guide fault troubleshooting and handling, improve efficiency and accuracy. There are various tools and technologies in knowledge graphs, such as natural language processing, graph databases, etc.
Failure Recovery
After the cause of the fault is identified, the next step of the emergency plan is to quickly recover the business. Afterwards, the issue review should be conducted.
Plan execution. In the process of fault response, it is necessary to carry out executions in accordance with the previously formulated emergency plan. The emergency plan includes emergency response process, responsibilities of various positions, handling process, etc. Plan execution helps ensure the standardization of fault handling process.
Fault self-healing. Fault self-healing means that the system detects faults and takes recovery measures automatically. Fault self-healing technology can help achieve faster and more accurate fault handling and recovery. For example, the system can automatically migrate workloads to solve faults by using container technology.
Fault review. Fault review refers to analysis and summary of the issue, so as to better prevent the recurrence. During the fault review process, detailed records and analysis of the causes, impacts, and handling processes of the faults need to be recorded, and relevant measures need to be formulated. Fault review is also a self-learning and self-improving process both for the system and the team.
Fault Drill Normalization
Fault drills provide a comprehensive testing concept and tool framework for end-to-end testing. Essentially, this is supposed to validate the vulnerability of software quality by actively introducing faults. There are various means recommended in this chapter, such as discovering system risks in advance, improving the quality of testing, perfecting risk plans, strengthening monitoring and alarms, and improving the efficiency of fault recovery. All the actions help achieve the goals of effective prevention, timely response, and regression verification. Fault drills can create a resilient distributed system based on faults themselves, continuously improving software quality, and enhancing the team's confidence in software production and operation. Fault drills can be divided into disaster recovery drills for plan verification, red-blue attack, defense drills for stability acceptance, and emergency response drills for fault verification.
Disaster Recovery Drills
Disaster recovery drills simulate instances, data centers, or regional-level faults to identify system-level disaster recovery capability and response capability in the face of disasters. Disaster recovery drills can help enterprises better verify Recovery Point Objective (RPO) and Recovery Time Objective (RTO) indicators, timely discover and solve related problems, and improve the availability and reliability of systems. Alibaba Cloud Application High Availability Service (AHAS) helps enterprise users better conduct fault drills for applications and enhance application stability.
Red Team Exercises
Red Team Exercises is a terminology in military training, referring to a type of comprehensive combat training. The exercise is usually divided into red team on the defending side and blue team on the attacking side. Red Team Exercises is often used for security drills, but here we apply it to stability drills. In stability scenario, the red team discovers various vulnerabilities from the perspective of third parties and injects faults into various software and hardware, continuously verifying the reliability of the business system. The red team needs to respond according to the predefined fault response and emergency process. After the exercise, it is recommended to review three stages of the process (discovery; response; recovery) and summarize the improvement action, thereby improving the stability of the business system.
Emergency Response Drills
Emergency response drill is a type of exercise where the red team is opaque to the blue team in terms of means and targets. Through emergency response drills, the technical team can comprehensively test the emergency response and recovery capabilities when facing unexpected faults, and improve internal personnel's security awareness. In emergency response drills, the red and blue teams are in complete confrontation. Therefore, higher requirements are placed on both the red and blue teams. The blue party needs to not only understand the weaknesses of the target system but also the business of the target system. The red team needs to not only fix faults but also quickly detect faults and respond effectively. The personnel, scenarios, and processes involved in emergency drills are more complex, compared with planned drills. It is also necessary to ensure the confidentiality of the drill plan and fully evaluate the impact control in case the red team fails to respond to faults on time.