System Construction of Cloud Native Digital Safety Production
Within Alibaba Group, after more than ten years of exploration, we have precipitated a series of product and service systems, as well as the methodology of safety production construction. We summarized "high availability and stability prevail" as our guiding ideology in the face of safety production challenges on the business side.
What is digital safety production
Today, when we talk about digital safety production, you may first think that safety production is more traditional. For example, in some factories, workshops, coal mines or construction sites, we often see slogans, posters and some related concepts. Traditional safe production refers to taking corresponding accident prevention and control measures to avoid accidents causing personal injury and property loss in production and operation activities.
The digital safety production we are discussing today is actually combined with our business digital transformation and upgrading, mainly to solve the enterprise business continuity management problem. In case of unexpected or unexpected accidents or disasters, the enterprise shall protect important business activities with reasonable costs and resources, ensure the resumption of continuous operation within the specified time, minimize the impact of disasters and minimize the impact of interruption.
Digital safety production has the following special requirements:
• Digitally enabled safe production. After the business is transferred from offline to online, complete the digital transformation of the contact points of the whole business life cycle. At this time, the focus of safety production will also shift from offline to online, and the work of safety production itself also needs digital empowerment.
• Safe production supported by cloud native. The digital transformation has brought about the upgrading of the architecture. All systems are on the cloud and are designed using the advanced cloud native and microservice architecture. Our safety production platform also needs to be upgraded synchronously to seamlessly connect and adapt to the cloud native product capabilities and the expansion capabilities of the future-oriented architecture.
• Safe production of best practices. The construction of safety production system needs to be tested by practice. Within Alibaba Group, we have a team of more than 100 people who have been exploring the construction of safe production and have accumulated a set of best practices that are very suitable for all walks of life and are still evolving.
Contents of digital safety production construction
Based on the above discussion, in order to do a good job in safety production, the core content is divided into three parts, namely, construction before, during and after the event.
• In advance: we need to have relevant organizational structure guarantee, system process system and system architecture construction in advance, and the ability of water level monitoring and fault monitoring of relevant systems, as well as the ability of protection, flow cutting and change control management matching with SLA.
• In the process: We need to achieve agile and rapid cooperation, so as to quickly find, locate and recover faults. For example, in Alibaba, we usually need to work with a team of hundreds or even thousands of people in the "Double 11" or big failure scenarios. In this context, first of all, we need a unified mechanism to ensure consistency, and the ability of full-link monitoring (observable) mentioned by the previous colleague to ensure rapid discovery. In addition, it also needs the systematic ability to do the automatic coordination of the event processing process, rely on the system's trace and topology ability to achieve rapid positioning, and rely on the system's protection ability and cellular disaster tolerance and multiple activities to truly achieve rapid recovery of faults.
• After the event: we need to reflect, summarize the root cause and define the action. After each failure emergency is completed, we need to make a review, rank and assign responsibilities, and output system improvement items to ensure the continuous iterative improvement of our entire architecture. For managers, we need to analyze the cause of the failure, the efficiency of team cooperation in the processing process, and the stability data statistics of teams and products, and then ensure that our entire safety production management system is measurable, assessable, and manageable. Finally, through the ability of visualization, we can control the business safety production in an index and global way.
Alibaba Group Best Practices
Alibaba Group Global Operation Command Center
First of all, at the organizational support level, we have an organization called the Global Operation Command Center (GOC). Within the group, there are more than 60 business BUs that connect all business related to safety production to GOC for unified collaborative processing.
Then we just talked about monitoring (observable), which is a very important link. We will gather all the observable and manual feedback (such as the feedback collected by Taobao customer service and Alibaba Cloud customer service) into a unified event center and use the systematic platform for management.
Finally, all the fault emergencies are gathered in the command centers of the three places on both sides of the Taiwan Straits. The corresponding emergency on-duty students use emergency coordination, fault location, and quick recovery tools to carry out fault emergency response and quick recovery disposal, and conduct post-recovery and improvement, and manage and control the safety production risk events of the whole group through various strategies such as mechanism operation.
Large drawing of safety production system
Safe production is a complete system. With the help of this structure chart, we will give you a general introduction. The group's safe production system is relatively large. We will divide the whole work into small modules.
First of all, it is supported by the technical capabilities of the platform. Through the previous introduction, we learned that work safety involves many people with different roles, and the observable data from different business systems. Work safety management needs the capabilities of pressure testing, fault emergency coordination, drilling, positioning, flow switching, and recovery. Our group has a corresponding platform to provide effective support.
On this platform, the construction of systems in various fields, including fault management, multi-activity, full-link pressure test, change control and other capabilities, is a unified support on this large platform. The construction of safety production platform is also a digital transformation of safety production work.
At the top of the platform, there are related management systems, data operation, and technical and cultural construction. In the early days, when we were working on safety production, the biggest physical feeling was that we could not measure. After a failure, we could not locate where the problem was and whose problem was. Through the construction of the fault level definition, fault classification, stability classification mechanism system and operation activities, we could realize the measurable and assessable work of safety production.
Then the platform and system construction need to cooperate with relevant drills to carry out standardized acceptance to ensure that these systems and product capabilities can be effectively implemented and play a role.
Core elements of safety production
The core of safety production is mainly composed of three parts. The first part is the construction of personnel organization structure. We believe that safety production is the top project of enterprises, and we need to establish a unified top-down organization that can coordinate all safety production capabilities.
Within the group, we have such a vertical organizational structure. The command center is a department at the same level as each business BU, and then there are corresponding professional roles supporting each business BU. Horizontally, there are some organizational roles such as the shift leader of safety production, arbitration committee, etc., to ensure that our system can be effectively implemented.
The second part is mainly about the mechanism process. After more than ten years of construction, the Group has accumulated a lot of system processes.
• Unified fault level definition of the whole group: it provides a quantitative standard for resource scheduling and decision-making in the emergency process;
• Standardized emergency process: make the event handling fast and orderly; The assessment criteria of fault score and stability score are used to uniformly measure the results of safety production;
• Fault classification, responsibility determination and dispute negotiation mechanism: a long-term mechanism to ensure safe production.
The last part is tools. The Group's systems and processes are not just on paper or on the wall. All our mechanism processes are supported by corresponding system platforms, and then based on our system capabilities, robots, NLP technology, etc., we can effectively implement all these mechanisms into the actual work of every day and every implementation link.
Fault level definition
The definition of fault level is the operation basis of safety production system. We uniformly define the service interruption, service quality degradation and experience degradation in the production environment as failure, regardless of the reason. Please note that this is a fault defined from a business perspective. Its advantage is that it can be detected before the user, and it is more accurate than traditional monitoring.
Then go down to the next level, we will have many supporting platforms, such as middleware, database, cloud platform, network, server, etc. The indicators and fault definitions of the lower level will be defined according to the characteristics of each business. However, the overall principle is still based on business impact. From top to bottom, it is only the business of the lower system, which usually becomes the business dependency of the upper system.
Based on the definition of fault level, there are many types of breakdown within the group when it is actually implemented. Here are a few common categories: P series represents the general level definition, D series represents the data quality level, S series represents the degree of influence on important customers, E series represents the public opinion level, and I series represents the relevant level of infrastructure.
We usually have four levels for each sequence, 4 represents common fault and 1 represents serious fault. The smaller the number, the higher the urgency and importance.
In the actual implementation process, first of all, we should include all businesses into the management scope, and define the fault level of all businesses. The fault level definition needs to work with all roles, including development, test, product, operation and maintenance, and business dependents, to review the level definition to ensure that agreement is reached in advance. After the official release of the fault level definition, everyone will invest and support back-end resources according to this level. In case of a fault, different emergency processes can be launched according to the level, and the corresponding resources can be coordinated to participate in the emergency.
The fault level of each business scenario is determined by referring to the business importance, impact area and duration. The determined fault level definition should be structured and measurable, and should be coordinated with the observation of the whole link to realize automatic fault detection.
Once a fault occurs, we will define rules according to observable indicators and automatically try to calculate the fault level. After reaching the fault standard, we will automatically send the fault notification through the robot. At the same time, we will provide preliminary positioning assistance in combination with the observable topology capability and trace capability of the whole link.
With the definition of fault level, we can accurately identify the fault risk of the business, and timely find and deal with it. So, how to measure the efficiency of fault handling? This involves one of the most core mechanisms in digital safety production, the 1-5-10 mechanism.
Within the group, after all the faults occurred, we set an assessment goal, requiring that the business faults be found and announced within 1 minute, relevant personnel respond and make initial positioning within 5 minutes, and complete the rapid recovery of the faults within 10 minutes. Then, based on such a core guidance mechanism, we will go down to the second level to build the whole safety production system.
1-5-10 Strategy decomposition
1-5-10 mainly focuses on the three links of "discovery, positioning and quick recovery", which will involve multiple links of architecture, development and operation and maintenance. Each link has its own business rules, relevant mechanisms, and corresponding systems that we need to build.
For example, the "1" part mainly involves the observation of the whole link, and also includes the intelligent baseline and the monitoring of the whole link that we usually pay more attention to. These are what we need to do in this link.
Then in the second part, about the 5-minute response and positioning, we usually make announcements based on mobile methods, including text messages, phone calls, and nails. Then there are tools for collaboration. We will do collaboration based on the nailing robot, and use NLP robot technology to do check-in and emergency process interaction to achieve ChatOps.
In terms of positioning, we need to have such capabilities as observable system, plan system and change control. Generally, if a fault occurs in the platform, we will first receive a fault notification, and then we will receive some relevant change information before the fault. The system will push the relevant plan for this scenario, and the emergency personnel will achieve auxiliary positioning based on the observable ability.
As for the 10-minute quick recovery part, our biggest trick is to cut the flow in a unit. Only the system can determine the impact of the fault and the estimated recovery time is unacceptable. We can cut the flow in a unit based on the unit multi-activity ability, and then judge after recovery. In addition, small-scale faults can also be quickly recovered locally based on the plan system.
The last word is our relevant operation mechanism construction and drill acceptance. The operation mechanism is also a very important part of safety, which can ensure the continuous iteration of relevant safety production capacity. The drill can simulate fault injection and test the system and process in real time using the online environment.
One of the major pain points of safety production is that it can't be measured. Usually, we don't know which product is stable, which team is doing well, and we don't know the direction of future improvement. Based on the above product technology system construction, we have designed many operation standards.
• Fault score: after each fault occurs, the system will automatically judge a score, and the basic calculation logic is the impact area, duration, and set weight. It is a result indicator to measure the product and emergency efficiency. Through continuous operation, we can set the team's fault quota value, and then set future goals related to safety production.
• Stability score: composed of 14 indicators in the fields of engineering design, architecture, operation and maintenance. We will focus on every business development team, including the cover review of the design phase, the observable coverage of the operation phase, the grayscale ability in the release process, and the post-action completion rate, and generate the evaluation indicators in a systematic way. The stability score is a process index to assess the input related to safety production.
Fault score and stability score are basically the two core indicators, which are important criteria to judge whether a team is qualified in the field of safety production. In addition, there are a series of mechanisms such as business availability, circuit breaker, change control, etc. These mechanisms will run into their respective system platforms to realize automatic management.
For the emergency process, we will collect all events to GOC. There are two main types of event access. One is user-side feedback. This is a manual part based on intelligent customer service docking; The other part is the observable alarm. We have connected dozens of monitoring systems of the group business BU. After a large amount of alarm data comes in, it will involve convergence, suppression and intelligent algorithm processing, combined with the robot processing and filtering in the background, and finally will be integrated into a unified platform to determine the fault level. Events or faults will go through the nail group for coordination, and ordinary non-emergency events will go through the work order. There will be corresponding coordination between systems. The processing process will be effectively precipitated through the knowledge base, and the whole process data will be displayed through the large screen for unified visualization.
The handling process is all completed in the nail group. After the fault passes, the relevant personnel need to sign in the group, and the emergency process will be presented uniformly through the group. Once there is a major failure, we will upgrade to our senior management group and cooperate with more people.
In addition to the product capabilities and related organizational structures just mentioned, mechanism operation is also a very important part of safety production. We will have a lot of operation activities, various awards, and the experience of teams with excellent performance can be shared, and teams with poor performance can summarize and improve to ensure the long-term mechanism of safe production.
Digital safety production system construction scheme
Large digital security map
If an enterprise wants to do safety production construction, the core is divided into two parts: one is the construction of technical system, and the other is the construction of service system.
For the technical system, we need to form a unified platform. Just now, a classmate mentioned that there are many monitoring systems in the enterprise, and all businesses have them. Then from the application layer, the middle layer, the database, the cloud platform, and the network have their own systems. If we build in such a decentralized way, it is difficult to form a unified emergency command center. Our suggestion is to build a unified platform, which has various operational capabilities for safe production, and integrate the business capabilities of all systems to form a unified command center.
In terms of service, we should ensure that the mechanism culture and organizational structure can effectively support the implementation of work safety.
Digital safety production platform
With regard to the digital safety production platform, we have designed a framework to integrate existing capabilities through the assembly of various fields, such as observability, contingency plans, work orders, and event management, and abstract it into a whole platform for unified management of personnel and events throughout the life cycle. Then through the platform, we form corresponding business areas, support our various business scenarios, and serve various businesses at the upper level. This is the overall architecture of the unified safety production platform.
Digital safety production system construction - full life cycle service design
When making long-term planning, enterprises need to design the product architecture and business architecture related to safety production clearly, and need to have corresponding business management thinking.
In the matter
In the process of operation, we should consider system capacity building, such as drilling, pressure testing, current limiting, multiple activities, etc., to ensure that we can effectively assess and prevent risks.
Here is a case, which is a business application that you are familiar with during this period, epidemic prevention and control system, such as health code, place code and nucleic acid detection. In the early stage, we will conduct pressure measurement for the whole system to evaluate the capacity and water level on the line.
The result of the evaluation is that the capacity of the online production system is 10000 QPS online. We prepare the system resources according to this flow. If the peak flow exceeds 10000 QPS at this time, we will configure the flow protection capability to ensure that the system will not collapse in extreme cases.
Then to the next level, if the system has higher requirements for SLA, then we also need to build the system's dual-activity capability, which is a big move for safe production. We need to ensure that when the business system collapses in extreme cases, we have corresponding dual-live sites to take over the business. All these capabilities need to be managed in a unified platform.
The last part is the improvement after the event. This part actually covers a wide range of aspects, such as the improvement of our emergency coordination capability, the improvement of product architecture, and the improvement of the entire management mechanism. Whether the improvement is completed on time and whether the landing effect is ideal is also a very important closed-loop. We need to have a platform for corresponding support.
Digital safety production - holographic observable platform
With regard to platform capacity building, we will take each important link apart. The first is observability, which is actually the full link monitoring of acos that we just talked about. Another point to add is that our observability does not necessarily depend on a certain platform, but needs to effectively integrate all the monitoring capabilities of the business site. Acos is compatible with proprietary cloud arms application monitoring, and enhances the access capabilities of business, log, and heterogeneous monitoring systems. through the improvement of visualization capabilities, it can achieve rapid positioning.
Digital safety production - full link pressure test
The second part is full link pressure test. In the process of safety production, one of the most important contents is to understand what the processing capacity of our platform is. We need to know the water level of the system and the ultimate bearing capacity. In this way, when the peak traffic of real business comes, we can have a clear idea and deal with it easily.
Full link pressure testing is a crucial link in various promotion activities within the group. Each full link pressure testing is carried out in the production system, which can ensure that all the data from the pressure testing is true. The corresponding short boards and system problems are the same as the online problems. Find the short boards accurately and improve the overall pressure level of the business system.
Digital safety production - "1-5-10" emergency coordination
This article mainly introduces 1-5-10 emergency coordination. In the actual construction process of the safety production system, first we need to integrate the event and alarm fed back manually, and then define the fault level from the business point of view to ensure that the fault can be found quickly in one minute and notified accurately and timely.
In the emergency process, we will have the corresponding horizontal support capabilities, including the connection of resources, cross-team and cross-vendor personnel cooperation, the implantation of de-ops capabilities and Chatops capabilities, to ensure that the system can automatically find the interface and assist in the rapid positioning.
At the initial stage of construction, we mainly rely on our own existing capabilities for effective integration. Of course, we have all mature solutions, but it doesn't necessarily mean that we need to completely change the version and start over. Generally, enterprises need to do it based on our current situation. Then, the quick recovery part includes relevant rules and capabilities such as relevant plans, disaster recovery and double living. 1-5-10 is the part that is the easiest to land and the fastest to see the effect in the construction of safety production.
Digital safety production - flow protection
About the capacity of flow protection, I have just mentioned a part.
In the real environment, we need to prepare additional resources in order to cope with the sudden peak traffic. This is actually a compromise between cost and efficiency. Some businesses may be evaluated when we are building a business peak. Based on this peak, we may not prepare unlimited computing resources and storage resources, but when the business peak comes, we cannot stop our system.
Therefore, we need the current limiting capability of the system to ensure the availability of services in extreme cases, and reserve expansion time for operation and maintenance operations. In general, we can cooperate with the full-link pressure test just introduced to ensure the smooth transition of the relevant traffic peaks and maximize the stability of the overall business through the traffic protection capability and the elasticity ability related to our cloud native containerization.
Digital safety production-disaster tolerance and multi-activity solution
Disaster tolerance and multi-activity is the ultimate strategy for safe production. When a large-scale failure occurs, if we rely on independent positioning and recovery capabilities, we may not be able to meet the SLA of important systems. At this time, we need to build business-level disaster tolerance and multi-activity.
In our high-level disaster recovery plan, the overall architecture is modular, that is, the business-level plan for multiple activities in different places. However, many enterprises usually take the lead in disaster recovery. Through database synchronization, storage and replication, each application manages its own part to obtain relevant disaster recovery capabilities.
There is a general management and control platform in the disaster-tolerant multi-activity system, which implements collaborative management of traffic access layer, middleware and database. It can be understood that we are a large traffic scheduling system, and then we can ensure that after a business failure, it can automatically schedule the traffic of a single application. The relevant business flow switching of a single application can be performed automatically. The switching process platform is completed automatically, without manual adjustment of the application.
Multi-life construction has obvious advantages in resource utilization rate, switching success rate and automation degree, and is also the ultimate goal of enterprise safety production construction.
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Explore More Special Offers
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00