The digital safety production platform DPS is released

On November 5, at the 2022 Hangzhou Yunqi Conference, the digital safety production platform DPS was launched to help transform traditional operation and maintenance to SRE.

Alibaba Senior Technical Expert Zhou Yang

Under the Fourteenth Five-Year Plan, all walks of life will comprehensively accelerate the digital transformation and upgrading. With the scale of enterprise digital business increasing, the iteration speed accelerating, and the system complexity increasing, the topic of how to ensure business stability has become increasingly important. The following are typical scenarios and challenges:

Scenario 1: The distributed system faces new challenges in stability assurance

In recent years, despite the increasing attention to stability and the vigorous development of new technologies, major failures still occur frequently and have a huge impact. For example, in 2021, a securities IDC failed for 2 hours, resulting in the customer's inability to trade, resulting in asset loss; A video website could not be accessed for 3 hours due to server failure, which triggered public opinion... Improper use of technology, human operation errors, hardware failures, natural disasters, and security attacks still bring great risks to production.

Scenario 2: The stability construction of the IT system is steadily promoted under the guidance of policies

With the promotion of the digital transformation policy, more and more national-level applications have been born, which greatly facilitates people's daily life, and various enterprises have also launched their own clients. However, most enterprises have not experienced the development of the Internet for many years, and their ability to cope with online risks is insufficient. It is urgent to complete the accumulation of stable operation and maintenance capabilities in the shortest time, and avoid detours.

Scenario 3: Traditional operation and maintenance methods cannot meet the requirements

The traditional operation and maintenance has problems such as fragmentation of operation and maintenance tools, infrastructure rather than business oriented, passive operation and maintenance, and lack of standardized process mechanism system. Enterprises should follow the innovative concepts of SRE (Site Reliability Engineering) and Platform Ops, and realize system management, problem discovery, problem solving and automated operation and maintenance through software.

In real life, whether building a skyscraper or maintaining a family project, while ensuring the quality of the project, it is more important to avoid safety accidents and causing personal injury. Therefore, a set of standardized technological processes, technical standards and acceptance methods are needed. In the software industry, standardized technical capabilities and methodologies are also needed to ensure the stability of online business. Therefore, since 2018, Alibaba Group has been committed to the construction of safe production in the field of IT software: on the one hand, it has strengthened the infrastructure construction of high availability architecture; on the other hand, it has provided the process mechanism system of SRE transformation, and formed a complete set of safe production methodology in line with the goals of availability, organization and disaster recovery.

To this end, the digital safety production platform (DPS) has emerged. DPS has concentrated Alibaba's ten years of operation and maintenance experience. It is a one-stop control SRE operation and maintenance platform with PlatformOps as its concept and business continuity as its goal. It has three typical features: scenario-based, digital and cloud-based.

• Scenarioization: DPS focuses on emergency scenarios and weakens the operation and maintenance constraints brought by the organizational structure. At the same time, the comprehensive monitoring and alarm rule configuration of DPS can support various scenarios covering the business.

• Digitization: DPS provides digital monitoring large screen, intelligent alarm, intelligent fault location, white-screen fault quick recovery means, digital measurement, personnel management and other capabilities, and contributes to the enterprise's digitalization process.

• Cloud Native: DPS takes Alibaba Cloud's rich cloud native products as technical support, and has enough openness to associate with Alibaba Cloud's one, two, and open source systems.

As the precipitation of Alibaba Group's Internet exploration for decades, the digital safety production platform (DPS) mainly focuses on the following points in terms of platform architecture and evolution:

• Clear objectives and scenarios: safety production is a global project, and its ability depends on the shortest board of the barrel. Therefore, safety production needs to have clear objectives and scenarios, and ensure the integrity of the main framework.

• Get through the organizational structure: safety production not only needs to solve the problems of people and systems and codes, but also needs to solve the problems of people and people, people and systems. Therefore, safe production requires the integration and connection of excellent technologies of Alibaba and the industry within a system.

• Future-oriented architecture: safety production focuses on cost and loss reduction. Therefore, safe production needs to have a certain degree of anti-technology periodicity. The architecture design should not only be compatible with the latest technology stack, but also be designed for the future architecture.

The digital safety production DPS supports two typical business scenarios: "1-5-10" fault fast recovery and "change three-board axe" fault prevention.

"1-5-10" fault fast recovery

The digital safety production platform provides the whole life cycle management for the discovery, response and recovery of emergency events and faults. "1-5-10" corresponds to "1 minute discovery - 5 minutes response - 10 minutes recovery" of the fault, which is the goal of defining the timeliness of fault handling.

• One-minute discovery: By establishing the full-link monitoring capability around the business application, the business health can be monitored in real time. If stability problems are found, they will be reported to the emergency support service team for troubleshooting at the second level to reduce the possibility of failure.

• 5-minute response: through the establishment of emergency response channels and full-link fault location capabilities, it can quickly connect the troubleshooting personnel, and improve the efficiency of fault handling by intelligent fault location based on AIOps and fault status update and notification flow based on ChatOps.

• 10-minute recovery: through the establishment of a complete fault recovery system, based on the built-in rich quick recovery capability of the scheme, it can intelligently recommend appropriate quick recovery plans according to different fault types, and shorten the time of fault recovery.

"Change the three-board axe" fault prevention

The digital safety production platform DPS incorporates the change operations that are prone to cause online failures into the stability control system, so as to achieve "observable, grayscale and rollback" of the change operations.

In terms of "change can be managed", we cover a complete change system, which greatly reduces the transformation cost of the change system; In terms of "controllable change", we provide change control rules based on time, personnel and other dimensions to prevent possible risks; In terms of "change availability", we can automatically discover the faults caused by changes and provide intelligent quick recovery capabilities such as change rollback.

