Community Blog Intelligent O&M - Alibaba DevOps Practice Part 24

Intelligent O&M - Alibaba DevOps Practice Part 24

Part 24 of this 27-part series discusses the goals of intelligent O&M.

This article is from Alibaba DevOps Practice Guide written by Alibaba Cloud Yunxiao Team

Alibaba's O&M Team is committed to building an unmanned O&M platform, using intelligence to promote efficient and low-cost application O&M. Intelligent O&M is the next natural development of O&M platforms after informatization and digitalization. Based on a solid technical foundation, intelligent O&M integrates machine learning, optimization algorithms, and expertise in various fields, providing satisfactory solutions for specific O&M scenarios.

AIOps is an intelligent O&M platform developed based on Alibaba's DevOps experience that integrates big data O&M and proofreading through multiple algorithms of the Algorithm Team. We have taken O&M to a new level. We use AI to help us view data, identify exceptions, and determine O&M operations, forming an O&M platform with integrated monitoring, management, and control.

Challenges for O&M Systems

In the DevOps era, Alibaba's O&M system faces the following challenges:

  1. Large Scale: The scale of Alibaba's infrastructure has increased exponentially. When there are thousands or tens of thousands of servers, we can still conduct O&M manually. However, when there are millions of servers, it is unrealistic to rely on purely manual operation for any step. Therefore, the first challenge is to ensure secure and efficient O&M.
  2. High Complexity: The diversity and rapid development of Alibaba's businesses place higher requirements on system stability and bring greater challenges to the O&M system. Once, we assessed the system availability by 99.99999% and storage by 99.9999%, but businesses, such as Hema, require the availability of 100%. As an offline business, it is unacceptable if orders cannot be paid for half an hour. We must start from the end-to-end perspective and pay attention to the stable construction of each stage.
  3. Cost Optimization: Cost is the threshold. If you cannot meet a certain threshold, there is no opportunity to enter the market. In addition to fixed-asset investments, operation costs are a very important part. Technologies can be used to optimize the process and reduce the cost of each part, which is the key to improving the core competitiveness of the business.
  4. Security: Security is the biggest concern of cloud computing. The system is getting bigger and changing faster, and the internal and external risks it faces are also getting bigger. Numerous changes and upgrades are carried out at the same time every day. Therefore, another challenge is maintaining stability during system changes.

The volume and complexity of Alibaba's infrastructure exceed the processing power of a human brain. Therefore, it is necessary to apply machine intelligence to solve these complex problems from a new perspective.


Intelligent O&M Practices

Based on the preceding challenges, we have implemented the unmanned deployment and O&M solution in various business scenarios of Alibaba Group.

Unmanned Deploy

The new-generation release platform supports multiple release modes, such as rolling, blue-green, and canary releases. Machine learning methods detect exceptions during the application release process using algorithms, avoiding failures caused by code changes. Based on the accumulation of a large amount of monitoring data and log data, we launched an unmanned release system with the support of algorithms.

It takes nearly three years for the unmanned system riskfree to be implemented and optimized after it is launched. Currently, the business scope is failure prevention during application release. After an application is connected to riskfree and a release order is submitted, the system analyzes the monitoring data during the entire release process. If any exception occurs, it automatically suspends the release and reports abnormal metrics and reasons. When developers determine the problem, they can disable or roll back the application. If no problem exists, the release process continues.

Pain Points of Online Release

In the past, engineers had to perform the following careful work during the online release:

  • Before Release

Testers perform comprehensive unit tests and integration tests on the code. If a bug is found, developers will handle it. There are two problems here. First, some business teams have no testers due to personnel problems. They are developers and testers. Second, not all bugs can be found through testing.

  • During Release

Perform pre-release, phased release, batch release, and canary release. You need to go to the monitoring platform, view all metrics, and even log on to the machine to refresh the logs during the slow release process for each environment. It is hoped that the exception logs of a special mode can be found among numerous logs. It is also necessary to check whether there are any problems with the upstream and downstream application monitoring for multi-party applications.

  • After Release

Check whether all the application machines start normally and disable or replace the failed machines. Check whether the faulty system gives an alarm and whether the upstream and downstream teams are called. If yes, roll back right away. In short, this process is time-consuming and labor-consuming and cannot guarantee no details are missed. In addition, the experience of different release personnel is different, so the release stability assurance is different.

Our Solution

We have designed an unmanned release system:


The system is divided into two major parts:

  1. Online Analysis: The system detects exceptions in dimensions, such as system monitoring, business monitoring, log monitoring, and trace calls. It intercepts or rolls back the release order once an exception is detected. When the user thinks there is no exception, they will give feedback and continue to release.
  2. Offline Analysis: After the user gives feedback, this feedback data is very useful for our algorithm. The algorithm can be adjusted automatically. After the feedback data has accumulated for a period of time, the accuracy of exception detection is very high.

Algorithm Platform

In the process of release, the system will collect data from each monitoring source, which poses high requirements for data collection, cleaning, and storage. We have designed an algorithm platform to undertake the data sources, algorithm detection, algorithm verification, and algorithm launch processes. The following figure shows the architecture.


It mainly includes three parts:

  1. Data Collection and Storage: Collect data from various monitoring data sources, such as system monitoring, business monitoring, middleware monitoring, log monitoring, database monitoring, and cloud monitoring. The collected data may be stored in time series databases or relational databases based on data features.
  2. Algorithm Result Storage: Store the results of each detection to facilitate result troubleshooting and performance evaluation
  3. Data Tagging: Each exception detection result will be tagged. The tag data is used to retrain the algorithm to form a positive loop. The detection result can also be notified to the release personnel by email and DingTalk in real-time. The result can also automatically connect to the self-recovery process of O&M orchestration described in previous articles. For example, it can automatically replace abnormal machines.

Intelligent Algorithm

In the algorithm platform above, we have designed many exception detection algorithms. Exception detection plays an important role in unmanned release systems. It is divided into three main parts:

  1. Data Collection: We have integrated monitoring data and trace analysis from various dimensions. The range of observations is incomparable to human monitoring.
  2. Exception Detection: Our carefully tuned exception detection algorithm does not depend on traditional threshold-based or 3-Sigma detection algorithms. It can automatically determine exceptions and has a good generalization ability. It supports single-index detection, multi-index detection, detection of adjacent indexes, and detection before and after release. The detection algorithms include ArimaKSigma, BoxplotDetect (Tukey), GrubbsTest, and Donat.
  3. Normal Fluctuation Exclusion: Filter normal fluctuations based on historical data and user feedback to obtain accurate detection results. The following figure shows the rule:


Practice Effect

Since the release, the system has covered all the application release processes of Alibaba Group, protecting the release security and stability. The following figure shows the exception detection results:


At this point, after the system is enabled, developers can focus on other things after clicking release without paying attention to the release process from time to time. If an exception occurs during release, the system will notify the developer through DingTalk or email, and the developer can intervene. If a machine exception occurs, the abnormal machine is replaced automatically, requiring no manual intervention. The release will continue.

In summary, the unmanned release system is an intelligent system for change fault detection and exception recommendation. It determines whether the change will cause a fault by analyzing the multi-dimensional monitoring data during the change execution. If a fault is detected during release, it intercepts the fault and recommends intelligently.

Unmanned Operations – ChatOps

We focus on two of the daily O&M tasks:

  1. O&M operations initiated by the user upon alarms or events
  2. Q&A and consulting for daily O&M

For the first case, we can perform a 360-degree physical examination for the application through O&M diagnosis to locate the exception and fix it with one click. For the second case, we have released the ChatOps robot to enhance communication and cooperation with DevOps and help R&D personnel complete some dirty, tiring, and mechanical tasks. The goal is achieving consultation and Q&A with "0" manual intervention.

Overview of ChatOps

The O&M robot is the O&M practice of chatbot. It is also the implementation of ChatOps and an important tool for DevOps. It is positioned as an application-oriented intelligent DevOps service assistant:

  • Application-Oriented: It brings together the application developers, testers, and O&M personnel to strengthen communication and cooperation, shorten the product launch time, and reduce labor costs. It can quickly detect and fix product problems, reduce or eliminate the possibility of product service interruptions, and ensure that the development and O&M personnel are always in the same context to understand the status of their applications.
  • DevOps: It emphasizes rapid iteration and continuous delivery, strives for information sharing and technology learning and cooperation, and speeds up the information feedback cycle.
  • Intelligent: It understands the instructions entered by users and determines the values of each parameter of a command based on the metadata in the command slots and the user information. It processes and understands users' instructions through natural language processing.

This robot hopes to achieve the ultimate goal of R&D, testing, and O&M students to work happily through a one-touch and second-level response experience.

ChatOps Advantages

Let's take a look at the value of this robot:

  1. From the personal point of view of employees, it can improve the work efficiency of employees. It helps users handle simple, repetitive, and boring tasks, such as viewing logs, executing commands, switching to alert, checking machine status, viewing monitoring data, and forwarding O&M events.
  2. From the perspective of team communication, collaboration costs can be reduced. Within a team, ChatOps is a transparent, cooperative, and session-driven development mode. If everyone on the team knows what/when is happening and who/how to fix it, the goal of complete and a transparent event scenario and a shared, queryable, and recordable event resolution process can be achieved. This makes it easier for other employees to learn and refer to event handling of the same type, which is called teaching by doing.

ChatOps is a session-driven O&M mode. It uses chatbots to connect to various system backend, integrating development, testing, O&M personnel, tools, environments, and automation processes involved in software development and delivery. Every person in the chat room can carry out information sharing, technology learning, and cooperation on a specific topic. The testing, release, monitoring, and diagnosis of applications can be sped up and visible to all.

The robot benefits include:

  1. Convenience: Many common operations of systems are aggregated to the machine. Thus, there is no need to log in to multiple systems to find information.
  2. Collaboration: All the information about the event is pushed to the chat room, and all members can understand what happened here.
  3. It allows everyone to see all the information without searching for data repeatedly.


ChatOps Implementation

Let's take a look at the implementation architecture of the robot:


It mainly includes three modules: dialogue manager, NLP tools, and intent dispatcher manager. The dialogue manager is used to determine the intention of the user's utterance of whether to initiate a new dialogue or undertake the existing intention above. It calls the processor of NLP tools to assist in the judgment. The intent dispatcher manager is responsible for interfacing with specific business systems. The dialogue manager transmits the processing results to it to invoke specific business logic and trigger the execution of tasks.

ChatOps Practices

Let's look at several implementation scenarios of the robot in Alibaba Group:

1.  Intelligent Q&A:


2.  Query the monitoring information of an application:


3.  Machine replacement:


In short, ChatOps can help us improve development efficiency and development happiness.


As intelligent algorithms become mature and a large amount of O&M data is accumulated, intelligence is implemented in more O&M scenarios. Alibaba has developed a series of intelligent O&M products based on the R&D scenarios of Alibaba Group and empowered small- and medium-sized enterprises. Our code leaves complexity to ourselves and gives the simplicity to users. Intelligentization is the ultimate state of O&M. In the future, we will make greater investments in automatic, unmanned, and intelligent O&M to build a world-class intelligent O&M platform.

0 0 0
Share on

Alibaba Cloud Community

825 posts | 184 followers

You may also like


Alibaba Cloud Community

825 posts | 184 followers

Related Products