This article is from Alibaba DevOps Practice Guide written by Alibaba Cloud Yunxiao Team
Alibaba's O&M Team is committed to building an unmanned O&M platform, using intelligence to promote efficient and low-cost application O&M. Intelligent O&M is the next natural development of O&M platforms after informatization and digitalization. Based on a solid technical foundation, intelligent O&M integrates machine learning, optimization algorithms, and expertise in various fields, providing satisfactory solutions for specific O&M scenarios.
AIOps is an intelligent O&M platform developed based on Alibaba's DevOps experience that integrates big data O&M and proofreading through multiple algorithms of the Algorithm Team. We have taken O&M to a new level. We use AI to help us view data, identify exceptions, and determine O&M operations, forming an O&M platform with integrated monitoring, management, and control.
In the DevOps era, Alibaba's O&M system faces the following challenges:
The volume and complexity of Alibaba's infrastructure exceed the processing power of a human brain. Therefore, it is necessary to apply machine intelligence to solve these complex problems from a new perspective.
Based on the preceding challenges, we have implemented the unmanned deployment and O&M solution in various business scenarios of Alibaba Group.
The new-generation release platform supports multiple release modes, such as rolling, blue-green, and canary releases. Machine learning methods detect exceptions during the application release process using algorithms, avoiding failures caused by code changes. Based on the accumulation of a large amount of monitoring data and log data, we launched an unmanned release system with the support of algorithms.
It takes nearly three years for the unmanned system riskfree to be implemented and optimized after it is launched. Currently, the business scope is failure prevention during application release. After an application is connected to riskfree and a release order is submitted, the system analyzes the monitoring data during the entire release process. If any exception occurs, it automatically suspends the release and reports abnormal metrics and reasons. When developers determine the problem, they can disable or roll back the application. If no problem exists, the release process continues.
In the past, engineers had to perform the following careful work during the online release:
Testers perform comprehensive unit tests and integration tests on the code. If a bug is found, developers will handle it. There are two problems here. First, some business teams have no testers due to personnel problems. They are developers and testers. Second, not all bugs can be found through testing.
Perform pre-release, phased release, batch release, and canary release. You need to go to the monitoring platform, view all metrics, and even log on to the machine to refresh the logs during the slow release process for each environment. It is hoped that the exception logs of a special mode can be found among numerous logs. It is also necessary to check whether there are any problems with the upstream and downstream application monitoring for multi-party applications.
Check whether all the application machines start normally and disable or replace the failed machines. Check whether the faulty system gives an alarm and whether the upstream and downstream teams are called. If yes, roll back right away. In short, this process is time-consuming and labor-consuming and cannot guarantee no details are missed. In addition, the experience of different release personnel is different, so the release stability assurance is different.
We have designed an unmanned release system:
The system is divided into two major parts:
In the process of release, the system will collect data from each monitoring source, which poses high requirements for data collection, cleaning, and storage. We have designed an algorithm platform to undertake the data sources, algorithm detection, algorithm verification, and algorithm launch processes. The following figure shows the architecture.
It mainly includes three parts:
In the algorithm platform above, we have designed many exception detection algorithms. Exception detection plays an important role in unmanned release systems. It is divided into three main parts:
Since the release, the system has covered all the application release processes of Alibaba Group, protecting the release security and stability. The following figure shows the exception detection results:
At this point, after the system is enabled, developers can focus on other things after clicking release without paying attention to the release process from time to time. If an exception occurs during release, the system will notify the developer through DingTalk or email, and the developer can intervene. If a machine exception occurs, the abnormal machine is replaced automatically, requiring no manual intervention. The release will continue.
In summary, the unmanned release system is an intelligent system for change fault detection and exception recommendation. It determines whether the change will cause a fault by analyzing the multi-dimensional monitoring data during the change execution. If a fault is detected during release, it intercepts the fault and recommends intelligently.
We focus on two of the daily O&M tasks:
For the first case, we can perform a 360-degree physical examination for the application through O&M diagnosis to locate the exception and fix it with one click. For the second case, we have released the ChatOps robot to enhance communication and cooperation with DevOps and help R&D personnel complete some dirty, tiring, and mechanical tasks. The goal is achieving consultation and Q&A with "0" manual intervention.
The O&M robot is the O&M practice of chatbot. It is also the implementation of ChatOps and an important tool for DevOps. It is positioned as an application-oriented intelligent DevOps service assistant:
This robot hopes to achieve the ultimate goal of R&D, testing, and O&M students to work happily through a one-touch and second-level response experience.
Let's take a look at the value of this robot:
ChatOps is a session-driven O&M mode. It uses chatbots to connect to various system backend, integrating development, testing, O&M personnel, tools, environments, and automation processes involved in software development and delivery. Every person in the chat room can carry out information sharing, technology learning, and cooperation on a specific topic. The testing, release, monitoring, and diagnosis of applications can be sped up and visible to all.
The robot benefits include:
Let's take a look at the implementation architecture of the robot:
It mainly includes three modules: dialogue manager, NLP tools, and intent dispatcher manager. The dialogue manager is used to determine the intention of the user's utterance of whether to initiate a new dialogue or undertake the existing intention above. It calls the processor of NLP tools to assist in the judgment. The intent dispatcher manager is responsible for interfacing with specific business systems. The dialogue manager transmits the processing results to it to invoke specific business logic and trigger the execution of tasks.
Let's look at several implementation scenarios of the robot in Alibaba Group:
1. Intelligent Q&A:
2. Query the monitoring information of an application:
3. Machine replacement:
In short, ChatOps can help us improve development efficiency and development happiness.
As intelligent algorithms become mature and a large amount of O&M data is accumulated, intelligence is implemented in more O&M scenarios. Alibaba has developed a series of intelligent O&M products based on the R&D scenarios of Alibaba Group and empowered small- and medium-sized enterprises. Our code leaves complexity to ourselves and gives the simplicity to users. Intelligentization is the ultimate state of O&M. In the future, we will make greater investments in automatic, unmanned, and intelligent O&M to build a world-class intelligent O&M platform.
Orchestration-Oriented O&M - Alibaba DevOps Practice Part 23
Alibaba DevOps Tool System - Alibaba DevOps Practice Part 25
1,066 posts | 262 followers
FollowAlibaba Cloud Community - February 22, 2022
Alibaba Cloud Community - March 1, 2022
Alibaba Cloud Community - March 2, 2022
Alibaba Clouder - January 27, 2021
Alibaba Cloud New Products - March 10, 2021
Alibaba Cloud Indonesia - August 22, 2022
1,066 posts | 262 followers
FollowManaged Service for Grafana displays a large amount of data in real time to provide an overview of business and O&M monitoring.
Learn MoreA unified, efficient, and secure platform that provides cloud-based O&M, access control, and operation audit.
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreMore Posts by Alibaba Cloud Community