Community Blog O&M with Integrated Monitoring, Management, and Control - Alibaba DevOps Practice Part 19

O&M with Integrated Monitoring, Management, and Control - Alibaba DevOps Practice Part 19

Part 19 of this 27-part series introduces the background and ideas about building an integrated O&M mode for Alibaba.

This article is from Alibaba DevOps Practice Guide written by Alibaba Cloud Yunxiao Team

Alibaba's O&M system has undergone the era of scripts, the era of tools, and the era of DevOps. Now, it is in the stage of implementing automated O&M and exploring the intelligent O&M. In 2008-2009, Alibaba's O&M system was still in the era of scripts, and a lot of O&M tasks had to be done by scripts. With the expansion of businesses and the increase of complexity, it is increasingly difficult for O&M through scripts. Therefore, Alibaba introduced O&M tools. In the era of O&M tools, Alibaba's O&M system has gone through the following stages:

  1. The stage of collaboration of the Tools Team and O&M Team at the same time
  2. The stage of the Tools Team that ensures the unified quality of tools
  3. The stage of the Software-Oriented Tools Team that has DevOps ideas and features

Finally, the Alibaba Application O&M Team underwent a major transformation. The previous Application O&M Team was split and merged into the Software Development Team of each business to fulfill the DevOps idea.


In the DevOps era, the mature streamlined O&M tools have partially improved the O&M efficiency, but they are separated from each other. For example, monitoring tools and maintenance tools are separated from each other, and the inspection tools are also separated from the quick restoration tools. These scenarios result in a long and inefficient process to quickly discover, locate, and recover from problems through monitoring during the continuous O&M of daily applications. For O&M developers, the expected state is No Ops after a business application goes online. The monitoring and O&M systems can detect exceptions and automatically fix them to restore the application and business to a normal state. After the exceptions are fixed, the systems send a message notification to the developers. Working towards No Ops, we start to build a system with integrated monitoring, management, and control for Alibaba's application O&M.

New Challenges

With the continuous development of Alibaba businesses and the continuous evolution of the technical architecture, new scenarios and problems are constantly emerging, which bring new challenges to application-centered monitoring and O&M.

Ultra-Large Scale

Alibaba has various businesses in different forms and on a large scale. The Double 11 Global Shopping Festival needs the support of ultra-large-scale IaaS resources every year. Before 2015, Alibaba spent a huge amount of money every year to purchase servers and build IDCs. From 2015 to 2019, Alibaba gradually moved towards the cloud. During this period, a part of Alibaba's infrastructure was located in on-premises data centers, and the remainder was located in Alibaba's cloud data centers. The infrastructure also needs to support active geo-redundancy in the same city or remote cities. Therefore, the company must have the excellent capability of managing ultra-large-scale resources on the premises and on the cloud. After Alibaba fully migrated to the cloud in 2019, it faced a new ultra-large-scale resource management scenario: hybrid cloud.

O&M Efficiency

Business development is changing rapidly, especially for key businesses of the company that iterate quickly. We need the capabilities to continuously and efficiently release, deploy, change, and configure applications to guarantee business continuity and rapid iteration under the precondition of ultra-large cluster management. This is also the problem that DevOps needs to solve in the continuous O&M field.

O&M Security

Security is the foundation of any industry, especially in the IT O&M field. O&M failures and events, such as system downtime, data exceptions, data loss, and database deletion, are not uncommon. They may be a death blow to enterprises and critical to the survival of businesses. Therefore, preventing and eliminating high-risk O&M failures is the goal DevOps has been pursuing. In the numerous business forms and cloud technology architectures today, the most important thing is ensuring the security of the enterprises' IT system O&M.

Business Continuity

O&M engineers of an application need to configure monitoring items and alerting rules in the monitoring system in the traditional monitoring and O&M mode of Alibaba. When a monitoring item triggers an alerting rule, engineers will receive an alert notification. Then, the engineers need to open the computer and create a work order on the O&M tool platform. After the work order is handled, O&M engineers must continuously observe whether the monitoring items are restored to the normal state. If an engineer receives an alert notification on a holiday or break, they cannot check the problem on time and may need to contact other team members to handle it. If the engineer receives the notification while sleeping, the engineer needs to get up and open the computer to handle the problem. The entire handling process is relatively long and requires lots of human effort. As a result, labor costs are high, and the O&M engineers may also be dissatisfied at work.

On the other hand, with the continuous development of businesses, the number of systems and the number of monitoring items and alerts are increasing. Gradually, O&M engineers become numb or despise alert notifications. They may miss some important alerts, causing online business failures. New business forms have flourished in recent years, such as Taobao Live, Hema offline stores, Ele.Me takeaway, and DingTalk online education services. These businesses show virtually no tolerance to faults. The original optimal 99.99% availability can no longer meet the requirements of new services, while the traditional mode of separated monitoring and O&M cannot meet the requirement of 100% business continuity for new businesses.

Solution One

We need to ensure the continuous operation of production services, improve the overall efficiency of the business system from alert reporting to fault recovery, and save labor costs while ensuring security. Therefore, we consider combining alert reporting through monitoring with O&M execution to achieve automatic fault detection, automatic and rapid location, and automatic recovery for the state of No Ops.

Traditional monitoring and O&M are separated before the integrated monitoring, management, and control system is built. If O&M engineers want to focus on the system operation status during the iteration of applications, they must define and configure all kinds of metrics for these applications on the monitoring platform in advance. During the change period of applications, they must actively observe the changes of the metrics or set a pre-alerting rule for each metric and get timely fault information through the subscription of preconfigured alerts. When a fault occurs, the engineers need to view the monitoring records, application logs, and application call procedure to analyze the cause of the fault. Then, they must select tasks to perform on the O&M platform and verify whether the task execution results meet the expectations. The whole process is: clarify requirements → configure monitoring metrics and alerts → analyze the causes → determine troubleshooting methods → execute tasks → verify the execution results. All stages require the involvement of O&M engineers.


Solution Two

Alibaba has developed a set of security engineering standards for business systems based on practical experience to ensure business continuity as we gradually promote integrated monitoring, management, and control. They can help detect business faults in advance, locate them automatically, and restore businesses quickly and automatically, providing multiple solutions in the monitoring, O&M, and security protection fields.

Security Protection

In the process of promoting DevOps, we do not need to bring more uncontrollable factors to the existing situation, especially the protection for high-risk scenarios. Global system risks cannot be caused by the handover of O&M tasks to O&M developers. As a result, a security protection solution is created.

Panoramic Monitoring

Monitoring is the foundation of O&M, and traditional resource monitoring or application monitoring can no longer meet the requirement for rapid fault detection. Based on Alibaba's large number of practices, we have developed an application-centric full-procedure monitoring solution for upper-layer businesses, PaaS businesses, and all the underlying resources, providing strong support for fault detection and location.

Diversified O&M

We have replaced the original single-event-execution mode to realize the integration of monitoring, management, and control and promote the rapid and automatic recovery of business faults. We have explored application-oriented orchestrated O&M, intelligent O&M, and ChatOps O&M modes, extending new perspectives of the O&M field.


The integration of monitoring, management, and control for Alibaba's application O&M is still in the stage of exploration and development. This article mainly introduces the background and ideas about building an integrated O&M mode for Alibaba. Based on the application-centric approach, we monitor the running status of applications in real-time from the perspective of application monitoring management. We make secure changes to applications through efficient release and deployment and flexible O&M orchestration. We also implement advanced protection for applications through intelligent O&M and security protection. The detailed information will be described in the next article.

0 0 0
Share on

Alibaba Cloud Community

917 posts | 201 followers

You may also like