Community Blog Best Practices for Event-Based Automated O&M

Best Practices for Event-Based Automated O&M

This article discusses event-based automated O&M, including architecture and cloud-managed events.

By Alibaba Cloud ECS

The following article highlights a speech entitled Best Practices for Event-Based Automated O&M by Wenle Bao (Alibaba Cloud Elastic Computing Technical Expert). This article is divided into four parts:

  1. Why Are Events Important?
  2. Make Event Notifications Effective
  3. Event-Driven O&M Architecture
  4. O&M of Cloud-Managed Events

Why Are Events Important?


System events represent changes in the status of cloud resources. Let's take the elastic computing system event as an example. The preceding figure represents the elastic computing system event source.

The underlying physical infrastructure settings and intermediate virtualization services are required to provide users with cloud servers. On virtualized services, run Guest OS and provide services to users.

Alibaba Cloud is responsible for O&M physical infrastructure and virtualization services in the O&M system events section. When computing, storage, and network components fail, Alibaba Cloud issues system events of the O&M class. Cloud vendors and users should collaborate and operate these O&M system events together.

The resource state change event section does not necessarily represent a fault or problem. However, it is the foundation for implementing the event-driven architecture.


As shown in the preceding figure, some typical system events are shown.

In terms of unexpected exceptions, instance downtime may cause service interruption. If an instance in the local disk is down, Alibaba Cloud cannot determine whether to migrate the instance for users. Therefore, the users must respond.

In terms of planned O&M, active O&M events are the most common. The instance is scheduled to restart due to system maintenance. When the underlying hardware (such as computing, storage, and the network) has problems, and it is not so serious that it immediately goes down, Alibaba Cloud will send a planned O&M event to the users after the test. If the users do not respond within a certain period, Alibaba Cloud will help the user migrate the machine to healthy hardware.

If the user responds, he can select a time that is most beneficial and has the least impact on the service in the operation window given by Alibaba Cloud. Migrate instances in advance to avoid scheduled restart.

In terms of fees, if an instance expires, the system will send an event three days before the instance expires. Users need to decide their renewal methods, such as auto-renewal.

In terms of status notification, changes in the instance lifecycle status indicate that the ECS instance control status changes. Instance creation, startup, shutdown, and release are included.


The preceding figure shows several common forms of event O&M on the cloud. From a bottom-up perspective, the degree of automation is getting higher. Alibaba Cloud recommends that users try to automate event O&M.

Layer 1: Users receive notifications passively and log on to the console to process them manually.

Layer 2: Users are aware of handling events and subscribe to event notifications.

Layer 3: Users have the technical skills to build their automated or semi-automated O&M system. They can process event messages at various levels as needed by an event-driven architecture. Users can call API operations and maintenance of cloud products during processing.

Layer 4: Managed automated O&M. The users completely put the O&M logic of the event on Alibaba Cloud and perform management through Alibaba Cloud.

Make Event Notifications Effective


Next, let's talk about how to make event notifications precise. CloudMonitor provides system events of all cloud products, unified query entry, and event alert functions.

Currently, more than 100 cloud products have sent their event information to CloudMonitor. Cloud servers, ApsaraDB, and container services are included.

CloudMonitor on the user side provides two types of services:

Type 1: Data can be actively queried by users. Users can query system events on the console and query system events through Open API.

Type 2: Event notifications from the system to users. Alert rules based on CloudMonitor events and the subscribing feature to event notifications.

CloudMonitor notification channels fall into two main categories:

Category 1 - Manual-Oriented: It includes phone calls, text messages, emails, and webhook-based tools, such as DingTalk, Lark, etc.

Category 2 - For Automated O&M Programs or O&M Systems: It includes message queues, log service, function compute, and URL callback. Based on the software library, users can automate the processing of events.


The preceding figure shows the common format of CloudMonitor events. The system events of more than 100 cloud products are integrated into CloudMonitor and incorporated into a unified event model.

Event data is in the JSON format, and the outer layer is the common attributes of an event, such as the unique ID of the event, the source of the event, the event level, the event name, the time when the event occurred, and the region where the event occurred.

The content field represents the content of the event. It is associated with a specific type of event. The event logic of cloud products determines that different events have different content.


EventRule is a filtering rule for events. EventRule matches events based on the common attributes and content of events.

Rule Targets is the target of event notifications. Alert notifications represent manual notifications, including phone calls, text messages, emails, DingTalk, and robots. Message service queues, function compute, URL callbacks, and log service are for automated program consumption.

Once an event that matches the event alert rule occurs, CloudMonitor routes the event to the corresponding Target. If multiple Targets are selected, each Target will receive a notification.


Alibaba Cloud uses labels to group resources to make event notifications effective and reduce the granularity of alert rules. Then, create CloudMonitor application groups based on dynamic label rules.

As a result, CloudMonitor application groups and labels have a one-to-one relationship. Users can create application groups by importing labels into the application management system. Application management automatically configures CloudMonitor application groups.

Create an alert rule based on the group. First, create different contact groups based on different roles and responsibilities. Next, divide multiple alert rules based on the severity of the event and the recipient.

In addition, users can use CloudMonitor filtering capabilities for fine-grained filtering. In an event, an instance may be created, released, started, and stopped. Users can use keyword filtering or SQLFilter to filter a certain type of state in a type of event.

In terms of selecting a notification method, users choose different notification methods according to the level of events. DingTalk group alerts are recommended to prevent event notifications from becoming spam messages and emails.

Create an event alert template after event rules are specified. Apply the template to N groups. This enables batch replication of alert rules and reduces maintenance pressure.

Event-Driven O&M Architecture


Poll for resource status based on OpenAPI to obtain status changes has low real-time performance, a large number of redundant requests, high system consumption, and the risk of throttling. It relies on multiple cloud product APIs, which further increases the complexity. Alibaba Cloud transforms the event-driven architecture to improve real-time performance and stability and reduce system consumption.


A customer originally obtained the event information by polling the APIs of cloud products and storing it in the O&M system. The O&M Team pushes events to the O&M portal where each business system is responsible for responding to events. Business systems are not directly connected to the Alibaba Cloud console.

Since it uses multiple cloud products, the O&M system has N polling codes, and the real-time performance is not good.

Transform the system into an event-driven architecture. The system sets event alert rules on CloudMonitor during the initialization phase. After the cloud product releases the event, CloudMonitor detects an event match and pushes the event to the message queue of a specified customer. The customer's O&M system pulls event messages from message queues, stores them, and pushes them to the O&M portal.

The preceding operations simplify code logic, reduce resource consumption, and improve real-time and stability.


In terms of auto scaling, after the instance is released, the ECS console releases an event that changes the status of the instance. CloudMonitor is routed to auto scaling message queues according to the rules. Then, auto scaling consumes event messages to obtain the instance information. Auto scaling removes the instance from the scaling group and proceeds to the next step based on the scaling rules.

O&M of Cloud Managed Event


The Operation Orchestration Service (OOS) system can automatically manage and execute O&M tasks. Compared with traditional manual O&M or script O&M, the OOS system has a low threshold, standard security, high efficiency, and easy maintenance and is free of charge.


Event O&M is provided in O&M orchestration. First, set event rules to limit the resource range. Then, select the O&M template executed after the event is triggered and set the template parameters.


The preemptible instance provided by Alibaba Cloud is cheap. After a user purchases a preemptible instance, the instance is protected for one hour. After one hour, the instance may be released and recycled at any time. An instance interruption notification is issued by the preemptible instance five minutes before it is recycled.

OOS event O&M can be used to remove all load balancing associated with the instance before it is recycled.

There are five steps in the operation and maintenance template:

Step 1 - eventTrigger: It is monitoring the release events of preemptible instances.

Step 2 - describeSLB: It refers to the ID of load balancing where the preemptible instance is released.

Step 3 - setBackendServers: It means the weight of the instance to be released on load balancing is set to 0.

Step 4 - waitConnectionExpire: It is waiting for the established network connection to be disconnected.

Step 5 - removeBackendServers: It means the instance to be released is removed from the list of load balancing backend servers.


Q1: Is there a best practice for a ROS and OOS connection?

A1: ROS defines OOS templates and execution as resources and supports the execution of OOS templates. Since OOS orchestrates through APIs, you can call ROS APIs in OOS to create a stack.

Q2: Do I have to configure the templates of OOS O&M orchestration?

A2: No. Orchestration provides some common O&M scenarios. You can perform these operations in the console and click Configure.

Q3: What can I do if the performance of my burstable instance is limited?

A3: The service may be damaged. You need to check whether the burstable instance is used as expected. You can upgrade or replace an instance with a non-burstable instance.

0 1 0
Share on

Alibaba Cloud Community

626 posts | 111 followers

You may also like


Alibaba Cloud Community

626 posts | 111 followers

Related Products