Best Practices for Event Based Automated Operation and Maintenance

01 Why is the event so important

System events represent changes in cloud resource status. Taking the system event of elastic computing as an example, the figure above represents the source of the system event of elastic computing.

In order to provide users with ECS, the underlying physical infrastructure and the intermediate virtualization services are required. On the virtualization service, run Guest OS to provide services to users.

In the operation and maintenance system events, Alibaba Cloud is responsible for the operation and maintenance of physical infrastructure and virtualization services. When computing, storage, and network components fail, Alibaba Cloud will issue operation and maintenance system events. These O&M system events require cloud manufacturers and users to operate and maintain together through collaboration.

In the event part of resource state change, it does not necessarily represent a fault or problem. But it is the basis for implementing event driven architecture.

As shown in the figure above, some typical system events are shown.

In terms of unexpected exceptions, instance downtime may lead to user service interruption. If the local disk instance goes down, Alibaba Cloud cannot decide whether to migrate the instance for the user, so the user must respond.

In terms of planned operation and maintenance, active operation and maintenance events are the most common. The instance was restarted due to the system maintenance plan. When the underlying hardware such as computing, storage, and network has problems, but they are not serious enough to stop immediately.

In this case, Alibaba Cloud will send a planned O&M event to the user after detection. If the user does not respond within a certain period of time, Alibaba Cloud will help the user migrate the machine to a healthy hardware.

If the user responds, you can select a time point that is most beneficial to you and has the least impact on the service in the operation window provided by Alibaba Cloud. Migrate instances in advance to avoid planned restart.

In terms of costs, if the instance is down at the expiration date, the system will issue an event three days before the instance expires. Users need to plan their own renewal methods, such as automatic renewal.

In terms of status notification, instance lifecycle status changes represent changes in the instance control status of ECS. For example, instance creation, instance startup, instance shutdown, instance release, etc.

The above figure shows several common forms of cloud event operation and maintenance. From bottom to top, the degree of automation is getting higher and higher. Alibaba Cloud recommends that users try to automate event operation and maintenance.

The first layer is that users passively receive notifications and log in to the console for manual processing.

The second layer is that users have a certain awareness of handling events and actively subscribe to event notifications.

In the third layer, users have certain technical strength, and have established their own automatic or semi-automatic operation and maintenance systems. Through the event driven architecture, they can process various levels of event messages as needed. During processing, call the API operation and maintenance of the cloud product.

The fourth layer is the automatic operation and maintenance of hosting. The user completely places the operation and maintenance logic of the event on Alibaba Cloud, which is hosted by Alibaba Cloud.

02 Make event notification more effective

Next, let's talk about how to make event notification more accurate. Cloud monitoring provides all cloud product system events, unified query entry and event alarm functions.

At present, there are more than 100 cloud products that send their event information to cloud monitoring. Including ECS, cloud database, container services, etc.

Cloud monitoring provides two types of services on the user side:

First, users can actively query. You can query system events through the console or through the Open API.

The second type is event notification from the system to the user. The function of subscribing to event notifications based on the alarm rules of cloud monitoring events.

The notification channels of cloud monitoring fall into two categories:

The first category is labor oriented. It includes telephone, SMS, email and tools based on Webhook, such as nail, flying book, etc.

The second type is for automatic operation and maintenance program or operation and maintenance system. Message queue, log service, function calculation, URL callback, etc. Based on the software library, automatic event processing can be realized.

The above figure shows the general format of cloud monitoring events. The system events of more than 100 cloud products are integrated into a unified event model after being collected into cloud monitoring.

The event data is in the json format, and the outer layer is the public attribute of the event, such as the unique ID of the event, the source of the event, the event level, the event name, the time and region of the event, etc.

The content field represents the content of the event. It is associated with a specific type of event. The event logic of cloud products determines that different events have different contents.

EventRule is a filtering rule for events, which matches events according to their public properties and content.

Rule Targets is the target of event notification. Among them, alarm notification represents manual notification, including telephone, SMS, email, nail, robot, etc. Message service queue, function calculation, URL callback, log service, etc. are used by the program for automatic consumption.

Once an event matching the event alarm rules occurs, Cloud Monitoring will route the event to the corresponding target. If multiple targets are selected, each target will receive a notification.

In order to make event notification more effective, Alibaba Cloud uses tags to group resources and make the granularity of alarm rules smaller. Then, create application groups of cloud monitoring through dynamic label rules.

So the application grouping and tags of cloud monitoring have a one-to-one relationship. Users can also create application groups by importing labels in the application management system. Application management will automatically configure the application grouping of cloud monitoring.

Then, create an alarm rule based on the grouping. First, create different contact groups according to different roles and responsibilities. Next, multiple alarm rules are divided according to the severity of the event and the receiver.

In addition, users can use the filtering ability of cloud monitoring to do fine grain filtering. In an event, there may be instance creation, release, start, stop and other statuses. Users can accurately filter a certain type of state in a type of event through keyword filtering or SQL Filter.

In terms of notification methods, users can select different notification methods according to the level of events. To avoid the event notification turning into spam messages and spam, it is recommended that the nail group alarm.

After refining the event rules, create an event alarm template. Apply the template to N groups. Thus, the batch copy of alarm rules is realized and the maintenance pressure is reduced.

03 Event driven operation and maintenance architecture

Polling the resource state based on OpenAPI to obtain the state change is less real-time, with a large number of redundant requests, high system consumption, risk of limited flow, and dependence on multiple cloud product APIs, further increasing the complexity. Alibaba Cloud has improved the real-time performance and stability and reduced system consumption by transforming the event driven architecture.

A customer originally polled the API of cloud products to obtain event information and store it in the operation and maintenance system. The operation and maintenance team pushes the event to the operation and maintenance portal, and each business system is responsible for responding to the event. The business system does not go directly to the Alibaba Cloud console.

Because it uses multiple cloud products, the O&M system has N segment polling codes, and the real-time performance is not good.

Transform the system into an event driven architecture. In the initialization phase, the system sets event alarm rules in cloud monitoring. After the event is published by the cloud product, cloud monitoring detects an event match and pushes the event to the message queue of the specified customer. The customer's operation and maintenance system pulls event messages from the message queue, stores them, and pushes them to the operation and maintenance portal.

Through the above operations, the code logic is simplified, the resource consumption is reduced, and the real-time performance and stability are improved.

In terms of elastic scaling, when an instance is released, ECS management and control will issue an instance state change event. Cloud monitoring routes to elastic scalable message queues according to rules. Then, elastically scale consumption event messages to obtain instance information. Elastic scaling removes the instance from the scaling group and performs the next operation according to the scaling rules.

04 On cloud hosting event operation and maintenance

The operation and maintenance organization OOS system can automatically manage and execute the operation and maintenance tasks. Compared with traditional manual operation and maintenance or script operation and maintenance, it has low threshold, standard security, high efficiency, easy maintenance and free of charge.

In the O&M choreography, the event O&M function is provided. First, set event rules to limit the resource range. Then, select the operation and maintenance template to be executed after the event is triggered, and set the template parameters.

The preemptive instance provided by Alibaba Cloud is a relatively cheap instance. After purchasing a preemptive instance, there will be a one hour protection period. One hour later, the instance may be released and recycled at any time. Five minutes before recycling, the preemptive instance will issue an instance interrupt notification.

Through the OOS event operation and maintenance, before the instance is recycled, all the load balancers associated with it are removed to achieve an elegant offline without affecting the business.

There are five steps in the O&M template.

First, eventTrigger. That is, monitor the release events of preemptive instances.

Second, describeSLB. That is, the load balancing ID of the preemptive instance released by the query.

Third, setBackendServers. The weight of the instance to be released on load balancing is set to 0.

Fourth, waitConnectionExpire. That is, wait for the established network connection to be disconnected.

Fifth, removeBackendServers, which removes the released instances from the load balancing backend server list.

Q&A link, user Q&A

Q1 Are there any best practice cases for connecting ROS and OOS in resource scheduling?

A: There is no problem in getting through. ROS defines OOS templates and execution as resources, and supports the execution of OOS templates. Because OOS is arranged through APIs, you can call ROS APIs in OOS to create a resource stack.

Q2 Do I have to configure the templates for OOS O&M orchestration?

A: No. The choreography provides some common operation and maintenance scenarios, which can be operated on the console. Click Configure.

What if the performance of the Q3 burst performance instance is limited?

Answer: The service may be damaged. You need to check whether the use of the burst performance instance meets the expectations. You can consider upgrading or replacing the instance with a non burst performance instance.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us