Best Practices for Event-Based Automated operation and maintenance

System events represent changes in the state of cloud resources. Take the system event of elastic computing as an example. The figure above represents the system event source of elastic computing.

In order to provide users with cloud servers, the underlying physical infrastructure and intermediate virtualization services are required. On the virtualization service, run Guest OS, and finally provide services to users.

In the operation and maintenance system events section, Alibaba Cloud is responsible for the operation and maintenance of physical infrastructure and virtualization services. When computing, storage and network components fail, Alibaba Cloud will send out system events of operation and maintenance type. These O&M system events require cloud manufacturers and users to operate and maintain together through cooperation.

In the event part of resource state change, it does not necessarily represent a fault or problem. But it is the basis for implementing event-driven architecture.

As shown in the figure above, some typical system events are shown.

In terms of unexpected exceptions, instance downtime may cause user service interruption. If the local disk instance goes down, Alibaba Cloud cannot decide whether to migrate the instance for the user, so the user must respond.

In terms of planned operation and maintenance, the most common event is active operation and maintenance. The instance is restarted due to the system maintenance plan. When the underlying hardware such as computing, storage, and network has problems, but it is not serious enough to immediately shut down.

In this case, Alibaba Cloud will send a planned O&M event to the user after detection. If the user does not respond within a certain period of time, Alibaba Cloud will help the user migrate the machine to a healthy hardware.

If the user responds, you can select a time point that is most beneficial to you and has the least impact on the service in the operation window provided by Alibaba Cloud. Migrate instances in advance to avoid planned restart.

In terms of costs, if the instance expires and is down, the system will send an event three days before the instance expires. Users need to plan their own renewal methods, such as automatic renewal.

In the aspect of status notification, the change of instance lifecycle status represents the change of ECS instance control status. For example, instance creation, instance startup, instance shutdown, instance release, etc.

The above figure shows several common forms of cloud event operation and maintenance. From the bottom up, the degree of automation is getting higher and higher. Alibaba Cloud recommends that users try to automate the operation and maintenance of events.

In the first layer, users passively receive notifications and log in to the console for manual processing.

In the second layer, users have a certain awareness of handling events and actively subscribe to event notifications.

At the third level, users have certain technical strength and have established their own automatic or semi-automatic operation and maintenance system. Through the event-driven architecture, they can process various levels of event messages as needed. During processing, call the API operation and maintenance of cloud products.

The fourth layer is the automated operation and maintenance of hosting. Users completely put the operation and maintenance logic of the event on Alibaba Cloud and run it under Alibaba Cloud hosting.

02 Make event notification more effective

Next, let's talk about how to make event notification more accurate. Cloud monitoring provides all cloud product system events, unified query entry and event alarm functions.

At present, there are more than 100 cloud products that send their own event information to cloud monitoring. Including cloud server, cloud database, container service, etc.

Cloud monitoring provides two types of services on the user side:

First, users can actively query. Query system events through the console or through the Open API.

The second is the event notification from the system to the user. Based on the alarm rules of cloud monitoring events, subscribe to the event notification function.

The notification channels of cloud monitoring are divided into two categories:

The first category is oriented to labor. It includes telephone, SMS, email and tools based on Webhook, such as staples, flying books, etc.

The second category is for automated operation and maintenance programs or operation and maintenance systems. There are message queue, log service, function calculation, URL callback, etc. Based on the software library, the automatic processing of events can be realized.

The above figure shows the general format of cloud monitoring events. The system events of more than 100 cloud products are integrated into the unified event model after being collected into cloud monitoring.

The event data is in json format, and the outer layer is the public attribute of the event, such as the unique ID of the event, the event source, the event level, the event name, the time and location of the event, etc.

The content field represents the content of the event. It is associated with a specific type of event. The event logic of cloud products determines that different events have different contents.

EventRule is the event filtering rule, which matches events according to their public attributes and event content.

Rule Targets are the targets of event notification. Among them, alarm notification represents manual notification, including telephone, SMS, email, nail, robot, etc. Message service queue, function calculation, URL callback, log service, etc. are used for program automation consumption.

Once an event that meets the matching event alarm rules occurs, cloud monitoring will route the event to the corresponding target. If multiple targets are selected, each target will be notified.

In order to make event notification more effective, Alibaba Cloud uses tags to group resources to make the granularity of alarm rules smaller. Then, create cloud monitoring application groups through dynamic label rules.

So the application grouping and label of cloud monitoring have a one-to-one relationship. Users can also create application groups by importing tags in the application management system. Application management will automatically configure the application groups for cloud monitoring.

Then, create alarm rules based on grouping. First, create different contact groups according to different roles and responsibilities. Next, divide multiple alarm rules according to the severity of the event and the recipient.

In addition, users can use the filtering ability of cloud monitoring to do fine-grained filtering. In an event, there may be instance creation, release, start, stop and other states. Users can accurately filter a certain type of status in a class of events through keyword filtering or SQLFilter.

In terms of selecting notification methods, users choose different notification methods according to the level of events. To avoid the event notification becoming spam messages and spam, it is recommended to nail group alarm.

After the event rules are refined, create an event alarm template. Apply the template to N groups. Thus, the batch copy of alarm rules is realized and the maintenance pressure is reduced.

03 Event-driven operation and maintenance architecture

Polling resource status based on OpenAPI to obtain status changes has low real-time performance, a large number of redundant requests, high system consumption, limited flow risk, and dependence on multiple cloud product APIs, further increasing complexity. Alibaba Cloud has improved real-time and stability and reduced system consumption by improving the event-driven architecture.

A customer originally obtained event information by polling the API of cloud products and stored it in the operation and maintenance system. The operation and maintenance team pushes the event to the operation and maintenance portal, and each business system is responsible for responding to the event. The business system does not directly access the Alibaba Cloud console.

Because it uses multiple cloud products, there are N segments of polling code in the operation and maintenance system, and the real-time performance is also poor.

Transform the system into an event-driven architecture. In the initialization phase, the system sets event alarm rules in cloud monitoring. After the event is released by the cloud product, cloud monitoring detects that the event matches and pushes the event to the message queue of the specified customer. The customer's operation and maintenance system pulls event messages from the message queue, stores them, and pushes them to the operation and maintenance portal.

Through the above operations, the code logic is simplified, the resource consumption is reduced, and the real-time performance and stability are improved.

In terms of elastic scaling, after the instance is released, ECS control will publish an instance status change event. Cloud monitoring is routed to the elastic scaling message queue according to rules. Then, the elastic consumption event message obtains the instance information. Resilient scaling removes the instance from the scaling group, and performs the next operation according to the scaling rules.

04 Cloud hosted event operation and maintenance

The operation and maintenance scheduling OOS system can automatically manage and execute the operation and maintenance tasks. Compared with traditional manual operation and maintenance or script operation and maintenance, it has low threshold, standard and safe, high efficiency, easy maintenance and free.

In the operation and maintenance arrangement, the event operation and maintenance function is provided. First, set event rules to limit the resource range. Then, select the operation and maintenance template to be executed after the event is triggered, and set the template parameters.

The preemptive instance provided by Alibaba Cloud is a relatively cheap instance. After purchasing a preemptive instance, there will be a one-hour protection period. After one hour, the instance may be released and recycled at any time. Five minutes before recycling, the preemptive instance will issue an instance interruption notification.

Through OOS event operation and maintenance, before the instance is recycled, remove all the load balancing associated with it, and achieve graceful offline without affecting the business.

There are five steps in the operation and maintenance template.

First, eventTrigger. That is, monitor the release event of the preemptive instance.

Second, describeSLB. That is, the load balancing ID of the released preemptive instance.

Third, setBackendServers. The weight of the instance to be released on load balancing is set to 0.

Fourth, waitConnectionExpire. That is, wait for the established network connection to be disconnected.

Fifth, removeBackendServers. The instance to be released will be removed from the list of load balancing back-end servers.

Q&A link, user question and answer

Q1 Is there a best practice case for connecting ROS and OOS in resource arrangement?

A: No problem. ROS defines OOS templates and execution as resources and supports the implementation of OOS templates. Because OOS is choreographed through API, ROS API can be called in OOS to create a resource stack.

Q2 Do the templates for OOS O&M choreography have to be configured by themselves?

A: No. Orchestration provides some common O&M scenarios, which can be operated on the console by clicking Configure.

Q3 What if the performance of a burst performance instance is limited?

A: The service may be damaged. You need to check whether the use of burst performance instances meets expectations. The instance can be upgraded or replaced with a non-burst performance instance.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us