Exploration and practice of cloud server observability - Apsara 2022

Date: Oct 1, 2022

On October 22, 2021, at the sub-forum of "Operation and Maintenance Best Practices on the Cloud" at the Apsara Conference 2022, Jiang Wenfeng, a senior technical expert from Alibaba, delivered a speech on the theme of "Exploration and Practice of Cloud Server Observability". The content of this article is organized according to his speech. The following three parts are used to introduce the exploration and practice of the observability of cloud servers Apsara Conference 2022.
1. Observable value
2. Cloud server observable solution
3. Summary

The observable value Apsara Conference 2022

Apsara Conference 2022 is observability and why is it so important for cloud servers? In layman's terms, observability is the ability to understand the internal operation of a cloud server. Its importance to cloud servers, in my opinion, has three main points: improving certainty, simplifying operation and maintenance, and improving information transparency.

Neither physical machines nor cloud servers can be 100% reliable. Apsara Conference 2022 cloud server has complete observability capabilities, and can scan various operating indicators and internal states of the Apsara Conference 2022 in a very comprehensive manner to obtain a rich full picture of information to improve the transparency of information and avoid black boxes. In abnormal scenarios, the results of this scan can also quickly locate the cause of the problem and simplify operation and maintenance.

2. Cloud server observable solution Apsara Conference 2022

The cloud server can observe the overall solution. Let's take a look at how Alibaba Cloud does Apsara Conference 2022.
Everything is data. Relying on a powerful data center, Alibaba Cloud collects nearly 100TB of data from nearly 100 million acquisition units every day. These data reflect various operating states, indicators, and parameters within the cloud server. After these data are collected, the data is cleaned to remove noise, and correlation analysis is performed to match the various defined indicators. Finally, the real image of the operation of the cloud server is obtained through feature calculation. The processed data is then output to two types of products. Apsara Conference 2022 first type of product is our internal operation and maintenance assurance platform, which is a major solution for Alibaba Cloud to actively maintain the stability of the cloud platform.

Apsara Conference 2022 other category is the observable products that are input to the user side, that is, the operation and maintenance products provided to meet the three goals: deterministic operation, simplified operation and maintenance, and information transparency, including 4 products: cloud monitoring, ECS system events, Health diagnosis, health status. Below we introduce these 4 products separately.

1. Cloud monitoring Apsara Conference 2022
When it comes to monitoring systems, I believe that everyone will not be unfamiliar. Cloud monitoring is a monitoring and alarm service of Alibaba Cloud for cloud resources and Internet applications. Cloud monitoring is relative to traditional monitoring services. What kind of advantages does it have? I will focus on the first two points:
1. Natural integration. Apsara Conference 2022 there is no need to purchase and activate, you only need to have an Alibaba Cloud account to use it, and you can use it immediately.
2. The alarm is flexible. There are flexible alarm rule settings, as well as flexible and rich alarm push channels. The alarm push channels are mainly divided into two categories: one is the channels of message reach, such as our common DingTalk and SMS. More importantly, it can have a channel that is an automatic processing channel, which lays the foundation for the next automatic operation and maintenance. Automatic processing channels include function computing, operation and maintenance orchestration, message service, and log service.
Regarding cloud monitoring, let's focus on sharing its powerful host monitoring items. In addition to supporting common CPU, memory, LOAD, disk, and network cards, cloud monitoring can also monitor processes. Through process monitoring, you can know whether your process is alive or not, and the current process resource consumption. Therefore, cloud monitoring is the most basic and most commonly used means Apsara Conference 2022.

2. ECS system events
Alibaba Cloud will proactively report low-level O&M events or unexpected maintenance events that affect the running of ECS instances, and provide maintenance suggestions to users. How can ECS system events improve and improve the observability of cloud servers?
1. Actively report underlying problems to improve the certainty of server operation Apsara Conference 2022.
2. It can simplify operation and maintenance. After the system event is reported, we subscribe to this event to realize automatic event processing, improve event processing efficiency, and simplify operation and maintenance Apsara Conference 2022.
3. Event-Driven can improve system efficiency. As we all know, in asynchronous scenarios, PUSH mode and PULL mode have obvious efficiency advantages. Take a very familiar example: to create an ECS instance, we generally call the RunInstances API first to get an instance ID, and then continuously call the DescribeInstances interface to query the instance status until it becomes Running. Not to mention the complexity of customer testing and programming, the efficiency is still low. In the event-driven mode, you only need to subscribe to the instance state change event, and when it becomes Running, the subsequent business logic is automatically triggered, which is simple and efficient. The right side of the above figure is the flow chart of the ECS event service. The event push will directly reuse the cloud monitoring exception push channel, laying the foundation for us to automate the processing of events next.

After we have a basic understanding of the capacity of the ECS system, let's focus on how it automates the processing of events? The left side of the above figure is the current event classification. Focus on the right side. Here are two solutions for automating event processing:
• The first is to push system events to the function computing service through cloud monitoring, and specific events trigger specific function computing capabilities, thereby realizing automatic event processing.
• The second is that events can be pushed to the O&M orchestration service. Specific events trigger a specific O&M orchestration template that we have set in advance, so as to realize the automatic processing of events. is free.
ECS system events can actively report the underlying events that affect the running of the instance, which is an important part of the observability of the cloud server and can better solve the problem of deterministic running. But that's not enough. Because the actual situation is that the probability of serious problems on the cloud platform is still very small, in general, the cloud platform is very stable. Most of the operation and maintenance problems are related to the user's operation and use, which means that the problem often occurs within the customer OS and customer application. The exception coverage of system events in the guest OS is relatively limited. Therefore, in order to further improve the observation capability of the cloud server, we have launched a diagnostic service. The diagnosis service is divided into three products: instance health diagnosis, instance health status, and network connectivity diagnosis.

3. Health diagnosis
First, let's take a look at the instance health diagnosis, which is a service that comprehensively detects the problems in the customer's OS and the software and hardware problems of the cloud platform that the cloud server depends on. Our diagnostic items are divided into two categories: customer OS diagnostic items and cloud platform diagnostic items.

Today, I will focus on the customer OS diagnosis items. Based on the health diagnosis, what problems can currently be detected in the customer OS?
1. First, based on the health diagnosis, you can find common problems such as CPU full, insufficient memory, insufficient disk space, and the top5 process that occupies the most resources.
2. Secondly, through the health diagnosis, you can also find common network settings, disk design, and file system settings. Take network settings as an example, whether the network card is up, whether network services are running, whether the network card multi-queue is enabled to ensure network performance, and whether the network card ip configuration is correct (for example, we often encounter user instances that should use dhcp to dynamically allocate ip, but Common network problems such as the problem of network inaccessibility due to the use of a custom image to configure a static ip).
3. Through health diagnosis, we can also see whether the services that affect the normal operation of the instance are running normally, such as whether common ports are listening (such as linux port 22, windows 3389 port), whether the dhcp process that dynamically allocates IP exists, and is responsible for system initialization systemd is running normally, etc.
4. Through the health diagnosis, you can also check whether the custom firewall and custom routing table are set in the customer OS. This often causes problems with network connectivity. For cloud servers, we recommend using security groups as the only firewall solution, because security groups are at the virtual network level and cannot be tampered with by users, so it is simple and secure.
These capabilities are the diagnostic capabilities we have so far, but this is far from the end. There are also many new diagnostic capabilities in development. I would also like to share some of our experience in making diagnosis. Frankly speaking, it is difficult to do a good job of diagnosis, because the problems of customers vary widely, and it is difficult to make the diagnosis ability stronger and more accurate through prior design. Our experience is to be problem-driven, that is, to find problems and solve problems quickly and iterate to continuously enrich the diagnostic capabilities.

Next we look at the typical usage of health diagnosis. Here I list two typical scenarios:
First, do the cause detection of abnormal instances. For example, you can see that the server load suddenly soars in the above figure. Of course, you can locate the cause through more fine-grained monitoring indicators, but there is a more convenient way to run the health diagnosis of the instance. Taking this case as an example, we will clearly tell you who is the process that occupies the most CPU resources? What is its ID? The following can help you quickly locate the problem. Can you see if this process is caused by the problem of the business itself? Is it normal business traffic growth or is there a problem with the implementation of the code? Is there a recent release etc.
Second, it is recommended to implement periodic health inspections of instances based on health diagnosis and O&M orchestration. Our O&M orchestration service supports periodic execution. You only need to periodically call the instance health diagnosis interface for the instances that need to be inspected, and you can generate a diagnosis report. Then do manual or automatic processing according to the prompt of the diagnosis report. If the scale of the instance is relatively large, it can realize automatic abnormal response and automatic operation and maintenance for very serious problems.

4. Health status
Next, take a look at the health status of the instance. The principles of health status and health diagnosis are the same, but there are three distinct differences between health status.
One is the scope of diagnostic items. The health status diagnosis items of the instance are more refined. We choose the basic computing, network, and storage diagnosis items to ensure the healthy operation of the instance. From now on, ECS has two states, one is the control state, and the other is the runtime state, that is, the healthy state. The health diagnosis of the instance presents a diagnosis report to the user. In addition to the problem, it will also inform you of the cause of the problem. Therefore, through these three comparisons, we find that the instance health status actually has its own special applicable scenarios. The above picture shows the diagnostic items supported by our selected instance health status. You can briefly understand it.

Typical usage of instance health status. If you have few instances, can you perceive the running status of the current instance through the console in time? Is it healthy?
• If it is not healthy, there must be a problem with the bottom layer or user settings. You can take corresponding operation and maintenance measures, or seek technical support.
• If the cluster scale is relatively large, and the reliability of the infrastructure is very high. We recommend using the diagram on the right to ensure the high availability of the entire cluster's infrastructure through the function of automatically replacing abnormal instances through elastic scaling. Specifically, the instance health status detection function is enabled through the elastic scaling console. Next, the elastic scaling service will periodically check the instance health status instead of the client. If an abnormality is found, the abnormal instance is immediately replaced with a healthy instance of the same specification to ensure that the entire infrastructure level remains highly available.

Finally, we will introduce a product we are beta testing, network connectivity diagnosis. We all know that the reasons for network failures can be very complex. Our customers are often troubled by network failures. Based on our long-term experience in troubleshooting problems, we find that three types of problems frequently occur:
1. The target process monitoring is incorrect;
2. The problem of firewall settings. Including custom firewalls and security groups within the guest OS;
3. The network setting of the instance itself is a problem.
Therefore, for these three problems of high frequency, we have developed an end-to-end diagnosis of the network. It can more accurately discover the source and destination nodes of communication:
• Security group and guest OS firewall settings issues
• Subnet ACL setting problem
• Instance's own network status/setting issues
• Whether the port is listening normally

phone Contact Us