Exploration and practice of cloud server observability

1、 Observable value

What is observability and why is it so important for cloud servers? Simply put, observability refers to the ability to understand the internal operation of cloud servers. In my opinion, the importance of cloud servers mainly lies in three aspects: improving certainty, simplifying operations, and improving information transparency.

Both physical machines and cloud servers cannot achieve 100% reliability. Cloud servers have comprehensive observable capabilities, which can comprehensively scan various operational indicators and internal states of the cloud server, obtaining a rich information map to improve information transparency and avoid black boxes. In abnormal scenarios, the results of this scan can also quickly locate the cause of the problem and simplify operation and maintenance.

2、 Cloud Server Observable Solution

Cloud servers can observe the overall solution. Let's take a look at how Alibaba Cloud does it first.

Everything is data. Alibaba Cloud relies on a powerful data center to collect nearly 100TB of data from nearly 100 million collection units every day, which reflects various operational states, indicators, and parameters within the cloud server. After collecting these data, noise is removed through data cleaning, correlation analysis is performed to match various defined indicators, and finally, a true image of the cloud server operation is obtained through feature calculation. Then output the processed data to two types of products. The first type of product is our internal operation and maintenance support platform, which is a major solution for Alibaba Cloud to proactively maintain the stability of the cloud platform.

On the other hand, it will input observable products to the user end, which are provided to meet three objectives: deterministic operation, simplified operation and maintenance, and information transparency. They include four products: cloud monitoring, ECS system events, health diagnosis, and health status. Below, we will introduce these four products separately.

1. Cloud monitoring

When it comes to monitoring systems, I believe everyone will not feel unfamiliar. Cloud monitoring is a monitoring and alarm service provided by Alibaba Cloud for resources and internet applications on the cloud. Cloud monitoring is relative to traditional monitoring services. What are its advantages? I will focus on the first two points:

1. Natural integration. You don't need to purchase or activate, just have an Alibaba Cloud account to use it, which is easy to check and use.

2. Flexible alarm. There are flexible alarm rule settings and flexible and rich alarm push channels. The alarm push channels are mainly divided into two categories: one is the message touch type channels, such as our common DingTalk and SMS. More importantly, it can have an automatic processing channel, which lays the foundation for automated operation and maintenance in the future. The automatic processing channels include function calculation, operation and maintenance orchestration, message service, and log service.

Regarding cloud monitoring, let's focus on sharing its powerful host monitoring features. In addition to supporting common CPUs, memory, LOADs, disks, and network cards, cloud monitoring can also monitor processes. Through process monitoring, you can know whether your process is alive and the current resource consumption of the process. So cloud monitoring is the most basic and commonly used means.

2. ECS system events

Alibaba Cloud will proactively report underlying operation and maintenance events or unexpected maintenance events that affect the operation of ECS instances, and provide repair suggestions to users. How can ECS system events improve and enhance cloud server observability?

1. Proactively report underlying issues to improve the certainty of server operation.

2. It can simplify operation and maintenance. After the system event is reported, we subscribe to this event to achieve automated event processing, improve event processing efficiency, and simplify operation and maintenance.

3. Event Driven can improve system efficiency. As we all know, in asynchronous scenarios, PUSH mode and PULL mode have obvious efficiency advantages. For a very familiar example: to create an ECS instance, we usually first call the RunInstances API to obtain an instance ID, and then continuously call the DescribeInstances interface to query the instance status until it becomes Running. The customer test programming is complex and inefficient, but changing to event driven mode only requires subscribing to instance status change events, and when it becomes Running, it automatically triggers subsequent business logic, which is simple and efficient. On the right side of the above figure is the ECS event service flowchart. Event push will directly reuse cloud monitoring exception push channels, laying the foundation for us to achieve automated event processing in the future.

After we have a basic understanding of the ECS system capacity, let's focus on how it achieves automated event processing? The left side of the figure shows the current event classification, with a focus on the right side. Here are two recommended solutions for implementing event automation processing:

The first is to push system events through cloud monitoring to function computing services, where specific events trigger specific function computing capabilities, thereby achieving automated event processing.

The second is that events can be pushed to the operation and maintenance orchestration service, and specific events trigger our pre set specific operation and maintenance orchestration template to achieve automated event processing. Here, it should be reminded that function calculation is a paid service, but operation and maintenance orchestration is free.

The ability of ECS system events to proactively report underlying events that affect instance operation is an important aspect of cloud server observability and can effectively solve deterministic operation problems. But that's not enough. Because the actual situation is that the probability of serious problems occurring on cloud platforms is still very low, overall, cloud platforms are very stable. Most operation and maintenance issues are related to user operations and usage, which means that problems often occur within the customer's OS and application. However, the coverage of system events on exceptions within the customer's OS is relatively limited. So in order to further improve the observation capability of cloud servers, we have launched diagnostic services. Diagnostic services are specifically divided into three products: instance health diagnosis, instance health status, and network connectivity diagnosis.

3. Health diagnosis

Let's first take a look at instance health diagnosis, which is a service that comprehensively detects issues within the customer's OS and the software and hardware issues of the cloud platform that the cloud server relies on. Our diagnostic projects are divided into two categories: customer OS diagnostic items and cloud platform diagnostic items.

Today, we will focus on discussing the customer OS diagnostic items. Based on health diagnosis, what issues can be detected within the customer OS?

Firstly, based on health diagnosis, common resource utilization issues such as CPU full capacity, insufficient memory, insufficient disk space, and the top 5 process with the highest resource usage can be identified.

Secondly, through health diagnosis, common issues with network settings, disk design, and file system settings can also be discovered. Taking network settings as an example, common network issues such as whether the network card is up, whether network services are running, whether multiple queues on the network card are enabled to ensure network performance, and whether the IP configuration method of the network card is correct (for example, we often encounter problems where users' instances should use dhcp to dynamically allocate IP, but the use of custom images to configure static IP makes the network inaccessible).

3. Through health diagnosis, we can also see whether the services that affect the normal operation of the instance are running normally, such as whether common ports are listening (such as Linux 22 port and Windows 3389 port), whether the dhcp process for dynamically allocating IP exists, and whether the system d responsible for system initialization is running normally.

4. Through health diagnosis, it is also possible to check if a custom firewall and routing table have been set up in the customer's OS. This often leads to network connectivity issues. For cloud servers, we recommend using security groups as the only firewall solution, as security groups are located at the virtual network level and cannot be tampered with by users, making them both simple and secure.

The above capabilities are the diagnostic capabilities we have so far, but they are far from the end point. There are still many new diagnostic capabilities under development. I would also like to share some of our experiences in conducting diagnosis. To be honest, it is difficult to make a good diagnosis because customers' problems vary greatly, and it is difficult to strengthen and accurately diagnose through prior design. Our experience is to be problem driven, that is, to discover and solve problems quickly, iteratively, and continuously enrich diagnostic capabilities.

Next, let's take a look at typical usage of health diagnosis. Here I have listed two typical scenarios:

Firstly, perform cause detection for abnormal instances. For example, in the above figure, you can see that the server load suddenly skyrocketed. Of course, you can use finer grained monitoring indicators to locate the cause, but there is also a more convenient approach, which is to run the health diagnosis of the instance. Taking this case as an example, we will clearly tell you who is the process that consumes the highest CPU resources? What is its ID? Next, it can help you quickly locate the problem. Can you take a look at whether this process is caused by a problem with the business itself? Is it normal business traffic growth or is there an issue with the implementation of the code? Is there a recent release, etc.

Secondly, it is recommended to implement periodic health inspections of instances based on health diagnosis and operation scheduling. Our operation and maintenance orchestration service supports the ability to execute periodically. As long as you regularly call the instance health diagnosis interface for instances that need to be inspected, you can generate diagnostic reports. Then perform manual or automatic processing according to the diagnostic report prompts. If the instance size is relatively large, automated exception response and automated operation and maintenance can be implemented for very serious problems.

4. Health status

Next, let's take a look at the health status of the instance. The principles of health status and health diagnosis are consistent, but there are three obvious differences in health status.

One is the scope of diagnostic items. The health status diagnosis items of the instance are more refined, and we choose the basic calculation, network, and storage diagnosis items to ensure the healthy operation of the instance. From now on, ECS has two states: one is the control state, and the other is the runtime state, which is the health state. The health diagnosis of the instance presents a diagnostic report to the user, which not only informs you of the problem, but also informs you of the reason for the problem? So through these three comparisons, we found that the instance health status actually has its own unique application scenarios. The above figure shows the diagnostic items supported by our selected instance health status. You can have a brief understanding.

Typical usage of instance health status. If you have very few instances, what is the current instance running status that can be sensed in a timely manner through the console? Is it healthy?

If it's not healthy, there must be something wrong with the underlying or user settings, and corresponding operation and maintenance measures can be taken or technical support can be sought.

If the cluster size is relatively large and the reliability requirements for infrastructure are very high. We suggest using the right figure to ensure the high availability of the entire cluster's infrastructure through the elastic and scalable automatic replacement of abnormal instances function. Specifically, the instance health status detection function is enabled through an elastic and scalable console. Next, the elastic scaling service will replace the customer's periodic check of instance health status. If any abnormalities are found, immediately replace them with healthy instances of the same specification to ensure high availability at the entire infrastructure level.

Finally, introduce a product that we are currently in public beta testing, Network Connectivity Diagnosis. Everyone knows that the reasons for network connectivity issues can be very complex. Our customers often suffer from network connectivity issues, and based on our long-term experience in troubleshooting problems for users, we have found that three types of problems occur frequently:

1. The target process is not listening correctly;

2. The issue with firewall settings. Including customized firewalls and security groups within the customer's OS;

3. There is a problem with the network settings of the instance itself.

So we have developed end-to-end diagnosis for networks to address these three issues of high frequency. It can accurately discover the source and destination nodes of communication:

• Security group and customer OS firewall settings issues

• Subnet ACL setting issue

• Network status/settings issues with the instance itself

Whether the port is listening normally

3、 Summary

1. Comparison of Several Products

Okay, we have introduced five products to you regarding the three goals of cloud server observability: deterministic operation, simplified operation and maintenance, and information transparency: cloud monitoring, ECS system events, instance health diagnosis, instance health status, and network end-to-end diagnosis. Finally, make a summary and review. First, let's compare the characteristics of several products:

Cloud monitoring: Indicator monitoring and alarm that is particularly suitable for customer OS and customer processes

System events: Covering a wide range of problem domains, but mainly reporting issues that affect instance operation due to cloud platform system maintenance or system errors. There are relatively few issues related to customer OS and customer processes.

Health diagnosis: Covering a wide range of problem domains, the diagnostic items are more abundant than events, especially for customer OS related diagnostic items, which are still constantly enriching.

Health status: The principle is similar to health diagnosis, but the diagnostic items are more refined and suitable for specific scenarios.

2. Product Selection in Different Scenarios

Finally, from a scenario perspective, let's take a look at what products and tools are suitable for different scenarios to solve our problem.

If we want to do business or host monitoring/measurement and alarm, we prioritize using cloud monitoring.

If our instance experiences abnormalities, we would like to conduct periodic health checks on the instance. We suggest using ECS system events and ECS health diagnosis.

If your scenario is a container or secondary virtualization scenario, the high availability requirements for infrastructure are very high. So we suggest elastic scaling and automatic detection of abnormal instance health status to ensure high availability of the entire cluster.

If encountering network connectivity issues, prioritize using end-to-end network diagnosis. Check if there are any issues with common security groups, customer security firewalls, process monitoring, and self instance network settings.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us