Observability on the Cloud -- Problem Discovery and Positioning Practice
01 Value of ECS observable capacity
Cloud server observability refers to the ability of customers to perceive the internal operation of the server, thus ensuring the reliability of resources on the cloud. Compared with traditional IT operation and maintenance, the use and operation and maintenance methods of cloud IT have changed. For example, in the traditional IT operation and maintenance scenario, customers will purchase their own machine rooms and machines, operating hardware resources; After going to the cloud, customers operate various computing resources through OpenAPI. At the same time, in traditional scenarios, the business scale is limited to the machine room or physical machine; Thanks to the flexibility of the cloud, customers can easily expand their business scale to hundreds or thousands of servers.
The operation and maintenance of the cloud also adds many challenges and difficulties, which need to be solved by the cloud server's observability.
The observable value of ECS mainly includes the following three points:
① Improve the efficiency of problem location. Through self-service, the problem can be quickly located and can be relied on.
② Simplify operation and maintenance, and make it easy to master the operation details of ECS.
③ Improve the reliability of resources, timely master the internal and underlying status of ECS client OS, and avoid black boxes.
Alibaba Cloud provides many mainstream toolsets for improving the observability of ECS, including health diagnosis, system events, cloud monitoring, ARMS and operation audit. Although the positioning and perspective of these tools are different, their purposes are the same, that is, to let customers clearly perceive the health status of the current instance, help quickly find problems, and reduce the operation and maintenance costs.
02 Use self-service tools to locate and analyze typical problems
Self service tool On demand self service means that users can obtain computing resources or services by themselves without having to deal with service providers.
Typical problems that customers often face on the cloud include instance failure to start, instance failure to connect, and ineffective operations. In traditional scenarios, customers can only submit work orders for after-sales support, and the speed of problem solving depends on the customer service's understanding of the problem or response efficiency.
In the self service scenario, we incorporate all typical customer problems into the health diagnosis tool set. Customers can self initiate diagnosis on the console to solve problems. It takes only minutes to locate the problem.
There are two reasons why an instance cannot be started:
First, there are problems in the operating system, such as viruses in the customer's operating system, some key files are damaged or deleted, some core system services of the operating system are not started because of the customer's misoperation, the configuration of the Fstab file is wrong, or it may be caused by the conflict between the image and the specification. For problems in the operating system, the current Health Guide tool can quickly locate the problem and push the corresponding repair solution to the customer.
The second is the underlying problem of the cloud platform, which is relatively rare, mainly including insufficient inventory, host alarm, control system exception, virtualization exception, and disk expansion/contraction exception. For such problems, the diagnostic tool will push the manual service to the entrance. At the same time, for some serious errors, the problem reporting portal will be pushed to the customer. If the operation and maintenance team confirms that the problem reported by the customer is very serious, it will actively trigger the operation and maintenance action, and the steps are completely transparent to the customer.
When an instance cannot be started, how can Alibaba Cloud diagnostic tools detect the internal of the instance operating system?
Take the PC operating system as an example. For example, when a PC breaks down, it usually uses a USB flash disk as a repair disk, which is adjusted during startup. After the USB flash disk is started, the system is reinstalled or repaired, and finally the repair disk is uninstalled, the computer can start normally.
The working principle of the diagnostic tool is similar, as shown in the bottom left of the above figure. If the customer's operating system fails to start normally, the diagnostic tool will mount a repair disk for the customer and generate a temporary password for logging in to the repair disk. After the repair disk is mounted, the instance will be automatically started for the customer. The original system disk will be attached to the current instance as a data disk, and then real-time detection will be performed. If a problem is found, the customer will be recommended a specific repair plan. The customer can solve the problem according to the repair plan. After the problem in the original system disk is solved, the repair disk can be uninstalled and started normally. The whole process is completely transparent to customers.
The instance cannot be remotely linked mainly because the two ECS servers cannot be connected and the ECS instance cannot be connected to the public IP address. The diagnostic tool supports three types of inputs. You can select ECS instances, network cards, or public IP addresses.
The diagnostic tool will list the critical paths between the starting end and the destination end, such as the instance account status, the instance operating system, the switch where the current instance is located, and then detect whether each critical path is connected in turn to draw a conclusion.
The critical path can be divided into two categories: instance configuration and configuration within the operating system. The instance configuration includes instance arrears, Vswitch unlicensed traffic, and locked instances; The operating system class mainly uses the cloud assistant to issue open source diagnostic commands in real time. If problems are found in the operating system, the user will be notified in the repair plan. The network connectivity diagnosis report will display the critical paths that cannot be connected and their causes.
The instance change operation does not take effect, which means that the customer has made some changes on the console, but the results are not as expected. This kind of problem is very difficult. There are many changes and reasons for failure to take effect. At present, the self service diagnostic tools have supported the following diagnostic capabilities: cloud disk expansion does not take effect, reset password does not take effect, instance configuration change does not take effect, and instance renewal fails.
For example, when the customer's cloud disk exceeds its capacity, the console has been expanded from 40G to 100G. 100G is displayed on the console, but it still requires the customer to enter the OS and make some expansion commands to really take effect. Otherwise, the service will be damaged. The diagnostic tool has a special diagnosis for cloud disk expansion. If the actual effective disk size of the customer is found to be inconsistent with the expansion size, it will push the expansion suggestions to the user to avoid damage to the customer's business.
Another type of instance change operation does not take effect because the customer is not familiar with the product rules. The diagnostic tool will push the current product rules to the customer.
The above typical problems need to be diagnosed on the console by the customer, which belongs to passive service. The active detection tool behind the self-service capability is the reconciliation system. If the service used by the customer on the cloud is inconsistent with the actual operation, the business will be affected. Therefore, the reconciliation tool can be used to ensure that what the customer sees on the console is consistent with the actual running value. For example, it will compare whether the IP address seen by the customer on the console is consistent with the actual IP address.
Through the active diagnosis initiated by the customer and the active service behind the self-service tool, we can ensure that the data observed by the customer is consistent with the actual operation.
The above figure shows an overview of the diagnostic capabilities and user scenarios of the self-service diagnostic tool.
Diagnostic capabilities are mainly divided into two categories, namely, problem troubleshooting and rules. The problem troubleshooting category is subdivided into operating system and cloud platform categories, and there are about 80 kinds of diagnostic capabilities; The product rule class currently provides more than 30 capabilities.
Alibaba Cloud platform diagnostic analysis depends on Alibaba Cloud's underlying data collection. Alibaba Cloud has nearly 30 regions and hundreds of zones around the world. Real time data is collected every moment, such as physical machines, IDC machine rooms, operation performance, serial port logs, etc. These basic logs are the input of the health diagnosis tool. With these underlying data, the diagnosis can do data cleaning, aggregate computing, extract features related to exceptions, and finally produce the diagnosis root cause.
The other part has a close relationship with customers in the operating system, which is realized by installing cloud assistant services in the instance. When a customer initiates a diagnosis, the cloud assistant executes open source scripts on the customer instance to collect real-time data, including load classes and configuration classes, such as real-time detection of CPU, memory, iOS and other load classes in the current customer's OS, or DHCP, IP and other configuration classes.
Alibaba Cloud platform diagnosis and diagnosis within the operating system constitute a health diagnosis service. At present, the health diagnosis service has been output to the console for cloud customers, and also to the internal cloud products. In the near future, OpenAPI will also be launched.
03 Integrated diagnosis realizes automatic operation and maintenance
For small and medium-sized customers without their own operation and maintenance platform, we recommend using the diagnostic products directly on the console, which is convenient and fast. The diagnostic products also provide many detailed schemes for customers' reference.
For medium and large users with their own O&M systems, it is recommended to integrate them into their own O&M systems in the form of APIs, which are efficient and convenient. For example, the monitoring system can be integrated with diagnosis. When the monitoring system finds that the instance load is abnormal, it will directly call the diagnosis API and handle according to the diagnosis results; The diagnostic service can be integrated into the patrol system to perform real-time diagnosis on some core instances of the cluster every day. If there is any abnormality, it can be replaced or expanded in time; The diagnosis service can be integrated into the background operation and maintenance system for use by the personnel on duty. For example, when the RDS instance initiates a diagnosis, it can also initiate a diagnosis on the ECS layer at the same time.
In addition, the diagnosis product also combines the O&M orchestration capability. It has opened many public templates for O&M orchestration, which can quickly implement scheduled O&M and event driven O&M, and also provides the capability of batch operation, such as batch initiation of diagnosis or cross region initiation of diagnosis. All the above capabilities can be directly implemented on the OS console using public scripts. At the same time, it also supports ECS products to trigger operations based on the diagnosis results. For example, if the diagnosis results indicate that the current instance is overloaded, you can trigger the creation of a new instance or upgrade configuration; If the diagnosis results show that the cloud disk capacity is full, you can trigger the capacity expansion command.
Self service diagnostic tools are expected to be the general entrance to ECS product problem location and troubleshooting. In the near future, we will release the Open API for health diagnosis, and we are also intensively developing more diagnostic capabilities. In addition, a technology circle has been established in Alibaba Cloud's official community for professional communication and sharing among peers in the industry.
Q&A link, audience questions
Why don't you perform automatic repair after Q1 automatic diagnosis?
A: We have tried some automatic repair, but the effect is not good. First, some repair actions are risky; Second, the repair action may require user authorization. Therefore, we finally decided to only provide the repair solution to the customer.
Q2 How accurate is the problem discovery and positioning?
A: The diagnostic tools cannot cover all the problems. At present, the main problem is to ensure the high frequency of occurrence. There are three sources of problems: the first is the customer work order, which evaluates which problems occur frequently and can be integrated into the diagnostic tool; The second type is to visit some customers with GC 3 or above on a regular basis; The third type is the weekly management and control on duty report within the team at ordinary times.
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Explore More Special Offers
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00