Artificial Intelligence for IT Operations (AIOps) has been very hot in recent years. It is committed to solving O&M problems intelligently and has become the trend of O&M development. AIOps involves intelligent discovery, analysis, processing, operation and decision-making, risk perception, and other fields.
Kubernetes is difficult to detect, locate, repair, and prevent exceptions in clusters due to its complexity, deep integration of multiple cloud products, and complex business scenarios on the user side. We gradually have precipitated AIOps capability for security inspection and container intelligence service (CIS) and productized this capability to reduce cluster exceptions and improve cluster stability. Therefore, users can quickly detect, diagnose, and resolve cluster exceptions.
We found that the problems encountered by users in the process of using Kubernetes show obvious distribution, and some problems occur frequently. We have developed a configurable inspection framework that can regularly check clusters for potential risks to address high-frequency and high-risk issues. Based on experts' experience, repairing suggestions are also given for each risk item to improve users' self-O&M capability.
With the best practices of Kubernetes, the security inspection can be used to detect security risks of clusters and provide detailed descriptions and repair samples. The descriptions can be the maximum usage of CPU is not set, configure container readiness probes are not set, and the container startup uses root user.
Currently, regular cluster inspection supports four events consisting of 28 items, covering multiple dimensions (such as resource pressure, resource quota, version certificate, and cluster risk). Each user has a certain quota and pressure for available resources by default. Sometimes exceptions occur because the resource quota is insufficient or the resource exceeds the pressure, but users are not aware of those situations. High-frequency problems can be detected through regular cluster inspection. Therefore, users can increase the resource quota or pressure as prompted.
Cluster inspection can solve high-frequency and common problems accounting for 20% of the total problems. The fault diagnosis feature is used to overcome the remaining uncommon problems accounting for 80%. According to our analysis, the topN problems encountered by users when using Kubernetes include node exception, pod status exception, network failure, application behavior exception (such as DNS error, access to external services error, restart, and crash), and cluster expansion failure. The main purpose of fault diagnosis is to tell users the causes of these errors and how to solve them. We have designed a fault diagnosis framework, including node diagnosis, pod diagnosis, and network diagnosis, with a total of 91 check items, covering 43 exception scenarios.
The network problems are the most among them. The vast majority of users do not have the ability to deal with these problems. When encountering very simple network problems (for example, being unable to access the network), users are often helpless. We designed and developed an automated container network diagnosis tool named Skoop to implement comprehensive-procedure container network fault diagnosis. When the network is blocked, users only need to carry out the network diagnosis once and tell Skoop the source and destination information of the access to the network. Then, Skoop can automatically find the reason the network is blocked.
Alibaba Cloud Container Service for Kubernetes serves tens of thousands of users. The business applications in the cluster involve AI, big data, online business, batch processing, video broadcasting, IoT, and other scenarios. The business deployment forms include public cloud, private cloud, local cloud, hybrid cloud, edge nodes, and other scenarios. We have summarized and accumulated rich experience from experts in dealing with these scenarios and application problems. Please see the FAQ page for more information.
We also provide the capability for self-healing of nodes for some problems that can be automatically repaired. For example, the systemd version of nodes is too low. If the systemd version of a node is earlier than systemd-219-67, the status of the node changes frequently between Ready and NotReady. As a result, the service pod is frequently evicted, and an exception occurs. Users need to upgrade the systemd version of the node to repair this. When a user cluster is large, it is very cumbersome to manually log on to hundreds of ECS for repair. In this case, users can utilize the self-healing capability of the node pool to solve the problem.
This series focuses on the inner workings of CloudOps, DevOps, SecOps, AIOps, and FinOps and how it relates to End-to-End Cloud-Native Application Management, which enables efficient, secure, and transparent container management. Learn more by visiting the landing page and be sure to check the other articles in this series!
Alibaba Cloud Community - May 12, 2022
Alibaba Cloud Community - June 8, 2022
Alibaba Cloud Community - May 19, 2022
Alibaba Clouder - February 26, 2021
DavidZhang - July 5, 2022
Alibaba Developer - December 16, 2021
A high-quality personalized recommendation service for your applications.Learn More
A unified, efficient, and secure platform that provides cloud-based O&M, access control, and operation audit.Learn More
This solution provides you with Artificial Intelligence services and allows you to build AI-powered, human-like, conversational, multilingual chatbots over omnichannel to quickly respond to your customers 24/7.Learn More
Managed Service for Grafana displays a large amount of data in real time to provide an overview of business and O&M monitoring.Learn More
More Posts by Alibaba Cloud Community