In the cloud-native era, enterprises are faced with challenges in IT O&M, such as complicated architecture, diversified demands from customers, and massive O&M data. In digital transformation, many enterprises are eagerly demanding precise alerts, intelligent diagnosis, root-cause locating, exception prediction, and automatic repair.
On September 26 2020, Teng Shengbo, senior technical expert of Alibaba, delivered a keynote speech titled "Unmanned Operations and Self-service Practices of On-cloud Servers" at the GOPS Global O&M Conference. He shared the experience of how Alibaba Cloud's ECS team achieved unmanned operations of on-cloud servers by using AI to empower automated O&M. This can help customers reduce the complexity of instance management on the cloud and ensure stable and efficient operations of instance services. This article is extracted from Teng Shengbo's speech.
Teng Shengbo, senior technical expert of Alibaba
O&M is a service that includes infrastructure software services and manpower services. In an enterprise, it servers business teams that use infrastructure. IaaS-based cloud computing is an O&M service for developers and operation teams who use cloud services. With the widespread implementation of cloud computing, most enterprises have already migrated their business to the cloud. Currently, more than 1 million users run their business on Alibaba Cloud's platforms. Alibaba Cloud is also providing services to more users on its platforms.
As the number of users increases, users are facing three pain points when operating and maintaining ECS instances:
Therefore, ECS team needs to invest a lot of manpower in providing customer service to efficiently solve their problems. To avoid the rapid increase in O&M costs caused by the increase in users' size, ECS team uses AI to empower O&M management for users.
When unmanned retail and driving are being developed, ECS team believes that unmanned operations of on-cloud servers will be realized in the future.
In fact, during ten years after the launch of Alibaba Cloud's elastic computing products, ECS team has accumulated much O&M experience and summarized abnormal "behavior" of ECS instances. Therefore, by analyzing the data of abnormal "behavior" based on machine learning, ECS team has built the unmanned architecture of on-cloud servers and provided a series of self-services. Thus, ECS team has achieved automatic diagnosis, repair, optimization, and O&M of ECS instances. As a result, the complexity of ECS instance management for users has been reduced to ensure stable and efficient operations of instance services.
The O&M of IaaS-based cloud computing can be split into service-side O&M and customer-side O&M. Service-side O&M is the O&M of Alibaba Cloud's platforms and is usually invisible to users. It mainly involves three layers: infrastructure, basic products, and upper-layer management, covering O&M of computer rooms and physical devices, resource virtualization, resource scheduling, and hot migration. As the number of users increases, these O&M tasks become more and more complex. On the contrary, customer-side O&M is visible to users and mainly focuses on users' modification and automation on ECS instances. It includes capacity expansion, restart, monitoring, customer service, ticket response, resource and O&M orchestration, etc.
Alibaba Cloud's unmanned architecture of on-cloud servers can provide a series of self-services for users on Alibaba Cloud's platforms. In a broad sense, self-services of Alibaba Cloud consist of four dimensions: ECS instance, instance lifecycle management, system management and automation, and market and ecosystem, as shown in the following figure.
Self-services in a broad sense
In a narrow sense, Alibaba Cloud's self-services provide users with three functions of ECS instances: diagnosis, repair, and recommendation. Alibaba Cloud provides a series of self-service tools, including instance diagnosis tool, recommendation of instance optimization solution, automatic repair tool, best template recommendation, and ECS event automation, etc. These tools can resolve 80% of ECS common issues and reduce the average issue resolution period from several hours to several minutes.
The unmanned operations of on-cloud servers are running smoothly without human involvement in customer services or risk of privacy disclosure. In the future, continually driven by AI and data, ECS instance diagnosis and repair will be more accurate.
According to platform statistics, users may encounter following problems when using ECS instances:
Therefore, in the intelligent diagnosis, ECS team provides services for ECS system, disk health, network health, and Guest OS system configuration. With these services, users can diagnose an instance intelligently with just one click on the button.
After the intelligent diagnosis is completed, ECS team also provides users with an automatic solution to fix instance problems. The solution can provide repair on ECS system services, network and disks. Once problems have been diagnosed, the automatic solution can fix these problems within 1-3 minutes.
Apart from the realization of automated repair, ECS team also believes that automatic repair should be transparent and compliant. With the automation engine provided by Operation Orchestration Service (OOS) of O&M and the executive power of GuestOS commanded by Cloud Assistant, ECS team can help users to complete the automatic repair. Besides, ECS team also provides open-source code of OOS and Cloud Assistant to make all the repair logic visible to users. In addition, through the image, snapshot, and data backup of ECS instances, users can quickly roll back to repair. In addition, by using Alibaba Cloud's Resource Access Management (RAM), users can control all permissions and audit all records through Alibaba Cloud's ActionTrail, which is transparent and compliant.
What make us be able to achieve intelligent diagnosis and automatic repair are AI and data, which provide powerful technical supports. Based on the underlying data middle platform, ECS team has completed the collection, cleansing, analysis, and modeling of data, including data of physical machines, network, control planes, GuestOS, and virtualization data. Together with the continuous optimization of AI algorithms, ECS team has built user profiles, decision trees, prediction and recommendation models, etc. to ensure more accurate and efficient diagnosis and automatic repair.
Currently, in the overall ECS self-service architecture, the control and monitoring center monitors data from log service, middleware, API request monitoring, monitoring console, and automatic diagnosis in real time. Besides, machine learning engine can be used for alerting and resolving problems. Then, users can apply OOS to fix problems automatically.
Through this AI-driven self-service architecture, Alibaba Cloud ECS can achieve an awareness accuracy rate of over 70% in real-time memory exceptions and control the procedure latency within 100 seconds. What's more, by integrating expert experience, case library, and knowledge base, ECS team built a powerful diagnostic decision tree to quickly locate and fix problems.
In the past two years, Alibaba Cloud ECS team has continuously invested in creating abnormal behavior dataset. In the future, ECS team plans to develop an "ImageNet dataset" for exception prediction of Alibaba Group and make it an open-source dataset. ECS team wishes this product can create more value for the development of exception prediction in the industry.
Alibaba Cloud Awarded HPC China 2020 "Best Industry Application Award"
34 posts | 12 followersFollow
Alibaba Cloud Community - March 2, 2022
Alibaba Clouder - October 22, 2020
Alibaba Container Service - August 17, 2021
Alibaba Clouder - November 20, 2018
Alibaba Clouder - May 24, 2019
Alibaba Clouder - December 1, 2020
34 posts | 12 followersFollow
An online computing service that offers elastic and secure virtual cloud servers to cater all your cloud hosting needs.Learn More
An end-to-end platform that provides various machine learning algorithms to meet your data mining and analysis requirements.Learn More
High Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.Learn More
Alibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.Learn More
More Posts by Alibaba Cloud ECS