Community Blog New Ideas of Automated O&M in the Future: CloudOps

New Ideas of Automated O&M in the Future: CloudOps

Tian Taotao from Alibaba Cloud Interpreted a speech that introduced a new concept called CloudOps.

By Alibaba Cloud ECS

On December 10, at the 2021 ECS CloudBuild Summit, Alibaba Cloud released the industrial first Cloud Automated Operation and Maintenance White Paper (CloudOps White Paper for short), where it proposed a CloudOps maturity model.

Tian Taotao, Senior Tech Expert and Head of the Elastic Computing Experience and Control system of Alibaba Cloud, delivered a speech entitled CloudOps: New Ideas for Automated O&M at the summit. He shared his views on the development trend of cloud O&M and DevOps for the future. The following part is the highlights of his speech.


1. Cloud and DevOps Need to Be More Closely Integrated

1.1 New Trends in DevOps


DevOps has been widely used for more than ten years since its proposal. In recent years, we can see some trends in DevOps.

1) The scope and content of DevOps have changed with the rise of public cloud platforms. It is no longer necessary to manage infrastructure as traditional O&M does. DevOps and SRE enable enterprises to build and publish applications at a higher rate of change.

2) The deepening of microservice transformation and service governance and the spread of cloud-native have shown us how verticalization and standardization bring the benefit of fast delivery. More enterprise architectures have service-oriented designs, which means the theme of service extends to a larger scope. The surge in the number of applications brings unprecedented challenges to O&M. Under the complex mesh application structure, real-time and accurate observability is a huge challenge. At the same time, some applications have produced a much larger blast radius than expected.

3) Over the past few years, automation has been the most important strategy in DevOps. However, the change of enterprise applications and the faster and more agile organization and application delivery forms make the requirement for automation of APIs and AS Services more urgent in today's openness background.

Openness brings great challenges. From the previous point-to-point support to a single-point infrastructure providing service to many internal and external customers, the ability of each team to independently and quickly troubleshoot problems will make organizations more agile. Therefore, self-service has become an important trend. Only self-service can fully reduce the marginal cost of services. The self-service capability is the most important capability of cloud computing.

1.2 The Cloud Can Leverage the Benefits of DevOps Further

The preceding trends and challenges faced by DevOps can be tackled by making full use of the cloud. We will not tell you how to solve the problems at the moment. It may not be a coincidence that the cloud could tackle the problems faced by DevOps since DevOps and cloud computing have a lot in common.


The main advantages brought by DevOps are reducing costs, improving delivery efficiency, improving flexibility, and enhancing the reliability of delivery quality. Cloud computing also has great advantages in these four aspects.

Reducing Costs: DevOps can reduce the cost of communication and collaboration between organizations and improve the degree of automation. The cloud can help reduce the hardware procurement and investment in basic resource O&M for enterprises and provide more convenient choices.

Improving Delivery Efficiency: Agile organization and automated construction of DevOps can significantly improve delivery efficiency. The cloud platform is a huge resource pool that can create and release resources on-demand for a large number of resources required by applications. The building of resources and applications can be significantly improved by combining the cloud and DevOps.

Improving Flexibility: DevOps has a huge advantage in flexibility, allowing enterprise operators to pay more attention to business innovation, and cloud computing can rapidly and automatically deliver resources to meet operational needs.

Enhancing System Reliability: DevOps is helpful to system reliability through system construction and standardization and tool-based construction. Man-made problems and faults can be avoided and reduced through tool and platform construction. At the same time, efficient organizational integration can improve internal communication efficiency. The primary responsibility of the cloud platform is to ensure reliability and availability. The cloud provides high-availability infrastructure, tools, and service-oriented capabilities, which can reduce system costs and create more flexible, secure, and standardized systems.

DevOps and the cloud can help enterprises reduce costs and increase efficiency.

1.3 Next Form of DevOps: CloudOps

The transformation from traditional R&D and O&M to DevOps has improved the efficiency of organizational culture, application delivery, and deployment. This is a huge advance for system delivery and O&M and enables enterprises to focus on business innovation.

Today, as more enterprises use cloud resources and entrust the responsibility of infrastructure O&M to cloud vendors, we believe a new era of cloud-centered DevOps has come that will redefine DevOps. We have defined a new concept called CloudOps by fully combining the advantages and capabilities of cloud computing and DevOps. CloudOps focuses on how to practice DevOps on the cloud platform and realize the evolution of O&M again.


CloudOps is an extension of traditional IT O&M and DevOps and realizes the re-evolution of O&M through cloud-native architecture. It helps enterprises reduce IT O&M costs, improve delivery efficiency and system flexibility and agility, enhance system reliability, and build a more secure, trusted, and open business platform.

CloudOps Maturity Model

The report shows that almost all enterprises currently recognize the products, services, and capabilities provided by the public cloud, and most enterprises have used DevOps in the public cloud. However, only a few enterprises believe they have fully realized the potential of the cloud.

We believe the cloud needs to be properly managed to achieve the best performance and benefits. To this end, the cloud provides a large number of automated and self-service capabilities to help enterprises. In the process of practicing CloudOps, we need to think about the following questions.

1) The cloud provides a large number of automated tools and self-service capabilities. How can we use these tools better to achieve automation?

2) The cloud platform provides sufficient elasticity. How can elasticity be utilized?

3) How can high reliability and availability on the cloud be realized?

4) Network management, security, and audit capabilities on the cloud are far more challenging. How should they be managed?

5) If cloud resources are poor in cost management, threshold design, and quantification, it will cause a huge waste. How can this be optimized?

Based on these challenges, we summarized five construction and measurement dimensions of CloudOps:


1.4 Automation

One of the core capabilities of DevOps is automation. Similarly, the automation capability is the core capability of the cloud. The cloud platform exposes a large number of open APIs and provides many automation products and capabilities to improve the automation and programmable capability. The automation capabilities provided by the cloud platform are beneficial. Enterprises do not need many DevOps experts and can make full use of the automation capabilities of the cloud platform.


The main automation capabilities provided by the cloud platform include three major parts:

The first part is the Infrastructure as Code (IaC ) capability. It can quickly and automatically realize repeated deployment and the version management of deployment scripts with the help of IaC tools and open APIs. It tries to use standardized policies to reduce environmental differences and realizes application delivery and action trail at the same time. Alibaba Cloud built various services to orchestrate basic resources and support automation, such as Resource Orchestration Service (ROS) and Terraform.

After basic resources and application delivery are completed, daily O&M mainly performs operations on existing resources. O&M task complexity increases as more tasks use the automation mode. Complicated tasks need to be deconstructed to complete O&M automation by combining more atomic tasks. More enterprises are beginning to use the capability of Pipeline (Ops) as Code. Each job unit can be atomized by sorting out and visualizing the dependencies in the context of the execution task. Then, you can efficiently complete unit tasks, reduce the complexity of individual tasks, and maintain and extend functions through task abstraction.

In addition to the infrastructure automation and the automated O&M of basic resources mentioned earlier, the cloud platform makes a large number of resources programmable and exposes other auxiliary capabilities through open APIs to help manage resources throughout their lifecycles. However, the platform needs to expose more capabilities as the complexity of the business system increases. For example, the event system can send changes to the underlying resources in real-time to improve transparency. More metrics can be exposed through the monitoring system. After application problems occur, the self-diagnosis service can shorten the problem discovery time. Problems can be fixed with one click by using our cloud assistant for control and the O&M channel.

The elasticity capability is one of the most important capabilities of cloud computing. It can realize minute-level resource demand supply and meet the elastic requirements of scenarios of different scales through the configuration capability of ultra-large-scale resource pools. The flexible elasticity capability can help enterprises reduce costs and improve availability. Using elastic capabilities on the cloud can comprehensively improve the flexibility and stability of enterprise businesses.

1.5 Elasticity


Elasticity capability can be divided into two directions according to business requirements. One is vertical elasticity capability, and the other is horizontal elasticity capability.

Vertical elasticity is suitable for scenarios where applications cannot be scaled out horizontally. In common scenarios, such as monolithic applications, isolated applications, and stateful applications, you need to upgrade or downgrade configurations to cope with business changes.

Horizontal elasticity is suitable for distributed applications and stateless applications. You can scale thousands of computing resources in minutes using Alibaba Cloud's console, APIs, and automation tools.

If you want to reduce the cost of using auto scaling, auto scaling supports automated resource auto scaling by setting different modes. It can intelligently predict resource demand based on historical records.

1.6 Reliability

The cloud platform provides reliability building capabilities from multiple levels, including IDC, hardware, data, and self-service.


The ultra-large-scale IDC of cloud computing and multi-zone support enable users to build high-availability solutions, such as zone-disaster recovery and geo-disaster recovery, based on the cloud with low cost, high expansion, and high reliability. When planning and deploying applications, you need to prioritize the design and deployment of disaster recovery architectures to improve reliability.

In terms of data reliability, the scale dividend of cloud platforms has advantages. This is reflected in the multiple copies of storage and the SLA guarantee of high data reliability and when the cloud platform exposes open APIs to users in a service-oriented manner. Users can take advantage of the snapshot and image capabilities provided by cloud vendors to realize high-reliability capability building for data backup and disaster recovery.

Observability has attracted a lot of attention in DevOps in recent years. Cloud platforms usually provide the following types of monitoring service capabilities to support different levels of user requirements: cloud resource monitoring, application-layer APM, and monitoring at the user business layer.

In addition to fault tolerance in infrastructure and data, cloud service vendors usually provide fault tolerance for application services to help users build distributed systems with elasticity and fault tolerance. For example, you can implement automated traffic control, business degradation, and plan execution for applications through traffic protection, fault drills, multi-active disaster recovery, and switch plans by using security groups to perform network disconnection drills through Application High Availability Service (AHAS).

1.7 Security and Compliance

According to the Flexera 2021 State of Cloud Report, 81% of enterprises are most concerned about cloud security, ranking first. 75% of enterprises are concerned about cloud compliance. Therefore, security and compliance are the most important topics on the cloud.

The cloud platform provides numerous policies, controls, and technologies that work together to help users ensure data, infrastructure, and application security and protect the cloud computing environment from external and internal network security threats and vulnerabilities.


In terms of security and compliance, the cloud platform is responsible for the security, trust, and audit of infrastructure and products, including identity and access control and management, monitoring, and operations, thus providing customers with highly available and secure cloud services. Clients need to properly configure and leverage the capabilities of platforms and products to build their cloud applications.

Network is the only portal for all cloud services. Network attacks are the most diverse, harmful, and difficult to defend. The cloud computing platform will provide a mature network security architecture to deal with various threats from the Internet. Security groups, subnetwork ACLs, and routing policies can be used to ensure communication and isolation among internal networks. Cloud firewalls, application firewalls, and DDoS protection provided by the cloud security center can ensure the network security capabilities of the system.

Action auditing and tracing are important parts of the security lifecycle. They can identify potential security configuration errors, threats, or unexpected behaviors. They can also be used to support quality processes, legal or compliance obligations, and threat identification and response. Similar to the log audit service, the cloud platform provides audit and change tracking capability to facilitate the quick tracing of change scope and source.

Traditional O&M channels need to use SSH to obtain keys for management and open corresponding network ports. Improper key management and network port exposure will bring risks to the security of cloud resources. Cloud Assistant, the native Alibaba Cloud automated O&M channel, can help customers maintain cloud resources securely and efficiently.

1.8 Cost and Resource Quantification

One of the biggest features of cloud services compared with IDCs is the use of resources rather than holding assets. You can quickly create and release resources on the cloud and reduce the usage cost compared with IDCs. According to the Flexera 2021 State of Cloud Report, the second concern of cloud customers is cloud cost and management.


Take a cloud server as an example. Its resource cost is mainly composed of computing, storage, and network. The billing mode determines the pricing of resources on the cloud. Choosing an appropriate billing mode can save costs. For example, compared with pay-as-you-go billing, preemptible instances can save up to 90% of costs. Different products provide various specifications and billing modes, and appropriate specifications can reduce resource costs significantly. Similarly, you can save considerable costs by improving the utilization rate of resources.

We provide a series of products, covering cost analysis, resource optimization, resource specification, resource usage insight, and automation to realize cost optimization and resource quantification. They can help enterprises reduce unnecessary cloud resource costs.

1.9 Panorama of CloudOps Maturity Model

Cloud O&M is a management process from simple to complex and from growth to maturity. Its main goal is to reduce costs and improve efficiency. In practice, users' cloud O&M ideas are different as the cloud status and usage scale differ from each other. Based on the commonly used maturity models, we classify the CloudOps maturity model into five levels.


The responsibility of the cloud platform is to build a solid and reliable infrastructure and a full set of O&M services and capabilities related to technical facilities (most of which are free). Unless the enterprise plans to make itself a cloud platform, any investment in the cloud platform is a waste of resources to some extent.

Today, we always stress the importance of speed as software development and delivery are experiencing drastic changes. Traditional O&M needs to adopt new ideas, including everything from monolithic and distributed applications, distributed and microservices architecture, and automation to observability. Enterprises should focus less on infrastructure and basic resources and more on the application itself.

We believe enterprises that embrace cloud-native will use new tools and ideas to complete application development and O&M. Cloud platforms and enterprises can evolve together. Cloud platforms are born in the cloud era and can help enterprises seize opportunities.

0 0 0
Share on

Alibaba Cloud Community

692 posts | 134 followers

You may also like