Author: Yi Li
this series of articles
* Article 1-cloud native infrastructure
* Article 2-cloud native software architecture
* Article 3-cloud native application delivery and O & M (this article)
the past 2020 is a year full of uncertainty, but it is also a year full of opportunities. The sudden new Coronavirus epidemic pushed the acceleration key for the digital transformation of the whole society. Cloud computing is no longer a technology, but has become a key infrastructure to support the development of digital economy and business innovation. In the process of reshaping enterprise IT with cloud computing, cloud-native Technologies, which are born in the cloud, grew up in the cloud, and maximized to realize cloud value, have been recognized by more and more enterprises and become an important means to reduce costs and improve efficiency for enterprise IT. However, the cloud-native transformation is not only at the technical level of infrastructure and application architecture, but also promoting the transformation of enterprise IT organization, process and culture.
In the CNCF 2020 survey report, 83% of organizations have also used Kubernetes in the production environment. However, the top three challenges are complexity, cultural change and security.
To better accelerate business innovation and address the challenges of the Internet scale, cloud-native application architectures and development methods have emerged. Compared with traditional single application architectures, distributed microservice architectures have better and faster iteration speeds, lower development complexity and better scalability and elasticity. However, just as in the Star Wars universe, the force has both light and dark side. The complexity of microservice application deployment, O & M, and management is greatly increased. The DevOps culture and the supporting automation tools and platform capabilities are crucial.
Before the emergence of container technology, DevOps theory had been developed for many years. However, if the "development" and "Operation and maintenance" teams cannot communicate in the same language and cooperate with the same technology, then the barriers of organization and culture will never be broken. The emergence of Docker container technology has standardized the software delivery process, which can be built at one time and deployed everywhere. A declarative API that combines cloud computing programmable infrastructure and Kubernetes. Continuous automation integration and continuous delivery of applications and infrastructure can be achieved through pipelines, greatly accelerating the integration of development and O & M roles.
Cloud native also reconstructs the business value and functions of a team. Some of the responsibilities of the traditional O & M team are transferred to the development team, such as application configuration and release, which reduces the labor cost for each release. The O & M responsibilities focus more on system stability and IT governance. The SRE Site Reliability Engineering-Site Reliability Project advocated by Google is to solve the complexity and stability of system operation and maintenance through software and automation. In addition, security and cost optimization have become the focus of cloud O & M.
Security is one of the core concerns for enterprises to migrate to the cloud. The agility and dynamics of Cloud Native pose new challenges to enterprise security. As cloud security is a responsibility sharing model, enterprises need to understand the responsibility boundary with cloud service providers, and consider how to solidify security best practices through a tool-based and automated process. In addition, the traditional security architecture protects borders through firewalls, and any internal users or services are fully trusted. In the 2020 outbreak, a large number of enterprises require remote office and collaboration between employees and customers, and enterprise applications need to be deployed and interacted on the IDC and the cloud. After the physical security boundary disappeared, cloud security is undergoing a profound change.
In addition, the new Crown epidemic has further made enterprises pay more attention to IT cost optimization. An important advantage of cloud native is to make full use of the elasticity of the cloud to provide the computing resources required by the business on demand, avoiding resource waste and achieving the goal of cost optimization. However, unlike the traditional cost budget review system, cloud-native dynamic and high-density application deployment make IT cost management more complex.
To this end, cloud-native concepts and technologies are also being developed to help users continuously reduce potential risks and system complexity. The following describes some new trends in cloud-native application delivery and O & M.
Kubernetes has become a universal and unified cloud control plane.
The word Kubernetes comes from Greek, meaning helmsman or navigator, and is the root of "cybernetic" in English of "cybernetic. Kubernetes has become the de facto standard in container orchestration, thanks not only to Google's halo and the hard work of CNCF (Cloud Native Computing Foundation). Behind this is Google's accumulation and systematic thinking in the field of Borg's large-scale distributed resource scheduling and automated O & amp; M. A careful understanding of Kubernetes architecture design helps to think about some essential problems in system scheduling and management in distributed systems.
The core of Kubernetes architecture is the controller loop, which is also a typical "negative feedback" control system. When the controller detects that the expected state is inconsistent with the current state, it will continuously adjust resources to make the current state approach the expected state. For example, scale up and down applications based on the number of application replicas, automatically migrate applications after node downtime, and so on.
The success of K8s cannot be achieved without three important architecture options.
- Declarative (Declarative) API: on top of the Kubernetes, developers only need to define the target state of abstract resources, and the controller will specifically implement how to achieve it. For example, the abstraction of different types of workload resources such as Deployment, StatefulSet, and Job. Allows developers to focus on the application itself rather than system execution details. Declarative API is an important design concept of cloud native. This architecture helps reduce the overall O & M complexity and deliver it to infrastructure for implementation and continuous optimization. In addition, due to the endogenous stability challenges of distributed systems, level-triggered implementation based on declarative and final state is better than that based on imperative APIs, the event-driven edge-triggered approach can provide more robust distributed system implementations.
- Masking underlying implementation: K8s uses a series of abstractions, such as Loadbalance Service, Ingress, CNI, and CSI, to help business applications better use infrastructure through business semantics without paying attention to underlying implementation differences.
- Scalability architecture: all K8s components are implemented and interacted based on consistent and open APIs. Third-party developers can also provide domain-related extension implementations through CRD(Custom Resource Definition)/ Operator, greatly extending the application scenarios of K8s.
Because of this, the scope of resources and infrastructure managed by Kubernetes is far beyond that of container applications. The following are some examples
- Infrastructure Management: unlike open-source Terraform or Infrastructure as Code(IaC) tools provided by cloud vendors, such as Alibaba Cloud ROS, AWS CloudFormation, crossplane and AWS Controllers for Kubernetes extend the management and abstraction of infrastructure based on Kubernetes. In this way, K8s applications and cloud infrastructure can be managed and changed in a consistent manner.
- Virtual machine management: K8s can implement unified scheduling and management of virtual machines and containers through KubeVirt. Virtualization can be used to make up for some limitations of container technology. For example, in CI/CD scenarios, you can use Windows virtual machines for automated testing.
- IoT device management: edge container technologies such as KubeEdge and OpenYurt provide the ability to manage a large number of edge devices.
- Kubernetes cluster management: the cluster management and cluster management of Alibaba Cloud container service ACK are automatically managed and maintained by Kubernetes. ACK Infra supports tens of thousands of Kubernetes clusters deployed around the world. Based on Kubernetes, Kubernetes automates scaling, fault discovery, and self-healing capabilities.
Automatic upgrade of workloads
the ideal of "leaving complexity to oneself and simplicity to others" of K8s controller is very good. However, the realization of an efficient and robust controller is full of technical challenges.
- Due to the limitations of the built-in workloads of K8s, some requirements cannot meet the needs of enterprise application migration. Expanding through Operator framework has become a common solution. However, on the one hand, repeatedly creating wheels for repeated requirements will lead to waste of resources; It will also lead to fragmentation of technology and reduce portability.
- With more and more enterprise IT architectures, from on Kubernetes to in Kubernetes, a large number of CRDS and custom Controller pose a large number of challenges to the stability and performance of Kubernetes. Final state-oriented automation is a "double-edged sword". It not only brings declarative deployment capabilities to applications, but also potentially enlarges some misoperations in final state. In the event of an operation failure, mechanisms such as maintaining the number of replicas, version consistency, and cascading deletion are likely to lead to an expansion of the explosion radius.
OpenKruise is an open-source Cloud-Native application automation management engine of Alibaba Cloud. It is also a Cloud Native Computing Foundation project currently hosted under Sandbox (CNCF). It comes from Alibaba's containerization and cloud-native technology accumulation over the years. It is a standard extension component based on Kubernetes for large-scale application in Alibaba's internal production environment, technical Concepts and best practices to adapt to large-scale internet scenarios. Open-source project OpenKruise with the community. On the one hand, it helps enterprise customers avoid detours, reduce technical fragments, and improve stability in the process of cloud native exploration; On the other hand, it promotes upstream technology communities to gradually improve and enrich Kubernetes application cycle automation capabilities.
For More information, see OpenKruise 2021 planning exposure: More than workloads
A new collaboration interface for development and O & M appears.
The emergence of cloud native technology has also brought about changes in the organizational structure of enterprise IT. To better meet the needs of business agility, the microservice application architecture has created Two-pizza teams. Small, independent, and self-contained development teams can better reach consensus and accelerate business innovation. The SRE team has become a horizontal support team to support the improvement of upper-layer R & D efficiency and system stability. With the development of Kubernetes, SRE teams can build their own enterprise application platforms based on K8s to promote standardization and automation, allows the upper-layer application development team to manage resources and application lifecycle through self-service. We see further changes in the organization mode. New platform engineering teams are emerging.
this is also very consistent with K8s's self-positioning. Kubernetes technology is positioned as an application-oriented infrastructure and Platform for Platform, rather than an integrated application Platform for developers. More and more enterprises build their own PaaS platforms based on Kubernetes, improving R & D efficiency and O & M efficiency.
A classic PaaS implementation similar to Cloud Foundry establishes a set of independent conceptual models, technical implementations, and extension mechanisms. This method can simplify the user experience, but also introduces some defects. It cannot be combined with the fast-growing Kubernetes system, and cannot be fully combined with a variety of new technologies, such as Serverless programming models and supporting new computing services such as AI/data analysis. However, the PaaS platform based on K8s lacks unified architecture design and implementation planning, which will lead to many fragmented technical implementations and is not conducive to sustainable development.
Open Application Model(OAM) Open Application Model and its Kubernetes implementation KubeVela project are exactly Alibaba Cloud's collaboration with Microsoft and the cloud native community, jointly launched standard models and framework projects in the field of cloud native application delivery and management. Among them, OAM is designed to provide a unified and end-user-oriented application definition model for any cloud infrastructure, including Kubernetes. KubeVela, this is the PaaS reference implementation of the unified model on the Kubernetes.
KubeVela/OAM provides Kubernetes-oriented service abstraction and Service assembly capabilities, which can abstract and describe workloads and O & M features of different implementations in a unified manner, and provide plug-in registration and discovery mechanism for dynamic assembly. The platform engineering team can expand new features in a consistent manner and maintain good interoperability with the new application framework on the Kubernetes. For the application development and O & M teams, the focus Separation (Separation of Concerns) is implemented, which can deconstruct the Application definition, O & M capabilities, and infrastructure. Make the application delivery process more efficient, reliable, and automated.
In the field of cloud-native application model definition, the industry is also exploring in different directions. For example, Proton released by AWS is a service for cloud-native application delivery. Proton can reduce the complexity of container and Serverless deployment and operation and maintenance, and can be combined with GitOps, improved the automation and manageability of the entire application delivery process. The Serverless supported by Alibaba Cloud Knative K8s can support Serverless containers and functions to implement event-driven applications at the same time, allowing developers to use a programming model to efficiently select different underlying Serverless cost-effectiveness for optimal execution, wait.
Ubiquitous Security risks lead to security architecture changes
DevSecOps become a key factor
agile development and programmable cloud infrastructure have greatly improved the delivery efficiency of enterprise applications. However, in this process, if security risk control is ignored, huge losses may be caused. Gartner judgment, by 2025, 99% of the security penetration of cloud infrastructure was caused by incorrect configuration and management.
In the traditional software development process, security personnel begin to intervene in the security audit after the system design and development is completed and before the release and delivery. This process cannot meet the needs of rapid business iteration." Shifting left on security (security left shift) is beginning to receive more attention, which will allow application designers and developers to collaborate with the security team as early as possible and seamlessly embed into security practices. By shifting security to the left, you can not only reduce security risks, but also reduce repair costs. IBM researchers found that solving safety problems in design can save about 6 times the cost during code development and 15 times the cost during testing.
The R & D collaboration process of DevOps has been expanded to DevSecOps. First of all, it is the change of concept and culture. Safety becomes everyone's responsibility instead of focusing on the responsibility of the safety team; Secondly, solve the safety problem as early as possible and move the safety left to the stage of software design, reduce the overall cost of security governance; Finally, realize risk prevention, continuous monitoring and timely response through automated tool chains rather than human governance.
The technical premise of DevSecOps implementation is to implement a verifiable and reproducible construction and deployment process, so as to ensure that we can perform testing, pre-release, and, architecture security is continuously verified and improved in different environments such as production. We can use immutable infrastructure (immutable infrastructure) of cloud native technology and declarative Policy management Policy as Code to implement DevSecOps implementation practices. The following figure shows the simplest container application DevSecOps pipeline.
After the code is submitted, you can use Alibaba cloud image service ACR to scan the application and sign the image. When the container service Kubernetes cluster starts to deploy the application, the security policy can verify the image, you can reject application images that fail the signature verification. Similarly, if we use Infrastructure as Code to change the Infrastructure, we can use the scan engine to perform a risk scan before the change. If relevant security risks are found, we can terminate and alert.
In addition, after an application is deployed to the production environment, any changes must go through the preceding automation process. This way minimizes the security risks caused by human misconfiguration. Garter predicts that 60% of enterprises will adopt DevSecOps and immutable infrastructure practices by 2025, reducing security incidents by 70% compared with 2020.
Implementation of zero trust security architecture for service grid acceleration
distributed microservice applications not only increase deployment and management complexity, but also increase security attack coverage. In the traditional three-tier architecture, security protection mainly focuses on north-south traffic, while in the microservice architecture, east-west traffic protection poses greater challenges. In the traditional border protection mode, if an application is captured due to security defects, it lacks a security control mechanism to prevent internal threats from "moving horizontally".
"Zero Trust" was first proposed by Forrester around 2010. Simply put, zero trust is to assume that all threats can occur and distrust anyone/devices/applications inside and outside the network, the Trust Foundation of access control needs to be reconstructed based on authentication and authorization to guide the security architecture from network-centered to identity-centered, instead, it is protected by micro-boundary.
Google is vigorously promoting cloud-native security and zero-trust architectures, such as BeyondProd methodology. Alibaba and Ant Financial have also introduced the concept and practice of a zero-trust architecture in the cloud migration process. The key is:
- unified identity system: provides an independent identity for each service component in the microservice architecture.
- Unified Access Authorization model: inter-service calls must be authenticated by identity
- unified Access Control Policy: centralized management and unified control of access control for all services in a standardized direction
The security architecture is a cross-cutting concern that runs through the entire IT architecture and is related to all components. If it is coupled with a specific microservice framework, any security architecture adjustment may recompile and deploy each application service. In addition, microservice implementers can bypass the security system. The service grid can provide a loosely coupled and distributed zero-trust security architecture independent of application implementation.
The following figure shows the security architecture of the Istio service grid,
- you can use both existing identity services and SPIFFE-formatted identity. The identity can be passed through the X.509 certificate or JWT format.
- Security policies such as authentication, authorization, and service naming are managed in a unified manner through the service mesh control plane API.
- You can use a Envoy Sidecar or a border proxy server as a policy execution point (PEP) to implement security policies, which can provide security access control for east-west and north-south service access. In addition, Sidecar provides an application-level firewall for each microservice, and the network differential segment minimizes the security attack surface.
The service mesh decouples the network security architecture from applications, allowing independent evolution and management to improve security compliance. In addition, by using its telemetry capability to call services, it can further analyze the risk of communication traffic between services and conduct automatic defense through data-based and intelligent methods. Cloud-native zero-trust security is still in its early stage. We expect more security capabilities to sink into the infrastructure in the future.
A new generation of software delivery methods began to emerge
from Infrastructure as Code to Everything as Code
Infrastructure-as-Code (IaC) is a typical declarative API that changes the management, configuration, and collaboration of enterprise IT architectures on the cloud. With the IaC tool, we can automatically create, configure, and assemble cloud resources such as cloud servers, networks, and databases.
We can extend the IaC concept to cover the entire delivery and O & M process of cloud-native software, that is, Everything as Code. The following figure describes various models in the application environment, from infrastructure to application model definition to global delivery methods and security systems. We can create application configurations in a declarative way, manage and change.
In this way, we can provide flexible, robust, and automated lifecycle management capabilities for distributed cloud-native applications.
- All configurations can be managed, traced, and audited.
- All configurations can be maintained, tested, understood, and collaborated.
- All configurations can be statically analyzed to ensure the predictability of changes.
- All configurations can be reproduced in different environments, and all environment differences need to be declared to improve consistency.
Declarative CI/CD practices have gradually attracted attention.
Furthermore, we can manage all the environment configurations of the application through the source code control system, and deliver and change the application in the final state through an automated process. This is the core concept of GitOps.
GitOps was initially proposed by Weaveworks Alexis Richardson to provide a set of best practices for unified deployment, management, and monitoring of applications. In GitOps, all environment information from application definition to infrastructure configuration is used as the source code and is managed through Git, the change process is recorded in the historical status of Git. In this way, Git becomes the source of truth. We can efficiently trace historical changes and easily roll back to a specified version. GitOps is combined with declarative APIs and immutable infrastructure advocated by Kubernetes to ensure the reproducibility of the same configuration and avoid unpredictable stability risks caused by configuration drift in online environments.
Combined with the DevSecOps automation process mentioned above, we can provide a consistent testing and pre-release environment before the business goes online. Earlier, the stability risks in the system are captured faster, and the gray level and rollback measures are verified more perfectly.
GitOps improves delivery efficiency, improves developer experience, and improves the stability of distributed application delivery. In the past two years, GitOps has been widely used in both Alibaba Group and Ant Financial, becoming a standardized delivery method for cloud-native applications. At present, GitOps is still in its early stage of development, and the open source community is constantly improving relevant tools and best practices. In 2020, Weaveworks Flagger project was incorporated into Flux. Developers can implement gradual delivery strategies such as phased release, blue-green release, and A/B test in GitOps mode to control the explosion radius of the release, improved the stability of publishing. At the end of 2020, the CNCF application delivery team officially announced the establishment of the GitOps Working Group. We can expect that the community will further promote the standardization process and technology implementation in related fields in the future.
The O & M system has evolved from standardization and automation to data and intelligence.
With the development of microservice applications, the complexity of problem locating and performance optimization has exploded. Although enterprises already have a set of tools in the IT service management field, such as log analysis, performance monitoring, configuration management, etc. However, different management systems are isolated data and cannot provide end-to-end visibility necessary for complex problem diagnosis. Many existing tools use rule-based methods for monitoring and alerting. In an increasingly complex and dynamic cloud-native environment, rule-based methods are too fragile, high maintenance costs, and difficult to scale out.
AIOps uses technologies such as big data analysis and machine learning to automate IT O & M processes. AIOps can gain visibility of internal and external dependencies of IT systems through a large number of logs, performance data processing, and system environment configuration analysis, enhance prospective and problem insight, and implement autonomous O & M.
Thanks to the development of the cloud-native technology ecosystem, technologies such as AIOps and Kubernetes will promote each other to further improve the cost optimization, fault detection, cluster optimization, and other solutions for enterprise IT. There are several important assistance:
- standardization of observability: With the development of projects such as Prometheus, OpenTelemetry, and OpenMetrics in the cloud native technology community, the application observability field has been further standardized and integrated in the fields of logs, monitoring, and link tracking, this enriches data sets for multi-index and root cause analysis. Service Mesh non-invasive data telemetry capability can obtain richer business indicators without modifying existing applications. This improves the AI-level accuracy and coverage of AIOPS.
- Standardization of application delivery management capabilities: Kubernetes declarative APIs and final-State application delivery methods to provide a more consistent management and maintenance experience. Service Mesh non-intrusive Service traffic management capabilities that allow us to manage applications and automate O & M in a transparent manner.
The unattended release of applications can be realized through the combination of Alibaba Group's DevOps platform Yunxiao and the container platform release and change system ". During the release process, the system continuously collects various metrics, including system data, log data, and business data, and compares the metric changes before and after the release through algorithms. Once a problem is found, the publishing process can be blocked or even rolled back automatically. With this technology, any development team can safely complete the release work without worrying about major faults caused by online changes.
Cloud Native cost optimization
as enterprises migrate more core businesses from data centers to the cloud, more and more enterprises are in urgent need of budgeting, cost accounting, and cost optimization for the cloud environment. Transforming from a fixed financial cost model to a changing, pay-as-you-go cloud financial model is an important change in concept and technology. However, most enterprises do not have a clear understanding and technical means of cloud financial management. In the FinOps 2020 survey report, nearly half of the interviewees (49%) have little or no automated methods to manage cloud expenditures. To help organizations better understand cloud costs and IT benefits, the FinOps concept has become popular.
FinOps is a way of cloud financial management and a transformation of enterprise IT operation mode. Its goal is to improve the organization's understanding of cloud costs and make better decisions. In August 2020, the Linux Foundation announced the establishment of the FinOps Foundation to promote cloud finance management through best practices, education and standards. Currently, cloud vendors are gradually increasing their support for FinOps to help enterprises better adapt their financial processes to the variability and dynamics of cloud resources. For example, AWS Cost Explorer and Alibaba Cloud expense center can help enterprises better analyze and allocate costs. See
more and more enterprises manage and use infrastructure resources through Kubernetes platforms on the cloud. Containers are used to improve deployment density and application elasticity, thus reducing overall computing costs. However, the dynamics of Kubernetes brings new complexity challenges to resource measurement and cost allocation. Because multiple containers can be dynamically deployed on the same virtual machine instance and can be scaled on demand, we cannot simply map the underlying cloud resources to container applications. In November 2020, CNCF Foundation and FinOps Foundation released a new white paper "Kubernetes: FinOps for Kubernetes" on Unpacking container cost allocation and optimization cloud financial management to help people better understand relevant financial management practices.
Alibaba Cloud container service also provides many built-in best practices for cost management and optimization. Many customers are very concerned about how to achieve cost optimization based on Kubernetes and resource elasticity. We recommend that enterprises better understand their business types and divide different node pools for Kubernetes clusters, find a balance point in multiple dimensions such as stability and performance.
- Daily Business: for predictable and relatively unchanged loads, we can use subscription bare metal or large-capacity virtual machines to improve resource utilization and reduce costs.
- Short-term or periodic business plans: for example, short-term business peaks such as the Double 11 shopping festival, New Year events, or periodic business load changes such as settlement at the end of the month. We can use virtual machines or elastic container instances to cope with business peaks.
- Unexpected elastic business bursts, such as hot news bursts or temporary computing tasks. You can easily scale out thousands of elastic container instances per minute.
For more information about Kubernetes planning, see https://developer.aliyun.com/article/743627.
in the past decade, several major technological trends, such as infrastructure migration to the cloud, upgrading of the Internet application architecture, and agile research and development processes, have been combined with technological innovations such as containers, Serverless, and service grids, together, the concept of cloud native was created and developed. Cloud native is redefining the computing infrastructure, application architecture, and organizational process, which is inevitable for the development of cloud computing. Thanks to all colleagues in the cloud native era, let's explore and define the future of cloud native.
At the end of the film, the name of the three articles in this series salutes to the Star Wars series :-) Have you found it?
alibaba Cloud container service team is recruiting! Welcome to transfer posts and recommend social recruitment! Create a passionate future for cloud native together! Hangzhou, Beijing, and Shenzhen all have opportunities.
scan the QR code to learn more about technical dry goods and customer cases: