There are more challenges to stability assurance with the increasing adoption and complexity of Kubernetes in production environments.
Securing stability has become a basic requirement for Kubernetes-based cloud products. Defects of stability may cause huge losses to the products, such as loss of users, reduced user confidence, and slow product iteration.
Although Kubernetes-based stability assurance is important, the industry does not have a standardized stability assurance solution based on practice. As a result, the same problem occurs repeatedly in the same product or different products. Thus, best practices cannot be applied to products with the same technology stack, and best practices of the stability assurance formed by different products cannot be complementary.
To this end, based on past development practices and the stability assurance experience in Kubernetes, the Kubernetes Stability Assurance Handbook was created. The handbook precipitates the best practices of stability assurance. By doing so, a comprehensive understanding of Kubernetes-based stability assurance was formed. Moreover, it allows corresponding tools and services to become the infrastructure to be reused in products of similar technology stacks, accelerating the dissemination, iteration, and adoption of best practices for stability assurance.
This article highlights the core content of stability assurance based on the Kubernetes Stability Assurance Handbook.
It is necessary to master the elements and interactions in cloud products and deconstruct complex systems from the aspects of basic elements and interactions to reduce the practice costs:
Element (two types)
Interaction (two types, three scenarios in total)
Inside cloud products
Between cloud products
The specific relationship is shown in the following figure:
Systems become more complex, and the challenges to stability assurance become greater with the number of elements and interactions as data grows. So, unnecessary complexity should be avoided.
For this reason, it is necessary to sort out the current running process diagram, analyze the process importance, and sort the component large diagram to determine the blast radius of the components. On this basis, it is also necessary to review the participants to avoid the single-point risk of personnel investment.
The running process diagram is shown below:
An example of process importance is shown below:
An example of the interaction between cloud products is shown below:
Based on the analysis of system complexity and running process above, solutions can be proposed and implemented effectively in the face of the problematic domains of stability assurance.
Complex systems usually have the following role relationships:
Sorting out the roles of each layer can make it easier for participants to find the targets, shortening the problem handling time.
Alibaba Developer - August 9, 2021
Alibaba Developer - August 9, 2021
Alibaba Developer - September 28, 2021
Alibaba Clouder - December 3, 2020
Alibaba Clouder - March 18, 2020
Alibaba Cloud Native Community - December 1, 2022
Accelerate and secure the development, deployment, and management of containerized applications cost-effectively.Learn More
Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.Learn More
Provides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resourcesLearn More
Multi-source metrics are aggregated to monitor the status of your business and services in real time.Learn More
More Posts by Alibaba Cloud Native Community