Community Blog Kubernetes Stability Assurance Handbook – Part 1: Highlights

Kubernetes Stability Assurance Handbook – Part 1: Highlights

Part 1 of this 3-part series highlights the core content of stability assurance based on the Kubernetes Stability Assurance Handbook.

By Wupeng


There are more challenges to stability assurance with the increasing adoption and complexity of Kubernetes in production environments.

Securing stability has become a basic requirement for Kubernetes-based cloud products. Defects of stability may cause huge losses to the products, such as loss of users, reduced user confidence, and slow product iteration.

Although Kubernetes-based stability assurance is important, the industry does not have a standardized stability assurance solution based on practice. As a result, the same problem occurs repeatedly in the same product or different products. Thus, best practices cannot be applied to products with the same technology stack, and best practices of the stability assurance formed by different products cannot be complementary.

To this end, based on past development practices and the stability assurance experience in Kubernetes, the Kubernetes Stability Assurance Handbook was created. The handbook precipitates the best practices of stability assurance. By doing so, a comprehensive understanding of Kubernetes-based stability assurance was formed. Moreover, it allows corresponding tools and services to become the infrastructure to be reused in products of similar technology stacks, accelerating the dissemination, iteration, and adoption of best practices for stability assurance.

This article highlights the core content of stability assurance based on the Kubernetes Stability Assurance Handbook.

Handbook Objectives

  • Understand stability assurance objectives in one minute
  • Grasp the global view of stability assurance in three minutes
  • Quick search for recommended stability assurance tools or services

Stability Assurance Objectives

  • Meet the stability requirements of services or products
  • Accelerate service or product iterations

Stability Assurance Check Items


Stability Assurance Level




Global View

Practice Process:

  1. Organize the running process diagram and mark whether the process is critical.
  2. Configure the observability based on the running process diagram
  3. Perform controllable governance based on process importance

It is necessary to master the elements and interactions in cloud products and deconstruct complex systems from the aspects of basic elements and interactions to reduce the practice costs:

  • Element (two types)

    • Cloud product components
    • Cloud products
  • Interaction (two types, three scenarios in total)

    • Inside cloud products

      • Components
      • Between components
    • Between cloud products

      • Between cloud products

The specific relationship is shown in the following figure:


Systems become more complex, and the challenges to stability assurance become greater with the number of elements and interactions as data grows. So, unnecessary complexity should be avoided.

For this reason, it is necessary to sort out the current running process diagram, analyze the process importance, and sort the component large diagram to determine the blast radius of the components. On this basis, it is also necessary to review the participants to avoid the single-point risk of personnel investment.

The running process diagram is shown below:


An example of process importance is shown below:


An example of the interaction between cloud products is shown below:


Based on the analysis of system complexity and running process above, solutions can be proposed and implemented effectively in the face of the problematic domains of stability assurance.


Practice Process:

  1. Long-term maintenance role lists, functional flowcharts, and running process diagrams
  2. Detect the occurrence and recovery of problems in alarm groups graded at multiple levels
  3. Handle issues and replay problems in the only problem addressing the group

Complex systems usually have the following role relationships:


Sorting out the roles of each layer can make it easier for participants to find the targets, shortening the problem handling time.

Problem Domain





Next Article

0 0 0
Share on

You may also like


Related Products