Kruise Rollout Enables Progressive Delivery for All Workloads

By Zhao Mingshan (Liheng)

Preface

OpenKruise is an open-source cloud-native application automation management suite of Alibaba Cloud. It is also a Sandbox project currently hosted under the Cloud Native Computing Foundation (CNCF). It comes from Alibaba's years of containerization and cloud-native technology. It is a standard extension component based on Kubernetes for large-scale applications in Alibaba's internal production environment. It is also a technical concept and best practice that closely adheres to upstream community standards and adapts to large-scale Internet scenarios. In addition to the original workloads and sidecar management, Kruise is currently experimenting with progressive delivery.

What Is Progressive Delivery?

The term Progressive Delivery originated from large and complex industrial projects. It attempts to dismantle complex projects in stages and reduce delivery costs and time through continuous small closed-loop iterations. With the popularization of Kubernetes and cloud-native concepts, especially after the emergence of continuous deployment pipelines, progressive delivery provides the infrastructure and implementation methods for Internet applications.

The specific behavior of progressive delivery can be attached to the pipeline during the iteration of the product. The entire delivery pipeline can be regarded as a process of product iteration and a progressive delivery cycle. Progressive delivery in practice is based on A/B testing, canary release, and other technical means. Let’s take Taobao product recommendation as an example. Every time a major function is released, it will go through a typical progressive delivery process. Therefore, it improves the stability and efficiency of delivery through progressive delivery.

Why Do We Use Kruise Rollout?

Kubernetes only provides deployment controllers for application delivery and Ingress and Service abstractions for traffic. However, Kubernetes does not have a standard definition of how to combine the implementations above into a progressive delivery solution that is easy to use. Argo-rollout and Flagger are currently popular progressive delivery solutions in the community, but they are different from our ideas in some capabilities and concepts. Firstly, they only support Deployment, not Statefulset and Daemonset, let alone custom operators. Secondly, they are not non-intrusive progressive publishing. For example, Argo-rollout cannot support community Kubernetes Native Deployment. Flagger copies Deployment created by businesses, resulting in Name changes and compatibility problems with GitOps or self-built PaaS.

In addition, free development is a major feature of cloud-native. The Alibaba Cloud Container Team is responsible for the evolution of the cloud-native architecture of the entire container platform. There is also a strong demand in the application progressive delivery field. Therefore, based on the community solutions and Alibaba's internal scenarios, we have the following goals in the process of designing Rollout:

Non-Intrusive: No modifications are made to the definitions of native workload controllers and user-defined Application YAML to ensure clean and consistent native resources.
Scalability: Supports Kubernetes-native workload, customized workload, and multiple traffic scheduling methods (such as Nginx and Isito) in a scalable manner.
Ease to Use: It is easy for users and can be easily combined with community GitOps or self-built PaaS.

Kruise Rollout: Progressive Delivery Capabilities of Bypass

Kruise Rollout is Kruise's abstract definition model for progressive delivery. The complete Rollout definition meets canary release, blue-green release, and A/B Test release that matches application traffic and actual deployment instances. The release process can be automated in batches and pauses based on Prometheus Metrics indicators. It can provide bypass imperceptible docking and compatibility with existing multiple workloads (Deployment, CloneSet, DaemonSet). The architecture is listed below:

Traffic Scheduling (Canary Release, A/B Test Release, Blue-Green Release) and Phased Release

Canary release and phased release are the most commonly used release methods in progressive delivery practices:

Workload (Deployment, CloneSet, and DaemonSet) of the Rollout is required for workloadRef bypass selection.
canary.Steps defines that the entire Rollout process is divided into five phases, of which the first phase only releases a new version of Pod. Routing 5% of the traffic goes to the new version of Pod and needs to confirm whether to continue the release manually.
The second phase releases 40% of the new version of pods. Routing 40% of the traffic goes to the new version of Pod. Sleep 10m after the release is completed, and the subsequent phases are released automatically.
trafficRoutings defines the service Ingress controller as Nginx, which is designed to be scalable and implemented. In addition to Nginx, it can support other traffic controllers (such as Istio and Alb).


apiVersion: rollouts.kruise.io/v1alpha1
kind: Rollout
spec:
  strategy:
    objectRef:
      workloadRef:
        apiVersion: apps/v1
        # Deployment, CloneSet, AdDaemonSet etc.
        kind: Deployment 
        name: echoserver
    canary:
      steps:
        # routing 5% traffics to the new version
      - weight: 5
        # Manual confirmation, release the back steps
        pause: {}
        # optional, The first step of released replicas. If not set, the default is to use 'weight', as shown above is 5%.
        replicas: 1
      - weight: 40
        # sleep 600s, release the back steps
        pause: {duration: 600}
      - weight: 60
        pause: {duration: 600}
      - weight: 80
        pause: {duration: 600}
        # No configuration is required for the last batch.
      trafficRoutings:
        # echoserver service name
      - service: echoserver
        # nginx ingress
        type: nginx
        # echoserver ingress name
        ingress:
          name: echoserver

Automate Batching and Pausing Based on Metrics

During the rollout process, Prometheus Metrics can be automatically analyzed and combined with steps to determine whether the rollout needs to be continued or suspended. As shown below, the HTTP status codes of the service in the past five minutes are analyzed after each batch is published. If the proportion of HTTP 200 is less than 99.5, this rollout process will be suspended.

apiVersion: rollouts.kruise.io/v1alpha1
kind: Rollout
spec:
  strategy:
    objectRef:
      ...
    canary:
      steps:
      - weight: 5
        ...
      # metrics analysis 
      analysis:
        templates:
        - templateName: success-rate
          startingStep: 2 # delay starting analysis run until setWeight: 40%
          args:
          - name: service-name
            value: guestbook-svc.default.svc.cluster.local

# metrics analysis template
apiVersion: rollouts.kruise.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 5m
    # NOTE: prometheus queries return results in the form of a vector.
    # So it is common to access the index 0 of the returned array to obtain the value
    successCondition: result[0] >= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
          )) / 
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
          ))

Canary Release Practices

1. Let’s assume that a user has deployed the echoServer service based on Kubernetes (below) and uses Nginx Ingress to provide external services:

2. Define Kruise Rollout Canary Release (1 new version of Pod and 5% traffic) and apply -f to the Kubernetes cluster:

apiVersion: rollouts.kruise.io/v1alpha1
kind: Rollout
metadata:
  name: rollouts-demo
spec:
  objectRef:
    ...
  strategy:
    canary:
      steps:
      - weight: 5
        pause: {}
        replicas: 1
      trafficRoutings:
        ...

3. Upgrade the echoserver image version (Version 1.10.2 -> 1.10.3) and kubectl -f to the Kubernetes cluster:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: echoserver
...
spec:
  ...
  containers:
  - name: echoserver
    image: cilium/echoserver:1.10.3

After Kruise Rollout monitors the preceding behavior, the canary release process will start automatically. As shown below, the canary Deployment, service, and Ingress are generated automatically. 5% of the traffic is configured to the new version of pods.

4. After R&D personnel confirm there is no exception in the new version for a period, they can run the command kubectl-kruise rollout approve rollout/rollouts-demo -n default to publish all remaining Pods. Rollout precisely controls the subsequent process. When the release is complete, all canary resources are reclaimed and restored to the user-deployed state.

5. If the new version is abnormal during the canary process, you can adjust the images to the previous version (1.10.2). Then, kubectl applies -f to the Kubernetes cluster. Kruise Rollout listens to this behavior and reclaims all canary resources to achieve a quick rollback.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: echoserver
...
spec:
  ...
  containers:
  - name: echoserver
    image: cilium/echoserver:1.10.2

Summary

With the increasing number of applications deployed on Kubernetes, learning how to achieve a balance between rapid business iteration and application stability is a problem that must be solved for platform builders. Kruise Rollout is a new exploration of OpenKruise in the field of progressive delivery. It aims to solve the problem of traffic scheduling and batch deployment in the field of application delivery. Kruise Rollout has officially released v0.1.0 and is integrated with the community OAM KubeVela project. Vela users can quickly deploy and use Rollout capabilities through Addons. In addition, we hope community users can join us to explore the application delivery field together.

GitHub: https://github.com/openkruise/rollouts
Official: https://openkruise.io/
Slack: Channel in Kubernetes Slack

Community

Kruise Rollout Enables Progressive Delivery for All Workloads

Preface

What Is Progressive Delivery?

Why Do We Use Kruise Rollout?

Kruise Rollout: Progressive Delivery Capabilities of Bypass

Traffic Scheduling (Canary Release, A/B Test Release, Blue-Green Release) and Phased Release

Automate Batching and Pausing Based on Metrics

Canary Release Practices

Summary

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Cloud-Native Applications Management Solution

Managed Service for Prometheus

Alibaba Cloud Flow

DevOps Solution