Kruise Rollout v0.3.0: How to Master Deployment Batch Release and Traffic Grayscale

By Mingzhou

Preface

Kruise Rollout is an open-source progressive delivery framework provided by OpenKruise. It is designed to provide a set of standard bypass Kubernetes release components that combine traffic release with instance grayscale release, support various release methods (such as canary, blue-green, and A/B testing), and support automatic release processes that are unaware and easy to scale based on custom metrics (such as Prometheus Metrics).

The latest version of Kruise Rollout 0.3.0 brought several interesting new features. First, we enhanced the publishing capability of the most widely used Deployment workloads in the Kubernetes community. Second, we expanded the traffic grayscale capability. Third, we supported the expansion of more gateway protocols by inserting Lua scripts.

Batch Release of Deployment: Deployment can release pods in batches just like StatefulSet or CloneSet.
Header&Cookie-Based North-South Traffic Grayscale: It allows you to divide Layer 7 traffic based on Header&Cookie matching rules and import different traffic groups to instances of different editions to perform A/B testing or finer-grained traffic scheduling for new features.
Lua Script-Based Ingress Traffic Extension: It allows you to configure a Lua script to create a Kruise Rollout plug-in for more types of traffic components. This plug-in supports more types of Ingress extension protocols.

Concept

Before introducing the new features, let's take a look at the current mainstream release forms of Kubernetes workloads:

1. Rolling Upgrade: The mainstream release mode of native Deployment. You cannot set a point in this mode.

Advantages: High release efficiency
Disadvantages: The explosion radius is large, and large-scale release failures are prone to occur.

2. Canary Release: A release mode supported by Flagger and Kruise Rollout for Deployment. When Deployment is released, a canary version of Deployment is created for verification. After the verification is passed, a full workload upgrade is performed, and the canary version of Deployment is deleted.

Advantages: No need to rebuild or republish pods for rollback, which is fast and convenient
Disadvantages: You need to consume additional resources and repeatedly release the new version of Pod. The release is not fully compatible with HPA.

Figure 1: Canary Release Mode

3. Standard Batch Release: A standard batch release is performed using the partition feature provided by StatefulSet or CloneSet. During the release, the metadata (such as the original workload name) remains unchanged, and other workloads are not split.

Advantages: The release does not waste resources, can control the explosion radius, and is fully compatible with HPA and other components that require Ref workloads.
Disadvantages: It is difficult for Deployment to support this type of release. Currently, only Kruise Rollout supports Deployment to perform this type of release.

Figure 2: Standard Batch Release Mode

4. Non-Standard Batch Release: The native logic of Deployment cannot support the batch release capability. Therefore, the rollout solution proposed by the KubeVela community uses the rolling release of two Deployments. A new Deployment is created each release, and the old Deployment is scaled in when Deployment is scaled out. This means the Deployment is replaced after each release.

Advantages: No additional resources are required during release, and the explosion radius can be controlled.
Disadvantages: Multiple workloads are split during release, resulting in a lack of a unified control plane. This may cause conflicts between the release and scaling and is difficult to be compatible with scenarios (such as HPA), which may result in a release jam.

Figure 3: Non-Standard Batch Release Method

5. A/B Testing: It divides user traffic into two disjoint paths (A and B) based on certain rules and imports different versions of pod instances for processing to observe, compare, or grayscale the capabilities of the new version. In general, A/B testing needs to be combined with canary release or batch release.

Figure 4: A/B Testing

Solution Comparison

For the release forms above, except for the rolling upgrade provided by Deployment, which does not need to rely on other three-party components, other release forms more or less need to rely on the capabilities of other components or upper-layer PaaS platforms. What are the advantages and disadvantages of Kruise Rollout as one of the solutions compared with other solutions? We compared two solutions that are currently popular in the open-source community: The Flagger 1 and the Argo-Rollout 2.

In general, the advantages of Kruise-Rollout are summarized below:

Flexibility: The bypass pluggable ability is supported. After you release the Kruise Rollout feature, the corresponding Deployment can have the standard batch release ability instantly. If you no longer need the feature, they can delete the Kruise Rollout feature at any time (even during the release process). Deployment will recover to the native rolling release instantly.
Compatibility: Fully compatible with HPA or other third-party components that require Ref Workload.
Easy Access: Kruise Rollout is flexible. You only need to issue the configuration to take effect. You do not need to migrate pods or workloads. This does not affect existing run time containers or scaling links, which are easy for accessing.

Features

Before introducing the new features, let's talk about why the OpenKruise community is obsessed with Rollout:

We know that in Kubernetes, the design of asynchronous management of container lifecycle and traffic lifecycle makes Deployment unable to detect the mounting and unmounting of traffic. We once encountered a customer in the process of rolling upgrade of Deployment, and the traffic component was abnormal, resulting in a traffic failure. It was only ten minutes but caused a large loss.
Bugs caused by business logic cannot be perceived during the release phase of Deployment rolling update. Once the deployment is fully launched, it may cause serious faults. It is difficult to control the explosion radius of faults (because the Deployment rolling update will be fully released as long as pods are available).
We often encounter problems (such as running well in the test environment and not in production). Environmental isolation alone cannot solve all problems. It is better not to upgrade instantly in the production and release environment but upgrade step by step.

If the batch release format is used in the scenario above, the explosion radius of the problem can be controlled within the grayscale range as much as possible, and sufficient grayscale and observation time can be left. However, the native logic of Deployment does not support batch operations. However, if Argo-Rollout is used, all workloads and pods need to be migrated, which is too risky and troublesome to adapt. If Flagger is used, pods still need to be migrated, and double resources are required when publishing, which is too expensive.

At this time, what you need may be Kruise-Rollout. It only takes two steps to make your existing Deployment ready for standard batch release.

New Feature 1: How to Master Deployment Standard Batch Release

Pre-Step

Use an existing Kubernetes cluster or create a new Kubernetes cluster:

Kubernetes version >= 1.19

Note: The requirements of this version are mainly caused by the major changes in Ingress API in 1.19. If you do not need the complex traffic grayscale capability, which means you do not need to configure the TrafficRouting field, you can pull and modify charts to avoid this version requirement.

Step 1: Install Kruise-Rollout with One Click

$ helm install kruise-rollout openkruise/kruise-rollout --version 0.3.0

Step 2: Bind and Issue a Batch Release Rule to Your Deployment

cat <<EOF | kubectl apply -f -
apiVersion: rollouts.kruise.io/v1alpha1
kind: Rollout
metadata:
  name: rollouts-demo
  namespace: default
  annotations:
    rollouts.kruise.io/rolling-style: partition
spec:
  objectRef:  # Bind your Deployment
    workloadRef:
      apiVersion: apps/v1
      kind: Deployment
      name: echoserver
  strategy:   # Make your batch release rules
    canary:
      steps:
      - replicas: 1     # The first batch issues one pod. After the batch is released, the pod is suspended. After manual confirmation, the pod enters the next batch.
      - replicas: 60%   # The second batch issues 60% of pods.  After the batch is released, the pod is suspended. After manual confirmation, the pods enter the next batch.
      - replicas: 100%  # The third batch issues full pods and is automatically completed after the last batch is released.
EOF

Step 3: Master the Batch Release of Deployment

As such, when you subsequently publish, the rolling upgrade of Deployment will directly become a batch release. The following uses a Deployment named echoserver as an example to describe the batch release process.

1. Before Release

Check that the number of Deployment replicas is 5 and the current version is 789b88f977

2. Start Releasing the First Batch

We modify an environment variable of the container to trigger the release. You can see that only one pod is released in the first batch, and the version number is d8db56c5b.

3. Continue to Release the Second Batch

After the first batch of pods is released, assuming that we have completed the verification of the first batch and want to continue to send the second batch of pods, we can use the command line tool kubectl-kruise to confirm the completion of the batch. This tool is an extension based on kubectl and is currently maintained by the OpenKruise community.

Note: The command to issue the next batch is kubectl-kruise rollout approve rollout/rollouts-demo.

As shown in the preceding process, the Rollout enters the StepUpgrade state when the batch is being published and is not completed. When the batch is published, the Rollout enters the StepPaused state.

4. Release the Last Batch

When the second batch of release is confirmed and the last batch is issued, Rollout enters the Completed state, indicating the release is complete.

In particular, we still follow the rolling release rules in a single release batch. In other words, you can adjust the MaxUnavailable and MaxSurge configurations of a Deployment to improve the stability and efficiency of the Deployment. For example, in the following scenarios, you can follow the following configurations of a Deployment.

Scale out and then scale in in a single batch to ensure stable release to the greatest extent.

kind: Deployment
spec:
  strategy:
    rollingUpdate: 
      maxUnavailble: 0
      maxSurge: 20%

Scale in and then scale out in a single batch to maximize resource usage.

kind: Deployment
spec:
  strategy:
    rollingUpdate: 
      maxUnavailble: 20%
      maxSurge: 0

Scale out and scale in a single batch to maximize the release efficiency.


kind: Deployment
spec:
  strategy:
    rollingUpdate: 
      maxUnavailble: 25%
      maxSurge: 25%

In addition, the solution fully considers various release scenarios to maximize flexibility:

Consecutive Release Scenario: v3 is released before v2 is released. Then, v3 continues the standard batch release process from the first batch.
Quick Rollback: When it is in the process from v1 to v2 and the release is rolled back to v1, a quick rollback is performed. The release is not performed in batches by default.
Release Policy Deletion: After you delete rollouts normally after or even during the release, Deployment is automatically rolled back to the rolling release scenario. This allows you to make changes quickly in special cases.

New Feature 2: Header&Cookie-Based Traffic Grayscale

In Kruise-Rollout versions earlier than v0.3.0, we provide a traffic canary release solution based on adjusting the traffic weight. However, in most scenarios, Ingress and other types of traffic have load balancing capabilities to meet the daily traffic canary release requirements. For example, 10% of canary replicas will automatically load 10% of traffic. If it is not for specified traffic adjustment (a 10% canary replica only imports 1% of traffic), you do not need to configure this capability separately.

However, special release forms (such as A/B testing) may be required for some release-sensitive businesses. When you release a specific batch of marked traffic to the new version of the pod, the traffic of the old and new versions must be isolated. For example, the following scenarios are used.

New business features are only available to users in the on, which can reduce the risks caused by the uncertainty of new business features.
Isolate the traffic between the old and new versions to facilitate control experiments and better observe the effectiveness of the new version's features.

Kruise-Rollout users can use the following configuration to enable this capability:

apiVersion: rollouts.kruise.io/v1alpha1
kind: Rollout
metadata:
  name: rollouts-demo
  namespace: default
  annotations:
    rollouts.kruise.io/rolling-style: partition
spec:
  objectRef:
    workloadRef:
      apiVersion: apps/v1
      kind: Deployment
      name: echoserver
  strategy:
    canary:
      steps:
      - matches:   # Set header&cookie matching rules.
        - headers:
          - name: UserAgent
            type: Exact
            value: iOS
        pause: {}
        replicas: 1
      - replicas: 50%
      - replicas: 100%
      trafficRoutings:
      - ingress:
          classType: nginx
          name: echoserver
        service: echoserver

Compared with the simple batch release configuration, the description above of Header&Cookie matching rules and the reference of TrafficRouting are added. The configuration here uses Ingress-Nginx as an example. In other words, the corresponding Ingress controller must have the basic capability to use this capability (which can be understood as Nginx provides data plane ability and Kruise-Rollout provides control plane ability).

In this configuration, if a Deployment with ten replicas exists, it will be divided into three batches for release. The specific behavior is listed below:

In the first batch, a total of one pod of the new version and nine pods of the old version are used. Only user traffic that meets the UserAgent=iOS rule is sent to the pods of the new version. The remaining traffic is evenly sent to the remaining nine pods of the old version.
In the second batch, there are five pods of the new version and five pods of the old version, and the traffic matching rule is canceled. All traffic directly goes to the load balancing policy.
In the third batch, there are ten pods of the new version and 0 pods of the old version. The traffic matching rule is canceled, and all traffic directly goes to the load balancing policy.

New Feature 3: Ingress Traffic Extension Solution Based on Lua Scripts

With the development of cloud-native technology, cloud-native gateways are flourishing. In addition to the Nginx Ingress and Gateway API provided by Kubernetes, there are many Network Provider solutions, such as Alibaba Cloud ALB, MSE, and ASM, community's Istio, Kong, Apisix, and other companies' gateway solutions and protocols. At the beginning of the design, Kruise Rollout considered how to support the flourishing cloud-native gateway. The conventional hard coding method is time-consuming, laborious, and inconvenient for developers from different companies to use and maintain.

Finally, Kruise Rollout chooses the Lua script-based method to allow users to support more types of gateway protocols in the form of plug-ins. (This version only supports Ingress-based extension protocols. Other custom resource protocols will be supported in the next version). Kruise Rollout completes some common parts of the capability, while the specific implementation of different NetWork Providers is solved by Lua scripts. You only need to write the corresponding Lua script for different implementations. Please see NGINX and Alb Lua script examples [3] for more information. In order to make it convenient for everyone to write your Lua scripts, the following explains the Lua script for Nginx Ingress (the corresponding Rollout configuration can refer to new feature 2), which can be placed in a specific directory or a specific ConfigMap.

-- Because the Ingress grayscale release protocol is implemented based on annotations, all operations of this script
-- modifies the annotations to the target state. Kruise rollout patches the annotations to the
-- ingress canary resource
annotations = {}
-- obj.annotations is Ingress.Annotations. This sentence does not need to be changed.
if ( obj.annotations )
then
    annotations = obj.annotations
end
-- This is the standard of nginx grayscale release protocol, and other implementations can be adjusted according to actual situation.
annotations["nginx.ingress.kubernetes.io/canary"] = "true"
-- Nginx's grayscale release protocol mainly has the following changes. To simplify the complexity of switching back and forth between multiple batches, each time,
-- empty these annotations first.
annotations["nginx.ingress.kubernetes.io/canary-by-cookie"] = nil
annotations["nginx.ingress.kubernetes.io/canary-by-header"] = nil
annotations["nginx.ingress.kubernetes.io/canary-by-header-pattern"] = nil
annotations["nginx.ingress.kubernetes.io/canary-by-header-value"] = nil
annotations["nginx.ingress.kubernetes.io/canary-weight"] = nil
-- obj.weight is rollout.spec.strategy.canary.steps[x].weight
-- Indicates the grayscale percentage of the current batch, which is '-1' when it is not set (the lua script does not support nil, so it is represented by '-1'). 
-- If it is not '-1', you need to set obj.weight to annotations.
if ( obj.weight ~= "-1" )
then
    annotations["nginx.ingress.kubernetes.io/canary-weight"] = obj.weight
end
-- obj.matches is rollout.spec.strategy.canary.steps[x].matches (same as data structure). 
-- If no settings are set, this step does not need to be published by A/B Testing, and you can return it directly.
if ( not obj.matches )
then
    return annotations
end
-- Publish A/B Testing, traverse matches, and set matches to annotations.
-Note: Nginx does not support multiple headers, so no real traversal is required here, and only the first array is taken by default.
for _,match in ipairs(obj.matches) do
    -- Note that the array in the lua script starts with the subscript '1'.
    local header = match.headers[1]
    -- cookie
    if ( header.name == "canary-by-cookie" )
    then
        annotations["nginx.ingress.kubernetes.io/canary-by-cookie"] = header.value
    -- header
    else
        annotations["nginx.ingress.kubernetes.io/canary-by-header"] = header.name
        -- Whether it is regular.
        if ( header.type == "RegularExpression" )
        then
            annotations["nginx.ingress.kubernetes.io/canary-by-header-pattern"] = header.value
        else
            annotations["nginx.ingress.kubernetes.io/canary-by-header-value"] = header.value
        end
    end
end
-- must be return annotations
return annotations

Note: This version is only implemented for Ingress resources. Other custom resources (CRDs) (such as Apisix and Kong) will be supported in the next version. Related PR[4] has been submitted to GitHub. You are welcome to discuss it together.

Planning

More Gateway Protocol Support: Kruise Rollout currently supports multiple types of gateway protocols in the form of the Lua script plug-in. We will focus on increasing investment in this area in the future. However, in the face of flourishing protocol type, the weak strength of the community Maintainers alone is far from enough. We hope that more community partners will join us and improve this aspect.
More Complete Release System: We need to build some related capabilities (such as hook calls and Prometheus Metrics Analysis) during release to support a more complete release system, including grayscale release, alerting, observability, automatic rollback, and unattended operation. We are currently working closely with the KubeVela community to make up for the lack of these capabilities through the integration of KubeVela's existing workflow system. As to whether these abilities need to be included in Kruise Rollout in the future, we hope to hear more opinions and welcome everyone to discuss and communicate together.

Get Involved

You are welcome to get involved with OpenKruise by joining us via GitHub or Slack.

Slack [5]

References

[1] flagger
https://github.com/fluxcd/flagger

[2] Argo-Rollout
https://github.com/argoproj/argo-rollouts

[3] Nginx and Alb Lua script sample https://github.com/openkruise/rollouts/tree/master/lua_configuration/trafficrouting_ingress

[5] Slack channel
https://kubernetes.slack.com/?redir=%2Farchives%2Fopenkruise

Community

Kruise Rollout v0.3.0: How to Master Deployment Batch Release and Traffic Grayscale

Preface

Concept

Solution Comparison

Features

New Feature 1: How to Master Deployment Standard Batch Release

Pre-Step

Step 1: Install Kruise-Rollout with One Click

Step 2: Bind and Issue a Batch Release Rule to Your Deployment

Step 3: Master the Batch Release of Deployment

1. Before Release

2. Start Releasing the First Batch

3. Continue to Release the Second Batch

4. Release the Last Batch

New Feature 2: Header&Cookie-Based Traffic Grayscale

New Feature 3: Ingress Traffic Extension Solution Based on Lua Scripts

Planning

Get Involved

References

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Managed Service for Prometheus

DevOps Solution

Cloud-Native Applications Management Solution

Function Compute