Release Strategy - Alibaba DevOps Practice Part 22

This article is from Alibaba DevOps Practice Guide written by Alibaba Cloud Yunxiao Team

DevOps pursues shorter iteration cycles and more frequent releases. However, the more times we release, the greater the possibility of introducing faults. More faults will lower the availability of services and affect the customer experience. Alibaba gradually developed a release strategy that meets the requirements of DevOps to ensure service quality and release.

Before getting started with Alibaba's practices, let's briefly introduce the following common release strategies and applicable scenarios, advantages, and disadvantages.

Common Release Policies

Downtime Release

Downtime release shuts down services before a release, stops user access, and upgrades all services at once. This kind of release strategy often has a low frequency and requires sufficient tests before the release.

The downtime release has the following features:

All components that need to be upgraded are integrated into a single release.
Most applications in a project are updated.
The R&D process and testing process before the release often take a long time.
The cost of fixing and rollback is high if any problem occurs during the release.
It takes a long time to complete a downtime release under the collaboration of many teams.
Client and server must be upgraded synchronously.

Downtime release is not suitable for Internet companies because the interval between two releases is very long. It takes too long from the proposal of functional features to the market entrance and is not sensitive to market reaction. So, it will be at a disadvantage in a fully competitive market. Downtime is required for each release, which also incurs economic losses.

Advantages:

It is simple and does not need to consider the compatibility between the old and new versions.

Disadvantages:

Services are unavailable during the release.
The release can only be implemented during the off-peak period, which is often at night and requires many teams to work together.
It is difficult to roll back after a failure.

Applicable Scenarios:

Development testing environments
Non-critical applications with little impact on users
Scenarios with difficult compatibility control

Canary Release

The term canary release was coined in the early 20th century. At that time, British coal miners would carry canaries into the mine before mining. The canaries are more sensitive than humans to the effects of toxic gases. If the concentration of toxic gases, such as carbon monoxide, in the mine is too high, the canaries are poisoned, and the miners will know they should evacuate immediately. A canary release means releasing the new version of the entire software to some users before releasing it to all users. The new version is tested with real customer traffic to ensure the software will not have serious problems and reduce the release risk.

In practice, canary releases usually involve a small proportion of servers at the beginning. For example, 2% of the servers are upgraded for traffic verification, getting feedback from the process to decide whether to expand the release or roll back. The monitoring system is usually used in canary releases to observe the health status of the canary instances based on monitoring metrics. If the canary release shows no problem, all the remaining instances are upgraded to the new version. Otherwise, the code is rolled back.

Advantages:

It has little impact on user experience. Only a few users are affected in the canary release process.
Release security can be guaranteed.

Disadvantages:

Only a small number of servers are involved in canary releases, and some problems cannot be revealed.

Applicable Scenarios:

Objects with comprehensive monitoring and integration with release systems

Phased Release/Rolling Release

Phased release is an extension of canary release. It divides release into different phases or batches, with the number of users increasing progressively in each phase or batch. If no problems are found in the current phase of the new version, add the number of users and enter the next phase until all users are included.

Phased release is a release strategy with zero downtime that reduces release risks. Phased release switches gradually from one version to another by switching the routing weights between online coexisting versions. The entire release process takes a long time. Both old and new codes coexist during this period. Therefore, we need to consider version compatibility during the development process, and the coexistence of the old and new code cannot affect feature availability or user experience. Phased release can quickly roll back the code to the earlier version when the code of the new version causes an error.

Phased release enables more complex and flexible release policies when combined with technologies, such as feature switch.

Advantages:

It has little impact on user experience and does not require downtime.
It can control release risks.

Disadvantages:

The release time is relatively long.
It requires a complex release system and load balancer.
Compatibility between the old and new versions needs to be considered.

Applicable Scenarios:

High-availability production environments

Blue-Green Release

Blue-green deployment involves two identical and independent production environments. One is called blue environment, and another one is called green environment. The green environment is the production environment users are using. If you want to deploy a new version, deploy the new version to the blue environment and then run a smoke test in the blue environment to check whether the new version is working properly. If the new version passes the test, the release system updates the routing configurations, and user traffic is directed from the green environment to the blue environment. Then, the blue environment becomes the production environment. This switch can usually be finished in a second. If a problem occurs, switch the route back to the green environment and debug it in the blue environment to find the cause of the problem. Therefore, blue-green deployment can immediately launch a new version to all users with only one switch. New features are visible to all users immediately.

Advantages:

It is fast during upgrade, switch, and rollback.
It has zero downtime.

Disadvantages:

It is a one-time full switch, so if there is a problem with the release, it will have a significant impact on users.
Twice the server resources are required.
Middleware and applications must support traffic switch in the hot standby cluster.

Applicable Scenarios:

Objects with abundant or on-demand server resources (with support from cloud vendors)

A/B Testing

A/B testing is very similar to phased release, which can be distinguished from the release purpose. A/B testing focuses on making decisions based on the differences between version A and version B and selecting a version for deployment. Compared with phased release, A/B testing is more inclined to make decisions. Compared with canary release, A/B testing is more flexible in switching weights and traffic.

For example, a feature has two implementation versions: implementation A and implementation B. 50% of users are always directed to implementation A through fine-grained traffic control, and the remaining 50% of users are always directed to implementation B. Implementation A with a higher conversion rate is selected as the final version of the feature by comparing the conversion rates of implementation A and implementation B.

Advantages:

It has rapid experiment capability.
It has minimal impact on user experience.
It can be tested through the traffic from the production environment.
It can be tested for specific users.

Disadvantages:

It requires capabilities to identify and control complex business traffic.
It needs to consider the relatively complex compatibility issues between the old and new versions.

Applicable Scenarios:

Business exploration and innovation testing
Decision-making for multiple solutions

Traffic Isolation Environment Release

In the release strategies above, the unit of releases is application. However, a functional module is often a service provided by a combination of multiple applications. Even if an exception occurs to the currently released application, this exception may not be reflected in the current application either. In complex cases, the exception will not be reflected until its downstream applications. It is very important to find such problems without affecting the user experience. We sometimes hope that after the code of a new version goes online, it will only affect a small number of users. A traditional phased release cannot identify business traffic. Therefore, even if a problem occurs on only one server of an application, it may affect all users.

For example, for the phased release on the left side of the following figure, it is possible that all servers on App1 will be routed to servers of the red App 2 with errors. In the isolation environment release on the right, the code of the new version will be released in the comprehensive-process isolation environment first. It only affects a small number of users, even if there is a problem during the release.

Advantages:

It can find some complex problems involving multiple applications.
Only a small number of users are affected when a fault occurs.

Disadvantages:

Independent monitoring of traffic isolation environments is required.
The system design is complex. All applications on the middleware and process need to be able to identify the business traffic.

Applicable Scenarios:

Core production business scenarios

Best Release Practices of Alibaba

This section describes the best release practices of Alibaba based on the release process.

Release Plan

Before release, we need to verify functional modules and consider how to mitigate the risks introduced by a fault. Therefore, it is very important to list the release plan before the release. A typical release plan is listed below:

a. Participants in this release
    Developer
    Tester
    Code reviewer
b. Release content
c. Test process
d. Risk description
e. Online verification scheme
f. Risk mitigation scheme for online problems
g. Release steps
    Release in x batches. Suspend for x hours after x batches are released.

Use Different Release Policies in Different Environments

Each of the release strategies mentioned above has advantages and disadvantages. You need to select a suitable release strategy based on the scenario's features and needs.

Generally, the test environment is used for preliminary functional testing, so code is updated and released frequently here. If you use phased release and the number of batches is relatively large, the development efficiency will be reduced substantially. In this situation, a single-server or multi-server single-batch downtime release is a good choice.

We need to consider our testing needs and the testing needs of other upstream and downstream developers for the pre-release environment. Therefore, single-batch downtime release is no longer appropriate. We can release it in two batches.

We can release the traffic isolation environment first and then release the online environment in multiple batches for the online environment.

Focus on Monitoring Alarms during Release

Release policies alone cannot prevent the occurrence of faults. It is very important to carefully observe the monitoring data of applications during and after release. The monitoring data of core application metrics, such as QPS, RT, success rate, and error rate, can help users detect faults as early as possible. In addition, if the number of batches and the number of servers per batch release are small in the production environment, it is important to configure independent monitoring of the released servers. Even if some monitoring metrics have problems, the data volume is relatively small, and the fault may be submerged in the overall monitoring data.

Canary Release and Unattended System

Most of the Alibaba applications are deployed in multiple data centers or units. One possible scenario is the same code and configuration work normally in some data centers or units but fails in other units or data centers. Therefore, it is necessary to put all the combinations of data centers and units in the first batch distribution, and problems can be exposed early. In addition, R&D personnel often focus on the first few batches. If problems occur in later batches, R&D personnel may not be able to respond quickly.

Unitization is designed to solve disaster recovery and scalability problems. The preceding figure shows Alibaba's unitization deployment architecture.

Moreover, there are many application monitoring items. If the release cycle is relatively long, it is impossible for the R&D personnel to focus on each monitoring item all the time. A certain amount of intelligent solutions are needed to help R&D personnel identify the monitoring items that need focus.

Alibaba has designed and implemented a canary release strategy to solve the two preceding problems. Canary release extracts 10% of servers from each data center or unit of applications to the first batch. The unattended intelligent monitoring system sets independent monitoring for these servers. For each monitoring item, the unattended system compares the monitoring metric data of released servers and unreleased servers and the metric data of servers before and after release. If any exception is found, the unattended system sends the exception to the R&D personnel for further judgment.

This canary release strategy helps R&D personnel identify problems as early as possible, reduces the workload of R&D personnel, and improves development efficiency.

Continuous Integration and Release

Reasonable selection of release strategies and implementation of releases according to the preceding best practices can minimize the release risks to a small range, which is even smaller than the risks of downtime release. A good release practice has a short release cycle and only involves a small amount of code in each release. The long interval between deployments results in more code changes in each deployment, which results in more defects and the risk of downtime. People tend to add more reviews to reduce release risks. This increases the deployment time but has little impact on reducing the release risk. This is an enhanced loop that is getting worse. We need to overturn this vicious circle through high-frequency and continuous deployment.

Summary

Agile development can shorten the time for products to go to the market. It enables consumers to get the desired functions faster and enables product teams to get consumers' feedback faster and iterate products accordingly. This article describes multiple release strategies, including their advantages, disadvantages, and applicable scenarios, to avoid the risks of frequent release in agile development. The combination of these modes in different scenarios can accelerate the delivery of high-quality products.

Community

Release Strategy - Alibaba DevOps Practice Part 22

Common Release Policies

Downtime Release

Canary Release

Phased Release/Rolling Release

Blue-Green Release

A/B Testing

Traffic Isolation Environment Release

Best Release Practices of Alibaba

Release Plan

Use Different Release Policies in Different Environments

Focus on Monitoring Alarms during Release

Canary Release and Unattended System

Continuous Integration and Release

Summary

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

Alibaba Cloud Flow

DevOps Solution

Content Moderation

Managed Service for Prometheus