This article is from Alibaba DevOps Practice Guide written by Alibaba Cloud Yunxiao Team
DevOps pursues shorter iteration cycles and more frequent releases. However, the more times we release, the greater the possibility of introducing faults. More faults will lower the availability of services and affect the customer experience. Alibaba gradually developed a release strategy that meets the requirements of DevOps to ensure service quality and release.
Before getting started with Alibaba's practices, let's briefly introduce the following common release strategies and applicable scenarios, advantages, and disadvantages.
Downtime release shuts down services before a release, stops user access, and upgrades all services at once. This kind of release strategy often has a low frequency and requires sufficient tests before the release.
The downtime release has the following features:
Downtime release is not suitable for Internet companies because the interval between two releases is very long. It takes too long from the proposal of functional features to the market entrance and is not sensitive to market reaction. So, it will be at a disadvantage in a fully competitive market. Downtime is required for each release, which also incurs economic losses.
The term canary release was coined in the early 20th century. At that time, British coal miners would carry canaries into the mine before mining. The canaries are more sensitive than humans to the effects of toxic gases. If the concentration of toxic gases, such as carbon monoxide, in the mine is too high, the canaries are poisoned, and the miners will know they should evacuate immediately. A canary release means releasing the new version of the entire software to some users before releasing it to all users. The new version is tested with real customer traffic to ensure the software will not have serious problems and reduce the release risk.
In practice, canary releases usually involve a small proportion of servers at the beginning. For example, 2% of the servers are upgraded for traffic verification, getting feedback from the process to decide whether to expand the release or roll back. The monitoring system is usually used in canary releases to observe the health status of the canary instances based on monitoring metrics. If the canary release shows no problem, all the remaining instances are upgraded to the new version. Otherwise, the code is rolled back.
Phased release is an extension of canary release. It divides release into different phases or batches, with the number of users increasing progressively in each phase or batch. If no problems are found in the current phase of the new version, add the number of users and enter the next phase until all users are included.
Phased release is a release strategy with zero downtime that reduces release risks. Phased release switches gradually from one version to another by switching the routing weights between online coexisting versions. The entire release process takes a long time. Both old and new codes coexist during this period. Therefore, we need to consider version compatibility during the development process, and the coexistence of the old and new code cannot affect feature availability or user experience. Phased release can quickly roll back the code to the earlier version when the code of the new version causes an error.
Phased release enables more complex and flexible release policies when combined with technologies, such as feature switch.
Blue-green deployment involves two identical and independent production environments. One is called blue environment, and another one is called green environment. The green environment is the production environment users are using. If you want to deploy a new version, deploy the new version to the blue environment and then run a smoke test in the blue environment to check whether the new version is working properly. If the new version passes the test, the release system updates the routing configurations, and user traffic is directed from the green environment to the blue environment. Then, the blue environment becomes the production environment. This switch can usually be finished in a second. If a problem occurs, switch the route back to the green environment and debug it in the blue environment to find the cause of the problem. Therefore, blue-green deployment can immediately launch a new version to all users with only one switch. New features are visible to all users immediately.
A/B testing is very similar to phased release, which can be distinguished from the release purpose. A/B testing focuses on making decisions based on the differences between version A and version B and selecting a version for deployment. Compared with phased release, A/B testing is more inclined to make decisions. Compared with canary release, A/B testing is more flexible in switching weights and traffic.
For example, a feature has two implementation versions: implementation A and implementation B. 50% of users are always directed to implementation A through fine-grained traffic control, and the remaining 50% of users are always directed to implementation B. Implementation A with a higher conversion rate is selected as the final version of the feature by comparing the conversion rates of implementation A and implementation B.
In the release strategies above, the unit of releases is application. However, a functional module is often a service provided by a combination of multiple applications. Even if an exception occurs to the currently released application, this exception may not be reflected in the current application either. In complex cases, the exception will not be reflected until its downstream applications. It is very important to find such problems without affecting the user experience. We sometimes hope that after the code of a new version goes online, it will only affect a small number of users. A traditional phased release cannot identify business traffic. Therefore, even if a problem occurs on only one server of an application, it may affect all users.
For example, for the phased release on the left side of the following figure, it is possible that all servers on App1 will be routed to servers of the red App 2 with errors. In the isolation environment release on the right, the code of the new version will be released in the comprehensive-process isolation environment first. It only affects a small number of users, even if there is a problem during the release.
This section describes the best release practices of Alibaba based on the release process.
Before release, we need to verify functional modules and consider how to mitigate the risks introduced by a fault. Therefore, it is very important to list the release plan before the release. A typical release plan is listed below:
a. Participants in this release
b. Release content
c. Test process
d. Risk description
e. Online verification scheme
f. Risk mitigation scheme for online problems
g. Release steps
Release in x batches. Suspend for x hours after x batches are released.
Each of the release strategies mentioned above has advantages and disadvantages. You need to select a suitable release strategy based on the scenario's features and needs.
Generally, the test environment is used for preliminary functional testing, so code is updated and released frequently here. If you use phased release and the number of batches is relatively large, the development efficiency will be reduced substantially. In this situation, a single-server or multi-server single-batch downtime release is a good choice.
We need to consider our testing needs and the testing needs of other upstream and downstream developers for the pre-release environment. Therefore, single-batch downtime release is no longer appropriate. We can release it in two batches.
We can release the traffic isolation environment first and then release the online environment in multiple batches for the online environment.
Release policies alone cannot prevent the occurrence of faults. It is very important to carefully observe the monitoring data of applications during and after release. The monitoring data of core application metrics, such as QPS, RT, success rate, and error rate, can help users detect faults as early as possible. In addition, if the number of batches and the number of servers per batch release are small in the production environment, it is important to configure independent monitoring of the released servers. Even if some monitoring metrics have problems, the data volume is relatively small, and the fault may be submerged in the overall monitoring data.
Most of the Alibaba applications are deployed in multiple data centers or units. One possible scenario is the same code and configuration work normally in some data centers or units but fails in other units or data centers. Therefore, it is necessary to put all the combinations of data centers and units in the first batch distribution, and problems can be exposed early. In addition, R&D personnel often focus on the first few batches. If problems occur in later batches, R&D personnel may not be able to respond quickly.
Unitization is designed to solve disaster recovery and scalability problems. The preceding figure shows Alibaba's unitization deployment architecture.
Moreover, there are many application monitoring items. If the release cycle is relatively long, it is impossible for the R&D personnel to focus on each monitoring item all the time. A certain amount of intelligent solutions are needed to help R&D personnel identify the monitoring items that need focus.
Alibaba has designed and implemented a canary release strategy to solve the two preceding problems. Canary release extracts 10% of servers from each data center or unit of applications to the first batch. The unattended intelligent monitoring system sets independent monitoring for these servers. For each monitoring item, the unattended system compares the monitoring metric data of released servers and unreleased servers and the metric data of servers before and after release. If any exception is found, the unattended system sends the exception to the R&D personnel for further judgment.
This canary release strategy helps R&D personnel identify problems as early as possible, reduces the workload of R&D personnel, and improves development efficiency.
Reasonable selection of release strategies and implementation of releases according to the preceding best practices can minimize the release risks to a small range, which is even smaller than the risks of downtime release. A good release practice has a short release cycle and only involves a small amount of code in each release. The long interval between deployments results in more code changes in each deployment, which results in more defects and the risk of downtime. People tend to add more reviews to reduce release risks. This increases the deployment time but has little impact on reducing the release risk. This is an enhanced loop that is getting worse. We need to overturn this vicious circle through high-frequency and continuous deployment.
Agile development can shorten the time for products to go to the market. It enables consumers to get the desired functions faster and enables product teams to get consumers' feedback faster and iterate products accordingly. This article describes multiple release strategies, including their advantages, disadvantages, and applicable scenarios, to avoid the risks of frequent release in agile development. The combination of these modes in different scenarios can accelerate the delivery of high-quality products.
Alibaba Clouder - September 2, 2020
Alibaba Clouder - February 9, 2021
Alibaba Clouder - December 28, 2020
Alibaba Clouder - January 4, 2021
Alibaba Clouder - January 11, 2021
Alibaba Cloud Community - February 18, 2022
An enterprise-level continuous delivery tool.Learn More
Accelerate software development and delivery by integrating DevOps with the cloudLearn More
Image and video moderation service that accurately detects inappropriate content.Learn More
Multi-source metrics are aggregated to monitor the status of your business and services in real time.Learn More
More Posts by Alibaba Cloud Community