As a database system that has been iterated for more than 15 years at Alibaba and Alibaba Cloud, Tair (Redis® OSS-Compatible) relies on core advantages such as high performance, low latency, distributed nature, and high reliability. It has been widely applied in multiple industries such as AI, E-commerce, gaming, transportation, education, and healthcare.
At critical moments of E-commerce sales promotions, gaming battles, or finance transactions, every database O&M change is like "walking a tightrope." For a long time, the master-slave architecture (Standalone) of the Redis® open source community has had a defect that gives developers a headache: during version upgrades or high availability switchovers, connection interruptions and errors lasting several seconds are inevitable. This service jitter of just a few seconds often means order churn or user disconnection in modern business scenarios that demand extreme experiences.
Can true imperceptible switchover be achieved, ensuring no errors, no connection interruptions, and continuous online status for the business? The Alibaba Cloud Tair team has given a positive answer. This article focuses on analyzing the innovative practices of Tair in the realm of imperceptible switchover technology. Through imperceptible switchover technology, this feature helps customer instances significantly reduce the instance unavailability time during O&M operations such as minor version upgrades and high availability switchovers, so the client does not perceive obvious performance fluctuations.
For the open source database Valkey or Redis®, the two currently supported architectures are as follows:
Standalone: In the standard master-slave architecture, the primary database usually provides read and write services, and the secondary database serves only as a replacement node in case of failure. The mainstream high availability solution in the community is to introduce the Sentinel component as an independent monitoring and arbitration layer to continuously detect the survival status and replication health of the primary database and secondary database. When the primary database is detected to be continuously unreachable, failover is automatically initiated after a majority of Sentinel instances reach a consensus.
Cluster: The cluster architecture integrates sharding and high availability. The entire keyspace is divided into 16,384 slots Distributed across multiple master nodes, and each master can also be configured with multiple replicas. When a master fails, its corresponding replica is automatically promoted to the new master through the voting and election mechanism within the cluster and takes over all slots managed by that master, thereby realizing automatic master-slave switchover. Cluster nodes synchronize status via the gossip protocol, and the client can achieve automatic routing switchover when receiving MOVED/ASK redirection.

Figure 1: High-availability mechanisms of Standalone and Cluster architectures
According to the standard protocol RESP of open source Redis®, after a primary/secondary failover occurs in the cluster architecture, the secondary database returns a MOVED instruction to existing requests. After receiving the corresponding MOVED instruction, the client refreshes the route table and retries the command, thereby realizing imperceptible switchover.
When implementing the direct connection cluster architecture (non-Proxy cluster architecture), Tair did not simply adopt the open source architecture in its entirety but transformed it. First, it introduced the virtual IP address (VIP) mechanism to ensure link connectivity after switchover to achieve imperceptible switchover. Second, it abandoned the gossip protocol and used a centralized Config Server to uniformly distribute the route table. This "strong management" pattern eliminates the uncertainty of distributed protocols and significantly improves cluster stability.

Figure 2: Schematic of Alibaba Cloud Tair direct connection cluster architecture
key1: 100 indicates that key1 is located in slot 100. Accessing the master node of VIP1-1 will be directly accepted.
key1: 100 Accessing the replica node of VIP1-2 will return MOVED.
key2: 200 Accessing the master node of VIP2-1, because the slot range managed by it is 8192 to 16383, which does not include the slot where the key is located
Compared with the redirection capability naturally possessed by the cluster architecture, the master-slave architecture faces two major "genetic defects" to achieve imperceptible switchover:
Support for the master-replica seamless protocol at the kernel level [1] was initially submitted to the Redis® community. After more than a hundred exchanges, and experiencing the closed-sourcing of the Redis® community, it was finally merged and accepted in the new Valkey community [2]. Three major obstacles were encountered during the entire discussion process:
• Architecture bias: Users outside China mostly use the Cluster Edition, so the community's desire for master-replica optimization is not very strong. However, there are a large number of Standard Edition customers on Alibaba Cloud because the Standard Edition is freer to use, such as multi-key commands not being subject to slot limitations.
• Inconsistent experience standards: Regarding seamless experience, some customers choose to tolerate breaks, but the Alibaba Cloud Tair team attaches great importance to customers' increasing demand for response time (RT) and the ultimate seamless experience.
• Ecosystem inertia: The community itself is relatively resistant to importing new protocols because it means that clients need to re-adapt. To dispel this concern, Tair directly entered the Redis® client ecosystem and is currently the owner of the valkey-java [3] client.
Although the closed-sourcing of Redis® was encountered during this period, upholding the open source spirit, the Alibaba Cloud Tair team united with multiple core contributors from the former Redis community and major vendors to establish the Valkey community, and succeeded in promoting the development and merging of seamless technology in the Valkey community, making seamless switching a standard capability of the Valkey database. The new protocol format is: REDIRECT HOST PORT, where REDIRECT indicates that the client needs to redirect the Request to the target node.

Figure 3: Support PR for Valkey seamless switching
After Redis® was closed-sourced, the core contributors of the Jedis community from the Tair team forked Jedis to create the Valkey-Java client, providing customers with continued client services. On Valkey-Java, Support for master-replica seamless switching was implemented. Customers only need to upgrade the client to the following versions to enjoy the capability support of master-replica seamless switching:
<dependency>
<groupId>io.valkey</groupId>
<artifactId>valkey-java</artifactId>
<version>5.3.0</version>
</dependency>
In the procedure of Supporting seamless switching, how to handle in-flight connections is specifically explained: that is, for Links accessed concurrently in the connection pool, if one or more links encounter -REDIRECT at the same time, the procedure of re-initializing the connection pool is controlled through a two-phase lock:
tryLock of ReentrantLock. Only when tryLock has succeeded is it possible to enter the candidate renew connection pool.ReentrantReadWriteLock, which is mainly used to control the reading and writing of the connection pool. However, after the write lock is added, external API requests will be blocked until the connection pool update is completed, thereby achieving concurrent secure access.

Figure 4: Valkey-Java Code for handling in-flight connections
Currently, seamless switching has become a standard capability of the Valkey client community. In addition to the Valkey-Java client, Support for seamless switching by other clients in the community is also proceeding successively:
| Client | URL | Imperceptible Switchover Support | Starting Version |
| Valkey-Java | https://github.com/valkey-io/valkey-java | Supported | 5.3.0 |
| Valkey-Go | https://github.com/valkey-io/valkey-go | Supported | 1.0.67 |
| Valkey-py | https://github.com/valkey-io/valkey-py | In progress | </td> |
| Glide | https://github.com/valkey-io/valkey-glide | In progress | </td> |
Apart from the differences in modifications and optimizations at the kernel layer, the difference in network environments between ApsaraDB and self-managed databases is also a very important aspect. Alibaba Cloud Tair adopts an "LB (load balancing) + VIP" architecture. The LB is responsible for connecting the VPC network. Regardless of how the backend primary and secondary nodes change, the client only needs to access a fixed VIP, which greatly reduces access costs.
However, in a standard high availability switchover, when the VIP mapping changes, the LB usually immediately resets (RST) the old connection. This leads to a consequence: the old primary database loses the link before it has time to send the redirection instruction REDIRECT to the client.
The implementation of Tair's imperceptible primary-secondary switchover relies on Connection Draining. This is a "graceful shutdown" mechanism which acts as a delay protector. It forces the old connection to remain open for a period of time (configurable) after the VIP switch. This ensures that the old primary database has sufficient time to successfully deliver the REDIRECT instruction to the client before gracefully disconnecting, thereby achieving a truly imperceptible switchover.

Figure 5: Imperceptible primary-secondary switchover of Alibaba Cloud Tair Standard Edition architecture
After the imperceptible switchover is published, Tair-pulse3 during the switchover) is used to perform comparative testing on the instance during the switchover. You can see that this feature brings improvements in two parts:

Figure 6: The old solution has 5 types of errors during the switchover procedure, and the unavailable time is 5 s

Figure 7: The imperceptible switchover solution has 0 errors, and the unavailable time is about 1 s
Currently, there is still space for optimization in the primary-secondary imperceptible switchover of the Tair cloud architecture. We will continue to work with the network team to further reduce the switchover duration and provide high-quality services to customers. You are welcome to try out the Tair primary-secondary imperceptible switchover capability and provide valuable feedback.
Appendix: Practical guide for Tair imperceptible switchover:
Tair Supports the primary-secondary imperceptible switchover capability starting from major version 7.0. If your major version is lower than 7.0, you need to upgrade the version to 7.0. If you are already on version 7.0, you can simply upgrade the minor version to 0.2.9. Note that you need to use Valkey-Java Version 5.3.0 or later to experience the complete imperceptible switchover capability.
[1] https://github.com/redis/redis/pull/12192
[2] https://github.com/valkey-io/valkey/pull/325
[3] https://github.com/valkey-io/valkey-java
[4] https://github.com/tair-opensource/tair-tools/tree/main/tair-pulse
Still Worried About MySQL's 5M Row Limit? It's Long Been Solved
ApsaraDB - March 13, 2025
祎程 - August 22, 2024
Alibaba Clouder - November 7, 2017
Alibaba Clouder - December 19, 2018
Alibaba Clouder - March 24, 2021
Alibaba Clouder - January 21, 2021
Tair (Redis® OSS-Compatible)
A key value database service that offers in-memory caching and high-speed access to applications hosted on the cloud
Learn More
Application High Availability Service
Application High Available Service is a SaaS-based service that helps you improve the availability of your applications.
Learn More
Database for FinTech Solution
Leverage cloud-native database solutions dedicated for FinTech.
Learn MoreMore Posts by ApsaraDB