All Products
Search
Document Center

Tair (Redis® OSS-Compatible):Disaster recovery solutions

Last Updated:Jun 20, 2026

As a high-performance key-value database, Tair (Redis OSS-compatible) often stores large amounts of critical data for your business. To ensure data security, Tair (Redis OSS-compatible) provides a variety of disaster recovery solutions.

Evolution of disaster recovery architectures

A disaster recovery mechanism ensures data consistency and service availability if an instance fails due to unexpected events, such as hardware malfunctions or power outages in the data center.

Figure 1. Evolution of disaster recovery architectures

image

Disaster recovery solution

Protection level

Description

Single-zone HA solution

★★★☆☆

The master node and replica node are deployed on different machines within the same zone. If a node fails, the high availability (HA) system automatically performs a failover to prevent service interruptions caused by a single point of failure (SPOF).

Zone-disaster recovery (multi-zone) solution

★★★★☆

The master node and replica node are deployed in two different zones within the same region. If a zone becomes unavailable due to factors like power outages or network failures, the HA system performs a failover to ensure the instance remains available.

Cross-region disaster recovery solution

★★★★★

A Global Distributed Cache instance consists of multiple child instances that synchronize data in real time through dedicated channels. A channel manager monitors the health of the child instances and handles exceptions, such as a failover. This solution is ideal for scenarios like geo-disaster recovery, active geo-redundancy, routing users to the nearest application access point, and distributing load.

Single-zone HA solution

All instance architectures support a single-zone HA architecture. The HA system monitors the health of the master and replica nodes and automatically performs a failover to prevent service interruptions caused by an SPOF.

Deployment architecture

Description

Standard architecture (dual-replica)

Figure 2. HA architecture for a standard dual-replica instance

image

A standard architecture instance uses a two-node master-replica setup. If the HA system detects a failure on the master node, it automatically initiates a failover, promoting the replica node to become the new master node. When the original master node recovers, it reconnects as the new replica node.

Cluster architecture (multi-replica)

Figure 3. HA architecture for a multi-replica cluster instance

image

In a multi-replica cluster architecture, data is stored on data shards. Each data shard has a multi-replica configuration with nodes deployed on different machines to ensure high availability. If a master node fails, the system automatically performs a failover to maintain service availability.

Read/write splitting architecture

Figure 4. HA architecture for a read/write splitting instance

image
  • The system automatically monitors the health of each node. If an anomaly is detected, it initiates a failover or rebuilds a read replica and updates the corresponding routing and weight information.

  • The proxy continuously probes the status of read replicas. The proxy performs traffic control actions under the following conditions:

    • The read replica is in an abnormal state: The proxy reduces the service weight of the node. If multiple connection attempts fail, the proxy stops routing traffic to the node until the issue is resolved and the node is re-enabled.

    • The read replica is undergoing full data synchronization: The proxy temporarily stops routing traffic to the node until the full data synchronization is complete.

Multi-zone disaster recovery

Tair (Redis OSS-compatible) provides a zone-disaster recovery architecture that spans multiple zones. If your services are deployed in a single region and require a high level of disaster recovery, you can select the multi-zone option when you create an instance. For instructions, see Create an instance.

Figure 5. Create a zone-disaster recovery instance创建同城容灾实例

After the instance is created, a replica instance with the same specifications as the primary instance is created in the secondary zone. Data is synchronized between the primary and secondary zones over a dedicated replication channel.

If the primary zone experiences a power or network failure, the system promotes the replica instance to a master instance and calls the Config Server API to update the routing information for the proxy. In addition, Tair (Redis OSS-compatible) optimizes the Redis synchronization mechanism. Similar to the GTID feature in MySQL, Tair uses a global Opid to manage synchronization points. A lock-free background thread performs the Opid lookup, and the AOF binlog is sent asynchronously with rate limiting, which ensures the performance of the Redis service.

Figure 6. Data synchronization process of a zone-disaster recovery instance

Cross-region disaster recovery

As your business expands globally, a cross-region access architecture can lead to high latency and a poor user experience. The Tair Global Distributed Cache feature reduces this cross-region latency. It offers the following advantages:

  • You can directly create or specify child instances for synchronization. This eliminates complex application-level redundancy designs, which significantly simplifies development and allows you to focus on your core business logic.

  • It enables you to quickly implement geo-disaster recovery and active geo-redundancy.

This feature is suitable for cross-region data synchronization and global deployments in industries such as multimedia, gaming, and e-commerce. For more information, see Global Distributed Cache.

Figure 7. Architecture of Tair Global Distributed Cache全球多活架构

Responding to failures

Failures, such as hardware malfunctions, data center power outages, and natural disasters, can be classified as master node failures or zone-level failures. Although rare, failures can temporarily prevent data writes, cause transient connection issues, or even lead to downtime or data loss. Instance reliability is closely tied to its architecture. Cluster architectures generally offer higher reliability. To minimize the impact of failures, instances with multi-replica and multi-zone deployments automatically perform a failover. This significantly reduces downtime. The following sections describe how instances with different disaster recovery solutions respond to failures.

Responding to a node failure

When a master node fails:

  • If the instance has multiple replicas in a single zone (for example, a master node and a replica node): The system promotes the replica node with the lowest replication latency to become the new master node and updates the routing information.

  • If the instance is deployed across multiple zones: The system promotes a replica node in another zone to become the new master node and updates the routing information. However, this may result in cross-zone access between your instance and other services.

    Note

    In a multi-zone cluster architecture, if replica nodes exist in both the primary and secondary zones, a failover preferentially promotes a replica node from the primary zone. This avoids cross-zone access for your application.

Responding to a zone-level failure

When a zone-level failure occurs, such as a power outage or a fire that makes an entire data center unavailable:

  • If the instance is deployed in a single zone: The instance becomes unavailable. You must wait for the zone to recover. During this time, you can create a new instance in another zone by using historical backup data.

  • If the instance is deployed across multiple zones: The system triggers an automatic failover.

Important
  • For maximum reliability, deploying across multiple zones and creating multiple replicas in each zone can significantly minimize downtime. However, you must balance the probability of failure, the importance of your data, and the associated costs.

  • The preceding principles also apply to the child instances of Global Distributed Cache. The failure of a single child instance does not affect the availability of other child instances. We recommend deploying child instances across multiple zones to prevent data write failures if a single child instance fails.

Related documents

Avoid Cross-Zone Failover by Customizing Node Count