ApsaraDB for Redis can monitor the health states of nodes. If a master node in an instance fails, ApsaraDB for Redis automatically triggers a master-replica switchover. For example, the roles of master and replica nodes are switched over to ensure the high availability (HA) of the instance. ApsaraDB for Redis allows a master-replica switchover to be manually triggered. This feature can be applied to disaster recovery drills and access to nearby nodes that are deployed in different zones.

Causes

  • Manual switchover

    A master-replica switchover is manually performed by you or an authorized Alibaba Cloud technical expert. For more information, see Manually switch workloads from a master node to a replica node.

  • Risk mitigation

    Alibaba Cloud detects vulnerabilities in an ApsaraDB for Redis instance. These vulnerabilities may cause the ApsaraDB for Redis instance to run not as expected. In this case, ApsaraDB for Redis fixes the vulnerabilities and performs a master-replica switchover during the specified maintenance window. High-risk vulnerability fixes are automatically performed at the earliest opportunities and master-replica switchovers are triggered.

    You can find the events that were triggered under the preceding conditions in history events. For more information, see Query history events. You can also manage pending events of master-replica switchovers. For more information, see Query and manage pending events.

  • Instance failure

    Alibaba Cloud detects failures in an ApsaraDB for Redis instance. These failures cause the ApsaraDB for Redis instance to run not as expected. In this case, ApsaraDB for Redis performs a master-replica switchover to switch your workloads over to replica nodes. This minimizes the impacts of the failures.

    You are notified of such events with internal messages in the following format:

    [Alibaba Cloud] Dear ******: Your ApsaraDB for Redis instance r-bp1zxszhcgatnx**** (name: ****) has an error. A switchover is triggered to ensure that your instance runs as expected. We recommend that you check whether your application is still connected to your instance and configure your application to automatically reconnect to the instance.

Impacts

Cause Impact Related suggestion
Manual switchover
  • The data nodes on which the switchover is performed are disconnected for a few seconds. A switchover has potential data loss risks. For example, the data may become inconsistent between the master and replica nodes due to synchronization latency. To prevent potential data loss risks caused by the switchover and data doublewrite caused by the Domain Name System (DNS) cache, the data nodes become read-only for up to 30 seconds.
  • After an instance enters the Switching state, you cannot manage this instance. For example, you cannot modify the instance configurations or migrate the instance to another zone.
Make sure that your applications are configured to automatically reconnect to the instance or handle exceptions. Otherwise, one of the following error messages may be returned during a switchover: READONLY You can't write against a read only instance and DISABLE You can't write or read against a disable instance.
Risk mitigation
Instance failure
  • The data nodes on which the switchover is performed are disconnected for a few seconds.
  • After an instance enters the Switching state, you cannot manage this instance. For example, you cannot modify the instance configurations or migrate the instance to another zone.
Note After the master-replica switchover is complete, the state of the instance becomes Running.

FAQ

  • Q: What is the principle behind the master-replica switchover triggered by an instance failure?
    A: The HA system relies on its detection mechanism to detect failures. The following table describes the HA mechanism.
    Event Description
    Health check The HA system checks whether master and replica nodes are healthy.
    Master node failure
    1. When a master node is determined to be unavailable, a replica node acts as the master node. At the same time, the virtual IP address (VIP) of the master node is switched to the replica node.
    2. Another replica node is created to ensure data synchronization.
    Replica node failure When a replica node is determined to be unavailable, another replica node is created to ensure data synchronization and maintain the data persistence of the master-replica architecture.
    Note Some data that was recently written to a master node may be lost because the synchronization between the master and replica nodes is asynchronously implemented.
  • Q: Does a master-replica switchover affect the use of read replicas in read/write splitting instances? For more information about read/write splitting instances, see Read/write splitting instances.

    A: A master-replica switchover does not affect the use of read replicas in read/write splitting instances.

  • Q: Does a master-replica switchover triggered for a specific data shard in an instance affect the instance as a whole if the instance is a cluster master-replica instance? For more information about cluster master-replica instances, see Cluster master-replica instances.

    A: The instance as a whole is not affected. Only the data shard is affected. For more information, see Impacts.