ApsaraDB for Redis can monitor the health status of nodes. If a master node in an instance becomes unavailable, ApsaraDB for Redis automatically triggers a master-replica switchover. The roles of master and replica nodes are switched over to ensure the high availability of the instance. ApsaraDB for Redis allows a master-replica switchover to be manually triggered. This feature can be applied to disaster recovery drills and access to nearby nodes that are deployed in different zones.

Causes

  • Manual switchover

    A master-replica switchover is manually performed by you or an authorized Alibaba Cloud technical expert. For more information, see Manually switch workloads from a master node to a replica node.

  • Risk mitigation

    Alibaba Cloud automatically detects vulnerabilities in an ApsaraDB for Redis instance. These vulnerabilities may cause the ApsaraDB for Redis instance unable to run as expected. In this case, ApsaraDB for Redis fixes the vulnerabilities and performs a master-replica switchover during a specified maintenance window.

    You can find the events that are triggered under the preceding conditions in logs. For more information, see Query history events. You can also manage pending events of master-replica switchovers. For more information, see Query and manage pending events.

  • Instance failure

    Alibaba Cloud detects failures in an ApsaraDB for Redis instance. These failures cause the ApsaraDB for Redis instance unable to run as expected. In this case, ApsaraDB for Redis performs a master-replica switchover to switch your workloads to the replica nodes. This minimizes the impacts of the failures.

    You are notified of such events with internal messages in the following format:

    [Alibaba Cloud] Dear ******: Your ApsaraDB for Redis instance r-bp1zxszhcgatnx**** (name: ****) has an error. A switchover is triggered to ensure that your instance runs as expected. We recommend that you check whether your application is still connected to your instance and configure your application to automatically reconnect to the instance.

Impacts

Cause Impact Related suggestion
Manual switchover
  • The data nodes on which the switchover is performed are disconnected for a few seconds. A switchover has potential data loss risks. For example, the data may become inconsistent between the master and replica nodes due to the synchronization latency. To prevent potential data loss risks caused by the switchover and data doublewrite caused by the Domain Name System (DNS) cache, the data nodes become read-only for up to 30 seconds.
  • After an instance enters the Switching state, you cannot manage this instance. For example, you cannot modify the instance configurations or migrate the instance to another zone.
Make sure that your applications are configured to automatically reconnect to the instance or handle exceptions. Otherwise, one of the following error messages may be returned during a switchover: READONLY You can't write against a read only instance and DISABLE You can't write or read against a disable instance.
Risk mitigation
Instance failure
  • The data nodes on which the switchover is performed are disconnected for a few seconds.
  • After an instance enters the Switching state, you cannot manage this instance. For example, you cannot modify the instance configurations or migrate the instance to another zone.
Note After the master-replica switchover is complete, the state of the instance becomes Running.

FAQ

  • Q: What is the principle behind the master-replica switchover triggered by an instance failure?
    A: The detection mechanism of the High Availability (HA) system is used to detect failures. The following table describes the detection mechanism.
    Event Description
    Health check The HA system checks whether master and replica nodes are healthy.
    Master node failure
    1. When a master node is determined to be unavailable, a replica node acts as the master node. At the same time, the virtual IP address (VIP) of the master node is switched to the replica node.
    2. Another replica node is created to ensure data synchronization.
    Replica node failure When a replica node is determined to be unavailable, another replica node is created to ensure data synchronization and maintain the data persistence of the master-replica architecture.
    Note Some data that was recently written to a master node may be lost because the synchronization between the master and replica nodes is asynchronously implemented.
  • Q: Does a master-replica switchover affect the use of read replica nodes in read/write splitting instances?

    A: A master-replica switchover does not affect the use of read-only nodes in read/write splitting instances.

  • Q: Does a master-replica switchover triggered for a specific data shard in an instance affect the instance as a whole if the instance is a cluster master-replica instance?

    A: The instance as a whole is not affected. Only the data shard is affected. For more information, see Impacts.