High-availability service

Last Updated: Mar 30, 2017

The high-availability service consists of several modules including the Detection, Repair, and Notification modules. In combination, these modules guarantee the availability of the data link services and process any internal database exceptions.

In addition, RDS can improve the performance of its high-availability service by migrating to a region that supports multiple zones and by adopting the appropriate high-availability policies.

Detection

The Detection module checks whether the master and slave nodes of the DB Engine are providing their services normally. The HA node uses heartbeat information, acquired at an interval of 8 to 10 seconds, to determine the health status of the master node. This information, combined with the health status of the slave node and heartbeat information from other HA nodes, allows the Detection module to eliminate any risk of misjudgment caused by exceptions such as network jitter. As a result, switchover can be completed within 30 seconds.

Repair

The Repair module maintains the replication relationship between the master and slave nodes of the DB Engine. It can also repair any errors that may occur on either node, such as:

  • Automatic restoration of master/slave replication in case of disconnection.

  • Automatic repair of table-level damage to the master or slave nodes.

  • On-site saving and automatic repair if the master or slave nodes crash.

Notification

The Notification module informs the SLB or Proxy of status changes to the master and slave nodes to ensure that you can continue to access the correct node.

For example, the Detection module discovers that the master node has an exception and instructs the Repair module to fix it. If the Repair module fails to resolve the problem, it directs the Notification module to initiate traffic switching. The Notification module then forwards the switching request to the SLB or Proxy, which begins to redirect all traffic to the slave node. At the same time, the Repair module creates a new slave node on another physical server and synchronizes this change back to the Detection module. The Detection module then incorporates this new information and starts to recheck the health status of the instance.

Multi-zone

Multi-zone refers to the physical area that is formed by combining multiple individual zones within the same region. Multi-zone RDS instances can withstand higher level disasters than single-zone instances. For example, a single-zone RDS instance can withstand server and rack failures, while a multi-zone RDS instance can survive a situation such as failure of an entire equipment room.

There is currently no extra charge for multi-zone RDS instances. Users in a region where multi-zone has been enabled can purchase multi-zone RDS instances directly or convert single-zone RDS instances into multi-zone RDS instances by using inter-zone migration.

Note: Multiple zones may have a certain amount of network latency. As a result, when a multi-zone RDS instance uses a semi-synchronous data replication solution, its response time to any individual update may be longer than that of a single-zone instance. In this case, the best way to improve overall throughput is to increase concurrency.

High-availability policy

The high-availability policies use a combination of service priorities and data replication modes to meet the needs of your business.

There are two service priorities:

  • RTO (Recovery Time Objective) priority: The database must restore services as soon as possible within a specified time frame. This is best for users who require their databases to provide uninterrupted online service.

  • RPO (Recovery Point Objective) priority: The database must restore services with as little data loss as possible. This is best for users whose highest priority is data consistency.

There are three data replication modes:

  • Asynchronous replication (Async)

    In this mode, the master node does not immediately synchronize data to the slave node. When an application initiates an update request, which may include add, delete, or modify operations, the master node responds to the application immediately after completing the operation but does not necessarily replicate that data to the slave node right away. This means that the operation of the primary database is not affected if the slave node is unavailable, but it does make it possible for data inconsistencies to occur between the master and slave nodes.

  • Forced synchronous replication (Sync)

    In this mode, the master node synchronizes all data to the slave node at all times. When an application initiates an update request, which may include add, delete, or modify operations, the master node replicates the data to the slave node immediately after completing the operation and waits for the slave node to return a success message before it responds to the application. This means that the operation of the master node will be affected if the slave node is unavailable, but the data on the master and slave nodes will always be consistent.

  • Semi-synchronous replication (Semi-Sync)

    This functions as a hybrid of the two preceeding replication modes. In this mode, as long as both nodes are functioning normally, data replication is identical to the forced synchronous replication mode. However, when there is an exception, such as the slave node becoming unavailable or a network exception occuring between the two nodes, the master node will only attempt to replicate data to the slave node and suspend its response to the application for a set period of time. Once the replication mode has timed out, the master node will degrade to asynchronous replication. At this point, if the master node becomes unavailable and the application updates its data from the slave node, it will not be consistent with the data on the master node. When data replication between the two nodes returns to normal, because the slave node or network connection is recovered, forced synchronous replication is reinstated. The amount of time it takes for the nodes to return to forced synchronous replication depends on how the semi-synchronous replication mode was implemented. For instance, ApsaraDB for MySQL 5.5 is different from ApsaraDB for MySQL 5.6 in this regard.

Several combinations of service priorities and data replication modes are available in order to meet your database and business needs. The characteristics of key combinations are detailed in the following table:

Cloud Data Engine Service Priority Data Replication Mode Combination Characteristics
MySQL 5.1 RPO Async If the master node fails, the slave node will switch over after applying all of the relay logs.
If the slave node fails, application operations on the master node are not affected. The data on the master node will be synchronized after the slave node recovers.
MySQL 5.5 RPO Async If the master node fails, the slave node will switch over after applying all of the relay logs.
If the slave node fails, application operations on the master node are not affected. The data on the master node will be synchronized after the slave node recovers.
MySQL 5.5 RTO Semi-Sync If the master node fails and data replication has not degraded, RDS will immediately trigger the switchover and direct traffic to the slave node because data consistency has been guaranteed.
If the slave node fails, application operations on the master node will time out, and data replication will degrade to asynchronous replication. After the slave node recovers and the data on the master node is synchronized completely, data replication will return to forced synchronization.
If the master node fails while the two nodes have inconsistent data and the data replication mode has degraded to asynchronous replication, the slave node will switch over after applying all of the relay logs.
MySQL 5.6 RPO ASync If the master node fails, the slave node will switch over after applying all of the relay logs.
If the slave node fails, application operations on the master node are not affected. The data on the master node will be synchronized after the slave node recovers.
MySQL 5.6 RTO Semi-Sync If the master node fails and data replication has not degraded, RDS will immediately trigger the switchover and direct traffic to the slave node because data consistency has been guaranteed.
If the slave node fails, application operations on the master node will time out, and data replication will degrade to asynchronous replication. After the slave node recovers and the data on the master node is synchronized completely, data replication will return to forced synchronization.
If the master node fails while the two nodes have inconsistent data and the data replication mode has degraded to asynchronous replication, the slave node will switch over after applying all of the relay logs.
MySQL 5.6 RPO Semi-Sync If the master node fails and data replication has not degraded, RDS will immediately trigger the switchover and direct traffic to the slave node because data consistency has been guaranteed.
If the slave node fails, application operations on the master node will time out, and data replication will degrade to asynchronous replication. When the slave node can obtain information from the master node again, because the slave node or network connection recovers, data replication will return to forced synchronization.
If the master node fails while the two nodes have inconsistent data and the data difference on the slave node cannot be reconciled completely, you can obtain the time of the slave node through the API. Then you can decide when to switchover and which method you plan to use to reconcile the data.
MySQL 5.7 X X This engine does not currently support policy adjustments.
SQL Server 2008 R2 X X This engine does not currently support policy adjustments.
SQL Server 2012 X X This engine does not currently support policy adjustments.
PostgreSQL X X This engine does not currently support policy adjustments.
PPAS X X This engine does not currently support policy adjustments.
Thank you! We've received your feedback.