High-availability service

Last Updated: Dec 04, 2017

The high-availability service consists of several modules including the Detection, Repair, and Notification modules. In combination, these modules guarantee the availability of the data link services and process any internal database exception.

In addition, RDS can improve the performance of its high-availability service by migrating to a region that supports multiple zones and by adopting the appropriate high-availability policies.

Detection

The Detection module checks whether the master and slave nodes of the DB Engine offer their services normally. The HA (High Available) node uses heartbeat information, acquired at an interval of 8 to 10 seconds, to check the health status of the master node. This information, combined with the health status of the slave node and heartbeat information from other HA nodes, allows the Detection module to eliminate any risk of misjudgment caused by exceptions such as network jitter and allows that the exception switchover can be completed within 30 seconds.

Repair

The Repair module maintains the replication relationship between the master and slave nodes of the DB Engine. It can also repair any errors that may occur on either node, such as:

  • Automatic restoration of master/slave replication in case of disconnection

  • Automatic repair of table-level damage to the master or slave nodes

  • On-site saving and automatic repair if the master or slave nodes crash

Notice

The Notice module informs the SLB or Proxy of status changes to the master and slave nodes to guarantee that you can continue to access the correct node.

For example, the Detection module discovers that the master node has an exception and instructs the Repair module to fix it. If the Repair module fails to resolve the problem, it directs the Notification module to initiate traffic switching. The Notification module then forwards the switching request to the SLB or Proxy, which begins to redirect all traffic to the slave node. Simultaneously, the Repair module creates a new slave node on another physical server and synchronizes this change back to the Detection module. The Detection module then incorporates this new information and starts to recheck the health status of the instance.

Multi-zone

Multi-zone refers to the physical area that is formed by combining multiple individual zones within the same region. Multi-zone RDS instances can withstand higher level disasters than single-zone instances. For example, a single-zone RDS instance can withstand server and rack failures, while a multi-zone RDS instance can survive a situation such as failure of an entire data center.

Currently no extra charge for multi-zone RDS instances is generated. Users in a region where multi-zone is enabled can purchase multi-zone RDS instances directly or convert single-zone RDS instances into multi-zone RDS instances by using inter-zone migration.

Note: Multiple zones may have a certain amount of network latency. As a result, when a multi-zone RDS instance uses a semi-synchronous data replication solution, its response time to any individual update may be longer than that of a single-zone instance. In this case, the best way to improve overall throughput is to increase concurrency.

High-availability policies

The high-availability policies use a combination of service priorities and data replication modes to meet the business needs.

The service priorities are as follows:

  • RTO (Recovery Time Objective) priority: The database must restore services as soon as possible within a specified time frame. This is best for users who require their databases to provide uninterrupted online service.

  • RPO (Recovery Point Objective) priority: The database must guarantee the data reliability, that is, as little data loss as possible. This is best for users whose highest priority is data consistency.

There are three data replication modes:

  • Asynchronous replication (Async)

    In this mode, the master node does not immediately synchronize data to the slave node. When an application initiates an update request, which may include add, delete, or modify operations, the master node responds to the application immediately after completing the operation but does not necessarily replicate that data to the slave node right away. This means that the operation of the primary database is not affected if the slave node is unavailable, but data inconsistencies may occur if the master node is unavailable.

  • Forced synchronous replication (Sync)

    In this mode, the master node synchronizes all data to the slave node at all times. When an application initiates an update request, which may include add, delete, or modify operations, the master node replicates the data to the slave node immediately after completing the operation and waits for the slave node to return a success message before it responds to the application. This means that the operation of the master node is affected if the slave node is unavailable, but the data on the master and slave nodes is always consistent.

  • Semi-synchronous replication (Semi-Sync)

    This functions as a hybrid of the two preceeding replication modes. In this mode, when both nodes are functioning normally, data replication is identical to the forced synchronous replication mode. However, when there is an exception, such as the slave node becoming unavailable or a network exception occurring between the two nodes, the master node only attempts to replicate data to the slave node and suspend its response to the application for a set period of time. Once the replication mode has timed out, the master node degrades to asynchronous replication. At this point, if the master node becomes unavailable and the application updates its data from the slave node, it is consistent with the data on the master node. When data replication between the two nodes returns to normal, because the slave node or network connection is recovered, forced synchronous replication is reinstated. The amount of time it takes for the nodes to return to forced synchronous replication depends on how the semi-synchronous replication mode was implemented. For instance, ApsaraDB for MySQL 5.5 is different from ApsaraDB for MySQL 5.6 in this regard.

Several combinations of service priorities and data replication modes are available to meet your database and business needs. The characteristics of key combinations are detailed in the following table.

Cloud data engine Service priority Data replication mode Combination characteristics
MySQL 5.1 RPO Async
  • If the master node fails, the slave node switches over after applying all of the relay logs.
  • If the slave node fails, application operations on the master node are not affected. The data on the master node is synchronized after the slave node recovers.
MySQL 5.5 RPO Async
  • If the master node fails, the slave node switches over after applying all of the relay logs.
  • If the slave node fails, application operations on the master node are not affected. The data on the master node is synchronized after the slave node recovers.
MySQL 5.5 RTO Semi-Sync
  • If the master node fails and data replication degrads, RDS immediately triggers the switchover and direct traffic to the slave node because data consistency is guaranteed.
  • If the slave node fails, application operations on the master node times out, and data replication degrades to asynchronous replication. After the slave node recovers and the data on the master node is synchronized completely, data replication returns to forced synchronization.
  • If the master node fails while the two nodes have inconsistent data and the data replication mode degrads to asynchronous replication, the slave node switches over after applying all of the relay logs.
MySQL 5.6 RPO ASync
  • If the master node fails, the slave node switches over after applying all of the relay logs.
  • If the slave node fails, application operations on the master node are not affected. The data on the master node is synchronized after the slave node recovers.
MySQL 5.6 RTO Semi-Sync
  • If the master node fails and data replication degrads, RDS immediately triggers the switchover and direct traffic to the slave node because data consistency is guaranteed.
  • If the slave node fails, application operations on the master node times out, and data replication degrades to asynchronous replication. After the slave node recovers and the data on the master node is synchronized completely, data replication returns to forced synchronization.
  • If the master node fails while the two nodes have inconsistent data and the data replication mode degrads to asynchronous replication, the slave node switches over after applying all of the relay logs.
MySQL 5.6 RPO Semi-Sync
  • If the master node fails and data replication has not degraded, RDS immediately triggers the switchover and direct traffic to the slave node because data consistency has been guaranteed.
  • If the slave node fails, application operations on the master node times out, and data replication degrades to asynchronous replication. When the slave node can obtain information from the master node again, because the slave node or network connection recovers, data replication returns to forced synchronization.
  • If the master node fails while the two nodes have inconsistent data and the data difference on the slave node cannot be reconciled completely, you can obtain the time of the slave node through the API. Then you can decide when to switchover and which method you plan to use to reconcile the data.
MySQL 5.7 X X Currently this engine does not support policy adjustments.
SQL Server 2008 R2 X X Currently this engine does not support policy adjustments.
SQL Server 2012 X X Currently this engine does not support policy adjustments.
PostgreSQL X X Currently this engine does not support policy adjustments.
PPAS X X Currently this engine does not support policy adjustments.
Thank you! We've received your feedback.