ApsaraDB for HBase can store large amounts of data. In most cases, ApsaraDB for HBase is used to store mission-critical data in business scenarios. To ensure high availability (HA) and security of data, ApsaraDB for HBase provides the active-active redundancy and primary/secondary disaster recovery features. This topic describes these features.

Scenarios

  • Active-active redundancy: real-time online workloads that require low latency for random read requests for large amounts of data, such as user-oriented recommendation and security risk control. In low-latency scenarios, the P999 latency is required to be less than 50 milliseconds.
  • Primary/secondary disaster recovery: An ApsaraDB for HBase cluster may fail due to unexpected reasons, such as a device failure or a power or network failure in a data center. In this case, disaster recovery can help ensure data consistency and service availability.

Benefits

  • Active-active redundancy provides the following benefits:
    • Few high-latency requests
    • Automatic fault tolerance
    • High resource utilization
  • Primary/secondary disaster recovery provides the following benefits:
    • Supports hybrid deployment of ApsaraDB for HBase, E-MapReduce HBase, and self-managed HBase clusters in primary/secondary mode.
    • Supports HBase V1.X and HBase V2.X.
    • Supports automatic data synchronization task management and highly efficient bidirectional synchronization with a latency of hundreds of milliseconds.
    • Supports automatic fault tolerance.
    • Allows you to view detailed information in the Lindorm Tunnel Service (LTS) web UI, including the synchronization latency, the number of client connections, and the status of the primary and secondary clusters.

How active-active redundancy works

  • Glitch reduction: The probability of the simultaneous occurrence of glitches on two standalone nodes is lower than the probability of the occurrence of glitches on a single node. You can deploy two nodes to provide dual services for the same data to improve data stability and ensure eventual consistency. When a user initiates a request, the request is first sent to the primary database. If no result is returned from the primary database before a period of time elapses, the request is sent to the secondary database. The result that is first received is used as the response to the request.
  • Automatic fault tolerance: The active-active redundancy feature provides automatic fault tolerance. In fault scenarios, failover operations can be automatically performed. Failover operations are required in the following scenarios:
    • A fault such as a power or network failure occurs in the data center. The primary cluster cannot be correctly connected. Errors are reported for all requests.
    • The primary cluster breaks down due to software bugs.
    • The request for accessing the primary cluster times out due to slow disk access or damaged disks.

How primary/secondary disaster recovery works

An ApsaraDB for HBase cluster may fail due to unexpected reasons, such as a device failure or a power or network failure in a data center. In this case, disaster recovery can help ensure data consistency and service availability. ApsaraDB for HBase provides two disaster recovery solutions to meet the requirements in different business scenarios.

Disaster recovery solution Description
Single-zone HA solution The primary node and the secondary node are deployed on different servers in the same zone. The HA system monitors the health status of each node in real time. If the primary node fails, the HA system performs a failover to prevent service interruptions that can occur due to single points of failure (SPOFs).
Zone-disaster recovery solution You can implement primary/secondary zone-disaster recovery between ApsaraDB for HBase clusters, between an ApsaraDB for HBase cluster and a self-managed HBase cluster, and between self-managed HBase clusters. The primary node and the secondary node are deployed in two different zones in the same region. If the zone in which the primary node resides becomes unavailable due to force majeure factors such as a power or network failure, the HA system performs a failover to ensure continuous availability of the entire system. You can use LTS to implement two-way synchronization of real-time incremental data between the primary and secondary nodes and perform the failover operation. You can use the alihbase-connector plug-in to access the HBase nodes. The alihbase-connector plug-in listens to the switchover events that occur between the ZooKeeper clusters associated with the primary and secondary nodes to complete the failover operation.

FAQ

  • Is data cyclically sent during two-way synchronization between the primary and secondary nodes?

    No, data is not cyclically sent. During two-way synchronization, the system distinguishes data that is written by using clients from synchronized data based on the cluster ID. This helps ensure that data is not cyclically sent.

  • Does LTS cache data when data cannot be written to the destination cluster?

    No, LTS does not cache data. During data synchronization, when the destination cluster fails and data cannot be written to the destination cluster, LTS records the position of the data that is synchronized in the write-ahead logging (WAL) file. This ensures that LTS can continue to synchronize data after the destination cluster recovers. Data is stored in the Hlog of the source cluster.

  • A data record (D1) is written to the primary instance, and a failover occurs while D1 is being synchronized to the secondary instance. If D1 is synchronized after another data record (D2) is written to the secondary instance, does the secondary instance store D1 or D2?

    The secondary instance stores the data record that is associated with a larger timestamp. While LTS synchronizes data records, the original timestamps associated with the data records remain unchanged. In most cases, D2 is stored. However, a time difference of milliseconds may occur between the primary and secondary instances. As a result, the timestamp of D2 may be smaller than the timestamp of D1. In this case, D1 is stored. Multi-zone Lindorm instances provide strong consistency and can ensure that D2 is stored. For more information about multi-zone Lindorm instances, submit a ticket.