Database solutions in active geo-redundancy scenarios

Background

When cloud computing prospers, an increasing number of information systems are deployed in a cloud computing environment. As a result, data quality and service capabilities provided based on cloud services become important for information systems. Catastrophic disasters such as fire, deluges, earthquakes, regional outages, and human-caused disasters can do irreversible damage to information systems. When you build an information system, you must build a disaster recovery system to ensure the reliability and stability of the information system.

In 2007, the Information Office of the China State Council worked together with enterprises from eight major industries to release the standard GB/T20988-2007 Information security technology - Disaster recovery specifications for information systems. The major industries include banking, electric power, civil aviation, railway, and securities. This standard provides the requirements of disaster recovery in six grades. After the release of this standard, enterprises can build disaster recovery systems based on the requirements defined in the China national standard, or use the standard as the set of compliance requirements. However, a majority of traditional disaster recovery solutions cannot meet the requirements of disaster recovery in grade 5 or grade 6 based on the China national standard. These traditional disaster recovery solutions include zone-disaster recovery, active-active zone-disaster recovery, geo-disaster recovery, and three data centers across two zones. These solutions require high costs and the instances used for disaster recovery may not be continuously robust.

Active geo-redundancy is a next generation disaster recovery solution that can ensure a high level of continuous availability of your business. This solution can help you save costs and allows you to scale out databases within a region. This topic describes how to use major Alibaba Cloud database services in active geo-redundancy scenarios.

Architecture

The active geo-redundancy architecture is implemented based on a top-down approach for business traffic isolation. The business traffic is split based on a specific dimension and is routed to different regions. In the active geo-redundancy architecture, databases are deployed across regions, and all databases in a region are referred to as a unit cluster. Among all unit clusters that use the active geo-redundancy architecture, one unit cluster acts as the logical center that provides centralized service capabilities. For example, the logical center allows you to create sequences and can ensure strong consistency for read operations. The architecture of each unit cluster consists of three layers: access layer, service layer, and data layer.
  • Access layer: The business traffic is resolved by Alibaba Cloud DNS (DNS) and is distributed to the access layers of different unit clusters. Then, each unit cluster determines whether to forward the business traffic at the access layer to the service layer based on the traffic splitting identifiers in the request headers or cookies and the custom distribution policies. If the traffic meets the forwarding conditions of a specific unit cluster, the unit cluster forwards the traffic to the service layer. If the traffic does not meet the forwarding conditions of a specific unit cluster, the traffic is forwarded to a different unit cluster. PolarDB-X can redirect the traffic to an upstream server or use a 304 redirect based on the connection established over the VPC.
  • Service layer: Business application systems including middleware are deployed at the service layer. You can use Alibaba Cloud middleware or remote procedure calls (RPCs) to connect to the databases that use the active geo-redundancy architecture. Alibaba Cloud middleware or RPCs can help you route traffic and troubleshoot routing errors at the service layer. The supported Alibaba Cloud middleware includes Cloud Service Bus (CSB) and Message Queue (MQ). If you use RPCs to access databases that use the active geo-redundancy architecture, PolarDB-X divides the services into the following types: unit service, centralized service, and regular service.
    • Unit service: a fully autonomous service in a unit cluster. This service type is the major service type in active geo-redundancy scenarios. PolarDB-X determines the unit cluster to which the traffic from a unit service is routed and troubleshoots traffic routing errors.
    • Centralized service: A service that is strongly dependent on the logical center. Traffic from a centralized service is forwarded to the logical center. A centralized service can access the data layer of the logical center.
    • Regular service: a service that is not transformed. Regular services within a region can invoke each other. A regular service can access the data layer of the unit cluster that runs the regular service.
    Unit services are major services that support active geo-redundancy. In specific scenarios in which business systems are complex, the process of classifying business modules based on a specific dimension is also complex. Specific services require single-node deployment to prevent data inconsistency that may occur during distributed deployment. This way, centralized services and regular services are used to provide the required capabilities.
  • Data layer: A data layer helps resolve the issues that can occur during data synchronization when databases are deployed across regions. If a disaster occurs, the data protection policy provided at the data layer can help you ensure data quality during failovers. PolarDB-X provides data synchronization policies UNIT and COPY based on the service types at the upper layer.
    • UNIT: Each unit cluster is deployed in an independent database system. Data Transmission Service (DTS) is used to perform two-way data synchronization between unit clusters in real time. This ensures that each unit cluster contains full data. You can perform read and write operations in each unit cluster. When you send read and write requests to a unit cluster, the read/write splitting policy that you configure based on your business scenario is used to enable write protection for the unit cluster. The UNIT synchronization policy takes effect on the unit services at the service layer and is a core synchronization policy in active geo-redundancy scenarios.
    • COPY: Each unit cluster is deployed in an independent database system. DTS is used to perform one-way data synchronization between unit clusters in real time. This ensures that each unit cluster contains full data. You can perform read and write operations in the logical center. If a unit cluster is not a logical center, you can perform read-only operations in the unit cluster. The COPY synchronization policy takes effect on centralized services and regular services. The requests sent from centralized services are routed to and processed by the logical center. The read requests sent from regular services can be processed by the unit cluster that runs the regular services.

Scenarios

The active geo-redundancy solution described in this topic can be used in the following business scenarios:
  • High requirement of disaster recovery: The active geo-redundancy solution can meet the requirements of disaster recovery in grade 6 based on the China national standard. This solution can be used in business scenarios that have high requirements of disaster recovery or in business scenarios in which traffic fluctuates, or in core business systems.
  • Fine-grained traffic management: The active geo-redundancy solution allows you to use various policies to manage traffic. This solution can be used in business scenarios that have complex requirements for fine-grained traffic management. For example, this solution can be used in business scenarios in which you want to deploy databases in a region that is close to your location or distribute requests to data centers based on user information.
  • Rapid business growth: The active geo-redundancy solution allows you to scale out a unit cluster or a data center based on your business requirements. After the configuration file of a unit cluster is defined, you can use the configuration file as an image to deploy multiple unit clusters in an efficient manner. This way, you can quickly scale out the storage.
  • Business scenario in which the number of read operations is larger than the number of write operations: If a large number of read operations and a small number of write operations exist, the issues caused by asynchronous replication can be mitigated or prevented. In this scenario, you can transform your database system to a system that uses the active geo-redundancy architecture at a low cost. The active geo-redundancy solution is suitable for this scenario.

Benefits

The active geo-redundancy solution provides the following benefits:
  • Use business systems as disaster recovery systems. If you use a traditional geo-disaster recovery solution or deploy databases in three data centers across two zones, the remote secondary data centers are used only for disaster recovery and do not provide database services. If catastrophic failures occur in the regions where the data centers for disaster recovery reside, the availability of the data centers and the success rate of failovers cannot be ensured. In this scenario, if you use the active geo-redundancy solution, the data centers in each region play the same active role and handle business traffic. This ensures that the business systems can be continuously robust. In this scenario, the data centers in each region are used as business systems and disaster recovery systems.
  • Ensure business continuity. In the active geo-redundancy architecture, the data centers in each region handle business traffic. If a failure occurs on a data center, the inbound traffic is routed to other data centers that are running as expected. This way, a failover can be performed within minutes. When an increasing number of data centers are deployed in the active geo-redundancy architecture, the amount of the inbound traffic that must be routed from a failed data center accounts for a decreasing proportion of the total amount of the inbound traffic. The failover does not affect the inbound traffic destined for the data centers that are running as expected.
  • Support rapid business growth. When your business experiences explosive growth, the amount of available resources in a single region may affect your database performance if the resources are insufficient. In this case, a single point of failure (SPOF) or performance bottleneck issues can occur at the data layer. If your database system uses a disaster recovery architecture other than the active geo-redundancy architecture, all write operations are performed only in the primary production center. The active geo-redundancy solution provides a closed-loop system to handle business traffic. Each data center allows you to perform read and write operations. This way, you can quickly scale out data centers or deploy databases across regions.
  • Isolate traffic in an efficient manner. Database systems in the active geo-redundancy architecture provide inherent capabilities that can be used to isolate traffic from top to down. Business traffic in each data center is isolated from each other. You can change the threshold value for the amount of traffic that can be handled in each data center. If a data center that is isolated handles the minimum amount of business traffic, for example, one percent of the total traffic, you can perform operations that can promote technological advances in the data center. For example, you can upgrade the infrastructure or verify new technologies in the data center. Before you perform these operations, make sure that you can handle or mitigate the potential risks.
  • Control costs in an efficient manner. If you deploy three data centers across two zones or use geo-disaster recovery, the disaster recovery center must provide sufficient system resources to handle all business traffic. This can prevent risks of catastrophic failures in a single region. As a result, the total cost of database systems equals twice the cost of the production center because the disaster recovery center must also provide sufficient resources. The active geo-redundancy solution allows you to deploy data centers in different cities. This can help you reduce database costs.

Case 1: Build database systems for the State Administration of Taxation in China

Background

One of the major projects to build finance and tax systems is to build a system for individual income taxation. The system for individual income taxation is used to store basic information of approximately 780 million natural persons and sensitive data of approximately 360 million active taxpayers. The system for individual income taxation becomes an important strategic information asset and is used as a large-scale cloud platform for public service sectors. This scenario requires high disaster recovery capabilities for database systems. The issues about how to reduce the cost of disaster recovery systems must be resolved. In this scenario, the active geo-redundancy solution meets the requirements for disaster recovery. Before the active geo-redundancy solution is used, consider how to reduce costs and how to meet the requirements for explosive business growth.

Architecture

The customer builds database systems based on the hybrid transaction/analytical processing (HTAP) and active geo-redundancy architectures. The customer uses multiple cloud services at the same time to implement active geo-redundancy. The cloud services include ApsaraDB RDS for MySQL, PolarDB-X, AnalyticDB for MySQL, Data Transmission Service (DTS), Data Management (DMS), and Multi-site High Availability (MSHA). The active geo-redundancy solution meets the requirements of disaster recovery in grade 6 based on the China national standard.
  • Use ApsaraDB RDS for MySQL and PolarDB-X to process transactional data. Use AnalyticDB for MySQL to process analytical data.
  • Use MSHA for traffic throttling of databases that use the active geo-redundancy architecture and perform failovers.
  • Use DTS to synchronize cross-region data in real time and synchronize data in the cloud.
  • Use DMS to manage routine O&M tasks and data changes.
  • Use Database Backup to back up data to a third-party database.
Results
  • Multiple traffic splitting rules take effect based on the business modules of the customer. For online business of the Natural Person Electronic Tax Department, the business traffic is distributed based on the file numbers of natural persons. For offline business of the Natural Person Electronic Tax Department, the business traffic is distributed based on the regions.
  • The active geo-redundancy solution meets the requirements of disaster recovery in grade 6 based on the China national standard. The customer can perform a failover within seconds. This ensures that no data is lost.
  • Use MSHA for traffic throttling of databases that use the active geo-redundancy architecture and perform failovers.
  • The customer deploys two unit clusters. In most cases, each unit cluster handles half the business traffic. This way, the resources provided by the two unit clusters can be fully utilized.
  • The customer can flexibly configure traffic splitting rules that determine how business traffic is routed to the databases that use the active geo-redundancy architecture. This way, canary release is supported for the release of key services.

Case 2: Build a customer service system for China Unicom

Background

The new customer service system of China Unicom is used for its customer service business in China. This requires continuous availability of the business system. This project symbolizes that China Unicom starts to transform its business systems into high-availability systems. The major business of China Unicom is transactional business.

Architecture

The customer uses multiple cloud services at the same time to implement active geo-redundancy. The cloud services include ApsaraDB RDS for MySQL, PolarDB-X, DTS, and MSHA. This way, the active geo-redundancy feature is enabled for the seven business centers of the new customer service system.
  • Use ApsaraDB RDS for MySQL and PolarDB-X to process business data. Use a centralized console to manage the database systems that use the active geo-redundancy architecture.
  • Use DTS to synchronize data across regions in real time and report the status of each data synchronization task.
  • Use MSHA for traffic throttling of databases that use the active geo-redundancy architecture and perform failovers.
Results
  • The new customer service system of China Unicom is connected to seven business systems. The business systems include the inbound call center, the outbound call center, and the business support center. The business traffic is distributed to business systems based on regions.
  • Multiple disaster recovery drills are performed. Cross-region failovers can be performed within seconds. No data loss occurs when failovers are performed.
  • The customer deploys two unit clusters. Each unit cluster handles half the business traffic. This way, the resources provided by the two unit clusters can be fully utilized.