×
Community Blog Disaster Recovery for Databases: High-availability Architecture of PolarDB-X

Disaster Recovery for Databases: High-availability Architecture of PolarDB-X

This article systematically analyzes the high-availability (HA) architecture of PolarDB-X.

By Yan Hua

This article systematically analyzes the high-availability (HA) architecture of PolarDB-X, a cloud-native distributed database of Alibaba Cloud. From disaster recovery fundamentals and HA design principles to real-world deployment solutions, this article reveals how PolarDB-X achieves financial-grade disaster recovery by using multi-layered redundancy and automated failover mechanisms.

1. Fundamentals of Disaster Recovery

1.1. Risk Pyramid

The core of disaster recovery design lies in applying tiered protection strategies to different levels of risks. Based on the scope of impact, fault risks can be categorized into four levels.

Risk level Fault probability Scope of impact Typical faults
Process-level High A single service OOM errors, segmentation faults, and killed processes
Machine-level Medium All services on a single machine Kernel panics and disk failures
Data center-level Low All machines in a single data center Power outages, fire, and fiber disconnections
City-level Extremely low All data centers in a single city Earthquakes and floods

1.2. Core Metrics

In the current national standard GB/T 20988 - 2025 "Cybersecurity technology - Disaster recovery specifications for information systems," disaster recovery capabilities are defined in six levels.

Level Name RTO RPO
Level 1 Basic support Greater than or equal to two days One to seven days
Level 2 Alternate site support Greater than or equal to 24 hours One to seven days
Level 3 Electronic transmission and partial device support Greater than or equal to 12 hours Several hours to one day
Level 4 Electronic transmission and complete device support Several hours to one day Several hours
Level 5 Real-time data transmission and complete device support Several minutes to several hours 0
Level 6 Zero data loss and remote cluster support Several minutes 0

The core metrics are recovery point objective (RPO) and recovery time objective (RTO):

RPO: the required point in time to which the system and data must be restored after a disaster.

  • In simpler terms, it represents the maximum amount of data loss that the system can tolerate, such as losing up to one hour of data.
  • In essence, it reflects the backup method and frequency. For example, if the requirement is RPO ≤ 24 hours, meaning up to 24 hours of data may be lost, the system must perform at least one backup per day. If the requirement is RPO = 0 (zero data loss), real-time backup is needed.

RTO: the maximum acceptable delay between the interruption and recovery of a service.

  • Put simply, it is the upper limit of system downtime that the business can tolerate.
  • Essentially, it reflects the speed of service recovery. For example, an RTO of 4 hours means that a service must be restored within 4 hours. An RTO less than 8 seconds requires that automatic failover be complete within 8 seconds.

1

1.3. Core Principles of Disaster Recovery

The fundamental principle of database HA and disaster recovery is to ensure rapid data restoration and continuous service delivery under various risks, faults, or disasters by employing multi-layered data protection and redundancy mechanisms. The key technologies include three main parts:

• Periodic full backups provide the baseline for disaster recovery.

  • Full backups can address logical errors, such as accidental table deletion, and large-scale data corruption.
  • Their drawback is limited recovery timeliness. For example, a daily full backup implies an RPO of 24 hours.
  • Full backup operations also impose load and operational constraints on the live production system.

• Incremental log backups capture data changes in real time or at scheduled intervals based on transaction logs, recording physical or logical entries for each operation such as INSERT, UPDATE, or DELETE.

  • Replaying these logs enables point-in-time recovery (PITR), allowing restoration to any specified point in time and significantly reducing RPO. For example, with log backups performed every minute, RPO can be at most 60 seconds.
  • However, incremental backups are also limited in recovery timeliness, because the RPO is constrained by the latency of the most recently backed-up log file.

• Real-time replica synchronization uses primary/secondary replication protocols or distributed consensus protocols (such as Paxos or Raft) to achieve data synchronization across multiple nodes.

Replication type Description Data consistency Failover speed
Synchronous replication The primary node must wait for all secondary nodes to complete data persistence before it can commit a transaction. RPO = 0 RTO < 30 seconds
Asynchronous replication The primary node does not wait for any secondary node to persist data when it commits a transaction. RPO > 0 Within minutes
Semi-synchronous replication The primary node must wait for a number of secondary nodes to complete data persistence before it can commit a transaction. If a replication latency occurs, the mechanism degrades to asynchronous replication. RPO ≈ 0 RTO < 30 seconds
Distributed consensus protocol The primary node must wait for a majority (more than half) of secondary nodes to complete data persistence before it can commit a transaction. RPO = 0 RTO < 8 seconds

2. HA Architecture of PolarDB-X

2.1. Architectural Overview

PolarDB-X integrates centralized and distributed architectures. It is offered in two editions: Standard Edition (centralized-compatible) and Enterprise Edition (distributed-compatible). In a database that integrates centralized and distributed architectures, storage nodes independently operate in a centralized manner and are fully compatible with the standalone database model.

2

When business grows and distributed scaling becomes necessary, the centralized Standard Edition can be upgraded to the distributed Enterprise Edition. The Enterprise Edition adopts a shared-nothing, compute-storage separation architecture, delivering financial-grade HA, distributed horizontal scalability, hybrid-workload support, low-cost storage, and extreme elasticity. Each component implements its own HA mechanisms:

Compute node (CN): provides distributed routing and computation, and uses the two-phase commit protocol (2PC) to coordinate distributed transactions. A CN also executes DDL statements in a distributed manner and maintains global indexes. CNs are stateless. Failure of any single CN process does not affect overall compute-layer availability. For production, we recommend that you deploy at least two CNs.

Data node (DN): provides highly reliable storage services and uses multiversion concurrency control (MVCC) for distributed transactions. A DN also provides the pushdown computation feature. DNs use the distributed consensus protocol Paxos to achieve high availability. Failure of any single DN does not compromise overall DN availability. Based on Paxos requirements, a production DN cluster contains at least three nodes, typically deployed as two full-featured replicas plus one logger replica.

Global meta service (GMS): provides distributed metadata and a global timestamp distributor named Timestamp Oracle (TSO), and maintains meta information such as tables, schemas, and statistics. GMS also maintains security information such as accounts and permissions. A GMS is effectively implemented as an independent DN cluster, with its HA ensured by the same DN-level HA mechanisms.

Change data capture (CDC) node: provides a primary/secondary replication protocol that is compatible with MySQL. The primary/secondary replication protocol is compatible with the protocols and data formats that are supported by MySQL binary logging. CDC uses the primary/secondary replication protocol to exchange data. CDC nodes are stateful, with their states persisted in GMS. For production, we recommend that you deploy at least two CDC nodes. If the primary CDC node fails, the secondary CDC node automatically promotes itself to primary. If both primary and secondary CDC nodes fail, a CDC node can be rebuilt from HA metadata stored in GMS.

Columnar node: builds columnar indexes on top of Object Storage Service (OSS) and provides real-time updates and snapshot-consistent columnar queries. Columnar nodes are stateful, with their states persisted in both GMS and OSS. OSS is a highly available distributed file system built on PANGU. In production, we recommend that you deploy at least two columnar nodes in a primary/secondary configuration. If the primary columnar node fails, the secondary columnar node automatically promotes itself to primary. If both primary and secondary columnar nodes fail, a columnar node can be rebuilt from HA metadata stored in GMS and persisted objects in OSS.

In summary, the cluster-level HA of PolarDB-X fundamentally relies on the HA capabilities of DNs, which can deliver an RPO of 0 and an RTO less than 8 seconds.

2.2. HA Principles

2.2.1. Limits of Traditional Primary/Secondary Replication

Issue type Synchronous replication Asynchronous replication Semi-synchronous replication Distributed consensus Paxos protocol
Data consistency RPO = 0 RPO > 0, with inevitable data loss RPO ≈ 0, with possible data loss RPO = 0
Failover duration Automatic failover within seconds Depending on manual verification, modification, and switchover Depending on manual verification, modification, and switchover Automatic failover within seconds
Network partition tolerance Unavailable With data loss Degraded to asynchronous replication, with data loss Available if the majority of nodes are alive
Performance Low High Medium Medium-high

2.2.2. Core Innovations of XPaxos

To address the limits of traditional primary/secondary replication, PolarDB-X implements a Paxos-based distributed consensus solution and extends it with its own enhanced, innovative XPaxos design.

Paxos optimizations with strong leadership

  • Provides autonomous clusters. XPaxos enables automatic failover so long as failed nodes do not exceed a majority, keeping services continuously available without dependence on any external component.
  • Supports weighted leader election. Node election weights can be configured dynamically to enforce a predetermined leader-priority order in disaster recovery scenarios.
  • Implements a policy-driven majority. A set of nodes can be configured to always hold strongly consistent data so they can be immediately promoted to leader in disaster recovery scenarios.
  • Enables automatic leader handoff. If a newly promoted leader cannot catch up within a configured replay-delay window after an outage, XPaxos will attempt to hand off leadership automatically to a latency-free node to ensure cluster availability.
  • Supports customizable node roles. XPaxos introduces lightweight logger replicas (without data) to reduce storage costs and learner replicas (without election weights) to provide asynchronous read-only services.
  • Achieves seamless switchover. Leader transitions are invisible to the application layer (no errors or failures), enabling uninterrupted services during leadership changes.

Deep integration of XPaxos with the logger module

  • Uses binary logs as XPaxos logs. Followers directly consume and replay binary logs, eliminating relay log conversion and reducing storage overheads.
  • Adds a replicate step to the binary log group commit (BGC) process, enabling majority-based asynchronous log transmission and lowering transaction commit latency.
  • Applies majority-based protocol validation during crash recovery to ensure cluster data consistency across all disaster recovery scenarios.
  • Enables lock-free XtraBackup based on XPaxos log sequence numbers (LSNs), removing backup-induced interference and operational constraints.
  • Achieves PITR based on XPaxos LSNs, supporting full-cluster PITR to any specified point in time.

Performance-oriented enhancements

  • Provides two-tier XPaxos log caching to avoid frequent I/O reads during log transmission, maintaining low latency under high concurrency.
  • Integrates the Libeasy network framework of Alibaba to implement an event-driven, asynchronous I/O mechanism that delivers low-latency, high-throughput network communication.
  • Supports batching and pipelining for transmission, enabling self-adaptation to network latency and maintaining throughput even under high-latency conditions.
  • Supports secondary database acceleration for follower replicas in one-phase commit mode to reduce secondary database latency and shorten the cluster RTO.
  • Enables the automatic splitting of large transactions and large objects to prevent oversized transaction logs from harming system stability and performance.

4

3. HA Deployment for PolarDB-X DNs

HA deployment for PolarDB-X DNs must be carefully designed around disaster recovery objectives (RPO/RTO), while considering the risk levels (process/machine/data center/city) of business scenarios. The following sections describe practical deployment solutions tailored to different disaster recovery requirements.

3.1. Single-machine Deployment: Process-level Disaster Recovery (RPO = 0, RTO < 8 Seconds)

Scenarios: used to mitigate the high-frequency risk of single-process crashes, meeting the most basic single-machine HA requirements.

Limits and suggestions: unable to meet machine-level, data center-level, or city-level HA disaster recovery requirements. Recommended only for testing environments.

Deployment method:

Topology: Deploy three processes (two full-featured replicas + one logger replica) on a single machine.

• Fault recovery:

  • If the leader process fails, the remaining processes automatically perform leader election based on weights (higher weight takes priority), achieving an RPO of 0 and an RTO less than 8 seconds.
  • Failure of any non-leader process does not affect the DN cluster's service availability.

Key configuration points:

  • Specify sync_binlog=1 and innodb_flush_log_at_trx_commit=1 for all DN processes and enable global transaction identifier (GTID) by specifying gtid_mode=ON and enforce_gtid_consistency=ON to ensure single-process HA.
  • Database processes must be managed by a daemon process to ensure automatic restart after a process-level failure.
  • Configure weights for the three processes: 9, 9, and 1. Set the weight of the logger replica to 1. The logger replica can be deployed on a lightweight 1c8g configuration to reduce deployment costs.
  • For the three processes on a single machine, assign separate data and log disks to prevent a single-disk failure from affecting cluster availability.

5

3.2. Single-data Center Deployment: Process/Machine-level Disaster Recovery (RPO = 0, RTO < 8 Seconds)

Scenarios: used to mitigate high-frequency risks such as single-process crashes or machine hardware failures, meeting process-level and machine-level HA requirements.

Limits and suggestions: unable to meet data center-level or city-level HA disaster recovery requirements. Recommended only for non-critical business workloads.

Deployment method:

Topology: Deploy three nodes (two full-featured replicas + one logger replica) on different machines in the same data center.

• Fault recovery:

  • If the leader node fails, the remaining nodes automatically perform leader election based on weights (higher weight takes priority), achieving an RPO of 0 and an RTO less than 8 seconds.
  • Failure of any non-leader node does not affect the DN cluster's service availability.

Key configuration points:

  • Specify sync_binlog=1 and innodb_flush_log_at_trx_commit=1 for all DN processes and enable GTID by specifying gtid_mode=ON and enforce_gtid_consistency=ON to ensure single-process HA.
  • Database processes must be managed by a daemon process to ensure automatic restart after a process-level failure.
  • Configure weights for the three nodes: 9, 9, and 1. Set the weight of the logger replica to 1. The logger replica can be deployed on a lightweight 1c8g configuration to reduce deployment costs.

6

3.3. Three-replica Deployment for Two Data Centers-based Zone-disaster Recovery: Process/Machine/Data Center-level Disaster Recovery (RPO = 0, RTO < 30 Seconds)

Scenarios: used to mitigate infrastructure failures such as data center power outages or fiber disconnections and ensure business continuity within the same city, meeting process-level, machine-level, and data center-level HA requirements.

Limits and suggestions: unable to meet city-level HA disaster recovery requirements. Data center-level failover has an approximately 50% probability of requiring manual or operational intervention and cannot be fully automated. Recommended for standard business workloads.

Deployment method:

Topology: Deploy three nodes (two full-featured replicas + one logger replica) in two data centers (A and B) in the same city, where the leader and logger replicas are in Data Center A and the follower replica is in Data Center B.

Fault recovery:

  • If the leader node fails, the remaining nodes automatically perform leader election based on weights (higher weight takes priority), achieving an RPO of 0 and an RTO less than 8 seconds.
  • Failure of any non-leader node does not affect the DN cluster's service availability.
  • If Data Center B fails, the DN cluster remains fully available, with an RPO of 0 and an RTO of 8 seconds.
  • If Data Center A fails, manual or operational intervention is required to force the promotion of the follower node in Data Center B to leader, achieving an RTO less than 30 seconds.

Key configuration points:

  • Specify sync_binlog=1 and innodb_flush_log_at_trx_commit=1 for all DN processes and enable GTID by specifying gtid_mode=ON and enforce_gtid_consistency=ON to ensure single-process HA.
  • Database processes must be managed by a daemon process to ensure automatic restart after a process-level failure.
  • Configure weights for the three nodes: 9, 9, and 1. Set the weight of the logger replica to 1. The logger replica can be deployed on a lightweight 1c8g configuration to reduce deployment costs.
  • To maintain an RPO of 0 if Data Center A fails, configure the follower node in Data Center B for synchronous replication. Otherwise, majority-based replication is used by default.

7

3.4. Four-replica Deployment for Two Data Centers-based Zone-disaster Recovery: Process/Machine/Data Center-level Disaster Recovery (RPO = 0, RTO < 30 Seconds)

Scenarios: used to mitigate infrastructure failures such as data center power outages or fiber disconnections and ensure business continuity within the same city, fully meeting process-level, machine-level, and data center-level HA requirements.

Limits and suggestions: unable to meet city-level HA disaster recovery requirements. Data center-level failover requires manual or operational intervention and cannot be fully automated. Recommended for standard business workloads.

Deployment method:

Topology: Deploy four nodes (four full-featured replicas) in two data centers (A and B) in the same city, where the leader and follower replicas are in Data Center A and two follower replicas are in Data Center B.

• Fault recovery:

  • If the leader node fails, the remaining nodes automatically perform leader election based on weights (higher weight takes priority), achieving an RPO of 0 and an RTO less than 8 seconds.
  • Failure of any non-leader node does not affect the DN cluster's service availability.
  • If either data center fails or a network partition occurs between data centers, manual or operational intervention is required to promote the node with the most logs in the remaining data center to leader, achieving an RPO of 0 and an RTO less than 30 seconds.

Key configuration points:

  • Specify sync_binlog=1 and innodb_flush_log_at_trx_commit=1 for all DN processes and enable GTID by specifying gtid_mode=ON and enforce_gtid_consistency=ON to ensure single-process HA.
  • Database processes must be managed by a daemon process to ensure automatic restart after a process-level failure.
  • Configure weights for the four nodes: 9, 9, 5, and 5. Deploy the leader node in the data center where business workloads reside to minimize network latency and response time (RT) when business workloads access the DN cluster.

8

3.5. Three Data Centers-based Zone-disaster Recovery Deployment: Process/Machine/Data Center-level Disaster Recovery (RPO = 0, RTO < 8 Seconds)

Scenarios: used to mitigate infrastructure failures such as data center power outages or fiber disconnections and ensure business continuity within the same city, meeting process-level, machine-level, and data center-level HA requirements.

Limits and suggestions: unable to meet city-level HA disaster recovery requirements. Recommended for core business workloads.

Deployment method:

Topology: Deploy three nodes (two full-featured replicas + one logger replica) in three data centers in the same city, with one node in each data center.

Fault recovery:

  • If the leader node fails or the data center where the leader node resides fails, the remaining nodes automatically perform leader election based on weights (higher weight takes priority), achieving an RPO of 0 and an RTO less than 8 seconds.
  • Failure of any non-leader node, or any data center hosting a non-leader node, does not affect the DN cluster's service availability.

Key configuration points:

  • Specify sync_binlog=1 and innodb_flush_log_at_trx_commit=1 for all DN processes and enable GTID by specifying gtid_mode=ON and enforce_gtid_consistency=ON to ensure single-process HA.
  • Database processes must be managed by a daemon process to ensure automatic restart after a process-level failure.
  • Configure weights for the three nodes: 9, 9, and 1. Set the weight of the logger replica to 1. The logger replica can be deployed on a lightweight 1c8g configuration to reduce deployment costs.
  • Deploy the leader node in the data center where business workloads reside to minimize network latency and RT when business workloads access the DN cluster.

9

3.6. Deployment of Three Data Centers across Two Regions: Process/Machine/Data Center/City-level Disaster Recovery (RPO = 0, RTO < 8 Seconds)

Scenarios: used to mitigate city-level disasters such as earthquakes or floods, enabling geo-disaster recovery and rapid restoration.

Limits and suggestions: only partially meets city-level HA disaster recovery requirements. City-level failover has an approximately 50% probability of requiring manual or operational intervention and cannot be fully automated. Recommended for key business workloads.

Deployment method:

Topology: Deploy three nodes (three full-featured replicas) in three data centers across two cities, with one node in each data center. Deploy the leader node in the city with two data centers.

Fault recovery:

  • If the leader node fails or the data center where the leader node resides fails, the remaining nodes automatically perform leader election based on weights (higher weight takes priority), achieving an RPO of 0 and an RTO less than 8 seconds.
  • If a disaster occurs in the city hosting the leader node, manual or operational intervention is required to force the promotion of the follower node in the other city to leader, achieving an RTO less than 30 seconds.
  • Failure of any non-leader node or any data center hosting a non-leader node, or a disaster in the city not hosting the leader node, does not affect the DN cluster's service availability.

Key configuration points:

  • Specify sync_binlog=1 and innodb_flush_log_at_trx_commit=1 for all DN processes and enable GTID by specifying gtid_mode=ON and enforce_gtid_consistency=ON to ensure single-process HA.
  • Database processes must be managed by a daemon process to ensure automatic restart after a process-level failure.
  • Configure weights for the three nodes: 9, 9, and 1. Set the weight of the remote follower replica to 1.
  • Deploy the leader node in the data center where business workloads reside to minimize network latency and RT when business workloads access the DN cluster.

10

3.7. Deployment of Five Data Centers across Three Regions: Process/Machine/Data Center/City-level Disaster Recovery (RPO = 0, RTO < 8 Seconds)

Scenarios: used to mitigate city-level disasters such as earthquakes or floods, enabling geo-disaster recovery and rapid restoration.

Suggestions: recommended for financial-grade or other key business workloads.

Deployment method:

Topology: Deploy five nodes (four full-featured replicas + one logger replica) in five data centers across three cities, with one node in each data center. Deploy a leader node and a follower node in City 1, two follower nodes in City 2, and a logger node in City 3.

Fault recovery:

  • If the leader node fails, the data center where the leader node resides fails, or a disaster occurs in the city hosting the leader node, the remaining nodes automatically perform leader election based on weights (higher weight takes priority), achieving an RPO of 0 and an RTO less than 8 seconds.
  • Failure of any non-leader node or any data center hosting a non-leader node, or a disaster in the city not hosting the leader node, does not affect the DN cluster's service availability.

Key configuration points:

  • Specify sync_binlog=1 and innodb_flush_log_at_trx_commit=1 for all DN processes and enable GTID by specifying gtid_mode=ON and enforce_gtid_consistency=ON to ensure single-process HA.
  • Database processes must be managed by a daemon process to ensure automatic restart after a process-level failure.
  • Configure weights for the five nodes: 9, 9, 7, 7, and 1. Set the weight of the remote logger replica to 1.
  • Deploy the leader node in the city where business workloads reside to minimize network latency and RT when business workloads access the DN cluster.

11

3.8. GDN Dual Independent Cluster Deployment (RPO ≈ 0, RTO < 30 Seconds)

Scenarios: Two independent clusters form a global database network (GDN), meeting requirements for multicloud deployment and active geo-redundancy.

Limits and suggestions: The two independent clusters use a semi-synchronous community primary/secondary replication scheme, so the RPO is not 0. Recommended for standard business workloads.

Deployment method:

Topology:

  • A PolarDB-X DN can be deployed as either a primary cluster or a secondary cluster.
  • The topologies of the previously mentioned single-data center, two data centers-based, and three data centers-based deployments in one city are all applicable.

Fault recovery:

  • When a PolarDB-X DN is deployed as the primary cluster:

    • The impact of a node or data center failure in the primary cluster varies depending on the specific deployment architecture of that cluster. The overall RPO is 0 and the overall RTO is less than 8 seconds.
    • If the entire primary cluster fails (such as due to a city-level disaster), manual or operational intervention is required to promote the remote secondary cluster to primary, enabling read/write capabilities and achieving an RTO less than 30 seconds.
  • When a PolarDB-X DN is deployed as the secondary cluster:

    • The impact of a node or data center failure in the primary or secondary cluster varies depending on the specific deployment architecture of that cluster. The overall RPO is 0 and the overall RTO is less than 8 seconds.
    • After the HA recovery of the secondary cluster, the system requires the automated reconfiguration of the primary/secondary replication links from the new leader. The cluster supports resumable replication.
    • If the entire primary cluster fails (such as due to a city-level disaster), manual or operational intervention is required to promote the remote secondary cluster to primary, enabling read/write capabilities and achieving an RTO less than 30 seconds.

Key configuration points:

  • Specify sync_binlog=1 and innodb_flush_log_at_trx_commit=1 for all DN processes and enable GTID by specifying gtid_mode=ON and enforce_gtid_consistency=ON to ensure single-process HA.
  • Database processes must be managed by a daemon process to ensure automatic restart after a process-level failure.
  • Configure weights for the three nodes: 9, 9, and 1. Set the weight of the logger replica to 1.
  • If a DN is deployed as a primary cluster, downstream primary/secondary replication links can be attached to the DN cluster's leader or follower node.
  • If a DN is deployed as a secondary cluster, only the leader node can create and attach replication links to the primary cluster.

12

4. Summary

High availability is no longer optional. It is the baseline for business survival. PolarDB-X fundamentally reshapes the disaster recovery paradigm for distributed databases: from backup-and-restore to real-time self-healing, from manual failover to automated, weight-based leader election, and from protecting against process-level crashes to city-level disasters. With its zero-data-loss guarantee, second-level failover, and cost-efficiency, PolarDB-X establishes a new financial-grade disaster recovery standard for the cloud era.

0 0 0
Share on

ApsaraDB

573 posts | 179 followers

You may also like

Comments

ApsaraDB

573 posts | 179 followers

Related Products