PolarDB Architecture: Compute-Storage Separation on Alibaba Cloud

This article examines the architecture of Alibaba Cloud PolarDB, focusing on compute-storage separation, PolarFS with ParallelRaft consensus, and physical redo replication.

Relational databases deployed in cloud environments routinely encounter scaling boundaries that originate not from the database engine itself but from the deployment model surrounding it. Single-instance storage caps individual databases at a few terabytes. Scaling through native replication introduces seconds of lag and operational complexity proportional to the replica count. Failover relies on binary log replay, which extends recovery time as transaction volume grows. Scaling compute capacity demands data redistribution. Each of these constraints stems from a single architectural assumption: that compute and storage are co-located on the same physical node.

PolarDB is engineered around the inverse assumption. Compute nodes and the storage layer are decoupled at the file system level and connected over an RDMA-capable network. A single cluster supports a primary read-write node and up to 15 read-only nodes, all of which share the same distributed storage substrate. The architectural consequences of this separation propagate through every layer of the system and are the focus of this article.

ChatGPT_Image_May_15_2026_02_41_52_PM
Figure 1. PolarDB cluster architecture: compute-storage separation over RDMA.

Compute-Storage Separation Architecture

PolarDB consists of two independently scalable tiers. The compute tier hosts the database engine MySQL, PostgreSQL, or PostgreSQL with Oracle compatibility and is responsible for query parsing, optimisation, execution, and buffer pool management. The storage tier, PolarStore, presents a shared distributed block device to every compute node in the cluster. Because storage is shared, read-only nodes do not maintain independent copies of the database; they read directly from the same physical pages the primary writes to.

Two operational properties follow directly from this design. First, adding a read-only node does not trigger data replication; the new node attaches to existing storage and becomes queryable within minutes, regardless of database size. Second, storage capacity scales independently of compute capacity. A 50TB database can be served by a small compute node during off-peak periods and scaled to a high-core configuration during analytical batch windows without any data movement.

The separation is implemented through a custom block protocol over RDMA over Converged Ethernet (RoCE v2), which reduces I/O latency by bypassing the kernel TCP/IP stack. Round-trip times for storage operations remain in the tens of microseconds range, comparable to local NVMe latency in most query patterns.

Distributed Storage with PolarFS

The PolarStore technology is built on top of PolarFS, which is a distributed file system specifically optimised for database workloads. Within the storage layer, a volume will be split up into chunks of 10GB in size and replicated three times. All replicas will be placed within different fault domains to preserve their durability upon hardware failures.

Consistency among replicas is ensured with ParallelRaft, a modification of the original Raft algorithm that allows for asynchronous commitment of log entries. Traditional Raft always commits log entries in order, which introduces tail latency during parallel I/O operations. ParallelRaft is able to circumvent this problem by relaxing the order constraints in situations when it is possible to recover the correct order of logs thanks to look-behind buffers. This way, higher concurrency will not adversely affect write P99 latency while preserving the necessary linearization properties.

In order to reduce context switching in the I/O operation path, PolarFS provides the client with access to its storage through the means of user-space libraries. All I/O requests from the database engine will be performed outside the kernel page cache and block layer and thus free of the associated overhead.

Physical Replication and Read Scaling

In conventional MySQL deployments, read scaling relies on asynchronous logical replication. SQL statements or row-change events from the binary log are replayed on each replica. This model introduces two costs: replication lag proportional to write volume, and CPU consumption on replicas for statement re-execution. In write-heavy workloads, replicas commonly fall seconds behind the primary.

PolarDB read-only nodes do not replicate data. Because all nodes share the same underlying storage, only buffer pool synchronisation is required. The primary node ships physical redo log records and block-level page modifications to read-only nodes over a dedicated link. Read-only nodes apply these records to invalidate or update their in-memory pages without re-executing SQL. Replication lag in this model is typically sub-second under steady-state load and remains bounded during write spikes, because the redo stream is significantly smaller than the equivalent logical replication payload.

For application design, the implication is clear. Read traffic can be directed to read-only nodes with substantially reduced risk of stale-read anomalies. For transactions that require read-after-write consistency on the same logical session, the primary endpoint remains the correct target.

Cluster Topology and Connection Routing

A PolarDB cluster exposes three categories of endpoints. The cluster endpoint provides read/write splitting through PolarProxy, an intermediary that inspects each statement, and routes writes to the primary while load-balancing reads across available read-only nodes. The primary endpoint routes all traffic exclusively to the primary, suitable for connections that require transactional read-after-write consistency. Custom endpoints allow specific subsets of read-only nodes to be grouped behind a dedicated address, enabling workload isolation, for example, dedicating two read-only nodes to reporting queries while keeping the remainder available for application traffic.

PolarProxy maintains connection pooling and transaction-aware routing, preserving multi-statement transactions on the primary rather than splitting them across nodes. Hint syntax is supported for cases where the application needs to override the proxy routing decision on a per-query basis.

Failover and High Availability

A PolarDB cluster is deployed with at least one primary and one read-only node, with the read-only node functioning as a standby. Fault detection is handled by VotingDisk, a distributed coordination service that uses atomic read/write semantics on shared storage to determine node health without relying on inter-node heartbeat networks alone. On confirmed primary failure, the standby is promoted by reattaching to the shared storage volume. No data recovery from log replay is required because the standby already has up-to-date buffer pool state from the physical redo stream.

Two distinct failover modes are exposed, with materially different recovery characteristics. With the hot replica feature enabled on a read-only node, the standby pre-fetches undo pages and applies redo updates continuously, and failover completes within 5 to 10 seconds. Without a hot replica, the elected read-only node must build transaction-system state from undo pages on promotion, and failover typically takes 20 to 30 seconds. The decision between the two is driven by the application’s connection-retry tolerance and the cost of provisioning a read-only node at primary-matching specifications, which the hot replica feature requires.

Cross-zone deployment places compute nodes and storage replicas in separate availability zones within a region. Storage durability is independently maintained by the PolarStore three-replica chunk model, which survives zone-level outages without compute-tier intervention. For cross-region resilience, the Global Database Network capability supports asynchronous replication of an entire PolarDB cluster to a secondary region, with sub-second cross-region replication lag under typical inter-region network conditions.

Operational Considerations

Three factors shape long-term operational outcomes for PolarDB deployments.

Endpoint strategy: Applications should map their read and write paths to the appropriate endpoints rather than relying on a single connection string. Routing all traffic through the cluster endpoint with PolarProxy is appropriate for OLTP applications with mixed read and write patterns. Reporting and analytical traffic should target a custom endpoint scoped to dedicated read-only nodes, isolating its resource consumption from transactional workloads.
Storage growth and cost: Because storage is decoupled from compute, capacity scales automatically up to the configured ceiling and is billed by consumed rather than provisioned size. This eliminates the overprovisioning that single-instance databases require to accommodate growth headroom. Backup retention windows and binary log retention periods should be reviewed periodically, as both contribute to billed storage beyond the active dataset.
Security boundary configuration: PolarDB clusters should be deployed within a VPC with security group rules restricting access to known application subnets. RAM policies should grant cluster-level permissions scoped to specific resource ARNs, and the root account should not be used for application connections. Transparent Data Encryption (TDE) can be enabled at the storage layer with keys managed by KMS, satisfying common compliance requirements without application changes.

Conclusion

The scalability challenges of traditional managed relational databases can be overcome with five specific engineering decisions, as implemented in PolarDB:

Decoupling of computation and storage allows independent scaling of both tiers and removes data copying when switching compute.
PolarFS with ParallelRaft consensus ensures durable distributed storage with deterministic tail latency.
Remote Direct Memory Access-based I/O maintains storage performance similar to local NVMe.
Physical replication using redo log transfer avoids the overhead and additional computation of logical replication for read-only replicas.
Fault detection with VotingDisk and hot replica standby allows failover in under ten seconds without replay-based recovery.

Such traits make PolarDB an excellent choice for applications with rapid storage growth, mixed transactional and analytical queries, or strict recovery time goals. However, when the workload fits neatly into a single instance storage footprint and scales predictably in reads, a simpler managed database is likely more practical.

Disclaimer: The views expressed herein are for reference only and don’t necessarily represent the official views of Alibaba Cloud.

Community

PolarDB Architecture: Compute-Storage Separation on Alibaba Cloud

Compute-Storage Separation Architecture

Distributed Storage with PolarFS

Physical Replication and Read Scaling

Cluster Topology and Connection Routing

Failover and High Availability

Operational Considerations

Conclusion

Read previous post:

Read next post:

PM - C2C_Yuan

You may also like

Comments

PM - C2C_Yuan

Related Products

Application High Availability Service

Elastic High Performance Computing

Remote Rendering Solution