Analyze the Technical Essentials of PolarDB at the Architecture Level

Background

PolarDB for PostgreSQL is an enterprise-level database product developed by Alibaba Cloud. It uses a compute-storage separation architecture and is compatible with PostgreSQL and Oracle. PolarDB supports the scale-out of storage and computing capabilities. PolarDB provides enterprise-level database features (such as high reliability, high availability, and elastic scaling). At the same time, PolarDB has large-scale parallel computing capabilities, which can cope with the mixed load of OLTP and OLAP. It also has multi-mode innovative features (such as spatio-temporal, vector, search, and graph), which can meet the new requirements of enterprises for rapid data processing.

PolarDB supports multiple deployment modes: storage and computing separation deployment, X-Paxos three-node deployment, and local disk deployment.

Problems with Traditional Databases

As user business data becomes larger and more complex, traditional database systems face huge challenges:

The storage space cannot exceed the upper limit of a single machine.
Read-only instances are used for read scaling. Each read-only instance enjoys an exclusive share of storage, resulting in increased costs.
As the number of data increases, the time consumed to create a read-only instance increases.
The primary/secondary latency is high.

Benefits of PolarDB Cloud-Native Databases

PolarDB cloud-native database wasdeveloped to solve the preceding problems of traditional databases. It adopts a self-developed architecture that separates computing clusters and storage clusters. It has the following advantages:

Extensibility: Storage and computing separation and extreme elasticity
Cost: Share data with low storage costs
Ease of Use: One-write, multiple-read, and transparent read/write splitting
Reliability: Three replicas and a second-level backup

An Overview of the Overall Architecture of PolarDB

The following section describes the architecture of PolarDB from two aspects: storage-computing separation architecture and HTAP architecture.

Storage-Computing Separation Architecture

PolarDB is designed to separate storage and computing. Storage clusters and computing clusters can be expanded independently:

If the computing capacity is insufficient, you can expand the computing cluster separately.
If the storage capacity is insufficient, you can expand the storage cluster separately.

Based on shared storage, the primary node can share the storage data with multiple read-only nodes. The primary node can no longer be scrubbed like the traditional scrubbed method. Otherwise:

The page read by the read-only node in storage may be an older version that does not meet its status.
A read-only node means the page read is ahead of the data in the memory.
When the primary node switches to a read-only node and when the read-only node takes over data updates, the pages in the storage may be old. You need to read logs to restore dirty pages.

We need to have page multi-version capability for the first problem. We need the main library to control the dirty speed of dirty pages for the second problem.

HTAP Architecture

After read/write splitting, a single compute node cannot take advantage of the large I/O bandwidth on the storage side or accelerate large queries by adding computing resources. We have developed distributed parallel execution of MPP based on Shared-Storage to accelerate OLAP queries in OLTP scenarios. PolarDB supports a set of OLTP scenarios in the following two types of computing engines:

Standalone Execution Engine: It handles highly concurrency OLTP-type loads.
Distributed Execution Engine: OLAP loads that handle large queries.

When the same hardware resources are used, the performance meets 90% of traditional Greenplum. At the same time, it has SQL-level elasticity. When the computing power is insufficient, the CPU that participates in OLAP analysis and query can be increased at any time without data redistribution.

PolarDB – Storage-Computing Separation Architecture

Challenges Posed by Shared Storage

Based on shared storage, the database changed from the traditional shared nothing to the shared storage architecture. The following issues need to be resolved:

Data Consistency: The original N-part calculation + N-part storage changed to N-part calculation +1-part storage.
Read/Write Splitting: How to achieve low-latency replication based on the new architecture
High Availability: How to Recovery and Failover
IO Model: How to optimize from Buffer-IO to Direct-IO

Architecture Description

First, let's look at the architecture of Shared-Storage-based PolarDB:

The primary node is a read-write node (RW), and the read-only node is a read-only node (RO).
Only the primary node can write data on the shared-storage layer. Therefore, the primary node and read-only nodes can be seen as consistent data on the disk.
The memory state of the read-only node is maintained in synchronization with the primary node through replay WAL.
WAL logs of the primary node are written to Shared-Storage. Only WAL meta is copied to read-only nodes.
The read-only node reads WAL from Shared-Storage and plays it back.

Data Consistency

Memory State Synchronization of Traditional Databases

In the traditional share nothing database, both the primary node and the read-only node have their memory and storage. You only need to copy WAL logs from the primary node to the read-only node and play back the logs on the read-only node in the sequence. This is also the basic principle of the replication state machine.

Shared-Storage-Based Memory State Synchronization

As mentioned earlier, after the separation of storage and computing, the pages read on Shared-Storage are consistent, and the memory state is obtained by reading the latest WAL from Shared-Storage and playing it back, as shown in the following figure:

The master node writes version 200 to Shared-Storage by scrubbing.
Read-only nodes are 100 based on the version and replay logs for 200.

"Outdated Pages" Based on Shared-Storage

In the preceding process, the pages replayed based on logs in the read-only node will be eliminated. After that, you need to read the pages from the storage again. It will appear that the read pages are the previous old pages, which are called outdated pages. Please see the following figure:

The master node writes the log LSN=200 at T1 to update the content of page P1 from 500 to 600.
The content of page P1 of the read-only node at this time is 500.
At T2, the primary node sends the meta information of log 200 to the read-only node, and the read-only node learns that there is a new log.
At T3, when reading page P1 on the read-only node, we need to read the logs of page P1 and LSN=200 and perform a replay to obtain the latest content of P1 as 600.
At T4, the latest page P1 played back is eliminated due to insufficient BufferPool on the read-only node.
The master node did not dirty the latest content of the 600-page P1 to Shared-Storage.
At T5, the operation of reading P1 is initiated again from the read-only node. Since P1 has been eliminated from the memory, it is read from Shared-Storage. At this time, the content of the outdated page is read.

The Solution to "Outdated Page"

When a read-only node reads a page at any time, it needs to find the corresponding Base page and the logs of the corresponding starting point and play them back in sequence (as shown in the following figure):

Maintain the log meta corresponding to each Page in the memory of the read-only node.
When one Page is read, the logs are applied one by one as needed until the desired Page version.
When a log is applied, it is read from Shared-Storage using the meta of the log.

Through the preceding analysis, the inverted index of each Page to Log needs to be maintained, and the memory of the read-only node is limited. Therefore, the index of this Page to Log needs to be persisted. PolarDB has designed a persistent index structure- LogIndex. LogIndex is a persistent hash data structure.

A read-only node receives WAL metadata from the primary node through a WAL receiver.
WAL meta records which pages are modified in the log.
Insert the WAL meta into LogIndex. The key is PageID and the value is LSN.
A WAL log may update multiple Pages (index splitting) and have multiple records in a LogIndex pair.
At the same time, mark the page with outdate in BufferPool, so the corresponding log is replayed from LogIndex during the next reading.
When the memory reaches a certain threshold, LogIndex asynchronously swipes the hash in the memory to the disk.

LogIndex solves the problem when dirty brushing depends on outdated pages. The replay of read-only nodes is converted into Lazy replay: only the meta information of the log needs to be played back.

Shared-Storage-Based "Future Pages"

After storage and computing are separated, there is still a future page problem in the dirty dependency. Please see the following figure:

At T1, the primary node updates P1 twice and generates two logs. At this time, the content of page P1 on the primary node and read-only nodes is 500.
At T2, the log LSN=200 is sent to the read-only node.
At T3, the read-only node plays back the log with LSN=200, and the content of P1 is 600. At this time, the log of the read-only node is played back to the 200, and the log with LSN=300 does not exist.
At T4, the primary node is dirty, and the latest content of P1 is 700 brushed to Shared-Storage. At the same time, BufferPool on the read-only node eliminates page P1.
At T5, the read-only node reads page P1 again. Since P1 does not exist in BufferPool, the latest P1 is read from the shared memory. However, the read-only node does not play back the log of LSN=300 and reads a future page that is ahead of him.
The problem with future pages is that some pages are future pages, some pages are normal, and the data will be inconsistent. For example, after the index is split into two pages, one reads the normal Page, the other reads the future page, and the index structure of B + Tree will be destroyed.

The Solution to "Future Page"

The reason for the future page is that the primary node swipes dirty faster than the replay speed of any read-only node (although the Lazy replay of the read-only node is already very fast). Therefore, the solution is to control the dirty progress of the primary node: the replay point of the slowest read-only node cannot be exceeded. The following figure shows:

The read-only node is played back to the T4 point.
When the master node is dirty, all dirty pages are sorted by LSN. Only dirty pages (including T4) before T4 are brushed, and dirty pages after T4 are not brushed.
Among them, the LSN site of T4 is called the consistency site.

Low-Latency Replication

Problems with Traditional Stream Replication

Synchronization Procedure: The log synchronization path has many I/O, and the network transmission volume is large.
Page Replay: Read and buffer modifications are slow (I /O-intensive + CPU-intensive).
DDL Replay: You need to lock the modified file when you modify the file. The locking process is easily blocked, resulting in slow DDL statements.
Snapshot Update: The transaction snapshot update is slow due to high RO concurrency.

Please see the following figure:

The master node writes WAL logs to the local file system.
Read the WAL Sender process and send
The WAL Receiver process of the read-only node receives and writes to the local file system.
The replay process reads WAL logs, reads the corresponding Page to BufferPool, and plays it back in memory.
The master node brushes dirty pages to Shared Storage.

As you can see, the entire link is very long, and the latency of read-only nodes is high, which affects the SLB of user service read/write splitting.

Optimization 1 – Copy Meta Only

The underlying layer is Shared-Storage, and read-only nodes can directly read the required WAL data from Shared-Storage. Therefore, the primary node only copies the metadata of the WAL log (with the payload removed) to the read-only node. Therefore, the network transmission volume is small, and the IO on the critical path is reduced. Please see the following figure:

WAL Record is composed of Header, PageID, and Payload.
Read-only nodes can directly read WAL files on Shared-Storage. Therefore, the primary node only sends (copies) the metadata of WAL logs to read-only nodes, including Header and PageID.
On the read-only node, the complete WAL file on Shared-Storage is directly read using the WAL metadata.

The preceding optimization can significantly reduce the amount of network transmission between the primary node and read-only nodes. The following figure shows the network transmission volume has been reduced by 98%.

Optimization 2 – Page Replay Optimization

In the process of log replay in traditional DB, a large number of pages are read and applied one by one and written on the disk. This process is based on the critical path of user read IO. Using storage and computing separation, you can do the following: If the Page on the read-only node is not in BufferPool, no IO is generated, and only LogIndex can be recorded.

The following IO operations in the replay process can be offloaded to the session process:

Data page IO overhead
Log apply overhead
Multi-version replay based on the LogIndex page

As shown in the following figure, when applying a WAL meta in the replay process on the read-only node:

If the corresponding page is not in the memory, only LogIndex is recorded.
If the corresponding Page is in the memory, mark it as Outdate and record LogIndex. The replay process is complete.
When the user session process reads Page, it reads the correct Page to BufferPool and uses LogIndex to play back the corresponding log.
The main IO operations include the original single replay process offload to multiple user processes.

With the optimizations above, you can significantly reduce the latency of replay.

Optimized 3 - DDL Lock Replay Optimization

When you execute DDL statements on the primary node (drop table, for example), you must lock the table on all nodes. This ensures that the table file is not deleted by the primary node when it is read on the read-only node (because only one copy of the file is stored in Shared-Storage). The exclusive lock on the table on all read-only nodes is replicated to all read-only nodes using WAL. The read-only nodes replay the DDL lock.

When the replay process plays back the DDL lock, the table lock may be blocked for a long time. Therefore, you can optimize the critical path of the return process by offloading the DDL lock to other processes.

Through the preceding optimization, the replay process is always in a smooth state, and the critical path of replay is not blocked by waiting for DDL.

The preceding three optimizations significantly reduce the replication latency and bring the following advantages:

Read/Write Splitting: SLB, closer to the Oracle RAC experience
High Availability: It accelerates the HA process.
Stability: It minimizes the number of future pages. You can write fewer or no page snapshots.

Recovery Optimization

Background

The recovery time in scenarios such as database OOM and Crash is long. In essence, log return slows down. The problem is even more prominent under the shared storage Direct-IO model.

Lazy Recovery

As mentioned earlier, we have achieved Lazy replay on the read-only node through LogIndex. During the recovery process after the restart of the primary node, the essence is to play back logs, so we can use Lazy replay to accelerate the recovery process:

Read WAL logs one by one from the checkpoint point.
After the LogIndex log is replayed, the replay is considered complete.
The recovery is complete, and the service is offered.
The real replay was offloaded to the session process that came in after the restart.

After optimization (500MB log volume is played back):

Persistent BufferPool

The preceding solution optimizes the restart speed in recovery, but after the restart, the session process plays back to the desired page by reading the WAL log. There will be a short-term slow response after recovery. The optimized method is that BufferPool is not destroyed when the database is restarted (as shown in the following figure). BufferPool is not destroyed during the crash and restart.

The shared memory in the kernel is divided into two parts:

Global Structure (such as ProcArray)
BufferPool Structure: BufferPool is allocated by named shared memory, which is still valid after the process is restarted. The global structure needs to be reinitialized after the process is restarted.

However, not all Page in BufferPool can be reused. For example, before restarting, a process locks X on Page and then crashes. There is no process to release the X lock. Therefore, after crash and restart, you need to traverse all BufferPool to remove pages that cannot be reused. In addition, the recycling of BufferPool depends on Kubernetes.

After this optimization, the performance is stable before and after the restart.

PolarDB – HTAP Architecture

After PolarDB is read/write splitting, the underlying layer is a storage pool. Theoretically, the I/O throughput is infinite. However, large queries can only be executed on a single compute node. The CPU, MEM, and IO of a single compute node are limited. Therefore, a single compute node cannot take advantage of the large IO bandwidth on the storage side or accelerate large queries by increasing computing resources. We have developed distributed parallel execution of MPP based on Shared-Storage to accelerate OLAP queries in OLTP scenarios.

Principles of HTAP Architecture

The underlying storage of PolarDB is shared on different nodes. Therefore, tables cannot be scanned directly like traditional MPP. We supported MPP distributed parallel execution on the original standalone execution engine and optimized Shared-Storage. Shared-Storage-based MPP is the first in the industry, whose principle is listed below:

Shuffle operators mask data distribution
ParallelScan operator mask shared storage

As shown in the figure:

Table A and Table B are joined and aggregated.
The table in the shared storage is still a single table and is not physically partitioned.
Four types of scanning operators are redesigned to scan tables on the shared storage in slices to form virtual partitions.

Distributed Optimizer

The community-based GPORCA optimizer extends the Transformation Rules that can sense shared storage characteristics. This enables you to explore the unique plan space under shared storage. For example, you can scan a table in PolarDB in full or in different regions. This is the essential difference from traditional MPP.

In the figure, the gray part is the adaptation part between the PolarDB kernel and the GPORCA optimizer.

The lower part is the ORCA kernel, and the gray module is an extension of the shared storage features in the ORCA kernel.

Operator Parallelization

Four types of operators in PolarDB need to be parallelized. The following describes the parallelization of a representative seqscan operator. Logical splitting is performed in units of 4MB during sequential scanning to make maximum use of the large IO bandwidth of storage. IO is scattered to different disks as much as possible to achieve the effect that all disks provide read services at the same time. The advantage is that each read-only node only scans part of the table files, so the final table size that can be cached is the sum of the BufferPool of all read-only nodes.

In the following chart:

Read-only nodes are added, which linearly improves the scanning performance by 30 times.
When the buffer is turned on, the scan drops from 37 minutes to 3.75 seconds.

Testing Scenario	Time	Read Ability	Increase Multiple
Turn off Buffer	Single RO,64 Concurrent	47 Minutes	360MB/s	30 Times
Turn off Buffer	6RO,15 Concurrent, Disable Buffer	1.5 Minutes	Total 11.34 GB/s	30 Times
Turn on Buffer	Single RO,64 Concurrent	37 Minutes	280MB/s	600 Times
Turn on Buffer	6RO,40 Concurrent, Open Buffer	3.75 Seconds	0	600 Times

Data Skew Elimination

Tilting is an inherent problem in traditional MPP.

In PolarDB, large objects are associated with TOAST tables through heap tables. No matter which table is split, the balance cannot be achieved. The transaction, buffer, network, and I/O load jitters of different read-only nodes can also cause the long tail process during distribution execution .

The coordination node is internally divided into DataThread and ControlThread.
DataThread is responsible for collecting summary tuples.
ControlThread is responsible for controlling the scan progress of each scan operator.
Scanning fast worker processes can scan multiple logical data slices.
The affinity of the buffer needs to be considered during the process.

Note: Although it is dynamically allocated, try to maintain the affinity of the buffer. In addition, the context of each operator is stored in the worker's private memory, and the Coordinator does not store the information of specific tables.

In the following table, when large objects appear, static segmentation data skew, while dynamic scanning can still be linearly improved.

SQL-Level Elastic Extension

Then, taking advantage of the characteristics of data sharing, we can also support the requirement of extreme flexibility under cloud-native. The external dependencies required by each module on the coordinator link are shared storage, and the run time parameters required on the worker link are synchronized from the coordinator through the control link, making the coordinator and worker stateless.

Thus:

Any read-only node of an SQL connection can become a coordinator node. This solves the coordinator's single-point problem.
An SQL can start any number of workers on any node to achieve SQL-level elastic expansion of computing power. It also allows businesses to have more scheduling strategies: different business domains run on different node sets at the same time.

Transaction Consistency

Data consistency across multiple compute nodes is accomplished by waiting for replay and the globalsnapshot mechanism. Waiting for replay ensures that all workers can see the required data version, while globalsnapshot ensures that a unified version is selected.

TPCH Performance – Acceleration Ratio

We used 1TB of TPCH for testing. First, we compared the performance of PolarDB's new distributed parallel and single-machine parallel. Three SQL statements were accelerated by 60, and 19 SQL statements were accelerated by more than ten.

In addition, using distributed execution engine testing to try to increase the performance of the CPU, you can see the performance is improved linearly from 16 and 128 cores. Looking at 22 SQL statements alone, the performance of each SQL statement is improved linearly by increasing the CPU.

TPCH Performance – Comparison with Greenplum

Compared with the traditional MPP Greenplum, PolarDB also uses 16 nodes. The performance of PolarDB is 90% that of Greenplum.

As mentioned earlier, the distributed engine of PolarDB is scalable. Data does not need to be redistributed. When dop=8, the performance is 5.6 times that of Greenplum.

Distributed Execution Acceleration Index Creation

A large number of indexes will be built in the OLTP business. In the process of index creation after analysis, 80% is sorting and building index pages, and 20% is writing index pages. Use of distributed parallelism can accelerate the sorting process, and support batch writes.

The preceding optimizations can improve index creation by four to five times.

Distributed Parallel Execution Accelerates Multimode – Spatio-Temporal Database

PolarDB is a multi-mode database that supports spatio-temporal data. Spatio-temporal databases are compute-intensive and I/O-intensive and can be accelerated with distributed execution. We have developed the ability to scan shared RTREE indexes for shared storage.

Data Volume: 400 million, 500 GB
Specification: Five read-only nodes (each node specification is a 16-core CPU and 128 GB memory.)
Performance:
- Increase linearly with the number of CPUs
- When the CPU is 80 cores, the increase is 71 times.

Summary

This article analyzes the technical points of PolarDB at the architecture level:

Storage-Computing Separation Architecture
HTAP Architecture

Subsequent articles will discuss more technical details specifically, including how to use a shared-storage-based query optimizer, how LogIndex achieves high performance, how to flash back to any point in time, how to support MPP on shared-storage, and how to combine with X-Paxos to build high availability. Please stay tuned!

Community

Analyze the Technical Essentials of PolarDB at the Architecture Level

Background

Problems with Traditional Databases

Benefits of PolarDB Cloud-Native Databases

An Overview of the Overall Architecture of PolarDB

Storage-Computing Separation Architecture

HTAP Architecture

PolarDB – Storage-Computing Separation Architecture

Challenges Posed by Shared Storage

Architecture Description

Data Consistency

Memory State Synchronization of Traditional Databases

Shared-Storage-Based Memory State Synchronization

"Outdated Pages" Based on Shared-Storage

The Solution to "Outdated Page"

Shared-Storage-Based "Future Pages"

The Solution to "Future Page"

Low-Latency Replication

Problems with Traditional Stream Replication

Optimization 1 – Copy Meta Only

Optimization 2 – Page Replay Optimization

Optimized 3 - DDL Lock Replay Optimization

Recovery Optimization

Background

Lazy Recovery

Persistent BufferPool

PolarDB – HTAP Architecture

Principles of HTAP Architecture

Distributed Optimizer

Operator Parallelization

Data Skew Elimination

SQL-Level Elastic Extension

Transaction Consistency

TPCH Performance – Acceleration Ratio

TPCH Performance – Comparison with Greenplum

Distributed Execution Acceleration Index Creation

Distributed Parallel Execution Accelerates Multimode – Spatio-Temporal Database

Summary

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for PostgreSQL

PolarDB for Xscale

PolarDB for MySQL

Managed Service for Prometheus