Cloud Database Open Source Release: Features and Key Technologies of PolarDB HTAP

Cloud Database Open Source Release Introduction:

Cloud Database Open Source Release.At the Alibaba Cloud Open Source PolarDB Enterprise Architecture Conference on March 2, Yan Hua, an expert in the core technology of Alibaba Cloud PolarDB , gave a wonderful speech themed " PolarDB HTAP Explained". On the basis of the PolarDB storage and computing separation architecture, we have developed an MPP distributed execution engine based on shared storage, which solves the problem that the computing resources of other nodes cannot be used when a single SQL is executed, and the large IO bandwidth of the shared storage pool cannot be used. With the guarantee of elastic computing and elastic expansion, PolarDB initially has the ability of HTAP. This topic mainly introduces the features and key technologies of PolarDB HTAP.

Cloud Database Open Source Release.PolarDB Enterprise Architecture Conference on March 2, Yan Hua, an expert in the core technology of Alibaba Cloud PolarDB , gave a wonderful speech themed " PolarDB HTAP Explained". On the basis of the PolarDB storage and computing separation architecture, we have developed an MPP distributed execution engine based on shared storage, which solves the problem that the computing resources of other nodes cannot be used when a single SQL is executed, and the large IO bandwidth of the shared storage pool cannot be used. With the guarantee of elastic computing and elastic expansion, PolarDB initially has the ability of HTAP. This topic mainly introduces the features and key technologies of PolarDB HTAP.

Live review video: https://developer.aliyun.com/topic/PolarDB_release
PDF download: https://developer.aliyun.com/topic/download?id=8353
The following is organized according to the video content of the conference speech:

Cloud Database Open Source Release Background


Cloud Database Open Source Release.Many PolarDB users have the requirement of sharing TP and AP. They use PolarDB to process high concurrent TP requests during the day, and continue to use PolarDB for AP report analysis after the TP traffic drops and the machine is idle at night. However, even so, the idle machine resources are still not maximized.
Cloud Database Open Source Release.Because the native PolarDB PG system will encounter two challenges when dealing with complex AP queries: First, a single SQL can only be executed on a single node under the native PG execution engine, whether it is a single-machine serial or a single-machine parallel, it cannot be used. Computing resources such as CPU memory of other nodes can only be scaled up vertically, but not scaled out horizontally; secondly, the bottom layer of PolarDB is a storage pool, and theoretically, the IO throughput is infinite. However, a single SQL can only be executed on a single node under the native PG execution engine, which is limited by the bottleneck of the CPU and memory of a single node, and cannot give full play to the advantages of large IO bandwidth on the storage side.
In order to solve the pain points in the actual use of users, PolarDB decided to do HTAP. At present, there are three main HTAP solutions in the industry:

① TP and AP are separated in storage and calculation, which can realize complete isolation of TP and AP without affecting each other. But there will be some problems in actual use. First, the TP data needs to be imported into the AP system, there will be a certain delay, resulting in low timeliness; secondly, redundant AP systems need to be added, and the total cost will also increase; third, after adding a set of AP systems, The difficulty of operation and maintenance will also increase.

② TP and AP are shared in storage and computing, which can minimize cost and maximize resource utilization, but there are still two problems. First, due to computing sharing, AP queries and TP queries will affect each other more or less when running at the same time; secondly, when the proportion of AP queries increases, the system needs to expand the storage of computing nodes, so it needs to be redistributed, which makes it impossible to scale quickly and elastically Out.

③ TP and AP are shared in storage and separated in calculation. PolarDB is a storage-computing separation architecture, so it naturally supports this solution.

2. Cloud Database Open Source.Principle


Based on PolarDB stores and computes a separate architecture. We have developed a distributed MPP execution engine, which provides the guarantee of cross-machine parallel execution and elastic computing elastic expansion, making PolarDB initially equipped with HTAP capabilities.

Cloud Database Open Source Release.On the right of the above figure is the architecture diagram of PolarDB HTAP. The bottom layer is pooled shared storage. TP and AP share a set of storage data, which can provide millisecond-level data freshness while reducing costs, and also provide the ability to rapidly expand computing nodes. This is also the first feature of PolarDB HTAP. .

Cloud Database Open Source Release.The upper layer is a computing node with read-write separation. PolarDB has two sets of execution engines to process HTAP queries. The single-machine execution engine processes high-concurrency TP queries on read-write nodes, and the distributed MPP execution engine processes complex AP queries on read-only nodes. TP and AP queries are naturally Physical isolation is carried out, the computing environment of TP and AP is decoupled, and the mutual influence between CPU and memory is eliminated. This is the second major feature of PolarDB HTAP: TP/AP physical isolation.

Cloud Database Open Source Release.PolarDB HTAP is serverless elastic expansion, which eliminates the single-point limitation of the traditional MPP database coordinate. It can initiate MPP on any read-only node, and can flexibly adjust the range of MPP nodes and the degree of parallelism of a single machine. It supports Scale Out, Scale Up. The elastic adjustment here takes effect in time and does not require data redistribution is the fourth feature of PolarDB HTAP. On the basis of fully considering the affinity of PG BufferPool , PolarDB HTAP can eliminate data skew and calculation skew, and realize scheduling with more work for those who can.

PolarDB HTAP principle is the distributed MPP execution engine, which is a typical volcano engine. The two tables of AB are joined first and then aggregated. This is also the execution process of the PG stand-alone execution engine.
In the traditional MPP execution engine, the data is scattered to different nodes, and the data on different nodes may have different distribution properties, such as hash distribution, random distribution, and replication table distribution. The traditional MPP execution engine will insert operators into the plan according to the data distribution characteristics of different tables to ensure that the upper-layer operators are unaware of the data distribution characteristics.

PolarDB is a shared storage architecture, and the underlying shared data can be fully accessed by each computing node. If the traditional MPP execution engine is used, each Worker will scan the full amount of data, which will generate duplicate data, and at the same time, it will not have the effect of dividing and conquering the scanning time , and it is not an MPP engine in the true sense.
Therefore, in the PolarDB distributed MPP execution engine, we draw on the idea of the volcano model paper, perform concurrent processing on all scanning operators, and introduce the PxScan operator to shield shared storage. The PxScan operator maps share-storage data to share - nothing data, and divides the target table into multiple virtual partition data blocks through coordination between workers. Machine distributed parallel scan.
PxScan operator will be redistributed by the shuffle operator, and then executed on each worker as if it were a single machine, according to the volcano model.
above is the core of PolarDB's distributed MPP execution engine: the shuffle operator shields data distribution, and the PxScan operator shields shared storage.

Traditional MPP can only initiate MPP queries on specified nodes, so only a single Worker can scan one table on each node. In order to support the requirement of serverless elastic expansion under cloud native, we introduce distributed transaction consistency guarantee.
First, arbitrarily select a node as the coordinator node, its ReadLSN will be used as the agreed LSN, and select the smallest version number from the snapshot version numbers of all MPP nodes as the globally agreed snapshot version number. Through LSN's waiting playback and Global Snaphot synchronization mechanism, it is ensured that when any node initiates an MPP query, both data and snapshots can reach a consistent and available state.
In order to achieve the elastic expansion of serverless, we also put the external dependencies required by each module on the entire link of the coordinator node to the shared storage based on the characteristics of shared storage. The parameters required for each worker node to run will also be synchronized from the coordinator node through the control link, so that the entire link between the coordinator node and the worker node is stateless.
Based on the above two points of design, the elastic expansion of PolarDB has the following advantages:
①Any node can become a coordinator node, which solves the node single point problem of the traditional MPP database coordinator.
② PolarDB can scale out horizontally ( elastic expansion of computing power), or scale up vertically (elastic expansion of single-machine parallelism), and the elastic expansion takes effect in time without data redistribution.
③Allow business to have more flexible scheduling strategies, and different business thresholds can run on different node sets. As shown on the right side of the figure above, business domain SQL 1 can choose RO1 and RO2 nodes to execute AP queries, and business domain SQL 2 can choose to use RO3 and RO4 nodes to execute AP queries. The computing nodes used by the two business domains can implement elastic scheduling.

Skew is an inherent problem of traditional MPP, mainly due to the skew of data distribution and data computation. Data distribution skew is usually caused by unbalanced data dispersion. In PG, some inevitable data distribution imbalance problems are also introduced due to large object toast table storage; calculation skew is usually caused by concurrent transactions, buffer, network, IO on different nodes caused by jitter. Leaning can cause the barrel effect of traditional MPPs in their execution.

PolarDB is designed to implement an adaptive scanning mechanism. As shown on the right side of the figure above, the coordinator node is used to coordinate the working mode of the Worker node query. When scanning data, the coordinator node creates a task manager in memory and schedules Worker nodes according to the scanning task. The coordinator node is divided into two threads. The data thread is mainly responsible for serving the data link and collecting and summarizing tuples. The control thread is responsible for serving the control link and controlling the scanning progress of each scanning operator.

Skew is an inherent problem of traditional MPP, mainly due to the skew of data distribution and data computation. Data distribution skew is usually caused by unbalanced data dispersion. In PG, some inevitable data distribution imbalance problems are also introduced due to large object toast table storage; calculation skew is usually caused by concurrent transactions, buffer, network, IO on different nodes caused by jitter. Leaning can cause the barrel effect of traditional MPPs in their execution.

PolarDB is designed to implement an adaptive scanning mechanism. As shown on the right side of the figure above, the coordinator node is used to coordinate the working mode of the Worker node query. When scanning data, the coordinator node creates a task manager in memory and schedules Worker nodes according to the scanning task. The coordinator node is divided into two threads. The data thread is mainly responsible for serving the data link and collecting and summarizing tuples. The control thread is responsible for serving the control link and controlling the scanning progress of each scanning operator.

3. Cloud Database Open Source Release Features


After continuous iterative research and development, there are currently five main features supported by PolarDB HTAP in supporting Parallel Query:
① Basic operators are fully supported. It includes not only scan operators, Join classes, and aggregation classes, but also SubqueryScan and HashJoin .
②The shared storage operator is optimized. Including shuffle operator sharing, ShareSeqScan sharing, ShareIndexScan , etc. Among them , ShareSeqScan sharing and ShareIndexScan sharing means that when a large table joins a small table, the small table adopts a mechanism similar to the replication table to reduce broadcast overhead and improve performance.
③ Partition table support. It not only includes complete support for Hash/Range/List three partition modes, but also includes support for multi-level partition static clipping and partition dynamic clipping. In addition, the PolarDB distributed MPP execution engine also supports Partition Wise Join for partitioned tables.
④ Flexible control of parallelism. Including parallelism control at the global level, table level, session level, and query level.
⑤ Serverless elastic expansion. It not only includes any node initiating MPP and any combination within the range of MPP nodes, but also includes automatic maintenance of cluster topology information, and supports shared storage mode, active and standby database mode , and three-node mode.

After continuous iterative research and development, there are currently five main features supported by PolarDB HTAP in supporting Parallel Query:
① Basic operators are fully supported. It includes not only scan operators, Join classes, and aggregation classes, but also SubqueryScan and HashJoin .

②The shared storage operator is optimized. Including shuffle operator sharing, ShareSeqScan sharing, ShareIndexScan , etc. Among them , ShareSeqScan sharing and ShareIndexScan sharing means that when a large table joins a small table, the small table adopts a mechanism similar to the replication table to reduce broadcast overhead and improve performance.

③ Partition table support. It not only includes complete support for Hash/Range/List three partition modes, but also includes support for multi-level partition static clipping and partition dynamic clipping. In addition, the PolarDB distributed MPP execution engine also supports Partition Wise Join for partitioned tables.

④ Flexible control of parallelism. Including parallelism control at the global level, table level, session level, and query level.

⑤ Serverless elastic expansion. It not only includes any node initiating MPP and any combination within the range of MPP nodes, but also includes automatic maintenance of cluster topology information, and supports shared storage mode, active and standby database mode , and three-node mode.

The PolarDB distributed MPP execution engine can be used not only for query and DML, but also for index building acceleration. There are a large number of indexes in the ALTP business, and the index creation process spends about 80% of the time on sorting and building index pages, and 20% on writing index pages. As shown in the upper right figure, PolarDB uses RO nodes to perform data distributed MPP accelerated sorting, uses process-based technology to build index pages, and uses batch write technology to improve index page writing speed.
Currently, PolarDB supports the common creation of commonly used B-tree indexes and the online creation of B-tree indexes.

Above figure with PolarDB 's native stand-alone parallelism. We used 16 RO instances with online PolarDB 16g and 256g memory to build a 1 TB TPCH environment for test comparison. Compared with stand-alone parallelism, distributed MPP parallelism makes full use of the computing resources of all RO nodes and the underlying shared storage RO bandwidth, fundamentally solving the HTAP challenge mentioned above. Among the 22 SQLs of TPCH, 3 SQLs are accelerated by more than 60 times, 19 SQLs are accelerated by more than 10 times, and the average speedup is 23 times . In addition, we also tested the changes that elastic scaling brings to computing resources. By increasing the total number of CPU cores from 16 cores to 128 cores, the total operating time of TPCH increases linearly, and each SQL also increases linearly, which also verifies the characteristics of PolarDB HTAP serverless elastic expansion.

The test found that when the total number of CPU cores increased to 256 cores, the performance improvement was not large. The reason is that the IO bandwidth of the PolarDB shared storage is already full and becomes a bottleneck. This also shows that the computing power of the PolarDB distributed MPP execution engine is very strong.

Additionally, we compared PolarDB 's distributed MPP with that of a traditional database, also using 16 nodes with 16g and 256g of memory . With 1 TB of TPCH data, PolarDB 's performance is 90% of that of traditional MPP databases while maintaining the same single-machine parallelism of 1 as traditional MPP databases. The most essential reason is that the distribution of traditional MPP databases is hash distribution by default. When the join keys of the two tables are their respective distribution keys, Local Wise Join can be performed directly without shuffle. The bottom layer of PolarDB is a shared storage pool. The data scanned in parallel by PxScan is equivalent to random distribution. It must be shuffled and redistributed before subsequent processing can be performed like a traditional MPP database. Therefore, when TPCH involves table join, PolarDB has one more network shuffle overhead than traditional MPP database.

PolarDB distributed MPP can be elastically expanded, and data does not need to be redistributed. Therefore, when MPP is executed on the limited 16 machines, PolarDB can continue to expand the single-machine parallelism and make full use of the machine's resources; when the single-machine parallelism of PolarDB is 8, its performance is 5-6 times that of the traditional MPP database. ; When the degree of parallelism of a single machine of PolarDB increases linearly, the total performance of PolarDB also increases linearly, and it can take effect in time only by modifying the configuration parameters .

Cloud Database Open Source Release.We also conducted performance tests for PolarDB HTAP's support for building index acceleration features. Under the data volume of 500 GB, when there are 1, 2, and 4 fields to be indexed, the performance of distributed MPP parallel construction is about 5 times higher than that of single-machine parallel construction of indexes; when the number of fields to be indexed is increased to 8, the performance increased by about 4 times.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00