Huawei FusionInsight LibrA (FI-MPPDB) is an OLAP system. According to the description in the book "core technologies of openGauss database", LibrA is gaussdb200. from the code of openGauss, The OLAP features of openGauss also come from this product.
this article describes four aspects of FI-MPPDB:
- system availability:online expansion and online upgrade;
- auto tuning:runtime feedback+machine learning;
- SQL on HDFS solutions;
- modern computer systems:cost-based JIT;
The R & D started in 2012 and the prototype was available in 2014.
- V1: vectorization and thread model;
- v2: supports column storage, compression, and intelligent query scheduling. The SCTP protocol is used to replace the original TCP;
2016 nian FI-MPPDB support SQL on Hadoop characteristics. Similar to HAWQ,FI-MPPDB can directly read HDFS and partitioned tables to avoid data movement. It is compatible with the 2008SQL standard and supports complete ACID (implemented by using local heap tables).
In 2017, FI-MPPDB was sold on Huawei cloud, which is also called LibrA.
2.1 System Overview
can see cross-partition when 2PC and GTM complete the transaction, here's a fastpath optimization: only involves a single parition without once GTM.
Use the SCTP protocol instead of TCP:
- message based the multi-address connection protocol;
- 65535 streams share one SCTP connection;
- supports out-of-band control protocols;
2.1.2 High Availability and Replication
for high availability, each DN node corresponds to a slave node. In addition, a log-only copy is added:
- when the secondary node fails, the primary node can continue to serve because the log-only node can still guarantee the number of log replicas;
- after the secondary node is restored, it does not need to be catchup from the primary node;
2.1.3 Workload Management
- avoid avalanche caused by competition among different queries;
- scheduling: resource pools, workload groups, and a controller: all queries are associated with the resource pool. The workload group manages the workload and adds new queries to the resource pool. If the query execution cost is higher than the remaining system load, the current query is queued;
2.2 Online Expansion
2.2.1 Solution Overview
objective: During data redistribution, the original database can still execute DML and query
- the original table enters the appendonly mode;
- during redistribution: append-delta data; delete-delta data;
- use the above three parts of data to construct visible data;
2.2.2 Core Algorithm
the hidden column original_tuple_key is added to the shadow table to delete the Tuple that was deleted during migration.
- The original table T follow ctid down into the segment, new data appended to the T-back;
- on T one by one segment heavy distribution to in S, been deleted in the process of Tuple in D, when the migration after this segment when D recorded in the tuple delete, and empty D. Note: the data in the previous segment may be deleted when the subsequent segment is copied;
2.3 Auto Tuning in Query Optimizer
FI-MPPDB optimizer can:
- plans to generate MPP based on cost;
- perceived the different costs of vectorized execution and different underlying storage (ORC);
- adapt to more optimization scenarios based on runtime feedback and machine learning (estimate predicate selectivity);
for machine learning based on runtime feedback, a good assumption is that OLAP business SQL is relatively fixed.
The number of rows estimated by optimization and the number of rows reported during actual execution are collected at the operator level: scan,join, and agg.
how to use PlanStore:
- selectivity matching: estimates the statistical data of scan and join operators;
- similarity selectivity: similar predicate estimation predicate cache;KNN algorithm, K NEAREST neighbor draw value;
2.4 MPPDB over HDFS (aka SQLonHDFS)
2.4.1 SQLonHDFS using Foreign Data Wrapper
- Gauss MPPDB Coordinator to receive SQL statements;
- Planner generate a distributed plan. During this period, the table distribution information needs to be read from the name node of HDFS;
- distribution plan fragment and task map;
- data node reads data locally and remotely from HDFS;
optimization on HDFS:
- A directory in HDFS corresponds to a FDW table;
- you can partition a table. Each subpartition has an HDFS Directory. Therefore, you can perform partition pruning;
- for ORC and other formats, you can push down;
- supports vectorization;
- star join for runtime filter optimization;
2.4.2 Advanced Data Collocation and Hash Partitioning
(Similar to HAWQ)
- FI-MPPDB datanode reads data from HDFS directly through short-circuit on the data node of HDFS;
- join co-locate with hash partitioned tables can further reduce the network;
Because HDFS Data is stored in three replicas, the local table is also stored in three replicas.
2.4.3 DML support
the block map table supports complete ACID.
- Write: to improve the write performance, a small amount of data is first written to the local delta table, and then converted into PAX root to write to HDFS when the amount is accumulated to a certain amount;
- delete: if the deleted data is in the local delta table, it is deleted immediately. If the deleted data is in PAX, the delete map in the block map table is marked;
- regular compaction;
2.5 Intelligent JIT Compiled Execution
three execution modes:
- normal interpretation execution, no JIT;
- JIT, low-level optimization;
- JIT and O3 level optimization;
JIT compilation of IR itself requires time-consuming. Therefore, it is necessary to accurately calculate the cost of JIT, so that the optimizer can decide whether to perform JIT on a function or a piece of code snippet. PostgreSQL JIT is still static to judge whether JIT is needed based on the cost threshold. FI-MPPDB processing is an innovative work.
P = (T1− T2) × N − Tjit_cost
T1: there is no JIT cost;
t2: cost after JIT optimization;
n: the size of the dataset;
tjit_cost: the cost of JIT itself, which is proportional to the size of the code to be generated;
if P is positive, it means there are still benefits after JIT is used;
3.1 Online Expansion
scale out during TPC-DS
- online expansion is generally slower than offline, but can still be queried and DML during this period;
- the hash distribution is 20% faster than the random distribution;
3.2 Auto Tuning in Query Optimizer
select o_orderpriority, count (∗) as ct from lineitem, orders where l_orderkey = o_orderkey and l_receiptdate <op> l_commitdate + date ':?';
most optimizers use a fixed ratio when predicting predicate selectivity, for example, 1/3 is used as selectivity.
The following figure shows the actual value of selectivity:
3.2.1 Hash join plan problem
for hashjoin, selecting the correct table to build hash is the key to the optimizer.
The reason why machine learning can improve performance is that different predicate results in large differences in the data scanned, thus affecting the join order. A large table may have a small result set after a predicate scan.
3.2.2 Unnecessary data shuffle
HDFS data is distributed according to round robin, so it does not have co-located. Data need shuffle:
- broadcast one of the tables;
- both tables are distributed in hash mode;
the estimation of predicate selectivity affects the strategy of data shuffle.
3.2.3 Insufficient hash buckets for hash aggregation
the prediction of predicate selectivity also affects the pre-allocation of buckets during hashjoin. If the bucket is insufficient, it needs to be redistributed, which doubles the performance gap.
3.2.4 Join Selectivity Experiment
the effect of the Table scan selectivity learning is described earlier. In addition, this method can also be used to evaluate the join predicate selectivity.
It is also important to estimate the conditional selectivity of join During join, which affects the join order of multiple tables.
3.3 Cache Model-based Selectivity Estimation
- for a single parameter, the error rate of evaluation cannot exceed 4%;
- 2 parameters, the error rate of evaluation is not more than 6%;
- even if the predicate cache is small, the error rate is not high;
3.4 Advanced Data Collocation Performance Improvement for SQLonHDFS
Conclusion: After hash partition is supported on HDFS, shuffle can be significantly reduced, and TPCH test can be improved by 30%.
3.5 Intelligent JIT Compiled Execution
- the cost-based JIT can select the final optimization level;
- the overall improvement of TPCH is 29%;
- greenplum: queyr is provided during scale-out, but DML is blocked;
- Amazon Redshift: the old clsuter enters readonly;
FI-MPPDB a more brief introduction to the learning-based approach: each operator is estimated based on real-time feedback at runtime. predicate cache and KNN are used to provide similarity selectivity estimation.
SQL on HDFS
- HAWQ: directly reads and writes HDFS and YARN databases to improve performance;
- SQL Server PDW: supports direct SQL Management and reading of Hadoop clusters, and supports indexes to accelerate queries;
- spectrum: supports a wide range of data sources, such as ORC,Parquet, and CSN;
Intelligent JIT Compiled Execution
- Amazon Redshift and MemSQL:SQL is converted to C/C ++;
- Cloudera Impala and Hyper: LLVM IR is used without refined cost models;
- Spark 2.0's Tungsten: converts a query into a single function through runtime bytecode optimization;
- FI-MPPDB: LLVM is used and has a centralized cost model;