Odyssey Data Value Mining Tool | A Next-generation Computing Engine for AnalyticDB for PostgreSQL

This article outlines the features of the new Odyssey computing engine of AnalyticDB for PostgreSQL in terms of architecture design, core technology, and performance evaluation.

By Kuairui, Changbie, Zhishu, et al.

Purpose

With the advent of the digital economy era, more and more applications rely on data analysis to mine the value of data. Databases with online analytical processing (OLAP) capabilities provide the foundation for big data storage and online analysis and also provide a technical platform to mine the value of data online.

Through its research, Alibaba's OLAP team found that the execution engines of OLAP databases are generally developed based on existing online transaction processing (OLTP) execution engines, which leads to certain disadvantages, such as massive performance loss, legacy issues, and insufficient use of the latest optimization technologies and hardware.

With the rapid growth of data volumes and the increasing demand for data analysis, OLAP systems have to bear an exponential increase in computing volumes and do not perform well enough to meet the future needs for online data analysis. Upon analysis, Alibaba's OLAP team concluded that an OLAP execution engine must provide the following technical features to meet the needs of cloud migration and the digital economy:

Support for multiple hardware platforms: To meet the needs of enterprises migrating their businesses to the cloud, the OLAP execution engine must support the advanced RISC machine (ARM) platform in addition to the traditional x86 platform, and support acceleration through new hardware such as the graphics processing units (GPUs) and field-programmable gate arrays (FPGAs).
Excellent cost performance: The OLAP execution engine must maximize hardware performance and select the optimal algorithms and operators to improve the hardware execution efficiency. It must also use special algorithms and hardware to improve the efficiency of intensive and complex computing.
High SQL compatibility: The OLAP execution engine must be highly compatible with existing SQL standards, optimizer standards, and storage standards to simplify migration and learning. Users should only have to upgrade the minor versions of software to obtain the latest performance optimization technologies, without having to rewrite SQL statements, learn new optimization technologies, or migrate data.

To meet the preceding requirements, Alibaba's OLAP AnalyticDB for PostgreSQL R&D team developed an innovative computing engine called Odyssey over more than 1 year. The Odyssey computing engine provides the following technical features:

Support for multiple software and hardware platforms: In addition to the traditional x86 platform, the Odyssey computing engine supports the ARM chip server and accelerates core algorithms through GPUs and FPGAs.
Robust performance: Unlike traditional database execution engines, the Odyssey computing engine improves execution efficiency through hardware. The Odyssey computing engine uses algorithm design to solve performance issues arising from volcano models, fragmented memory allocation, and redundancy logic and allocates CPU resources to core computing tasks. The Odyssey computing engine uses Low-Level Virtual Machine (LLVM) for dynamic code generation (CodeGen) to improve expression computing performance, simplify the computational logic, and streamline logic computing. In addition, Odyssey uses the supercomputer optimization technology to model operator performance, analyze hot spots in code, and strictly evaluate the execution performance of each line of code. It further improves performance by using new hardware features, such as single instruction, multiple data (SIMD) technology for CPUs, ultra-high bandwidth technology for GPUs, and deep pipelines for FPGAs.
High compatibility: The Odyssey computing engine is endogenous to AnalyticDB for PostgreSQL and compatible with the SQL standard, optimizers, and storage layer of PostgreSQL. Users can directly migrate data from a native engine to the Odyssey computing engine without having to perform any operations.

This article describes the features of the innovative Odyssey computing engine in terms of architecture design, core technology, and performance evaluation.

Introduction to the Odyssey Computing Engine

Basic Architecture

The following figure shows the system framework of AnalyticDB for PostgreSQL and its relationship with the Odyssey computing engine.

AnalyticDB for PostgreSQL is a highly available and scalable distributed analytic database system based on the massively parallel processing (MPP) architecture. An AnalyticDB for PostgreSQL cluster consists of a primary node and several segment nodes, which are interconnected through a network. The primary node receives user requests, parses and optimizes SQL statements, and sends computing tasks. The segment nodes store data and execute computing tasks. Multiple data shuffles may occur between the segment nodes. The performance of AnalyticDB for PostgreSQL is closely related to the network interconnection speed, storage performance, and execution performance of the segment nodes. The Odyssey computing engine improves the execution performance of the segment nodes.

The Odyssey computing engine is intended to add a more efficient computing engine to AnalyticDB for PostgreSQL without modifying the existing modules. After receiving an SQL request from a user, the primary node parses and optimizes the SQL statement, generates an execution plan, and sends the plan to a segment node. The Odyssey computing engine does not affect this process, so no SQL modification or optimization is needed.

After receiving the execution plan from the primary node, the segment node calls the Odyssey computing engine if it is enabled. In this case, the native execution engine is not called. At the storage layer, the Odyssey computing engine is fully compatible with the heap table, AOCS table, and AORO table of AnalyticDB for PostgreSQL and reuses the native storage structure and access interfaces. The Odyssey computing engine focuses on the execution layer. In theory, the Odyssey computing engine is orthogonal to and can seamlessly integrate different optimizations of AnalyticDB for PostgreSQL, such as optimizer optimizations and storage optimizations.

The preceding figure shows the execution process of the Odyssey computing engine. The Odyssey computing engine is connected to AnalyticDB for PostgreSQL through hooks and is transparent to users. When executing an SQL statement, AnalyticDB for PostgreSQL selects different execution engines by selecting different hook functions. Select the native execution engine or set the GUC parameter to select the Odyssey computing engine. The next version will use a cost model and rules to automatically select a computing engine, allowing us to make better use of their respective advantages.

To ensure system stability and functional integrity, the Odyssey computing engine automatically rolls back execution to the native engine when functions are not supported or the execution fails. Execution engines are stateless, and thus allow to switch between engines at any granularity, such as database, session, and SQL statement.

Design Ideas

The preceding architecture allows the Odyssey computing engine to fully reuse the storage, optimizers, and transaction control function of AnalyticDB for PostgreSQL. The Odyssey computing engine is specially designed to improve computing performance and work with AnalyticDB for PostgreSQL to provide more powerful and richer functions. We developed the Odyssey computing engine by analyzing the native execution engine of AnalyticDB for PostgreSQL and comparing different execution engines. The Odyssey computing engine has redesigned plan nodes, CodeGen, and executor. As shown in the following figure, each plan node corresponds to a CodeGen unit and an executor. The CodeGen unit generates code (IR) based on the plan node. The executor selects different backends based on specific hardware platforms, generates an execution file from IR, and applies for resources for execution. The execution results are returned to the client by the primary node.

Optimized execution model: The Odyssey computing engine replaces the traditional volcano model with the batch volcano model. Batch execution reduces the overhead of function calls and supports vectorized computing to improve data processing efficiency. This makes development more flexible and allows for more optimizations.
Instant compilation: The Odyssey computing engine uses Just-in-Time (JIT) technology and LLVM to dynamically generate code. JIT technology is applied to core operations to reduce the performance overhead due to the excessive abstraction of advanced languages. Expression computing and complex logical operations are optimized at the assembly level, which minimizes the instruction size.
Optimized memory management: The native execution engine of AnalyticDB for PostgreSQL has performance issues such as fragmented memory allocation and fragmented memory copy. The Odyssey computing engine is redesigned at the algorithm layer. It implements modular memory allocation and fully reuses the allocated memory, significantly reducing the number of memory allocations required.
Optimizations based on supercomputer software: The Odyssey computing engine was developed by using the many optimization technologies involved in the development of supercomputer software. Supercomputer software is optimized by various means. For example, a theoretical performance model is created, performance profiling is performed, and the code performance is strictly reviewed. This effectively improves the software performance. The Odyssey computing engine uses relevant technologies to optimize every operator and every piece of code.
Multi-platform compatibility: In addition to the traditional x86 platform, the Odyssey computing engine is compatible with other x86 platforms and the ARM platform. It supports different versions of the Linux operating system to provide more options and optimization opportunities.
Optimization through hardware acceleration: The Odyssey computing engine leverages the high throughput of GPUs and the high-density computing capability of FPGAs. It uses GPUs to accelerate operators with a large amount of computation and high throughput requirements and uses FPGAs to accelerate operations with fixed computational logic and complex logic.

Core Technologies

Optimized Execution Model

The preceding figure shows the execution plan of TPCH-Q3. In AnalyticDB for PostgreSQL, an SQL statement is converted to a tree-like execution plan. In the traditional volcano model used by databases, the parent node drives child nodes to execute an SQL statement. The child nodes return data by row to the parent node, which then continues to execute the SQL statement. The volcano model has three performance issues:

1) Excessive function calls: At least one function call is required to transmit each row of data, resulting in high CPU overhead.
2) Difficult optimization: Some logical operations involve multiple functions and cannot be optimized.
3) Low CPU utilization: After the CPU completes one operation of an operator, it immediately returns to the upper node to perform another operation. This process causes many logical jumps and makes it difficult to improve the CPU execution efficiency.

The Odyssey computing engine rewrites the volcano model and implements batch execution with improved performance based on the original tree-like architecture.

In the volcano model, data is transmitted between nodes by row. After each node generates a row of data, this data is pulled by the upper node layer by layer for subsequent computation. The Odyssey computing engine transmits data in batches. A batch contains multiple rows of data (usually thousands of rows) and data is stored in the column-first format. After the rows of data generated by a node fill up a batch, the data is returned to the upper node. This solves the performance issues of the volcano model. Batch execution reduces the number of function calls by three orders of magnitude, which eliminates the performance impact of excessive function calls in the volcano model. In addition, batch execution reduces cross-function operations, making it easy to optimize algorithms and select the optimal algorithm for each operator. Batch execution also reduces logical jumps and makes effective use of modern CPU resources for execution.

Memory Management

The Odyssey computing engine reuses the memory management module of AnalyticDB for PostgreSQL. With improved performance, the native memory management module of AnalyticDB for PostgreSQL automatically tracks memory allocation for automatic memory release. This improves performance and avoids memory leaks. This module has been tested by extensive cloud practices and applied to projects. The Odyssey computing engine reuses this module and has optimized its memory usage to reduce unnecessary and fragmented memory allocation.

The Odyssey computing engine reuses allocated memory as much as possible to avoid repeated memory allocation. In the volcano model, after a node generates a row of data, new memory space may be allocated to the data. Therefore, memory allocation may be required for each row of data. The Odyssey computing engine replaces the volcano model with batch data forwarding. After a row of data in each batch is consumed by the upper node, the Odyssey computing engine reuses the memory space allocated to this batch. This avoids repeated memory allocation.

// Native engine: fragmented allocation and memory allocation by row
for (...) {
    void* hash_entry = palloc(32);
    
    ... // copy data
}
// Odyssey引擎：整块内存分配。
void* hash_table = palloc(32*1000,000);
for () {
    ... // copy data
}

The Odyssey computing engine allocates large memory blocks in a more efficient way, as shown in the preceding pseudocode. For example, when a hash table is created through hash join, one row of data is inserted at a time. The native execution engine of AnalyticDB for PostgreSQL allocates memory to this row before data is inserted. Therefore, memory is allocated each time data is inserted into a row, which results in many memory fragments. If a hash table has 1 million rows and each row is 32 bytes in length, then 1 million 32-byte memory blocks are allocated. Fragmented memory allocation has a significant impact on the performance of modern systems. The allocation process as well as subsequent management, tracking, and memory release are time-consuming.

Through diagnosis and cause analysis, the Odyssey computing engine identifies the core algorithm problems that cause fragmented memory allocation in the native execution engine of AnalyticDB for PostgreSQL. Through algorithm design, the Odyssey computing engine avoids the problem of fragmented memory allocation, which improves the performance of memory management without reducing memory utilization efficiency.

CodeGen

The Odyssey computing engine uses LLVM-based CodeGen, a code generation technology, to reduce the number of virtual function calls and improve the performance of complex logical expressions and logical judgment. The Odyssey computing engine applies CodeGen to whole operators in a different way from how CodeGen is applied to expressions in PostgreSQL 11. Currently, CodeGen is used to optimize operators and logical expressions and implement code specialization.

Operator optimization: Use CodeGen to create a function based on an operator to reduce virtual function calls and redundant code. Also, create a function based on multiple operators to reduce virtual function calls and lower the memory usage and copy frequency. For example, create a function based on Join and Group By to perform the join and aggregate (AGG) operations. Only one hash table needs to be created when the hash key and group key are the same.
Logical expression optimization: The following sample code illustrates how this optimization is implemented. AnalyticDB for PostgreSQL calls three functions to implement the filter conditions a > 10 and b < 5. The first function Int32GT(a, 10) implements a > 10, the second function Int32LT(b, 5) implements b < 5, and the third function implements the AND operation. The need to make three function calls has a significant impact on performance. By contrast, the Odyssey computing engine uses LLVM to generate LLVM IR and compiles it into underlying machine code so that the preceding filter conditions are implemented through three instructions. This avoids many redundant operations and generates the least underlying code based on different expressions and logical judgments.

// 实例SQL
select count(*) from table where a > 10 and b < 5;
// ADBPG表达式方案：多次函数调用
result = Int8AndOp(Int32GT(a, 10), Int32LT(b, 5));
// Odyssey方案：生成最小化底层代码
%res1 = icmp ugt i32 %a, 10;
%res2 = icmp ult i32 %b, 5; 
%res = and i8 %res1, %res2;

Code specialization: Use LLVM to implement code specialization for eliminating known logical judgment. For example, the types of data structures are determined after SQL statements are parsed before they are executed by the engine. This includes the output data type of each node and the data length. If SQL statements are executed by using an advanced language, such as the C language, it is necessary to determine the type and length of each element because the coder does not know the data type. LLVM determines the data type when code is generated, so the code is directly executed without determining the data type. Though only a small amount of time is saved each time, it significantly improves performance in the long term considering that the execution engine needs to determine the data type and length frequently.

Vectorized Execution

After the data pulling mode of the volcano model is changed from row-based pulling to batch pulling, it's possible to further improve the processor performance through column store. For example, the batch processing model uses CPU vector instructions or many-core GPUs to improve performance.

// 伪代码样例：利用SIMD指令加速聚合  
for (int i = 0; i < round_row; i += 8) {
    __m256 m_tmp = _mm256_load_ps(&(aligned_input_data[i]));
    __m256 ymm2 = _mm256_permute2f128_ps(m_tmp , m_tmp , 1);
    m_tmp = _mm256_add_ps(m_tmp, ymm2);
    m_tmp = _mm256_hadd_ps(m_tmp, m_tmp);
    m_tmp = _mm256_hadd_ps(m_tmp, m_tmp);
    result += m_tmp[0];
  }

For example, the common aggregate operation in OLAP is performed by using a SIMD instruction, which improves computing performance by about 400%. The preceding figure shows the pseudocode. The GPU many-core computing capability can be applied to complex algorithms, such as the parallel filter, aggregation, and reduction algorithms, to improve computing performance by up to one order of magnitude.

Multi-platform Support

In the era of cloud computing, a cloud platform supports different hardware resources, including the traditional x86 platform, x86 and ARM platforms made in China, GPUs, and FPGAs. You simply need to buy resources of relevant specifications based on different scenarios. The Odyssey computing engine supports more hardware resources than the native execution engine of AnalyticDB for PostgreSQL. Currently, Odyssey effectively runs on the x86 and ARM platforms.

The Odyssey computing engine accelerates key operators through GPUs and FPGAs. A GPU, as a many-core architecture with sufficient computing and bandwidth resources, is suitable for data-intensive and high-throughput scenarios. An FPGA, as a custom processor, is suitable for operations with complex logic and effectively improves the resource utilization of hardware. Currently, the Odyssey computing engine uses GPUs to accelerate some high-density operators, which improves performance by more than 500%. The Odyssey computing engine also uses FPGAs to accelerate compression and decompression at the storage layer, which improves performance by more than 10x.

Performance Results

The standard TPCH test set is used to test performance. The official TPCH tool helps to generate 1 TB data and 22 SQL statements, with a scale factor of 1000 and a cluster size of 32 segments. The 22 SQL statements are sequentially and continuously executed in single concurrency mode. The time when each SQL statement is sent is recorded as the start time, and the time when results are returned is recorded as the end time. The execution duration is calculated by subtracting the end time from the start time.

Test Results for the x86 Platform

The x86 platform is tested in Alibaba Cloud, with the following instance types:

Number of cores per node: 4
Memory size per node: 32 GB
Disk size per node and disk type: 320 GB, SSD
Number of nodes per instance: 32

The test results are as follows:

As shown in the preceding figure, the Odyssey computing engine performs much better than the native execution engine of AnalyticDB for PostgreSQL. The former displays more than double the execution performance for the Q1, Q4, Q9, Q11, Q20, Q21, and Q22 statements, and improves that of the Q17 statement by more than 200%. The degree of performance improvement is related to the SQL statement type and is higher for computing-intensive SQL statements. For I/O-intensive SQL statements, we have adapted and optimized the I/O interface to improve the computing and I/O performance.

Test Results for the ARM Platform

Three servers are used to test the ARM platform. One server is used as the primary node and the other two servers are used as compute nodes. Sixteen segments are deployed on each compute node, amounting to 32 segments. Each server is configured as follows:

CPU: 128 cores
Memory: 378 GB
Disk: 3.84 TB × 4 AliFlash

As shown in the preceding figure, the Odyssey computing engine performs much better than the native execution engine of AnalyticDB for PostgreSQL. The performance improvement is more significant on the ARM platform than on the x86 platform.

Summary

This article introduces the technical details of the new Odyssey computing engine of AnalyticDB for PostgreSQL. It uses new technologies, including JIT and the volcano model for batch execution, and new hardware features such as the CPU AVX instructions, GPU many-core computing, and FPGA dedicated computing. According to actual TPCH testing with 1 TB data in Alibaba Cloud, the time required for the Odyssey computing engine to execute 22 SQL statements is 50% less than that of the native execution engine. The Odyssey computing engine improves performance by more than 200% for each SQL statement. The Odyssey computing engine doubles the performance of AnalyticDB for PostgreSQL on the ARM server platform. The performance improvement is more significant on the ARM platform than on the x86 platform.

The Odyssey computing engine will be adapted and optimized at the storage layer to provide even better performance. Please stay tuned for updates.

Community

Odyssey Data Value Mining Tool | A Next-generation Computing Engine for AnalyticDB for PostgreSQL

Purpose

Introduction to the Odyssey Computing Engine

Basic Architecture

Design Ideas

Core Technologies

Optimized Execution Model

Memory Management

CodeGen

Vectorized Execution

Multi-platform Support

Performance Results

Test Results for the x86 Platform

Test Results for the ARM Platform

Summary

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

Hologres

Time Series Database (TSDB)

Personalized Content Recommendation Solution

Big Data Consulting for Data Technology Solution