All Products
Document Center

Distributed execution and parallel query

Last Updated: Jun 18, 2021

This topic describes distributed execution and parallel query.

Distributed execution

For a distributed system based on the Shared-Nothing architecture, the data of a relational table is stored in partitions on various nodes of the system. So, for a cross-zone data query, the execution plan is required to handle data on multiple nodes. Therefore, OceanBase Database is provided with capabilities to generate and execute distributed execution plans.

For distributed execution plans, partitioning improves query performance. You do not need to partition a small relational table. However, for a large relational table, you need to carefully select the partition key based on requirements of the upper-layer application system. Make sure that most queries can use the partition key for partition pruning, which reduces the data access workload.

We recommend that you use the join key as the partition key for related tables and apply the same partitioning method to configure the same partitions on the same node by using TABLE GROUP, to reduce cross-node data interaction.

The optimizer of OceanBase Database automatically generates a distributed execution plan based on the physical distribution of queries and data.

Parallel query

A parallel query refers to the process of restructuring an execution plan to increase its CPU resources and I/O processing capacity, and thus reduce the system response time to the corresponding query. The parallel query technology applies to both distributed and local execution plans.

When a single query involves access to data stored on different nodes, you need to fetch such data to the same node for computing through data redistribution. The system takes each data redistribution node as the upper and lower boundaries to vertically divide each execution plan of OceanBase Database into multiple Data Flow Objects (DFOs). Each DFO is split into tasks of specified parallelism to improve the execution efficiency by running tasks in parallel.

Generally, higher parallelism leads to shorter response time to the query, but it requires more CPU, I/O, and memory resources to execute the query. For Decision Support Systems (DSS) or data warehouses that support querying massive amounts of data, the response time is shortened.

Generally, the execution plan of a parallel query is similar to a distributed execution plan in the execution logic. After an execution plan is decomposed, each part of it is executed by multiple execution threads. Concurrent execution of DFOs of an execution plan and tasks within a DFO is achieved through scheduling. Parallel queries are ideal for batch update, index creation, and index maintenance in online transaction processing (OLTP) scenarios.

Parallel queries can effectively improve the processing performance when the system meets the following conditions:

  • Sufficient I/O bandwidth

  • Low CPU utilization

  • Adequate memory resources

So, if the system does not have sufficient resources for additional parallel processing tasks, you cannot improve the execution performance by using parallel queries or increasing the parallelism of the tasks. On the contrary, when the application system is overloaded, the operating system has to take more scheduling tasks, such as context switching and page switching, which may degrade its performance.

Generally, for DSS systems of a large number of partitions and data warehouses, parallel execution can reduce the response time. For simple DML operations, intra-partition queries, or queries across a small number of partitions, parallel queries cannot significantly reduce the response time.

Note that the performance bottleneck of any part of a system will become a point that restricts the performance of the entire system. Therefore, to achieve the best performance through parallel queries, you need to properly configure every part of the system.