This topic describes the write-ahead log (WAL) parallel replay feature for and PolarDB for PostgreSQL (Compatible with Oracle).
Applicability
This feature is available for PolarDB for PostgreSQL (Compatible with Oracle) clusters that use Oracle syntax compatibility 2.0 and have a minor engine version of 2.0.14.5.1.0 or later.
You can view the minor engine version in the PolarDB console or by running the SHOW polardb_version; statement. If your cluster's minor engine version does not meet the requirements, you must upgrade the minor engine version.
Background information
A or PolarDB for PostgreSQL (Compatible with Oracle) cluster uses a one-writer, multiple-reader architecture. On a running read-only node (replica node), the LogIndex background worker process and a backend process use LogIndex data to replay WAL records in different buffers. This method essentially achieves the parallel replay of WAL records.
WAL log replay is critical for the high availability (HA) of PolarDB clusters. Therefore, applying the parallel replay method to the standard log replay path is an effective optimization.
Parallel WAL log replay offers advantages in at least three scenarios:
The crash recovery process for the primary database, read-only nodes, and secondary databases.
The continuous replay of WAL logs by the LogIndex background worker process on a read-only node.
The continuous replay of WAL logs by the Startup process on a secondary database.
Terms
Block: A data block.
WAL: Write-Ahead Logging.
Task Node: A subtask execution node in the parallel execution framework that can receive and execute one subtask.
Task Tag: A classification identifier for a subtask. Subtasks with the same tag must be executed in sequence.
Hold List: A linked list that each child process in the parallel execution framework uses to schedule and execute replay subtasks.
How it works
Overview
A single WAL log may modify multiple data blocks. The WAL log replay process can be defined as follows:
Assume that the
i-th WAL log has an LSN ofLSN<sub>i</sub>and modifiesmdata blocks. The list of data blocks modified by thei-th WAL log is represented asBlock<sub>i</sub>=[Block<sub>i,0</sub>,Block<sub>i,1</sub>,...,Block<sub>i,m</sub>].The smallest replay subtask is defined as
Task<sub>i,j</sub> = LSN<sub>i</sub> -> Block<sub>i,j</sub>. This subtask represents replaying thei-th WAL log on the data blockBlock<sub>i,j</sub>.Therefore, a WAL log that modifies
mblocks can be represented as a collection ofmreplay subtasks:TASK<sub>i,∗</sub> = [Task<sub>i,0</sub>, Task<sub>i,1</sub>, ..., Task<sub>i,m</sub>].Furthermore, multiple WAL logs can be represented as a series of replay subtask collections:
TASK<sub>∗,∗</sub> = [Task<sub>0,∗</sub>, Task<sub>1,∗</sub>, ..., Task<sub>N,∗</sub>].
In the collection of log replay subtasks
Task<sub>∗,∗</sub>, the execution of a subtask does not always depend on the result of the preceding subtask.Assume the collection of replay subtasks is
TASK<sub>∗,∗</sub> = [Task<sub>0,∗</sub>, Task<sub>1,∗</sub>, Task<sub>2,∗</sub>], where:Task<sub>0,∗</sub> = [Task<sub>0,0</sub>, Task<sub>0,1</sub>, Task<sub>0,2</sub>]Task<sub>1,∗</sub> = [Task<sub>1,0</sub>, Task<sub>1,1</sub>]Task<sub>2,∗</sub> = [Task<sub>2,0</sub>]
Also, Block0,0 = Block1,0, Block0,1 = Block1,1, and Block0,2 = Block2,0.
Then, there are three collections of subtasks that can be replayed in parallel: [Task0,0, Task1,0], [Task0,1, Task1,1], and [Task0,2, Task2,0].
In summary, many subtask sequences within the collection of replay subtasks can be executed in parallel without affecting the consistency of the final result. PolarDB uses this concept in its parallel task execution framework, which is applied to the WAL log replay process.
Parallel task execution framework
A segment of shared memory is divided equally based on the number of concurrent processes. Each segment acts as a circular queue and is allocated to one process. You can configure the depth of each circular queue by setting a parameter.

Dispatcher process.
Controls concurrent scheduling by dispatching tasks to specified processes.
Removes completed tasks from the queue.
Process group.
Each process in the group retrieves a task from its corresponding circular queue and executes it based on the task's state.

Tasks
A circular queue consists of Task Nodes. Each Task Node has one of five states: Idle, Running, Hold, Finished, or Removed.
Idle: The Task Node is not assigned a task.
Running: The Task Node is assigned a task and is either waiting for execution or is being executed.
Hold: The task in the Task Node has a dependency on a preceding task and must wait for that task to finish.
Finished: A process in the process group has finished executing the task.
Removed: When the Dispatcher process finds that a task's state is Finished, all its prerequisite tasks must also be in the Finished state. The Dispatcher process then changes the state to Removed. This state indicates that the Dispatcher process has deleted the task and its prerequisites from the management struct. This mechanism ensures that the Dispatcher process handles the results of dependent tasks in the correct order.
In the state machine transitions shown above, the transitions marked by black lines are completed in the Dispatcher process. The transitions marked by orange lines are completed in the parallel replay process group.Dispatcher process
The Dispatcher process uses three key data structures: Task HashMap, Task Running Queue, and Task Idle Nodes.
Task HashMap: Records the hash mapping between Task Tags and their corresponding task execution lists.
Each task has a specific Task Tag. If two tasks have a dependency, they share the same Task Tag.
When a task is dispatched, if the task has a prerequisite, its state is marked as Hold. The task must wait for its prerequisite to be executed.
Task Running Queue: Records the tasks that are currently being executed.
Task Idle Nodes: Records the Task Nodes that are currently in the
Idlestate for different processes in the process group.
The Dispatcher uses the following scheduling policies:
If a task with the same Task Tag as the new task is already running, the new task is assigned by preference to the process that is handling the last task in that Task Tag's linked list. This policy aims to execute dependent tasks on the same process to reduce the overhead of inter-process synchronization.
If the preferred process's queue is full, or if no task with the same Task Tag is running, a process is selected in sequence from the process group. An
IdleTask Node is then retrieved from that process's queue to schedule the task. This policy aims to distribute tasks as evenly as possible across all processes.

Process group
This parallel execution applies to tasks of the same type that share the same Task Node data structure. When the process group is initialized, you can configure
SchedContextto specify the function pointers that execute specific tasks:TaskStartup: Performs the required initialization action before a process executes tasks.
TaskHandler: Executes a specific task based on the incoming Task Node.
TaskCleanup: Performs the required cleanup action before a process exits.

A process in the process group retrieves a Task Node from the circular queue. If the state of the Task Node is
Hold, the process inserts the Task Node at the tail of theHold List. If the state of the Task Node isRunning, the process callsTaskHandlerto execute the task. IfTaskHandlerfails, the system sets a retry count for the Task Node (default: 3) and inserts the Task Node at the head of theHold List.
The process searches the
Hold Listfrom the beginning for an executable task. If a task's state isRunningand its wait count is 0, the process executes the task. If the task's state isRunningand its wait count is greater than 0, the process decrements the wait count by 1.
WAL parallel replay
LogIndex data records the mapping between Write-Ahead Logging (WAL) logs and the data blocks they modify, and also supports retrieval by LSN. During the continuous replay of WAL logs on a standby node, PolarDB uses a parallel task execution framework. This framework uses LogIndex data to parallelize WAL log replay tasks, which accelerates data synchronization on the standby node.
Workflow
Startup process: Parses WAL logs and builds LogIndex data without replaying the WAL logs.
LogIndex BGW replay process: Acts as the Dispatcher process in the parallel task execution framework. This process uses LSNs to retrieve LogIndex data, builds log replay subtasks, and assigns them to the parallel replay process group.
Processes in the parallel replay process group: Execute log replay subtasks and replay a single log on a data block.
Backend process: When reading a data block, the process uses the PageTag to retrieve LogIndex data. This action obtains the linked list of LSN logs that modified the block. The process then replays the entire log chain on the data block.

The Dispatcher process uses LSNs to retrieve LogIndex data. It enumerates PageTags and their corresponding LSNs in their LogIndex insertion order to build
{LSN -> PageTag}mappings, which serve as Task Nodes.The PageTag is used as the Task Tag for the Task Node.
The enumerated Task Nodes are dispatched to the child processes in the parallel execution framework's process group for replay.

Usage guide
To enable the WAL parallel replay feature, add the following parameter to the postgresql.conf file on the standby node.
polar_enable_parallel_replay_standby_mode = ON