All Products
Search
Document Center

PolarDB:WAL parallel replay

Last Updated:Jan 14, 2026

This topic describes the write-ahead log (WAL) parallel replay feature for and PolarDB for PostgreSQL (Compatible with Oracle).

Applicability

This feature is available for PolarDB for PostgreSQL (Compatible with Oracle) clusters that use Oracle syntax compatibility 2.0 and have a minor engine version of 2.0.14.5.1.0 or later.

Note

You can view the minor engine version in the PolarDB console or by running the SHOW polardb_version; statement. If your cluster's minor engine version does not meet the requirements, you must upgrade the minor engine version.

Background information

A or PolarDB for PostgreSQL (Compatible with Oracle) cluster uses a one-writer, multiple-reader architecture. On a running read-only node (replica node), the LogIndex background worker process and a backend process use LogIndex data to replay WAL records in different buffers. This method essentially achieves the parallel replay of WAL records.

WAL log replay is critical for the high availability (HA) of PolarDB clusters. Therefore, applying the parallel replay method to the standard log replay path is an effective optimization.

Parallel WAL log replay offers advantages in at least three scenarios:

  • The crash recovery process for the primary database, read-only nodes, and secondary databases.

  • The continuous replay of WAL logs by the LogIndex background worker process on a read-only node.

  • The continuous replay of WAL logs by the Startup process on a secondary database.

Terms

  • Block: A data block.

  • WAL: Write-Ahead Logging.

  • Task Node: A subtask execution node in the parallel execution framework that can receive and execute one subtask.

  • Task Tag: A classification identifier for a subtask. Subtasks with the same tag must be executed in sequence.

  • Hold List: A linked list that each child process in the parallel execution framework uses to schedule and execute replay subtasks.

How it works

  • Overview

    A single WAL log may modify multiple data blocks. The WAL log replay process can be defined as follows:

    1. Assume that the i-th WAL log has an LSN of LSN<sub>i</sub>​ and modifies m data blocks. The list of data blocks modified by the i-th WAL log is represented as Block<sub>i</sub>​=[Block<sub>i,0</sub>​,Block<sub>i,1</sub>​,...,Block<sub>i,m</sub>​].

    2. The smallest replay subtask is defined as Task<sub>i,j</sub> = LSN<sub>i</sub> -> Block<sub>i,j</sub>. This subtask represents replaying the i-th WAL log on the data block Block<sub>i,j</sub>.

    3. Therefore, a WAL log that modifies m blocks can be represented as a collection of m replay subtasks: TASK<sub>i,∗</sub> = [Task<sub>i,0</sub>, Task<sub>i,1</sub>, ..., Task<sub>i,m</sub>].

    4. Furthermore, multiple WAL logs can be represented as a series of replay subtask collections: TASK<sub>∗,∗</sub> = [Task<sub>0,∗</sub>, Task<sub>1,∗</sub>, ..., Task<sub>N,∗</sub>].

    In the collection of log replay subtasks Task<sub>∗,∗</sub>, the execution of a subtask does not always depend on the result of the preceding subtask.

    Assume the collection of replay subtasks is TASK<sub>∗,∗</sub> = [Task<sub>0,∗</sub>, Task<sub>1,∗</sub>, Task<sub>2,∗</sub>], where:

    • Task<sub>0,∗</sub> = [Task<sub>0,0</sub>, Task<sub>0,1</sub>, Task<sub>0,2</sub>]

    • Task<sub>1,∗</sub> = [Task<sub>1,0</sub>, Task<sub>1,1</sub>]

    • Task<sub>2,∗</sub> = [Task<sub>2,0</sub>]

    Also, Block0,0 = Block1,0, Block0,1 = Block1,1, and Block0,2 = Block2,0.

    Then, there are three collections of subtasks that can be replayed in parallel: [Task0,0, Task1,0], [Task0,1, Task1,1], and [Task0,2, Task2,0].

    In summary, many subtask sequences within the collection of replay subtasks can be executed in parallel without affecting the consistency of the final result. PolarDB uses this concept in its parallel task execution framework, which is applied to the WAL log replay process.

  • Parallel task execution framework

    A segment of shared memory is divided equally based on the number of concurrent processes. Each segment acts as a circular queue and is allocated to one process. You can configure the depth of each circular queue by setting a parameter.共享内存分配

    • Dispatcher process.

      • Controls concurrent scheduling by dispatching tasks to specified processes.

      • Removes completed tasks from the queue.

    • Process group.

      Each process in the group retrieves a task from its corresponding circular queue and executes it based on the task's state.进程组

    • Tasks

      A circular queue consists of Task Nodes. Each Task Node has one of five states: Idle, Running, Hold, Finished, or Removed.

      • Idle: The Task Node is not assigned a task.

      • Running: The Task Node is assigned a task and is either waiting for execution or is being executed.

      • Hold: The task in the Task Node has a dependency on a preceding task and must wait for that task to finish.

      • Finished: A process in the process group has finished executing the task.

      • Removed: When the Dispatcher process finds that a task's state is Finished, all its prerequisite tasks must also be in the Finished state. The Dispatcher process then changes the state to Removed. This state indicates that the Dispatcher process has deleted the task and its prerequisites from the management struct. This mechanism ensures that the Dispatcher process handles the results of dependent tasks in the correct order.

      任务In the state machine transitions shown above, the transitions marked by black lines are completed in the Dispatcher process. The transitions marked by orange lines are completed in the parallel replay process group.

    • Dispatcher process

      The Dispatcher process uses three key data structures: Task HashMap, Task Running Queue, and Task Idle Nodes.

      • Task HashMap: Records the hash mapping between Task Tags and their corresponding task execution lists.

        • Each task has a specific Task Tag. If two tasks have a dependency, they share the same Task Tag.

        • When a task is dispatched, if the task has a prerequisite, its state is marked as Hold. The task must wait for its prerequisite to be executed.

      • Task Running Queue: Records the tasks that are currently being executed.

      • Task Idle Nodes: Records the Task Nodes that are currently in the Idle state for different processes in the process group.

      The Dispatcher uses the following scheduling policies:

      • If a task with the same Task Tag as the new task is already running, the new task is assigned by preference to the process that is handling the last task in that Task Tag's linked list. This policy aims to execute dependent tasks on the same process to reduce the overhead of inter-process synchronization.

      • If the preferred process's queue is full, or if no task with the same Task Tag is running, a process is selected in sequence from the process group. An Idle Task Node is then retrieved from that process's queue to schedule the task. This policy aims to distribute tasks as evenly as possible across all processes.

      调度策略

    • Process group

      This parallel execution applies to tasks of the same type that share the same Task Node data structure. When the process group is initialized, you can configure SchedContext to specify the function pointers that execute specific tasks:

      • TaskStartup: Performs the required initialization action before a process executes tasks.

      • TaskHandler: Executes a specific task based on the incoming Task Node.

      • TaskCleanup: Performs the required cleanup action before a process exits.

      Process group 1

      A process in the process group retrieves a Task Node from the circular queue. If the state of the Task Node is Hold, the process inserts the Task Node at the tail of the Hold List. If the state of the Task Node is Running, the process calls TaskHandler to execute the task. If TaskHandler fails, the system sets a retry count for the Task Node (default: 3) and inserts the Task Node at the head of the Hold List.进程组2

      The process searches the Hold List from the beginning for an executable task. If a task's state is Running and its wait count is 0, the process executes the task. If the task's state is Running and its wait count is greater than 0, the process decrements the wait count by 1.进程组3

  • WAL parallel replay

    LogIndex data records the mapping between Write-Ahead Logging (WAL) logs and the data blocks they modify, and also supports retrieval by LSN. During the continuous replay of WAL logs on a standby node, PolarDB uses a parallel task execution framework. This framework uses LogIndex data to parallelize WAL log replay tasks, which accelerates data synchronization on the standby node.

    Workflow

    • Startup process: Parses WAL logs and builds LogIndex data without replaying the WAL logs.

    • LogIndex BGW replay process: Acts as the Dispatcher process in the parallel task execution framework. This process uses LSNs to retrieve LogIndex data, builds log replay subtasks, and assigns them to the parallel replay process group.

    • Processes in the parallel replay process group: Execute log replay subtasks and replay a single log on a data block.

    • Backend process: When reading a data block, the process uses the PageTag to retrieve LogIndex data. This action obtains the linked list of LSN logs that modified the block. The process then replays the entire log chain on the data block.工作流程

    • The Dispatcher process uses LSNs to retrieve LogIndex data. It enumerates PageTags and their corresponding LSNs in their LogIndex insertion order to build {LSN -> PageTag} mappings, which serve as Task Nodes.

    • The PageTag is used as the Task Tag for the Task Node.

    • The enumerated Task Nodes are dispatched to the child processes in the parallel execution framework's process group for replay.工作流程

Usage guide

To enable the WAL parallel replay feature, add the following parameter to the postgresql.conf file on the standby node.

polar_enable_parallel_replay_standby_mode = ON