The online promotion feature in PolarDB for PostgreSQL lets you promote a read-only node to the primary node.
Scope
Your PolarDB for PostgreSQL cluster must run one of the following minor engine versions:
PostgreSQL 18 with minor engine version 2.0.18.0.1.0 or later
PostgreSQL 17 with minor engine version 2.0.17.2.1.0 or later
PostgreSQL 16 with minor engine version 2.0.16.3.1.1 or later
PostgreSQL 15 with minor engine version 2.0.15.7.1.1 or later
PostgreSQL 14 with minor engine version 2.0.14.5.1.0 or later
PostgreSQL 11 with minor engine version 2.0.11.2.1.0 or later
You can view the minor engine version in the PolarDB console or by running the SHOW polardb_version; statement. If the minor engine version does not meet the requirements, you can upgrade the minor engine version.
Background information
PolarDB for PostgreSQL uses a one-writer, multiple-reader architecture based on shared storage. This differs from the primary/secondary architecture of a traditional database in the following ways:
Standby node: A standby node is a secondary node in a traditional database. It has independent storage and synchronizes data with the primary node by transferring complete Write-Ahead Logging (WAL) logs.
Read-only node: A Replica node is a read-only secondary node in PolarDB for PostgreSQL. It shares storage with the primary node and synchronizes data by transferring WAL metadata.
A traditional database supports promoting a standby node to a primary node without requiring a restart. The promoted node can then continue to serve read and write requests. This ensures high availability (HA) and reduces the recovery time objective (RTO).
PolarDB for PostgreSQL also needs the capability to promote a read-only secondary node to the primary node. Because Replica nodes are different from the standby nodes of traditional databases, PolarDB for PostgreSQL provides an online promotion mechanism for its one-writer, multiple-reader architecture.
Usage
You can use the pg_ctl tool to promote the Replica node:
pg_ctl promote -D [datadir]How it works
The online promotion feature is based on a trigger mechanism.
Trigger mechanism
PolarDB for PostgreSQL uses the same method as a traditional database to promote a secondary node. The feature is triggered in one of the following ways:
Run the promote command using the pg_ctl utility. The pg_ctl utility sends a signal to the postmaster process, which then notifies other processes to perform the required operations and complete the promotion.
Define the path of the trigger file in the recovery.conf file. Other components are triggered when this trigger file is generated.
NoteCompared to promoting a standby node in a traditional database, promoting a Replica node in PolarDB for PostgreSQL involves several considerations:
After a Replica node is promoted to the primary node, the shared storage must be remounted in read/write mode.
A Replica node maintains important control information in memory. On a primary node, this information is persisted to shared storage. During promotion, this information must also be persisted to shared storage.
The online promotion process must identify which data can be written to shared storage.
When a Replica node replays WAL logs, its buffer eviction methods and dirty page flushing characteristics are different from those of the primary node. The online promotion process must handle these differences.
Procedure for handling each child process during the OnlinePromote process of a replica node.
Postmaster process
The postmaster process starts the online promotion process after it discovers the trigger file or receives the online promotion command.
It sends a SIGTERM signal to all current backend processes.
NoteRead-only nodes can continue to provide read-only services during the online promotion process, but the data may not be up-to-date. To prevent reading stale data from the new primary node during the switchover, all backend sessions must be disconnected. Read and write services become available after the Startup process exits.
It remounts the shared storage in read/write mode.
NoteThis step requires support from the underlying storage.
It sends a SIGUSR2 signal to the Startup process to end log replay and handle the online promotion operation.
It sends a SIGUSR2 signal to the Polar Worker auxiliary process to stop parsing some LogIndex data. This data is useful only to a Replica node during normal operation.
It sends a SIGUSR2 signal to the LogIndex background worker (BGW) process to handle the online promotion operation.
The following figure shows this process:

Startup process
The Startup process replays all WAL logs generated by the old primary node and generates the corresponding LogIndex data.
It confirms that the last checkpoint from the old primary node has also completed on the Replica node. This ensures that the data for that checkpoint that must be written locally on the Replica node is flushed to disk.
It waits for the LogIndex BGW process to enter the POLAR_BG_WAITING_RESET state.
It copies the local data, such as clog, from the Replica node to the shared storage.
It resets the WAL Meta Queue memory space and reloads slot information from shared storage. It then resets the replay offset of the LogIndex BGW process to the minimum of its current offset and the consistency offset. This new offset is the starting point for the next replay by the LogIndex BGW process.
It sets the node role to primary and sets the state of the LogIndex BGW process to POLAR_BG_ONLINE_PROMOTE. At this point, the cluster can serve read and write requests.
The following figure shows this process:

Processing procedure for LogIndex BGW.
The LogIndex BGW process has its own state machine and runs according to this state machine throughout its lifecycle. The following table describes the operations for each state.
Parameter
Description
POLAR_BG_WAITING_RESET
The LogIndex BGW process state is reset. It notifies other processes that the state machine has changed.
POLAR_BG_ONLINE_PROMOTE
Reads LogIndex data, organizes and distributes replay tasks, and uses the parallel replay process group to replay WAL logs. The process in this state must replay all LogIndex data before it can switch to another state. Finally, it advances the replay offset of the background replay process.
POLAR_BG_REDO_NOT_START
Indicates that the replay task has ended.
POLAR_BG_RO_BUF_REPLAYING
When the Replica node is running normally, the process is in this state. It reads LogIndex data and replays a certain amount of WAL logs in sequence. After each round of replay, it advances the replay offset of the background replay process.
POLAR_BG_PARALLEL_REPLAYING
The LogIndex BGW process reads a certain amount of LogIndex data, organizes and distributes replay tasks, and uses the parallel replay process group to replay WAL logs. After each round of replay, it advances the replay offset of the background replay process.
The following figure shows this process:

After the LogIndex BGW process receives the SIGUSR2 signal from the postmaster process, it performs the online promotion operation as follows:
It flushes all LogIndex data to disk and switches its state to POLAR_BG_WAITING_RESET.
It waits for the Startup process to switch its state to POLAR_BG_ONLINE_PROMOTE.
Before the Replica node performs the online promotion operation, the background replay process replays only the pages in the buffer pool.
During the online promotion process of the Replica node, some pages from the previous primary node may not have been flushed to disk from memory. Therefore, the background replay process replays all WAL logs in sequence. After replay, it calls MarkBufferDirty to mark the page as a dirty page, which then waits to be flushed.
After the replay is complete, it advances the replay offset of the background replay process and then switches its state to POLAR_BG_REDO_NOT_START.
Dirty page flushing control
Each dirty page has an Oldest LSN. This LSN is ordered in the FlushList and is used to determine the consistency offset.
After a Replica node is promoted, both replay and new page writes occur simultaneously. On a primary node, the current WAL insertion offset is directly set as the buffer's Oldest LSN. If this is done on the newly promoted node, a new consistency offset might be set before a buffer with a smaller LSN is flushed to disk.
Therefore, two issues must be addressed during the online promotion of a Replica node:
Setting the Oldest LSN for dirty pages when replaying WAL logs from the old primary node.
Setting the Oldest LSN for dirty pages generated by the new primary node.
NoteDuring the online promotion of a Replica node, PolarDB for PostgreSQL sets the Oldest LSN for dirty pages in both cases to the replay offset advanced by the LogIndex BGW process. The consistency offset is advanced only after all buffers marked with the same Oldest LSN are flushed to disk.