PolarDB Bulk Import Architecture and High-Throughput Ingestion Principles - PolarDB

Bulk import operations traditionally suffer from per-page I/O overhead, frequent WAL log writes, and synchronous disk persistence before each transaction commit. PolarDB for PostgreSQL (Compatible with Oracle) addresses these bottlenecks with a bulk import infrastructure that applies four layered optimizations across the entire import path, significantly reducing latency for operations on new objects within a transaction.

Prerequisites

Before you begin, ensure that you have:

A PolarDB for PostgreSQL (Compatible with Oracle) cluster running version 2.0, revision version 2.0.14.13.28.0 or later

Check the revision version in the PolarDB console or by running SHOW polardb_version;. If the revision version does not meet the requirement, create or refresh materialized views for bulk writes instead.

Supported operations

The following SQL operations automatically use the bulk import infrastructure:

Operation	Notes
`CREATE INDEX`
`REINDEX`
`VACUUM FULL`
`ALTER TABLE SET TABLESPACE`
`CREATE MATERIALIZED VIEW`
`REFRESH MATERIALIZED VIEW`
`CREATE TABLE AS` / `SELECT INTO`	Conditional — see note below

Whether CREATE TABLE AS or SELECT INTO uses the bulk import infrastructure depends on the wal_level parameter. If wal_level is set to a value lower than logical, the operation uses the bulk import infrastructure. If wal_level is logical, the operation uses the bulk insert interface of Multi-Insert Optimization instead.

How it works

When you bulk import data into new objects during a transaction, transaction isolation keeps those objects invisible to other processes until the transaction commits. PolarDB for PostgreSQL (Compatible with Oracle) takes advantage of this property to apply four layered optimizations across the entire import path.

1. Bypass buffer management

During page construction, data is written directly into process-private memory instead of the shared buffer pool. This eliminates contention for buffer pool locks and speeds up page construction.

2. Bulk page extension

Traditional import writes pages one at a time, triggering a storage management layer call for each filled page. On cloud storage with higher latency, these per-page calls accumulate into significant I/O wait overhead.

The bulk import infrastructure pre-aggregates multiple pages in process-private memory and issues a single batch page extension call when the number of pre-aggregated pages reaches a predefined threshold. This reduces the overhead from frequent single-page extensions.

3. Bulk WAL logging

Traditional import generates a separate WAL record for each inserted row, resulting in frequent I/O operations and high metadata overhead.

The bulk import infrastructure consolidates multiple rows into a single page and pre-aggregates multiple pages in process-private memory before logging. After a batch of pages is built, it uses XLOG_FPI-type WAL logs to record the entire page at once rather than incremental modifications. This produces more compact WAL records and reduces metadata overhead compared to row-by-row logging.

4. Asynchronous page persistence

Normally, all constructed pages must be physically written to storage before a transaction can commit — a synchronous step that adds latency to the main execution path.

After batch page construction completes, the bulk import infrastructure immediately copies pages into the shared buffer pool and marks them as dirty pages, allowing the transaction to commit without waiting for a physical disk write. A background parallel dirty page flushing process then writes those dirty pages to persistent storage asynchronously, using PolarDB's high-throughput dirty page flushing capability to overlap page construction and background writes.

Crash safety: Because XLOG_FPI-type WAL logs are recorded during page construction — before the transaction commits — these logs are available during crash recovery to ensure the integrity of committed data.

Configure the bulk import infrastructure

The polar_bulk_write_maxpages parameter controls when the bulk import infrastructure flushes pre-aggregated pages from process-private memory to the shared buffer pool.

Parameter	Default	Trigger	When to adjust
`polar_bulk_write_maxpages`	`1MB`	When pre-aggregated pages reach this limit, bulk page extension, bulk WAL logging, and page copying to the shared buffer pool are triggered	The default is pre-tuned for most workloads. Adjust only if profiling identifies memory pressure or I/O contention during bulk import operations.

To check the current value, run:

SHOW polar_bulk_write_maxpages;
 polar_bulk_write_maxpages
---------------------------
 1MB
(1 row)

References

Multi-Insert Table AM infrastructure