Bulk update and merge data in PostgreSQL - AnalyticDB

Single-row updates in AnalyticDB for PostgreSQL carry significant distributed transaction overhead. Batching multiple updates into a single SQL statement dramatically reduces that overhead and improves throughput.

AnalyticDB for PostgreSQL treats an update (also called a merge) as an upsert operation: if the incoming row matches an existing record, it replaces the old version; if no match exists, the row is inserted.

How it works

Every UPDATE statement triggers a full distributed transaction cycle:

The master node receives the UPDATE statement, initiates a distributed transaction, and locks the target table. Concurrent updates to the same table are not allowed.
The master node routes the request to the matching segment nodes.
Each segment node scans the index to locate the target row. For column-oriented tables, an update is physically a delete of the old row followed by an append of the new row at the end of the table. The updated data page is written to memory cache, and the change in table file length is recorded in the write-ahead log (WAL).
Before the transaction commits, the updated data page and WAL are synchronized to the mirror. After synchronization completes, the master node ends the distributed transaction and returns a success response.

Each statement carries fixed overhead: SQL parsing, distributed transaction management, table locking, master-to-segment network round trips, and segment-to-mirror log synchronization. For single-row updates, this overhead dominates the actual work.

To reduce overhead, batch updates into a single SQL statement and a single transaction wherever possible.

Bulk update

Use a staging table to batch multiple row updates into one statement.

Step 1: Prepare the target table

CREATE TABLE target_table (c1 INT, c2 INT, PRIMARY KEY (c1));
INSERT INTO target_table SELECT generate_series(1, 10000000);

The target table is typically large. The primary key creates a unique index that the optimizer uses to locate rows efficiently during updates.

Step 2: Load new data into a staging table

A staging table holds the incoming data before it is merged into the target table. Load data into the staging table using the COPY command, an OSS external table, or direct INSERT.

CREATE TABLE source_table (c1 INT, c2 INT);
INSERT INTO source_table SELECT generate_series(1, 100), generate_series(1, 100);

After loading data, run ANALYZE source_table so that the optimizer has accurate row count statistics.

Step 3: Run the bulk UPDATE

SET optimizer = ON;
UPDATE target_table
SET c2 = source_table.c2
FROM source_table
WHERE target_table.c1 = source_table.c1;

Enable the ORCA optimizer with SET optimizer = ON before the UPDATE to maximize index usage. If ORCA is not available, run SET enable_nestloop = ON to force a nested loop join with index scans. For complex cases—such as multiple index fields or partitioned tables—ORCA is required to match the index correctly.

How the optimizer chooses a plan

When the staging table is small, the optimizer uses a nested loop join and index scans:

EXPLAIN UPDATE target_table SET c2 = source_table.c2
FROM source_table WHERE target_table.c1 = source_table.c1;

                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Update  (cost=0.00..586.10 rows=25 width=1)
   ->  Result  (cost=0.00..581.02 rows=50 width=26)
         ->  Redistribute Motion 4:4  (slice1; segments: 4)  (cost=0.00..581.02 rows=50 width=22)
               Hash Key: public.target_table.c1
               ->  Assert  (cost=0.00..581.01 rows=50 width=22)
                     Assert Cond: NOT public.target_table.c1 IS NULL
                     ->  Split  (cost=0.00..581.01 rows=50 width=22)
                           ->  Nested Loop  (cost=0.00..581.01 rows=25 width=18)
                                 Join Filter: true
                                 ->  Table Scan on source_table  (cost=0.00..431.00 rows=25 width=8)
                                 ->  Index Scan using target_table_pkey on target_table  (cost=0.00..150.01 rows=1 width=14)
                                       Index Cond: public.target_table.c1 = source_table.c1

As the staging table grows, the optimizer may switch to a hash join with full table scans if it estimates that to be more efficient:

INSERT INTO source_table SELECT generate_series(1, 1000), generate_series(1, 1000);
ANALYZE source_table;
EXPLAIN UPDATE target_table SET c2 = source_table.c2
FROM source_table WHERE target_table.c1 = source_table.c1;

                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Update  (cost=0.00..1485.82 rows=275 width=1)
   ->  Result  (cost=0.00..1429.96 rows=550 width=26)
         ->  Assert  (cost=0.00..1429.94 rows=550 width=22)
               Assert Cond: NOT public.target_table.c1 IS NULL
               ->  Split  (cost=0.00..1429.93 rows=550 width=22)
                     ->  Hash Join  (cost=0.00..1429.92 rows=275 width=18)
                           Hash Cond: public.target_table.c1 = source_table.c1
                           ->  Table Scan on target_table  (cost=0.00..477.76 rows=2500659 width=14)
                           ->  Hash  (cost=431.01..431.01 rows=275 width=8)
                                 ->  Table Scan on source_table  (cost=0.00..431.01 rows=275 width=8)

Bulk delete

Use the same staging table pattern for DELETE operations. The DELETE ... USING syntax identifies which rows to remove by joining the target table against the staging table:

DELETE FROM target_table
USING source_table
WHERE target_table.c1 = source_table.c1;

The query plan uses an index scan on the target table:

EXPLAIN DELETE FROM target_table
USING source_table WHERE target_table.c1 = source_table.c1;

                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Delete (slice0; segments: 4)  (rows=50 width=10)
   ->  Nested Loop  (cost=0.00..41124.40 rows=50 width=10)
         ->  Seq Scan on source_table  (cost=0.00..6.00 rows=50 width=4)
         ->  Index Scan using target_table_pkey on target_table  (cost=0.00..205.58 rows=1 width=14)
               Index Cond: target_table.c1 = source_table.c1

Merge data with Delete + Insert

When merging data, first load the data into a staging table.

If you know in advance that all data to be merged already exists in the target table, use UPDATE statements to merge the data.
In most cases, part of the data to be merged already exists in the target table, and part of it is new with no matching records. In this case, use a combination of bulk delete and bulk insert within a single transaction:

SET optimizer = ON;
DELETE FROM target_table USING source_table WHERE target_table.c1 = source_table.c1;
INSERT INTO target_table SELECT * FROM source_table;

This removes all matching rows from the target table and then inserts all rows from the staging table, effectively upserting the entire staging table in one pass.

Real-time updates using VALUES expressions

The staging table pattern requires creating, populating, and dropping (or truncating) a table for each batch. For continuous, low-latency data synchronization, use inline VALUES expressions instead:

-- Bulk update without a staging table
UPDATE target_table
SET c2 = t.c2
FROM (VALUES (1, 1), (2, 2), (3, 3), ..., (2000, 2000)) AS t(c1, c2)
WHERE target_table.c1 = t.c1;

-- Bulk delete without a staging table
DELETE FROM target_table
USING (VALUES (1, 1), (2, 2), (3, 3), ..., (2000, 2000)) AS t(c1, c2)
WHERE target_table.c1 = t.c1;

Splice the rows to update into the VALUES list in your application, then submit the statement. Both SET optimizer = ON and SET enable_nestloop = ON generate query plans that use indexes for VALUES-based updates. For complex cases involving multiple index fields or partitioned tables, use SET optimizer = ON.

AnalyticDB:Bulk update