Troubleshoot WAL log accumulation on primary and read-only PostgreSQL databases - ApsaraDB RDS

WAL (write-ahead logging) logs accumulate when the checkpointer process cannot delete them — typically because a replication slot is holding them back, WAL-related parameters are set too high, or write operations overwhelm the cleanup process. This topic covers how to diagnose and resolve WAL log accumulation in both the primary database and read-only instances.

Background information

WAL is a key component of PostgreSQL to ensure data security and improve system reliability and performance. WAL helps prevent data loss and ensure that data can be restored in a reliable manner even if multiple faults occur.

WAL log accumulation in the primary database

Inactive replication slots or unreported LSN

A replication slot is a key tool in PostgreSQL to implement high availability and disaster recovery. A replication slot tells PostgreSQL not to delete WAL logs until the consumer has processed them. If a slot becomes inactive — because the consumer stopped reporting its log sequence number (LSN) — WAL logs tied to that slot accumulate indefinitely.

Step 1: Identify slots holding back WAL deletion.

Run the following query against the pg_replication_slots system view to check how much WAL is held back by each slot:

SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_insert_lsn(), restart_lsn)) AS delay_size
FROM pg_replication_slots;

If delay_size for a slot is large or matches the total WAL accumulation you observe, the slot is likely stalling deletion.

Step 2: Drop the inactive slot.

Evaluate whether the slot is still needed based on your business requirements, then drop it if it is not:

SELECT pg_drop_replication_slot('<slot_name>');

Warning

Dropping a replication slot permanently removes it. Verify the slot is genuinely inactive and has no downstream consumers before dropping it.

Step 3: Confirm removal.

Re-run the identification query from step 1 to verify the slot has been removed.

For more information on managing replication slots, see Use the WAL log management feature.

Incorrect parameter settings

If wal_keep_segments (PostgreSQL 12 and earlier), wal_keep_size, or max_wal_size are set too high, PostgreSQL retains far more WAL than necessary. Review the current values and lower them to match your actual replication and recovery needs.

VACUUM storm and high write volume

A VACUUM storm — many automatic or manual VACUUM operations running simultaneously — generates a large volume of WAL logs in a short period. This can spike I/O load and delay WAL cleanup. To reduce the impact, tune VACUUM parameters and stagger VACUUM execution to spread the load over time.

High write throughput has a similar effect. Schedule bulk writes during off-peak periods where possible.

WAL log accumulation in read-only instances

Replication latency

WAL logs accumulate on a read-only instance when replay falls behind. Two common causes:

Long-running transactions blocking replay. A long-running transaction on the read-only instance can conflict with WAL log replay. Review the hot_standby_feedback and max_standby_streaming_delay settings. For example, if hot_standby_feedback is off and max_standby_streaming_delay is set to a large value, long-running queries on the read-only instance may delay replay significantly.
Underpowered read-only instance specifications. If the read-only instance has lower compute or storage specifications than the primary database, it may not keep up with the replication stream. Unreplayed WAL logs cannot be deleted. Evaluate and select appropriate specifications for the read-only database based on your business requirements.

What's next

If WAL log accumulation persists after you complete these steps, contact the technical support team for ApsaraDB RDS for PostgreSQL.

References

You can manually delete inactive replication slots to allow AliPG to automatically delete WAL logs. For more information, see Use the WAL log management feature.