This topic describes the monitoring metrics and latency of log sequence numbers (LSNs), and how to handle latency.
- The write LSN for the current log of the primary node. You can use the
pg_current_wal_insert_lsn()
function to query this LSN. - The minimum LSN for log replay among all secondary nodes. You can use the
polar_oldest_apply_lsn()
function to query this LSN. - The consistent LSN of the primary node. You can use the
polar_consistent_lsn()
function to query this LSN.Note For more information, see Consistency LSNs. - The checkpoint, which is the starting LSN for log replay when the database crashes
and is restored. You can use the redo_lsn column of the
pg_control_checkpoint()
function to query this LSN.
Monitoring method
select *,
pg_size_pretty(pg_wal_lsn_diff(WP, AP)) as "WP-AP",
pg_size_pretty(pg_wal_lsn_diff(AP, CP)) as "AP-CP",
pg_size_pretty(pg_wal_lsn_diff(WP, CP)) as "WP-CP",
pg_size_pretty(pg_wal_lsn_diff(WP, redo_lsn)) as "WP-REDO"
from
(select pg_current_wal_insert_lsn() as WP,
polar_oldest_apply_lsn() as AP,
polar_consistent_lsn() as CP,
redo_lsn
from pg_control_checkpoint()
) as lsn_info;
polar_*
function in the preceding statement, you must install the polar_monitor
plug-in.
Replay latency of secondary nodes
If the monitoring results show a high WP-AP
value, the replay delay of secondary nodes is high.
- The primary node cannot reuse WAL logs, so a large number of WAL logs are generated and storage usage increases. You can query the storage space of the database cluster or of WAL logs to check this problem.
- You cannot query the latest updated data of the primary node from the secondary node.
- If a proxy for read/write splitting is used, the proxy may route all requests to the primary node and read /write splitting fails in effect.
- This results in latency of the consistent LSN and checkpoint and therefore affects the cluster availability.
- Execute the following statement to query replication slots and check whether the primary/secondary
synchronization is interrupted:
Check whetherselect * from polar_replication_slots;
active
of a replication slot whose isf
. If yes, the primary/secondary synchronization is interrupted. You must further discover the causes for the interrupted primary/secondary synchronization.Note Replication slots in PolarDB fall into two types: replica and standby. Replication slot types correspond to node types in a PolarDB cluster. The replica node shares the same data as the primary node. Interrupted synchronization between the primary and replica nodes has a profound impact on the primary node. You must handle the interrupted synchronization in a timely manner. The standby and primary node are in a similar relationship to the traditional primary and secondary nodes in PostgreSQL. Each of them has independent data. The replication slot type is identified by the name displayed in theslot_name
field. - If primary/secondary synchronization is normal, execute the following statement to
check latency of the primary and secondary nodes:
The preceding query results show the following latency types:select pid, usename, application_name, client_addr, state, pg_wal_lsn_diff(pg_current_wal_insert_lsn(), sent_lsn) as sent_delay, pg_wal_lsn_diff(pg_current_wal_insert_lsn(), write_lsn) as write_delay, pg_wal_lsn_diff(pg_current_wal_insert_lsn(), flush_lsn) as flush_delay, pg_wal_lsn_diff(pg_current_wal_insert_lsn(), replay_lsn) as replay_delay from pg_stat_replication;
sent_delay
: the latency for the primary node to send logs.write_delay
: the latency for the secondary node to write the received logs to the storage.flush_delay
: the latency for the secondary node to flush the received logs to the storage.replay_delay
: the latency for the secondary node to replay logs.Note If the replay_delay value is high, you usually must log on to the host where the secondary node is located and check the running status of thestartup
process of the secondary node. Further troubleshooting is required depending on the check results.
Latency of consistent LSN
A high WP-CP
value indicates a long latency of the consistent LSN. Latency of the consistent LSN
prevents timely execution of checkpoints and therefore impacts the time to restore
a crashed cluster and the cluster availability.
- If latency of the consistent LSN coincides with latency of the secondary node, you can first troubleshoot latency of the secondary node, because the latency of the secondary node may cause latency of the consistent LSN.
- Check whether batch data import or modification is performed. Such operation produces a large amount of dirty data in a short period of time. The database cluster may not be able to write dirty pages to the storage in a timely manner, which results in latency of the consistent LSN. This is a normal case. The latency of the consistent LSN decreases after such operation is complete.
- Check whether the cluster specifications are high and a shared_buffer value is set.
The higher the shared_buffer value, the larger the buffer pool of the database. More
dirty pages may be cached because the buffer size is higher. The upper limit for PolarDB flushing depends on disk I/O and parallel flushing efficiency. When the upper limit
is reached, the rate of writing dirty pages in the buffer pool to the storage is also
at the peak. If dirty pages are generated at a higher speed than the flushing rate,
more and more dirty pages are accumulated in the buffer pool. Latency of the consistent
LSN gradually increases.
You can modify the following flushing related parameters to improve flushing efficiency:
polar_parallel_bgwriter_workers
: the number of parallel flushing processes. Default value: 5.Note This parameter defines the number of parallel flushing processes started by default. PolarDB also dynamically starts and stops some parallel flushing processes depending on the primary/secondary latency. The actual number of parallel flushing processes is min (2 * polar_parallel_bgwriter_workers, 16).polar_parallel_bgwriter_delay
: the waiting time before each parallel flushing process. Default value: 10ms.
polar_parallel_bgwriter_workers
value and decrease thepolar_parallel_bgwriter_delay
value to improve flushing efficiency.Note The large number of flushing processes degrades the overall throughput of the cluster. High I/O consumption by flushing may reduce the efficiency of writing other requests to WAL logs and even the efficiency of SQL execution in the cluster.
Checkpoint latency
A high WP-REDO
value indicates a long checkpoint latency, which affects the number of logs that
are replayed when you restore a crashed cluster. The more logs are replayed, the longer
the time to restore a crashed cluster, the greater the impact on the cluster availability.
- Latency of the consistent LSN will cause checkpoint latency. In this case, you must first troubleshoot the causes for latency of the consistent LSN.
- The causes for latency of the consistent LSN may result in checkpoint latency. For example, batch data modification and flushing inefficiency also cause checkpoint latency.
- The checkpoint execution period is long and many data modifications are performed
within the period. You can modify the
checkpoint_timeout
ormax_wal_size
value to increase the frequency of checkpoint executions.
Other monitoring metrics
- Number of dirty pages
Execute the following statement to view the number of dirty pages in the buffer pool:
select size from polar_flushlist;
polar_flushlist
records partial statistics of the flush list, which contains only the number of dirty pages in the preceding query results. Therefore, the size of dirty pages in buffer pool issize * The size of each page (default value: 8 KB)
. - Copy buffer pool
Execute the following statement to view the number of buffers in the copy buffer pool:
select copy-release as cbuf_size from polar_cbuf;
polar_cbuf
records partial statistics of the copy buffer pool.copy
indicates the number of buffers copied into the copy buffer pool andrelease
indicates the number of buffers released. The difference between the two values is the number of buffers that remain in the copy buffer pool.