Why Does PolarDB Solve the Performance Problem of FPW?

By digoal

Background

The page modified for the first time after the checkpoint needs to write the entire page into wal log (full page write (FPW)) to prevent bad block problems.

With FPW, bad blocks can be repaired (such as pg_basebackup), and the online copy is supported. If data is copied to a partial block, fp can repair it.

However, the introduction of FPW will lead to larger wal logs, especially in scenarios where frequent updates are made. It is clearer when the checkpoint frequency is high.

The higher the checkpoint frequency, the fewer wal logs that need replayed for recovery, and the faster the recovery is. For example, database crash recovery will become very fast. Therefore, there is a contradiction:

Lengthening the checkpoint period can reduce the impact on performance caused by enabling FPW, but it will lead to a longer recovery time.
Shortening the checkpoint period can reduce the recovery time, but it will cause a large number of FPW logs to be written into WAL, affecting the writing performance of databases.

PolarDB supports bad block repair through standby and can disable the FPW function. This solves the contradiction above and significantly improves performance. At the same time:

A similar method is used in Oracle, but Oracle uses checksum to judge whether a page is damaged. This method has a disadvantage: the checksum of the damaged block has no difference from the normal block, resulting in permanent block damage. Why does this happen? The checksum value of an 8 KB block is 8 bytes. There is a hash value conflict. The content of the page is different, but the checksum is the same.
All PolarDB pages that are modified for the first time are pulled from hot standby. Therefore, this problem does not exist.

Join the PolarDB community

In the case of high-frequency checkpoints, disabling FPW can improve the writing load performance by 30%.

Comparison Test

8c64g 2TB nvme ssd:

rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm       
      
yum -y install coreutils glib2 lrzsz dstat sysstat e4fsprogs xfsprogs ntp readline-devel zlib-devel openssl-devel pam-devel libxml2-devel libxslt-devel python-devel tcl-devel gcc gcc-c++ make smartmontools flex bison perl-devel perl-ExtUtils* openldap-devel jadetex  openjade bzip2 git iotop lvm2 perf centos-release-scl       
  
yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
      
yum install -y postgresql14*

parted -a optimal -s /dev/vdb mklabel gpt mkpart primary 1MiB 100%FREE  
  
mkfs.ext4 /dev/vdb1 -m 0 -O extent,uninit_bg -b 4096 -L lv01

vi /etc/fstab             
LABEL=lv01 /data01 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0

mkdir /data01          
        
mount -a

vi /etc/sysctl.conf  
  
# add by digoal.zhou          
fs.aio-max-nr = 1048576          
fs.file-max = 76724600          
          
# Options: kernel.core_pattern = /data01/corefiles/core_%e_%u_%t_%s.%p                   
# The /data01/corefiles directory that is used to store core dumps is created with the 777 permission before testing. If a symbolic link is used, change the corresponding directory to 777.          
          
kernel.sem = 4096 2147483647 2147483646 512000              
# Specify the semaphore. You can run the ipcs -l or -u command to obtain the semaphore count. Each group of 16 processes requires a semaphore with a count of 17.          
          
kernel.shmall = 107374182                
# Specify the total size of shared memory segments. Recommended value: 80% of the memory capacity. Unit: pages.          
kernel.shmmax = 274877906944             
# Specify the maximum size of a single shared memory segment. Recommended value: 50% of the memory capacity. Unit: bytes. In PostgreSQL versions later than 9.2, the use of shared memory significantly drops.          
kernel.shmmni = 819200                   
# Specify the total number of shared memory segments that can be generated. There are at least 2 shared memory segments within each PostgreSQL cluster.          
          
net.core.netdev_max_backlog = 10000          
net.core.rmem_default = 262144                 
# The default setting of the socket receive buffer in bytes.          
net.core.rmem_max = 4194304                    
# The maximum receive socket buffer size in bytes          
net.core.wmem_default = 262144                 
# The default setting (in bytes) of the socket send buffer.          
net.core.wmem_max = 4194304                    
# The maximum send socket buffer size in bytes.          
net.core.somaxconn = 4096          
net.ipv4.tcp_max_syn_backlog = 4096          
net.ipv4.tcp_keepalive_intvl = 20          
net.ipv4.tcp_keepalive_probes = 3          
net.ipv4.tcp_keepalive_time = 60          
net.ipv4.tcp_mem = 8388608 12582912 16777216          
net.ipv4.tcp_fin_timeout = 5          
net.ipv4.tcp_synack_retries = 2          
net.ipv4.tcp_syncookies = 1              
# Enable SYN cookies. If an SYN waiting queue overflows, you can enable SYN cookies to defend against a small number of SYN attacks.          
net.ipv4.tcp_timestamps = 1              
# Reduce time_wait.          
net.ipv4.tcp_tw_recycle = 0              
# If you set this parameter to 1, sockets in the TIME-WAIT state over TCP connections are recycled. However, if network address translation (NAT) is used, TCP connections may fail. We recommend that you set this parameter to 0 on the database server.          
net.ipv4.tcp_tw_reuse = 1                
# Enable the reuse function. This function enables network sockets in the TIME-WAIT state to be reused over new TCP connections.          
net.ipv4.tcp_max_tw_buckets = 262144          
net.ipv4.tcp_rmem = 8192 87380 16777216          
net.ipv4.tcp_wmem = 8192 65536 16777216          
          
net.nf_conntrack_max = 1200000          
net.netfilter.nf_conntrack_max = 1200000          
          
vm.dirty_background_bytes = 409600000                 
# When the dirty page of the system reaches this value, the dirty page brushing process pdflush in the system(or other page brushing processes) automatically brushes the dirty page (dirty_expire_centisecs/100) seconds ago to the disk.          
# The default limit is 10% of the memory capacity. We recommend that you specify the limit in bytes for machines with large memory capacity.          
          
vm.dirty_expire_centisecs = 3000   
  
# Specify the maximum period to retain dirty pages. Dirty pages are flushed to disks after the time period specified by this parameter elapses. The value 3000 indicates 30 seconds.          
vm.dirty_ratio = 95                                    
# If the process that the system flushes dirty pages is too slow, causing the system dirty pages to exceed 95% of the memory, the process that users call to write data onto disks must actively flush dirty pages to disks (These processes include fsync, fdatasync, etc.).          
# Set this parameter properly to prevent user-called processes from flushing dirty pages to disks, which is very effective when a single machine has multiple instances, and CGROUP is used to limit the IOPS of a single instance.            
          
vm.dirty_writeback_centisecs = 100                      
# Specify the time interval at which the background scheduling process (such as pdflush or other processes) flushes dirty pages to disks. The value 100 indicates 1 second.          
          
vm.swappiness = 0          
# Disable the swap function.          
          
vm.mmap_min_addr = 65536          
vm.overcommit_memory = 0               
# When allocating memory, more memory space than the malloc is allowed. If you set this parameter to 1, the system always considers the available memory space sufficient. If the memory capacity provided in the test environment is low, we recommend that you set this parameter to 1.            
          
vm.overcommit_ratio = 90               
# Specify the memory capacity that can be allocated when the overcommit_memory parameter is set to 2.          
vm.swappiness = 0                      
# Disable the swap partition.          
vm.zone_reclaim_mode = 0               
# Disable non-uniform memory access (NUMA). You can also disable NUMA in the vmlinux file.           
net.ipv4.ip_local_port_range = 40000 65535              
# Specify the range of TCP or UDP port numbers that are automatically allocated locally.          
fs.nr_open=20480000          
# Specify the maximum number of file handles that a single process can open.          
          
# Take note of the following parameters:          
# vm.extra_free_kbytes = 4096000          
# vm.min_free_kbytes = 2097152  # vm.min_free_kbytes We recommend that you set the value of the vm.min_free_kbytes parameter to 1 GB for every 32 GB of memory.         
# If the physical host does not provide much memory, we recommend that you do not configure vm.extra_free_kbytes and vm.min_free_kbytes.          
# vm.nr_hugepages = 66536              
# If the size of the shared buffer exceeds 64 GB, we recommend that you use huge pages. You can specify the page size by setting the Hugepagesize parameter in the /proc/meminfo file.          
# vm.lowmem_reserve_ratio = 1 1 1          
# If the memory capacity exceeds 64 GB, we recommend that you set this parameter. Otherwise, we recommend that you retain the default value 256 256 32.

sysctl -p

vi /etc/security/limits.d/20-nproc.conf    
      
# If nofile exceeds 1048576, the fs.nr_open of sysctl must be set to a larger value, and then you can continue to set nofile after sysctl takes effect.          
          
* soft    nofile  1024000          
* hard    nofile  1024000          
* soft    nproc   unlimited          
* hard    nproc   unlimited          
* soft    core    unlimited          
* hard    core    unlimited          
* soft    memlock unlimited          
* hard    memlock unlimited

chmod +x /etc/rc.d/rc.local

vi /etc/rc.local      
      
touch /var/lock/subsys/local      
           
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then          
   echo never > /sys/kernel/mm/transparent_hugepage/enabled          
fi      
  
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource

useradd postgres  
su - postgres  
  
wget https://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.bz2  
  
tar -jxvf postgresql-snapshot.tar.bz2  
cd postgresql-15devel  
./configure --prefix=/home/postgres/pg15  
make world -j 16  
make install-world

cd ~  
vi .bash_profile  
  
# Append the following parameters:        
export PS1="$USER@`/bin/hostname -s`-> "          
export PGPORT=1921          
export PGDATA=/data01/pg15_$PGPORT/pg_root         
        
export LANG=en_US.utf8          
export PGHOME=/home/postgres/pg15  
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH          
export DATE=`date +"%Y%m%d%H%M"`          
export PATH=$PGHOME/bin:$PATH:.          
export MANPATH=$PGHOME/share/man:$MANPATH          
export PGHOST=$PGDATA          
export PGUSER=postgres          
export PGDATABASE=postgres          
alias rm='rm -i'          
alias ll='ls -lh'          
unalias vi

su - root  
mkdir /data01/pg15_1921  
chown postgres:postgres /data01/pg15_1921

su - postgres  
  
initdb -D $PGDATA -U postgres -E UTF8 --lc-collate=C --lc-ctype=en_US.utf8

cd $PGDATA  
  
vi postgresql.conf  
  
listen_addresses = '0.0.0.0'          
port = 1921                  
max_connections = 1000              
superuser_reserved_connections = 13      
unix_socket_directories = '., /tmp'      
tcp_keepalives_idle = 60          
tcp_keepalives_interval = 10          
tcp_keepalives_count = 6          
shared_buffers = 16GB              
maintenance_work_mem = 1GB          
dynamic_shared_memory_type = posix      
vacuum_cost_delay = 0              
bgwriter_delay = 10ms              
bgwriter_lru_maxpages = 1000          
bgwriter_lru_multiplier = 5.0          
effective_io_concurrency = 0          
max_parallel_workers_per_gather = 0      
wal_level = replica              
fsync = on                  
synchronous_commit = on          
full_page_writes = on              
wal_writer_delay = 10ms          
wal_writer_flush_after = 1MB          
max_wal_size = 1GB    # High-frequency checkpoint, which has obvious influence compared with fpw  
min_wal_size = 80MB  
random_page_cost = 1.1              
effective_cache_size = 64GB  
log_destination = 'csvlog'          
logging_collector = on          
log_truncate_on_rotation = on          
log_checkpoints = on  
log_timezone = 'Asia/Shanghai'  
autovacuum_vacuum_cost_delay = 0ms      
vacuum_freeze_table_age = 750000000  
vacuum_multixact_freeze_table_age = 750000000  
datestyle = 'iso, mdy'  
timezone = 'Asia/Shanghai'  
lc_messages = 'en_US.utf8'              
lc_monetary = 'en_US.utf8'              
lc_numeric = 'en_US.utf8'              
lc_time = 'en_US.utf8'                  
default_text_search_config = 'pg_catalog.english'

pg_ctl start

pgbench -i -s 5000

pgbench -M prepared -n -r -P 1 -c 16 -j 16 -T 120

tpc-b Read/Write Performance When Full Page Write Is Enabled

transaction type: <builtin: TPC-B (sort of)>  
scaling factor: 5000  
query mode: prepared  
number of clients: 16  
number of threads: 16  
duration: 120 s  
number of transactions actually processed: 1210084  
latency average = 1.586 ms  
latency stddev = 1.181 ms  
initial connection time = 8.439 ms  
tps = 10072.852551 (without initial connection time)  
statement latencies in milliseconds:  
         0.001  \set aid random(1, 100000 * :scale)  
         0.000  \set bid random(1, 1 * :scale)  
         0.000  \set tid random(1, 10 * :scale)  
         0.000  \set delta random(-5000, 5000)  
         0.077  BEGIN;  
         0.681  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;  
         0.111  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  
         0.117  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;  
         0.107  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;  
         0.091  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);  
         0.399  END;

tpc-b Read/Write Performance When Full Page Write Is Disabled

transaction type: <builtin: TPC-B (sort of)>  
scaling factor: 5000  
query mode: prepared  
number of clients: 16  
number of threads: 16  
duration: 120 s  
number of transactions actually processed: 1569385  
latency average = 1.223 ms  
latency stddev = 0.970 ms  
initial connection time = 9.154 ms  
tps = 13070.045981 (without initial connection time)  
statement latencies in milliseconds:  
         0.001  \set aid random(1, 100000 * :scale)  
         0.000  \set bid random(1, 1 * :scale)  
         0.000  \set tid random(1, 10 * :scale)  
         0.000  \set delta random(-5000, 5000)  
         0.075  BEGIN;  
         0.402  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;  
         0.113  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  
         0.114  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;  
         0.111  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;  
         0.096  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);  
         0.310  END;

After FPW is turned off, the performance is improved by 30%.

Community

Why Does PolarDB Solve the Performance Problem of FPW?

Background

Comparison Test

tpc-b Read/Write Performance When Full Page Write Is Enabled

tpc-b Read/Write Performance When Full Page Write Is Disabled

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for PostgreSQL

PolarDB for Xscale

Database Overview

PolarDB for MySQL