×
Community Blog Open-Source PolarDB for PostgreSQL on Shared Storage Block Device: Multi-Machine Deployment Practice

Open-Source PolarDB for PostgreSQL on Shared Storage Block Device: Multi-Machine Deployment Practice

This article describes the deployment of open-source PolarDB for PostgreSQL in an environment with shared storage and multiple compute nodes.

By digoal

Background

This article describes the deployment of open-source PolarDB for PostgreSQL in an environment with shared storage and multiple compute nodes.

The shared storage adopts Network Block Device (NBD). This environment is relatively easy to deploy. You can build it as long as you have a server, lowering the threshold for learning PolarDB for PG. However, it is not recommended to use it in the production environment. you are advised to use commercial SAN or commercial distributed shared block storage in the production environment. The setup method is similar to the content in this article. Skip the NBD deployment and replace the NBD with another shared block device.

Environment:

8 cores, 32 GB  
  
ecs shared storage, 3 network shared disks, 1   
172.25.9.67  
  
1 ecs rw   
172.25.9.68  
  
2 ecs ro   
172.25.9.69  
172.25.9.70  

ECS needs to be able to connect to the public network. The installation process includes some installation operations, such as obtaining cpan and yum.

The open-source address of PolarDB for PostgreSQL: https://github.com/ApsaraDB/PolarDB-for-PostgreSQL

1. Deploy OS (All ECS Instances)

2. Deploy NBD Software (All ECS Instances)

3. Export Shared NBD (ECS Shared Storage)

Please refer to Network Block Device for Testing RAC and Shared Storage Version of PolarDB for PostgreSQL for the three sections above.

Check the block devices used for NBD: vdb, vdc, and vdd

[root@iZbp10sz66ubwpqzg2ry5gZ ~]# lsblk  
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT  
vda    253:0    0  100G  0 disk   
└─vda1 253:1    0  100G  0 part /  
vdb    253:16   0  100G  0 disk   
vdc    253:32   0  100G  0 disk   
vdd    253:48   0  100G  0 disk   

Write the NBD server configuration file. Note: There must be no space at the end of the normal configuration line of nbd.conf. Otherwise, the startup will fail.

# vi /root/nbd.conf  
  
# This is a comment  
[generic]
    # The [generic] section is required, even if nothing is specified  
    # there.  
    # When either of these options are specified, nbd-server drops  
    # privileges to the given user and group after opening ports, but  
    # _before_ opening files.  
    # user = nbd  
    # group = nbd  
    listenaddr = 0.0.0.0
    port = 1921
[export1]
    exportname = /dev/vdb
    readonly = false
    multifile = false
    copyonwrite = false
    flush = true
    fua = true
    sync = true
[export2]
    exportname = /dev/vdc
    readonly = false
    multifile = false
    copyonwrite = false
    flush = true
    fua = true
    sync = true
[export3]
    exportname = /dev/vdd
    readonly = false
    multifile = false
    copyonwrite = false
    flush = true
    fua = true
    sync = true

Start nbd-server:

# nbd-server -C /root/nbd.conf    
# netstat -anp|grep 1921  
tcp        0      0 0.0.0.0:1921            0.0.0.0:*               LISTEN      1296/nbd-server   

4. Mount Shared NBD (rw, ro ECS)

Please refer to Network Block Device for Testing RAC and Shared Storage Version of PolarDB for PostgreSQL.

After restarting the server, perform the command-modprobe nbd to load the module and then mount NBD:

nbd-client 172.25.9.67 1921 -N export1 /dev/nbd0     
nbd-client 172.25.9.67 1921 -N export2 /dev/nbd1   
nbd-client 172.25.9.67 1921 -N export3 /dev/nbd2    
[root@iZbp13tgwor95f2508zo4oZ ~]# nbd-client 172.25.9.67 1921 -N export1 /dev/nbd0     
Negotiation: ..size = 102400MB  
bs=1024, sz=107374182400 bytes  
[root@iZbp13tgwor95f2508zo4oZ ~]# nbd-client 172.25.9.67 1921 -N export2 /dev/nbd1   
Negotiation: ..size = 102400MB  
bs=1024, sz=107374182400 bytes  
[root@iZbp13tgwor95f2508zo4oZ ~]# nbd-client 172.25.9.67 1921 -N export3 /dev/nbd2    
Negotiation: ..size = 102400MB  
bs=1024, sz=107374182400 bytes  
vi /etc/rc.local  
  
  
nbd-client 172.25.9.67 1921 -N export1 /dev/nbd0     
nbd-client 172.25.9.67 1921 -N export2 /dev/nbd1     
nbd-client 172.25.9.67 1921 -N export3 /dev/nbd2     
  
#nbd-client -d /dev/nbd0  
#nbd-client -d /dev/nbd1  
#nbd-client -d /dev/nbd2  

5. Install PolarDB File System (PFS) Software (rw, ro ECS)

https://github.com/ApsaraDB/PolarDB-for-PostgreSQL/blob/POLARDB_11_STABLE/README.md

https://github.com/ApsaraDB/PolarDB-FileSystem/blob/master/Readme.md

yum install -y cmake3 gcc gcc-c++ libaio-devel git unzip    
  
ln -s /usr/bin/cmake3 /usr/bin/cmake  

https://github.com/HardySimpson/zlog/releases

wget https://github.com/HardySimpson/zlog/archive/refs/tags/1.2.15.tar.gz  
  
tar -zxvf 1.2.15.tar.gz  
  
cd zlog-1.2.15/  
  
make  
  
make install  
  
echo "/usr/local/lib" >> /etc/ld.so.conf  
  
ldconfig
ldconfig -v|grep zlog  
 ... ...   
    libzlog.so.1.2 -> libzlog.so.1.2  
su - root

cd ~
  
# Note the branch. We recommend that you use the stable branch or the release branch. For more information, see the corresponding documentation.

git clone https://github.com/ApsaraDB/PolarDB-FileSystem  
  
cd PolarDB-FileSystem  
  
./autobuild.sh   
... ...  
[100%] Linking CXX executable ../../../bin/pfsdaemon  
[100%] Built target pfsdaemon  
~/PolarDB-FileSystem-master  
 end compile, binary's in ./bin, library's in ./lib   
  
./install.sh  
install pfsd success!  

6. Configure PFS (rw, ro ECS)

There are different paths to different block devices, and different directories can also be created in the same block device. This way, block devices are mapped to directories, thus mapping to the tablespace of the database.

Block Device Renaming

PFS only supports block devices that start with specific characters for access. We recommend that all block device access nodes share the same name to access shared block devices through a soft link.

Execute on the NBD client host:

ln -s /dev/nbd0 /dev/nvme0n1  
ln -s /dev/nbd1 /dev/nvme0n2  
ln -s /dev/nbd2 /dev/nvme0n3  

Block Device Initialization

Note: Only perform PFS operations on RW nodes to format shared block devices:

pfs -C disk mkfs nvme0n1  
pfs -C disk mkfs nvme0n2  
pfs -C disk mkfs nvme0n3  
  
...  
pfs mkfs succeeds!  

Block Device Mounting

Start PFS and mount shared disks on RW and RO nodes, respectively:

/usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme0n1  
/usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme0n2  
/usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme0n3  
  
...  ...    
pfsdaemon nvme0n1 start success  
pfsdaemon nvme0n2 start success  
pfsdaemon nvme0n3 start success  
vi /etc/rc.local  
  
/usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme0n1  
/usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme0n2  
/usr/local/polarstore/pfsd/bin/start_pfsd.sh -p nvme0n3  
  
# /usr/local/polarstore/pfsd/bin/stop_pfsd.sh nvme0n1  
# /usr/local/polarstore/pfsd/bin/stop_pfsd.sh nvme0n2  
# /usr/local/polarstore/pfsd/bin/stop_pfsd.sh nvme0n3  

7. Install PolarDB Software (rw, ro ECS)

Deployment Guide for Compute-Storage Separation Architecture on PolarDB for PostgreSQL Open-Source Version

su - root  
  
yum install -y readline-devel zlib-devel perl-CPAN bison flex git       
      
cpan -fi Test::More IPC::Run  
useradd polardb    
      
su - polardb      
      
cd ~      
  
wget https://github.com/ApsaraDB/PolarDB-for-PostgreSQL/archive/refs/heads/POLARDB_11_STABLE.zip     
  
unzip POLARDB_11_STABLE.zip
  
cd PolarDB-for-PostgreSQL-POLARDB_11_STABLE  
  
  
./polardb_build.sh --noinit --with-pfsd  

8. Initialize the rw Node (rw ECS)

su - polardb  

cd ~
  
$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D primary -E UTF8 --lc-collate=C --lc-ctype=en_US.utf8 -U polardb  
  
# shared storage initialization  
su - root  
pfs -C disk mkdir /nvme0n1/shared_data  

# This step is also executed with root.
/home/polardb/tmp_basedir_polardb_pg_1100_bld/bin/polar-initdb.sh /home/polardb/primary/ /nvme0n1/shared_data/  

Node Configuration:

su – polardb  
cd ~/primary  # Note  

Open postgresql.conf and add the following configuration items:

listen_addresses = '0.0.0.0'          
port = 5432                  
max_connections = 1000              
unix_socket_directories = '., /tmp'      
tcp_keepalives_idle = 60          
tcp_keepalives_interval = 10          
tcp_keepalives_count = 10          
shared_buffers = 8GB              
maintenance_work_mem = 1GB          
dynamic_shared_memory_type = posix      
parallel_leader_participation = off  
random_page_cost = 1.1              
log_destination = 'csvlog'          
logging_collector = on          
log_directory = 'log'              
log_truncate_on_rotation = on          
log_checkpoints = on  
log_connections = on  
log_disconnections = on  
log_error_verbosity = verbose    
log_lock_waits = on              
log_statement = 'ddl'              
log_timezone = 'Asia/Shanghai'  
log_autovacuum_min_duration = 0      
autovacuum_vacuum_cost_delay = 0ms      
datestyle = 'iso, mdy'  
timezone = 'Asia/Shanghai'  
lc_messages = 'en_US.UTF-8'              
lc_monetary = 'en_US.UTF-8'              
lc_numeric = 'en_US.UTF-8'              
lc_time = 'en_US.UTF-8'                  
default_text_search_config = 'pg_catalog.english'  
polar_hostid=1  # Note  
polar_enable_shared_storage_mode=on  
polar_disk_name='nvme0n1' # Note  
polar_datadir='/nvme0n1/shared_data/'  # Note  
polar_vfs.localfs_mode=off  
shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'  
polar_storage_cluster_name='disk'  
synchronous_standby_names='replica1, replica2'  # Note  

Open the pg_hba.conf and add the following configuration items:

host    replication    polardb     172.25.9.68/32    trust  
host    replication    polardb     172.25.9.69/32    trust  
host    replication    polardb     172.25.9.70/32    trust  
host all all 0.0.0.0/0 md5  

9. Start rw (rw ECS)

Start and Check:

su - polardb  

Start:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/primary  

Check:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -p 5432 -d postgres -c 'select version();'  
  
                                                 version                                                   
---------------------------------------------------------------------------------------------------------  
 PostgreSQL 11.9 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit  
(1 row)  

Configure environment variables for ease of use:

su - polardb  
  
  
vi ~/.bashrc    
    
# Append the following parameters:            
export PS1="$USER@`/bin/hostname -s`-> "              
export PGPORT=5432              
export PGDATA=/home/polardb/primary          
            
export LANG=en_US.utf8              
export PGHOME=/home/polardb/tmp_basedir_polardb_pg_1100_bld    
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH              
export DATE=`date +"%Y%m%d%H%M"`              
export PATH=$PGHOME/bin:$PATH:.              
export MANPATH=$PGHOME/share/man:$MANPATH              
export PGHOST=127.0.0.1           
export PGUSER=polardb              
export PGDATABASE=postgres              
alias rm='rm -i'              
alias ll='ls -lh'              
unalias vi    

10. Stream Replication Preparation for Read-Only Nodes (rw ECS)

Create the corresponding replication slot for physical stream replication of the read-only node next:

su - polardb  
  
$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -p 5432 -d postgres -c "select pg_create_physical_replication_slot('replica1');"  
$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -p 5432 -d postgres -c "select pg_create_physical_replication_slot('replica2');"  

11. Create a tablespace (Corresponding to Different Paths of PFS) (rw ECS)

Currently, mapping multiple block devices to different tablespaces is not supported.

There are multiple disks, but how do you use them? Lvm2 logical volume can be used for management. It can be configured on the NBD server to make a large volume.

pvcreate  
vgcreate  
lvcreate  
lvextend  

Use PFS to view the contents of the shared storage after initialization:

[root@iZbp13tgwor95f2508zo4oZ ~]# pfs -C disk ls /nvme0n1/
  File  1     4194304           Mon Nov  8 14:21:58 2021  .pfs-paxos
  File  1     1073741824        Mon Nov  8 14:22:03 2021  .pfs-journal
   Dir  1     1280              Mon Nov  8 15:23:05 2021  shared_data
total 2105344 (unit: 512Bytes)
[root@iZbp13tgwor95f2508zo4oZ ~]# pfs -C disk ls /nvme0n1/shared_data
   Dir  1     512               Mon Nov  8 15:11:38 2021  base
   Dir  1     7424              Mon Nov  8 15:11:39 2021  global
   Dir  1     0                 Mon Nov  8 15:11:39 2021  pg_tblspc
   Dir  1     10368             Mon Nov  8 18:06:41 2021  pg_wal
   Dir  1     896               Mon Nov  8 18:06:41 2021  pg_logindex
   Dir  1     0                 Mon Nov  8 15:11:39 2021  pg_twophase
   Dir  1     512               Mon Nov  8 17:57:42 2021  pg_xact
   Dir  1     0                 Mon Nov  8 15:11:40 2021  pg_commit_ts
   Dir  1     256               Mon Nov  8 15:11:40 2021  pg_multixact
   Dir  1     512               Mon Nov  8 15:23:18 2021  polar_fullpage
total 0 (unit: 512Bytes)

12. Initialize the ro Node (ro ECS)

RO 1

su - polardb  

Node Initialization:

cd ~
$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D replica1  # Note  

Node Configuration:

cd ~/replica1   # Note  

Open postgresql.conf and add the following configuration items:

listen_addresses = '0.0.0.0'          
port = 5432                  
max_connections = 1000              
unix_socket_directories = '., /tmp'      
tcp_keepalives_idle = 60          
tcp_keepalives_interval = 10          
tcp_keepalives_count = 10          
shared_buffers = 8GB              
maintenance_work_mem = 1GB          
dynamic_shared_memory_type = posix      
parallel_leader_participation = off  
random_page_cost = 1.1              
log_destination = 'csvlog'          
logging_collector = on          
log_directory = 'log'              
log_truncate_on_rotation = on          
log_checkpoints = on  
log_connections = on  
log_disconnections = on  
log_error_verbosity = verbose    
log_lock_waits = on              
log_statement = 'ddl'              
log_timezone = 'Asia/Shanghai'  
log_autovacuum_min_duration = 0      
autovacuum_vacuum_cost_delay = 0ms      
datestyle = 'iso, mdy'  
timezone = 'Asia/Shanghai'  
lc_messages = 'en_US.UTF-8'              
lc_monetary = 'en_US.UTF-8'              
lc_numeric = 'en_US.UTF-8'              
lc_time = 'en_US.UTF-8'                  
default_text_search_config = 'pg_catalog.english'  
polar_hostid=2    # Note  
polar_enable_shared_storage_mode=on  
polar_disk_name='nvme0n1' # Note 
polar_datadir='/nvme0n1/shared_data/' # Note 
polar_vfs.localfs_mode=off  
shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'  
polar_storage_cluster_name='disk'  

Open the pg_hba.conf and add the following configuration items:

host    replication    polardb     172.25.9.68/32    trust  
host    replication    polardb     172.25.9.69/32    trust  
host    replication    polardb     172.25.9.70/32    trust  
host all all 0.0.0.0/0 md5  

Create recovery.conf and add the following configuration items:

polar_replica='on'  
recovery_target_timeline='latest'  
primary_slot_name='replica1'  # Note  
primary_conninfo='host=172.25.9.68 port=5432 user=polardb dbname=postgres application_name=replica1'  # Note  

Configure environment variables for ease of use:

su - polardb  
  
  
vi ~/.bashrc    
    
# Append the following parameters:            
export PS1="$USER@`/bin/hostname -s`-> "              
export PGPORT=5432              
export PGDATA=/home/polardb/replica1        # Note  
            
export LANG=en_US.utf8              
export PGHOME=/home/polardb/tmp_basedir_polardb_pg_1100_bld    
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH              
export DATE=`date +"%Y%m%d%H%M"`              
export PATH=$PGHOME/bin:$PATH:.              
export MANPATH=$PGHOME/share/man:$MANPATH              
export PGHOST=127.0.0.1           
export PGUSER=polardb              
export PGDATABASE=postgres              
alias rm='rm -i'              
alias ll='ls -lh'              
unalias vi    

RO 2

Node Initialization:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/initdb -D replica2    # Note  

Node Configuration:

cd ~/replica2 # Note  

Open postgresql.conf and add the following configuration items:

listen_addresses = '0.0.0.0'          
port = 5432                  
max_connections = 1000              
unix_socket_directories = '., /tmp'      
tcp_keepalives_idle = 60          
tcp_keepalives_interval = 10          
tcp_keepalives_count = 10          
shared_buffers = 8GB              
maintenance_work_mem = 1GB          
dynamic_shared_memory_type = posix      
parallel_leader_participation = off  
random_page_cost = 1.1              
log_destination = 'csvlog'          
logging_collector = on          
log_directory = 'log'              
log_truncate_on_rotation = on          
log_checkpoints = on  
log_connections = on  
log_disconnections = on  
log_error_verbosity = verbose    
log_lock_waits = on              
log_statement = 'ddl'              
log_timezone = 'Asia/Shanghai'  
log_autovacuum_min_duration = 0      
autovacuum_vacuum_cost_delay = 0ms      
datestyle = 'iso, mdy'  
timezone = 'Asia/Shanghai'  
lc_messages = 'en_US.UTF-8'              
lc_monetary = 'en_US.UTF-8'              
lc_numeric = 'en_US.UTF-8'              
lc_time = 'en_US.UTF-8'                  
default_text_search_config = 'pg_catalog.english'  
polar_hostid=3  # Note  
polar_enable_shared_storage_mode=on  
polar_disk_name='nvme0n1' # Note 
polar_datadir='/nvme0n1/shared_data/' # Note  
polar_vfs.localfs_mode=off  
shared_preload_libraries='$libdir/polar_vfs,$libdir/polar_worker'  
polar_storage_cluster_name='disk'  

Open the pg_hba.conf and add the following configuration items:

host    replication    polardb     172.25.9.68/32    trust  
host    replication    polardb     172.25.9.69/32    trust  
host    replication    polardb     172.25.9.70/32    trust  
host all all 0.0.0.0/0 md5  

Create recovery.conf and add the following configuration items:

polar_replica='on'  
recovery_target_timeline='latest'  
primary_slot_name='replica2'  # Note  
primary_conninfo='host=172.25.9.68 port=5432 user=polardb dbname=postgres application_name=replica2'  # Note  

Configure environment variables for ease of use:

su - polardb  
  
  
vi ~/.bashrc    
    
# Append the following parameters:            
export PS1="$USER@`/bin/hostname -s`-> "              
export PGPORT=5432              
export PGDATA=/home/polardb/replica2        # Note  
            
export LANG=en_US.utf8              
export PGHOME=/home/polardb/tmp_basedir_polardb_pg_1100_bld    
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH              
export DATE=`date +"%Y%m%d%H%M"`              
export PATH=$PGHOME/bin:$PATH:.              
export MANPATH=$PGHOME/share/man:$MANPATH              
export PGHOST=127.0.0.1           
export PGUSER=polardb              
export PGDATABASE=postgres              
alias rm='rm -i'              
alias ll='ls -lh'              
unalias vi    

13. Start the ro Node (ro ECS)

RO 1

Start and Check:

su - polardb  

Start:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/replica1  

Check:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -p 5432 -d postgres -c 'select version();'  

RO 2

Start and Check:

su - polardb  

Start:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/pg_ctl start -D $HOME/replica2  

Check:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -p 5432 -d postgres -c 'select version();'  

14. Database Usage Test (rw, ro ECS)

Instance Check and Test

After the deployment, you need to check and test the instance to ensure the master node supports data writing and the read-only node supports data reading.

Log on to the RW master node, create a test table, and insert sample data:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q -p 5432 -d postgres -c "create table t(t1 int primary key, t2 int);insert into t values (1, 1),(2, 3),(3, 3);"  

Log on to any RO read-only node and query the sample data that was inserted:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -q -p 5432 -d postgres -c "select * from t;"  
  
 t1 | t2   
----+----  
  1 |  1  
  2 |  3  
  3 |  3  
(3 rows)  

15. Other Checks

RW Replication Status Check:

postgres=# select * from pg_stat_replication ;  
-[ RECORD 1 ]----+------------------------------  
pid              | 17361  
usesysid         | 10  
usename          | polardb  
application_name | replica1  
client_addr      | 172.25.9.69  
client_hostname  |   
client_port      | 56684  
backend_start    | 2021-11-08 15:34:04.711213+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 0/166A090  
write_lsn        | 0/166A090  
flush_lsn        | 0/166A090  
replay_lsn       | 0/166A090  
write_lag        |   
flush_lag        |   
replay_lag       |   
sync_priority    | 1  
sync_state       | sync  
-[ RECORD 2 ]----+------------------------------  
pid              | 17363  
usesysid         | 10  
usename          | polardb  
application_name | replica2  
client_addr      | 172.25.9.70  
client_hostname  |   
client_port      | 45858  
backend_start    | 2021-11-08 15:34:30.442495+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 0/166A090  
write_lsn        | 0/166A090  
flush_lsn        | 0/166A090  
replay_lsn       | 0/166A090  
write_lag        |   
flush_lag        |   
replay_lag       |   
sync_priority    | 2  
sync_state       | potential 
postgres=# select * from pg_replication_slots ;  
-[ RECORD 1 ]-------+----------  
slot_name           | replica1  
plugin              |   
slot_type           | physical  
datoid              |   
database            |   
temporary           | f  
active              | t  
active_pid          | 17361  
xmin                |   
catalog_xmin        |   
restart_lsn         | 0/1669C78  
confirmed_flush_lsn |   
-[ RECORD 2 ]-------+----------  
slot_name           | replica2  
plugin              |   
slot_type           | physical  
datoid              |   
database            |   
temporary           | f  
active              | t  
active_pid          | 17363  
xmin                |   
catalog_xmin        |   
restart_lsn         | 0/1669C78  
confirmed_flush_lsn |   

Stress Testing (before optimization):

pgbench -i -s 100  

The read-only stress testing is performed on all nodes at the same time.

pgbench -M prepared -n -r -P 1 -c 32 -j 32 -T 120 -S  

The reading/writing stress testing is performed on the RW node.

pgbench -M prepared -n -r -P 1 -c 8 -j 8 -T 120   

Query throughput is basically a linear increase with the increase in the number of nodes (because the data is in memory).

Stress Testing Process

RW Node:

Reading/Writing Stress Testing:

pgbench -M prepared -n -r -P 1 -c 8 -j 8 -T 1200   
  
progress: 192.0 s, 0.0 tps, lat 0.000 ms stddev 0.000  
progress: 193.0 s, 0.0 tps, lat 0.000 ms stddev 0.000  
progress: 194.0 s, 559.0 tps, lat 58.408 ms stddev 456.270  
progress: 195.0 s, 1616.7 tps, lat 4.959 ms stddev 18.420  
progress: 196.0 s, 2153.2 tps, lat 3.709 ms stddev 1.102  
progress: 197.0 s, 646.0 tps, lat 3.635 ms stddev 1.042  
progress: 198.0 s, 0.0 tps, lat 0.000 ms stddev 0.000  
progress: 199.0 s, 0.0 tps, lat 0.000 ms stddev 0.000  
progress: 200.0 s, 283.1 tps, lat 104.779 ms stddev 595.861  
progress: 201.0 s, 2214.0 tps, lat 3.620 ms stddev 1.123  
progress: 202.0 s, 2153.0 tps, lat 3.709 ms stddev 1.096  
progress: 203.0 s, 2377.8 tps, lat 3.369 ms stddev 0.977  
progress: 204.0 s, 2313.3 tps, lat 3.460 ms stddev 0.987  
progress: 205.0 s, 2329.9 tps, lat 3.429 ms stddev 0.976  
progress: 206.0 s, 2283.7 tps, lat 3.508 ms stddev 1.052  
progress: 207.0 s, 2098.3 tps, lat 3.809 ms stddev 1.895  
progress: 208.0 s, 2340.1 tps, lat 3.417 ms stddev 0.977  
progress: 209.0 s, 2052.9 tps, lat 3.902 ms stddev 6.712  
progress: 210.0 s, 2346.1 tps, lat 3.408 ms stddev 0.965  
progress: 211.0 s, 2316.9 tps, lat 3.452 ms stddev 0.994  

RO Node 1, 2:

Read-Only Stress Testing:

pgbench -M prepared -n -r -P 1 -c 32 -j 32 -T 1200 -S  
  
  
progress: 254.0 s, 30921.5 tps, lat 1.034 ms stddev 0.892  
progress: 255.0 s, 32351.0 tps, lat 0.990 ms stddev 1.660  
progress: 256.0 s, 33540.0 tps, lat 0.953 ms stddev 0.859  
progress: 257.0 s, 33027.0 tps, lat 0.971 ms stddev 1.040  
progress: 258.0 s, 32791.1 tps, lat 0.976 ms stddev 0.631  
progress: 259.0 s, 32839.6 tps, lat 0.975 ms stddev 1.837  
progress: 260.0 s, 33539.4 tps, lat 0.954 ms stddev 0.527  
progress: 261.0 s, 34344.5 tps, lat 0.932 ms stddev 0.984  
progress: 262.0 s, 32383.9 tps, lat 0.988 ms stddev 0.618  
progress: 263.0 s, 33186.0 tps, lat 0.964 ms stddev 0.512  
progress: 264.0 s, 33253.3 tps, lat 0.962 ms stddev 0.497  
progress: 265.0 s, 32584.0 tps, lat 0.982 ms stddev 0.466  
progress: 266.0 s, 32959.8 tps, lat 0.967 ms stddev 1.310  
progress: 267.0 s, 32392.0 tps, lat 0.991 ms stddev 0.701  
progress: 268.0 s, 33307.8 tps, lat 0.961 ms stddev 0.505  
progress: 269.0 s, 33255.8 tps, lat 0.962 ms stddev 0.589  
progress: 270.0 s, 33994.6 tps, lat 0.941 ms stddev 0.449  
progress: 271.0 s, 34127.2 tps, lat 0.937 ms stddev 1.441  
pgbench -M prepared -n -r -P 1 -c 32 -j 32 -T 1200 -S  
  
progress: 253.0 s, 32289.2 tps, lat 0.991 ms stddev 0.621  
progress: 254.0 s, 32778.7 tps, lat 0.976 ms stddev 0.658  
progress: 255.0 s, 32314.9 tps, lat 0.984 ms stddev 1.320  
progress: 256.0 s, 31984.6 tps, lat 1.006 ms stddev 1.159  
progress: 257.0 s, 31889.4 tps, lat 1.004 ms stddev 0.679  
progress: 258.0 s, 32466.4 tps, lat 0.986 ms stddev 0.565  
progress: 259.0 s, 33572.2 tps, lat 0.953 ms stddev 0.517  
progress: 260.0 s, 33482.7 tps, lat 0.956 ms stddev 0.532  
progress: 261.0 s, 33047.4 tps, lat 0.968 ms stddev 0.487  
progress: 262.0 s, 32951.0 tps, lat 0.971 ms stddev 0.634  
progress: 263.0 s, 34039.1 tps, lat 0.940 ms stddev 1.068  
progress: 264.0 s, 33919.8 tps, lat 0.944 ms stddev 0.936  
progress: 265.0 s, 34062.7 tps, lat 0.940 ms stddev 0.648  
progress: 266.0 s, 31726.4 tps, lat 1.009 ms stddev 0.567  
progress: 267.0 s, 34335.7 tps, lat 0.932 ms stddev 1.252  
progress: 268.0 s, 33604.4 tps, lat 0.952 ms stddev 0.571  
progress: 269.0 s, 34043.3 tps, lat 0.940 ms stddev 0.673  
progress: 270.0 s, 33909.3 tps, lat 0.944 ms stddev 0.547  

Low Observation Delay:

postgres=# select * from pg_stat_replication ;  
-[ RECORD 1 ]----+------------------------------  
pid              | 17361  
usesysid         | 10  
usename          | polardb  
application_name | replica1  
client_addr      | 172.25.9.69  
client_hostname  |   
client_port      | 56684  
backend_start    | 2021-11-08 15:34:04.711213+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 2/456D3BF8  
write_lsn        | 2/456D3BF8  
flush_lsn        | 2/456D3BF8  
replay_lsn       | 2/456D3BF8  
write_lag        | 00:00:00.000147  
flush_lag        | 00:00:00.000147  
replay_lag       | 00:00:00.000244  
sync_priority    | 1  
sync_state       | sync  
-[ RECORD 2 ]----+------------------------------  
pid              | 17363  
usesysid         | 10  
usename          | polardb  
application_name | replica2  
client_addr      | 172.25.9.70  
client_hostname  |   
client_port      | 45858  
backend_start    | 2021-11-08 15:34:30.442495+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 2/456D3BF8  
write_lsn        | 2/456D3BF8  
flush_lsn        | 2/456D3BF8  
replay_lsn       | 2/456D3BF8  
write_lag        | 00:00:00.000517  
flush_lag        | 00:00:00.000517  
replay_lag       | 00:00:00.00052  
sync_priority    | 2  
sync_state       | potential  

Analyze wait events:

The data is concentrated in wal writer. Since NBD server runs on TCP, it is normal if the delay is very high. On the other hand, TPS 0 indicates that I/O traffic is limited. This environment is built by the NBD server composed of ECS + cloud disk. The ECS network layer and cloud disk support traffic limitations, so TPS 0 may occur. You can create a local SSD server with good performance as an NBD server next time.

If there is a bottleneck in RT but not in IOBW throughput, group commit can be used to improve performance. If a bottleneck occurs in IOBW, there is no way to improve performance.

postgres=# select wait_event_type,wait_event,count(*) from pg_stat_activity group by 1,2 order by 3 desc;  
 wait_event_type |     wait_event      | count   
-----------------+---------------------+-------  
 LWLock          | WALWriteLock        |     8  
 Activity        | WalSenderMain       |     2  
 Activity        | CheckpointerMain    |     1  
 IO              | WALInitWrite        |     1  
 Activity        | AutoVacuumMain      |     1  
                 |                     |     1  
 IO              | VFSFileOpen         |     1  
 Activity        | LogicalLauncherMain |     1  
(8 rows)  

nbd server, observe network, and disk throughput:

dstat  
  
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--  
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw   
  0   2  80  17   0   0|   0    51M|  54M   33M|   0     0 |  61k   99k  
  0   2  79  18   0   0|   0    50M|  54M   35M|   0     0 |  63k  102k  
  0   1  89  10   0   0|   0    26M|  28M   36M|   0     0 |  45k   69k  
  0   1  92   7   0   0|   0    17M|  19M   31M|   0     0 |  36k   55k  
  0   1  92   7   0   0|   0    18M|  19M   28M|   0     0 |  36k   53k  
  
iostat -x 1  
  
avg-cpu:  %user   %nice %system %iowait  %steal   %idle  
           0.13    0.00    0.77    7.17    0.00   91.93  
  
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util  
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00  
vdb               0.00  1142.00    0.00 6946.00     0.00 18460.00     5.32     0.55    0.08    0.00    0.08   0.14  99.20  
vdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00  
vdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00  

Monitor IO Performance:

su - root
pfsadm  mountstat nvme0n1

Open group commit:

// Sets the delay in microseconds between transaction commit and flushing WAL to disk.  
postgres=# alter role polardb set commit_delay ='10';  
ALTER ROLE  
  
// Sets the minimum concurrent open transactions before performing commit_delay.   
postgres=# alter role polardb set commit_siblings =5;  
ALTER ROLE  

Set the number of transaction processes in the reading and writing stress testing to 32 concurrent requests:

pgbench -M prepared -n -r -P 1 -c 32 -j 32 -T 1200   

There is no improvement because the IOBW is full.

After changing to the unlogged table, the performance of group commit with 32 concurrent requests is improved. (Note: Unlogged tables and temporary tables are not supported on RO nodes. Unlogged tables created by the master node cannot be queried on RO nodes.)

progress: 164.0 s, 4019.8 tps, lat 7.965 ms stddev 2.940
progress: 165.0 s, 1676.1 tps, lat 8.152 ms stddev 2.298
progress: 166.0 s, 4027.8 tps, lat 12.503 ms stddev 63.648
progress: 167.0 s, 4445.0 tps, lat 7.197 ms stddev 2.822
progress: 168.0 s, 4357.3 tps, lat 7.342 ms stddev 2.804
progress: 169.0 s, 4567.0 tps, lat 7.006 ms stddev 3.003
progress: 170.0 s, 4648.9 tps, lat 6.881 ms stddev 2.792
progress: 171.0 s, 4427.9 tps, lat 7.226 ms stddev 3.254
progress: 172.0 s, 4468.9 tps, lat 7.163 ms stddev 3.111
progress: 173.0 s, 4571.2 tps, lat 7.003 ms stddev 3.023
progress: 174.0 s, 4695.8 tps, lat 6.814 ms stddev 2.940
progress: 175.0 s, 4627.2 tps, lat 6.914 ms stddev 2.644
progress: 176.0 s, 4466.9 tps, lat 7.159 ms stddev 3.036
progress: 177.0 s, 4508.4 tps, lat 7.109 ms stddev 2.564
progress: 178.0 s, 4474.7 tps, lat 7.143 ms stddev 2.683
progress: 179.0 s, 4476.1 tps, lat 7.156 ms stddev 2.609
progress: 180.0 s, 4622.0 tps, lat 6.924 ms stddev 2.884
progress: 181.0 s, 4726.6 tps, lat 6.770 ms stddev 2.798
progress: 182.0 s, 4480.2 tps, lat 7.142 ms stddev 2.644
progress: 183.0 s, 2224.2 tps, lat 6.899 ms stddev 2.849
progress: 184.0 s, 3633.6 tps, lat 13.391 ms stddev 70.771
progress: 185.0 s, 4314.0 tps, lat 7.416 ms stddev 2.274
0 0 0
Share on

digoal

255 posts | 20 followers

You may also like

Comments