Community Blog Network Block Device for Testing RAC and Shared Storage Version of PolarDB for PostgreSQL

Network Block Device for Testing RAC and Shared Storage Version of PolarDB for PostgreSQL

This article discusses the background of Network Block Devices (NBD) and deployment solutions.

By digoal


Network Block Device (NBD) is a cheap shared storage solution. NBD can be used as a lightweight shared storage test solution to deploy and learn Oracle RAC and PolarDB for PostgreSQL.

In addition to supporting Transmission Control Protocol (TCP), NBD supports Remote Direct Memory Access (RDMA)-based Sockets Direct Protocol (SDP) and can test the performance of RAC and PolarDB for PostgreSQL.

The open-source address of PolarDB for PostgreSQL: https://github.com/ApsaraDB/PolarDB-for-PostgreSQL



Build Oracle-RAC using NBD: http://www.fi.muni.cz/~kripac/orac-nbd/

There are some issues that need to be noted, such as the cache at the operating system level. When you switch from an NBD server to another NBD server, if there is a cache written in image files, data loss will occur. This problem can be solved with the nbd-server sync export mode.

1. Deploy the Environment

One server (two 100 GB disks-vdb and vdc); two clients. Runtime environment: CentOS 7.9.

yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm        
yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm        
yum install -y centos-release-scl  
yum install -y postgresql14*      

Kernel Parameter

vi /etc/sysctl.conf            
# add by digoal.zhou                
fs.aio-max-nr = 1048576                
fs.file-max = 76724600                
# Options: kernel.core_pattern = /data01/corefiles/core_%e_%u_%t_%s.%p                         
# The /data01/corefiles directory that is used to store core dumps is created with the 777 permission before testing. If a symbolic link is used, change the corresponding directory to 777                
kernel.sem = 4096 2147483647 2147483646 512000                    
# Specify the semaphore. You can run the ipcs -l or -u command to obtain the semaphore count. Each group of 16 processes requires a semaphore with a count of 17.                
kernel.shmall = 107374182                      
# Specify the total size of shared memory segments. Recommended value: 80% of the memory capacity. Unit: pages.                
kernel.shmmax = 274877906944                   
# Specify the maximum size of a single shared memory segment. Recommended value: 50% of the memory capacity. Unit: bytes. In PostgreSQL versions later than 9.2, the use of shared memory significantly drops.                
kernel.shmmni = 819200                         
# Specify the total number of shared memory segments that can be generated. There are at least 2 shared memory segments within each PostgreSQL cluster.                
net.core.netdev_max_backlog = 10000                
net.core.rmem_default = 262144                       
# The default setting of the socket receive buffer in bytes.                
net.core.rmem_max = 4194304                          
# The maximum receive socket buffer size in bytes                
net.core.wmem_default = 262144                       
# The default setting (in bytes) of the socket send buffer.                
net.core.wmem_max = 4194304                          
# The maximum send socket buffer size in bytes.                
net.core.somaxconn = 4096                
net.ipv4.tcp_max_syn_backlog = 4096                
net.ipv4.tcp_keepalive_intvl = 20                
net.ipv4.tcp_keepalive_probes = 3                
net.ipv4.tcp_keepalive_time = 60                
net.ipv4.tcp_mem = 8388608 12582912 16777216                
net.ipv4.tcp_fin_timeout = 5                
net.ipv4.tcp_synack_retries = 2                
net.ipv4.tcp_syncookies = 1                    
# Enable SYN cookies. If an SYN waiting queue overflows, you can enable SYN cookies to defend against a small number of SYN attacks.                
net.ipv4.tcp_timestamps = 1                    
# Reduce time_wait.                
net.ipv4.tcp_tw_recycle = 0                    
# If you set this parameter to 1, sockets in the TIME-WAIT state over TCP connections are recycled. However, if network address translation (NAT) is used, TCP connections may fail. We recommend that you set this parameter to 0 on the database server.                
net.ipv4.tcp_tw_reuse = 1                      
# Enable the reuse function. This function enables network sockets in the TIME-WAIT state to be reused over new TCP connections.                
net.ipv4.tcp_max_tw_buckets = 262144                
net.ipv4.tcp_rmem = 8192 87380 16777216                
net.ipv4.tcp_wmem = 8192 65536 16777216                
net.nf_conntrack_max = 1200000                
net.netfilter.nf_conntrack_max = 1200000                
vm.dirty_background_bytes = 409600000                       
# When the dirty page of the system reaches this value, the dirty page brushing process pdflush in the system(or other page brushing processes) automatically brushes the dirty page (dirty_expire_centisecs/100) seconds ago to the disk.                
# The default limit is 10% of the memory capacity. We recommend that you specify the limit in bytes for machines with large memory capacity.                
vm.dirty_expire_centisecs = 3000                             
# Specify the maximum period to retain dirty pages. Dirty pages are flushed to disks after the time period specified by this parameter elapses. The value 3000 indicates 30 seconds.                
vm.dirty_ratio = 95                                          
# If the process that the system flushes dirty pages is too slow, causing the system dirty pages to exceed 95% of the memory, the process that users call to write data onto disks must actively flush dirty pages to disks (These processes include fsync, fdatasync, etc.).                
# Set this parameter properly to prevent user-called processes from flushing dirty pages to disks, which is very effective when a single machine has multiple instances and CGROUP is used to limit the IOPS of a single instance.                  
vm.dirty_writeback_centisecs = 100                            
# Specify the time interval at which the background scheduling process (such as pdflush or other processes) flushes dirty pages to disks. The value 100 indicates 1 second.                
vm.swappiness = 0                
# Disable the swap partition                
vm.mmap_min_addr = 65536                
vm.overcommit_memory = 0                     
# When allocating memory, more memory space than the malloc is allowed. If you set this parameter to 1, the system always considers the available memory space sufficient. If the memory capacity provided in the test environment is low, we recommend that you set this parameter to 1.                  
vm.overcommit_ratio = 90                     
# Specify the memory capacity that can be allocated when the overcommit_memory parameter is set to 2.                
vm.swappiness = 0                            
# Disable the swap partition.                
vm.zone_reclaim_mode = 0                     
# Disable non-uniform memory access (NUMA). You can also disable NUMA in the vmlinux file.                 
net.ipv4.ip_local_port_range = 40000 65535                    
# Specify the range of TCP or UDP port numbers that are automatically allocated locally.                
# Specify the maximum number of file handles that a single process can open.                
# Take note of the following parameters:                
# vm.extra_free_kbytes = 4096000                
# vm.min_free_kbytes = 6291456  # vm.min_free_kbytes We recommend that you set the value of the vm.min_free_kbytes parameter to 1 GB for every 32 GB of memory.               
# If the physical host does not provide much memory, we recommend that you do not configure vm.extra_free_kbytes and vm.min_free_kbytes.                
# vm.nr_hugepages = 66536                    
# If the size of the shared buffer exceeds 64 GB, we recommend that you use huge pages. You can specify the page size by setting the Hugepagesize parameter in the /proc/meminfo file.                
# vm.lowmem_reserve_ratio = 1 1 1                
# If the memory capacity exceeds 64 GB, we recommend that you set this parameter. Otherwise, we recommend that you retain the default value 256 256 32.            
# Take effect       
# sysctl -p        

Configure Limits

vi /etc/security/limits.conf            
# If nofile exceeds 1048576, the fs.nr_open of sysctl must be set to a larger value, and then you can continue to set nofile after sysctl takes effect.                
# Comment other lines and add them as follows:        
* soft    nofile  1024000                
* hard    nofile  1024000                
* soft    nproc   unlimited                
* hard    nproc   unlimited                
* soft    core    unlimited                
* hard    core    unlimited                
* soft    memlock unlimited                
* hard    memlock unlimited            

Do modifications at the same time (if any):


Disable Transparent Huge Pages (Optional)

echo never > /sys/kernel/mm/transparent_hugepage/enabled         

The configuration takes effect permanently:

chmod +x /etc/rc.d/rc.local      
vi /etc/rc.local            
touch /var/lock/subsys/local            
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then                
   echo never > /sys/kernel/mm/transparent_hugepage/enabled                

Modify the Clock (Optional)

vi /etc/rc.local      
touch /var/lock/subsys/local      
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then      
   echo never > /sys/kernel/mm/transparent_hugepage/enabled      
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource         

Supported clocks:

cat /sys/devices/system/clocksource/clocksource0/available_clocksource         
kvm-clock tsc acpi_pm         

Modify the clock:

echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource         

2. Deploy NBD

The NBD package can be used directly, but the kernel in the client needs to be compiled (maybe the kernel of CentOS 7 does not have a built-in NBD module by default).

yum install -y nbd  


[root@iZbp1eo3op9s5gxnvc7aokZ ~]# ifconfig  
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500  
        inet  netmask  broadcast  
        inet6 fe80::216:3eff:fe00:851f  prefixlen 64  scopeid 0x20<link>  
        ether 00:16:3e:00:85:1f  txqueuelen 1000  (Ethernet)  
        RX packets 159932  bytes 229863288 (219.2 MiB)  
        RX errors 0  dropped 0  overruns 0  frame 0  
        TX packets 30124  bytes 3706650 (3.5 MiB)  
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0  
[root@iZbp1eo3op9s5gxnvc7aokZ ~]# lsblk  
vda    253:0    0  100G  0 disk   
└─vda1 253:1    0  100G  0 part /  
vdb    253:16   0  100G  0 disk   
vdc    253:32   0  100G  0 disk   

Write the NBD-server configuration file (Please see man 5 nbd-server for more information).

Note: No space is allowed at the end of each configuration line, or parse problems may occur.

vi /root/nbd.conf  
# This is a comment  
    # The [generic] section is required, even if nothing is specified  
    # there.  
    # When either of these options are specified, nbd-server drops  
    # privileges to the given user and group after opening ports, but  
    # _before_ opening files.  
    # user = nbd  
    # group = nbd  
    listenaddr =
    port = 1921
    exportname = /dev/vdb
    readonly = false
    multifile = false
    copyonwrite = false
    flush = true
    fua = true
    sync = true
    exportname = /dev/vdc
    readonly = false
    multifile = false
    copyonwrite = false
    flush = true
    fua = true
sync = true

Start NBD-server:

[root@iZbp1eo3op9s5gxnvc7aokZ ~]# nbd-server -C /root/nbd.conf


Additional compilation of the NBD module into the kernel is required. Please refer to for a more detailed compilation:

yum install -y kernel-devel kernel-headers elfutils-libelf-devel gcc+ gcc-c++  
[root@iZbp1eo3op9s5gxnvc7aolZ ~]# uname -r  

In theory, the related configuration of the 42.2 minor version should be consistent with that above, but no rpm package was found. Kernel-3.10.0-1160.el7 seems to be available.

PS: Later, the src was found here to update the minor version.

curl https://vault.centos.org/7.9.2009/os/Source/SPackages/kernel-3.10.0-1160.el7.src.rpm -o ./kernel-3.10.0-1160.el7.src.rpm  
Revised to:
curl https://vault.centos.org/7.9.2009/updates/Source/SPackages/kernel-3.10.0-1160.42.2.el7.src.rpm -o ./kernel-3.10.0-1160.42.2.el7.src.rpm  
rpm -ivh kernel-3.10.0-1160.el7.src.rpm   
Revised to:
rpm -ivh kernel-3.10.0-1160.42.2.el7.src.rpm

cd rpmbuild/SOURCES/  
tar xvf linux-3.10.0-1160.el7.tar.xz -C /usr/src/kernels/  
Revised to:
tar xvf linux-3.10.0-1160.42.2.el7.tar.xz -C /usr/src/kernels/  
cd /usr/src/kernels/linux-3.10.0-1160.el7  
Revised to:
cd /usr/src/kernels/linux-3.10.0-1160.42.2.el7  

make mrproper  
cp /usr/src/kernels/3.10.0-1160.42.2.el7.x86_64/Module.symvers ./  
cp /boot/config-3.10.0-1160.el7.x86_64 ./.config  
Revised to:
cp /boot/config-3.10.0-1160.42.2.el7.x86_64 ./.config  

make oldconfig  
make prepare  
make scripts  

The following is a section to fix the compilation error. (The error is caused by the lack of variable definitions owing to the absence of the blkdev.h header file):

REQ_TYPE_SPECIAL = 7 is defined in /usr/src/kernels/linux-3.10.0-1160.el7/include/linux/blkdev.h
Revised to:
REQ_TYPE_SPECIAL = 7 is defined in /usr/src/kernels/linux-3.10.0-1160.42.2.el7/include/linux/blkdev.h
 * request command types  
enum rq_cmd_type_bits {  
        REQ_TYPE_FS             = 1,    /* fs request */  
        REQ_TYPE_BLOCK_PC,              /* scsi command */  
        REQ_TYPE_SENSE,                 /* sense request */  
        REQ_TYPE_PM_SUSPEND,            /* suspend request */  
        REQ_TYPE_PM_RESUME,             /* resume request */  
        REQ_TYPE_PM_SHUTDOWN,           /* shutdown request */  
#ifdef __GENKSYMS__  
        REQ_TYPE_SPECIAL,               /* driver defined type */  
        REQ_TYPE_DRV_PRIV,              /* driver defined type */  
         * for ATA/ATAPI devices. this really doesn't belong here, ide should  
         * use REQ_TYPE_DRV_PRIV and use rq->cmd[0] with the range of driver  
         * private REQ_LB opcodes to differentiate what type of request this is  

Modify the file:

vi drivers/block/nbd.c  
sreq.cmd_type = REQ_TYPE_SPECIAL;  
sreq.cmd_type = 7;  

Continue the compilation:

cp drivers/block/nbd.ko /lib/modules/3.10.0-1160.42.2.el7.x86_64/kernel/drivers/block/  

Load the NBD module:

depmod -a  
modinfo nbd  
modprobe nbd  

Configure automatic loading of the NBD module:

#cd /etc/sysconfig/modules/

#vi nbd.modules
Add the following contents to the file
/sbin/modinfo -F filename nbd > /dev/null 2>&1
if [ $? -eq 0 ]; then
    /sbin/modprobe nbd

#chmod 755 nbd.modules   //This step is crucial.

Mount the network block device:

[root@iZbp1eo3op9s5gxnvc7aomZ ~]# nbd-client 1921 -N export1 /dev/nbd0   
Negotiation: ..size = 102400MB  
bs=1024, sz=107374182400 bytes  
[root@iZbp1eo3op9s5gxnvc7aomZ ~]# nbd-client 1921 -N export2 /dev/nbd1   
Negotiation: ..size = 102400MB  
bs=1024, sz=107374182400 bytes  

Format the file system and mount it:

mkfs.ext4 /dev/nbd0  
mkfs.ext4 /dev/nbd1  
mkdir /data01  
mkdir /data02  
mount /dev/nbd0 /data01  
mount /dev/nbd1 /data02  

Write tests:

# dd if=/dev/zero of=/data01/test oflag=direct bs=1M count=1000  
1000+0 records in  
1000+0 records out  
1048576000 bytes (1.0 GB) copied, 4.90611 s, 214 MB/s  
# dd if=/dev/zero of=/data02/test oflag=direct bs=1M count=1000  
1000+0 records in  
1000+0 records out  
1048576000 bytes (1.0 GB) copied, 4.90611 s, 214 MB/s  
df -h  
/dev/nbd0        99G  1.1G   93G   2% /data01  
/dev/nbd1        99G  1.1G   93G   2% /data02  

Some IO operations can be seen on the server side iotop:

13899 be/4 root        0.00 B/s   42.56 M/s  0.00 % 73.39 % nbd-server -C /root/nbd.conf [pool]  
13901 be/4 root        0.00 B/s   42.81 M/s  0.00 % 73.00 % nbd-server -C /root/nbd.conf [pool]  
13897 be/4 root        0.00 B/s   42.56 M/s  0.00 % 72.95 % nbd-server -C /root/nbd.conf [pool]  
13900 be/4 root        0.00 B/s   42.32 M/s  0.00 % 72.47 % nbd-server -C /root/nbd.conf [pool]  

fsync test:

[root@iZbp1eo3op9s5gxnvc7aomZ data01]# /usr/pgsql-14/bin/pg_test_fsync -f /data01/test  
5 seconds per test  
O_DIRECT supported on this platform for open_datasync and open_sync.  
Compare file sync methods using one 8kB write:  
(in wal_sync_method preference order, except fdatasync is Linux's default)  
        open_datasync                      1056.250 ops/sec     947 usecs/op  
        fdatasync                          1032.631 ops/sec     968 usecs/op  
        fsync                               404.807 ops/sec    2470 usecs/op  
        fsync_writethrough                              n/a  
        open_sync                           414.387 ops/sec    2413 usecs/op  
Compare file sync methods using two 8kB writes:  
(in wal_sync_method preference order, except fdatasync is Linux's default)  
        open_datasync                       553.453 ops/sec    1807 usecs/op  
        fdatasync                          1011.726 ops/sec     988 usecs/op  
        fsync                               404.171 ops/sec    2474 usecs/op  
        fsync_writethrough                              n/a  
        open_sync                           208.758 ops/sec    4790 usecs/op  
Compare open_sync with different write sizes:  
(This is designed to compare the cost of writing 16kB in different write  
open_sync sizes.)  
         1 * 16kB open_sync write           405.717 ops/sec    2465 usecs/op  
         2 *  8kB open_sync writes          208.324 ops/sec    4800 usecs/op  
         4 *  4kB open_sync writes          106.849 ops/sec    9359 usecs/op  
         8 *  2kB open_sync writes           52.999 ops/sec   18868 usecs/op  
        16 *  1kB open_sync writes           26.657 ops/sec   37513 usecs/op  
Test if fsync on non-write file descriptor is honored:  
(If the times are similar, fsync() can sync data written on a different  
        write, fsync, close                 413.350 ops/sec    2419 usecs/op  
        write, close, fsync                 417.832 ops/sec    2393 usecs/op  
Non-sync'ed 8kB writes:  
        write                            608345.462 ops/sec       2 usecs/op  

The other client server operates similarly, except mkfs is not required. If you want to mount the cluster file system (which can reflect write changes and support distributed lock), you can use linux gfs2.

Disconnect the NBD:

Umount first  
umount /data01   
umount /data02   
nbd-client -d /dev/nbd0  
nbd-client -d /dev/nbd1  

3. Other Information

An introduction to NBD (from Wikipedia):

Network block device  
From Wikipedia, the free encyclopedia  
In Linux, a network block device is a device node whose content is provided by a remote machine. Typically, network block devices are used to access a storage device that does not physically reside in the local machine but on a remote one. As an example, the local machine can access a fixed disk that is attached to another computer.  
Contents   [hide]   
1 Kernel client/userspace server  
2 Example  
3 Availability  
4 See also  
5 References  
6 External links  
Kernel client/userspace server[edit]  
Technically, a network block device is realized by two components. In the client machine, where the device node is to work, a kernel module named nbd controls the device. Whenever a program tries to access the device, this kernel module forwards the request to the server machine, where the data physically resides.  
On the server machine, requests from the client are handled by a userspace program called nbd-server. This program is not implemented as a kernel module because all it has to do is to serve network requests, which in turn just requires regular access to the server filesystem.  
If the file /tmp/xxx on ComputerA has to be made accessible on ComputerB, one performs the following steps:  
On ComputerA:  
nbd-server 2000 /tmp/xxx  
On ComputerB:  
modprobe nbd  
nbd-client ComputerA 2000 /dev/nbd0  
The file is now accessible on ComputerB as device /dev/nbd0. If the original file was for example a disk image, it could be mounted for example via mount /dev/nbd0 /mnt/whatever.  
The command modprobe nbd is not necessary if module loading is done automatically. Once the module is in the kernel, nbd-client is used to send commands to it, such as associating a given remote file to a given local nb device. To finish using /dev/nbd0, that is, to destroy its association with the file on other computer, one can run nbd-client -d /dev/nbd0 on ComputerB.  
In this example, 2000 is the number of the server port through which the file is made accessible. Any available port could be used.  
The network block device client module is available on Linux and GNU Hurd.  
Since the server is a userspace program, it can potentially run on every Unix-like platform. It was ported to Solaris.[1]  

In CentOS or RHEL, you can use the EPEL additional repository to install NBD:

[root@150 postgresql-9.3.5]# yum install -y nbd  
Loaded plugins: fastestmirror, refresh-packagekit, security, versionlock  
Loading mirror speeds from cached hostfile  
epel/metalink                                                                                                | 5.4 kB     00:00       
 * base: mirrors.skyshe.cn  
 * epel: mirrors.ustc.edu.cn  
 * extras: mirrors.163.com  
 * updates: centos.mirror.cdnetworks.com  
base                                                                                                         | 3.7 kB     00:00       
extras                                                                                                       | 3.3 kB     00:00       
updates                                                                                                      | 3.4 kB     00:00       
updates/primary_db                                                                                           | 5.3 MB     00:21       
Setting up Install Process  
Resolving Dependencies  
--> Running transaction check  
---> Package nbd.x86_64 0:2.9.20-7.el6 will be installed  
--> Finished Dependency Resolution  
Dependencies Resolved  
 Package                     Arch                           Version                              Repository                    Size  
 nbd                         x86_64                         2.9.20-7.el6                         epel                          43 k  
Transaction Summary  
Install       1 Package(s)  
Total download size: 43 k  
Installed size: 83 k  
Downloading Packages:  
nbd-2.9.20-7.el6.x86_64.rpm                                                                                  |  43 kB     00:00       
Running rpm_check_debug  
Running Transaction Test  
Transaction Test Succeeded  
Running Transaction  
  Installing : nbd-2.9.20-7.el6.x86_64                                                                                          1/1   
  Verifying  : nbd-2.9.20-7.el6.x86_64                                                                                          1/1   
  nbd.x86_64 0:2.9.20-7.el6                                                                                                           

[root@iZbp1eo3op9s5gxnvc7aokZ ~]# rpm -ql nbd 

4. References

0 0 0
Share on


278 posts | 24 followers

You may also like