Kenan
Assistant Engineer
Assistant Engineer
  • UID621
  • Fans1
  • Follows0
  • Posts55
Reads:2996Replies:0

[PostgreSQL Development]Case study of SSD write amplification caused by mis-alignment

Created#
More Posted time:Sep 19, 2016 13:55 PM
Background
The storage organization of SSD determines it is written in fixed units during write operations. So you must execute alignment during SSD usage.
Mis-alignment may lead to serious consequences, not only reducing the performance, but also causing write amplification.
Let’s see a picture. The solid lines separate the writing units of SSD. If mis-alignment occurs during partitioning or LVM, an IO operation may write data across units, amplifying the write by one time.
In addition, SSD cells have a lifetime of a fixed number of erases and writes, for which such amplification not only compromises the performance, but also halves the lifetime.


Example
Two pieces of PCI-E SSDs, after being setting to striped LVM, are inferior to one piece of SSD in performance. Let’s try to find out why.
SSD mis-alignment test
pvcreate /dev/xxa
pvcreate /dev/xxb

vgcreate -s 128M vgdata01 /dev/xxa /dev/xxb
lvcreate -i 2 -I 8 -n lv01 -l 100%VG vgdata01

mkfs.ext4 /dev/mapper/vgdata01-lv01 -m 0 -O extent,uninit_bg -E lazy_itable_init=1

mkdir /data01

mount -o defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback /dev/mapper/vgdata01-lv01 /data01

dd if=/dev/zero of=/data01/img01 bs=1024k count=1024000 oflag=direct &
dd if=/dev/zero of=/data01/img02 bs=1024k count=1024000 oflag=direct &
dd if=/dev/zero of=/data01/img03 bs=1024k count=1024000 oflag=direct &
dd if=/dev/zero of=/data01/img04 bs=1024k count=1024000 oflag=direct &
dd if=/dev/zero of=/data01/img05 bs=1024k count=1024000 oflag=direct &
dd if=/dev/zero of=/data01/img06 bs=1024k count=1024000 oflag=direct &


Performance data
#dstat
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  0   3  92   5   0   0|8192B 2951M| 470B  810B|   0     0 |  42k   99k
  0   3  92   5   0   0|4096B 2971M| 246B  358B|   0     0 |  42k   99k
  0   3  92   5   0   0|8192B 2945M| 220B  750B|   0     0 |  40k   98k
  0   3  92   5   0   0|4096B 2940M|  66B  268B|   0     0 |  39k   92k
  0   3  92   5   0   0|4096B 2896M|  66B  268B|   0     0 |  40k   94k
  0   3  92   5   0   0|4096B 2883M|  66B  358B|   0     0 |  40k   96k


The write speed is always around only 2.9GB/s.
SSD alignment Test 1
Align using parted command
umount /data01
lvchange -an /dev/mapper/vgdata01-lv01
lvremove /dev/mapper/vgdata01-lv01
vgremove vgdata01
pvremove /dev/xxa
pvremove /dev/xxb

parted -a optimal -s /dev/xxa mklabel gpt mkpart primary 1MB 6390GB
parted -a optimal -s /dev/xxb mklabel gpt mkpart primary 1MB 6390GB


Alignment parameters of parted command
-a alignment-type, --align alignment-type
              Set alignment for newly created partitions, valid alignment types are:

              none   Use the minimum alignment allowed by the disk type.

              cylinder
                     Align partitions to cylinders.

              minimal
         Use minimum alignment as given by the disk topology information.
         This and the opt value will use layout information provided by the disk to align  the  logical  partition  table  addresses  to  actual physical blocks on the disks.  
         The min value is the minimum alignment needed to align the partition properly to physical blocks, which avoids performance degradation.

              optimal
                     Use optimum alignment as given by the disk topology information.
This aligns to a multiple of the physical block size in a way that guarantees optimal performance.


Partition, test
pvcreate /dev/xxa1
pvcreate /dev/xxb1

vgcreate -s 128M vgdata01 /dev/xxa1 /dev/xxb1

lvcreate -i 2 -I 8 -n lv01 -l 100%VG vgdata01

mkfs.ext4 /dev/mapper/vgdata01-lv01 -m 0 -O extent,uninit_bg -E lazy_itable_init=1

mkdir /data01

mount -o defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback /dev/mapper/vgdata01-lv01 /data01

dd if=/dev/zero of=/data01/img01 bs=1024k count=1024000 oflag=direct &
dd if=/dev/zero of=/data01/img02 bs=1024k count=1024000 oflag=direct &
dd if=/dev/zero of=/data01/img03 bs=1024k count=1024000 oflag=direct &
dd if=/dev/zero of=/data01/img04 bs=1024k count=1024000 oflag=direct &
dd if=/dev/zero of=/data01/img05 bs=1024k count=1024000 oflag=direct &
dd if=/dev/zero of=/data01/img06 bs=1024k count=1024000 oflag=direct &


Performance data
# dstat
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  0   8  88   4   0   0|  16k 5959M| 190B  268B|   0     0 | 100k  229k
  0   8  87   4   0   0|8192B 5967M|  66B  358B|   0     0 |  98k  228k
  0   9  87   4   0   0|  16k 5936M| 190B  178B|   0     0 | 112k  232k
  0   8  87   4   0   0|8192B 5920M|  66B  426B|   0     0 | 110k  233k


After write amplification is eliminated, the write speed is increased to around 5.9GB/s.
SSD alignment Test 2
Considering the file system layer locks, the SSD is partitioned into multiple LVs and multiple file systems to continue the test.
lvcreate -i 2 -I 8 -n lv01 -l 30%VG vgdata01
lvcreate -i 2 -I 8 -n lv02 -l 30%VG vgdata01
lvcreate -i 2 -I 8 -n lv03 -l 30%VG vgdata01

mkfs.ext4 /dev/mapper/vgdata01-lv01 -m 0 -O extent,uninit_bg -E lazy_itable_init=1
mkfs.ext4 /dev/mapper/vgdata01-lv02 -m 0 -O extent,uninit_bg -E lazy_itable_init=1
mkfs.ext4 /dev/mapper/vgdata01-lv03 -m 0 -O extent,uninit_bg -E lazy_itable_init=1

mkdir /data01
mkdir /data02
mkdir /data03

mount -o defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback /dev/mapper/vgdata01-lv01 /data01
mount -o defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback /dev/mapper/vgdata01-lv02 /data02
mount -o defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback /dev/mapper/vgdata01-lv03 /data03

dd if=/dev/zero of=/data01/img01 bs=1024k count=1024000 oflag=direct &
dd if=/dev/zero of=/data02/img01 bs=1024k count=1024000 oflag=direct &
dd if=/dev/zero of=/data03/img01 bs=1024k count=1024000 oflag=direct &

dd if=/dev/zero of=/data01/img02 bs=1024k count=1024000 oflag=direct &
dd if=/dev/zero of=/data02/img02 bs=1024k count=1024000 oflag=direct &
dd if=/dev/zero of=/data03/img02 bs=1024k count=1024000 oflag=direct &


Performance data
#dstat
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  0   7  87   6   0   0|8192B 6498M|  66B  338B|   0     0 |  92k  222k
  0   7  87   6   0   0|  16k 6447M| 156B  268B|   0     0 |  92k  222k
  0   7  87   6   0   0|8192B 6483M|  66B  178B|   0     0 | 102k  219k
  0   7  87   6   0   0|  16k 6396M|  66B  178B|   0     0 |  95k  205k
  0   7  87   6   0   0|8192B 6403M| 220B  750B|   0     0 |  80k  191k
  0   6  87   6   0   0|  16k 6330M| 190B  178B|   0     0 |  95k  206k
  0   6  88   6   0   0|8192B 6474M| 132B  272B|   0     0 |  97k  233k
  0   6  88   6   0   0|  16k 6441M| 190B  178B|   0     0 | 100k  229k
  0   6  87   6   0   0|8192B 6375M|  66B  516B|   0     0 |  88k  208k
  0   7  87   6   0   0|  16k 6365M| 715B  437B|   0     0 |  95k  207k
  0   6  88   6   0   0|8192B 6500M|  66B  252B|   0     0 |  95k  220k
  0   6  88   6   0   0|8192B 6433M|  66B  178B|   0     0 |  93k  224k


The use of multiple file systems breaks the lock bottleneck of the file system and the write speed is increased to around 6.4GB/s.
SSD alignment Test 3
Let’s look at what the performance will be if we don’t use LVM but use block devices directly.
lvchange -an /dev/mapper/vgdata01-lv01
lvremove /dev/mapper/vgdata01-lv01
lvchange -an /dev/mapper/vgdata01-lv02
lvremove /dev/mapper/vgdata01-lv02
lvchange -an /dev/mapper/vgdata01-lv03
lvremove /dev/mapper/vgdata01-lv03
vgremove vgdata01
pvremove /dev/xxa1
pvremove /dev/xxb1

parted -a optimal -s /dev/xxa mklabel gpt mkpart primary 1MB 6390GB
parted -a optimal -s /dev/xxb mklabel gpt mkpart primary 1MB 6390GB

mkfs.ext4 /dev/xxa1 -m 0 -O extent,uninit_bg -E lazy_itable_init=1
mkfs.ext4 /dev/xxb1 -m 0 -O extent,uninit_bg -E lazy_itable_init=1

mkdir /data01
mkdir /data02

mount -o defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback /dev/xxa1 /data01
mount -o defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback /dev/xxb1 /data02


dd if=/dev/zero of=/data01/img01 bs=1024k count=1024000 oflag=direct &
dd if=/dev/zero of=/data02/img01 bs=1024k count=1024000 oflag=direct &

dd if=/dev/zero of=/data01/img02 bs=1024k count=1024000 oflag=direct &
dd if=/dev/zero of=/data02/img02 bs=1024k count=1024000 oflag=direct &

dd if=/dev/zero of=/data01/img03 bs=1024k count=1024000 oflag=direct &
dd if=/dev/zero of=/data02/img03 bs=1024k count=1024000 oflag=direct &


Performance data
#dstat
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  0   3  89   8   0   0|8192B 6564M| 322B  428B|   0     0 |  42k  103k
  0   3  89   8   0   0|  24k 6558M|  66B  178B|   0     0 |  43k  102k
  0   3  89   8   0   0|  24k 6518M|  66B  268B|   0     0 |  42k  102k
  0   3  89   8   0   0|8192B 6545M| 344B  750B|   0     0 |  43k  101k
  0   3  89   8   0   0|8192B 6543M|  66B  268B|   0     0 |  43k  103k
  0   3  89   8   0   0|  16k 6592M| 132B  362B|   0     0 |  42k  101k
  0   3  89   8   0   0|  16k 6586M| 280B  482B|   0     0 |  42k  105k
  0   3  89   8   0   0|8192B 6617M|  66B  302B|   0     0 |  42k  103k
  0   3  89   8   0   0|8192B 6560M|  66B  268B|   0     0 |  40k  101k
  0   3  89   8   0   0|  24k 6541M|  66B  178B|   0     0 |  39k  101k


With no LVMs and when the block device is partitioned directly, the speed can reach 6.5GB/s or so.
After SSD alignment, the LVM strip achieves linear performance increase against single disks, basically equal to the performance sum of independent block devices.
Summary
SSD alignment is crucial. The test result in this article is the evidence. The alignment not only guarantees the best performance but also ensures the longest lifetime.
Guest