By digoal
In e-commerce operations, scenarios like the following often occur: due to the relationships between upstream and downstream manufacturers, when a user places an order, both parties generate an order record within their own systems and exchange feedback. After receiving a feedback order, the local system initially caches the order record queue. Subsequently, a background process retrieves orders from the cache for processing.
The key steps in this process are: high-speed writing to the queue, sequential extraction from the queue and high-speed processing, and removal of processed order records from the queue.
During periods of high concurrency, everyone accesses the queue in the same order, which can lead to hotspots and may result in bottlenecks such as lock conflicts, I/O scanning inefficiencies, and wasteful CPU computations. Additionally, such bottlenecks might emerge from delayed clearance of index versions after order records are processed, causing redundant I/O and CPU resource consumption due to backtracking for version verification.
This article presents optimization strategies and a demo illustrating how to boost database performance for queue processing tasks, achieving an enhancement in performance by 10 to 20 times.
1. Test Environment
MacBook Pro (15-inch, 2018)
2.2 GHz six-core Intel Core i7
32 GB 2400 MHz DDR4
PostgreSQL 15.1
Since the environment is MacOS, you may need to set ulimit.
ulimit -n 1000000
2. The processing queue list of the upstream write orders
create table t_order_q (
id serial8 primary key, -- Auto-increment primary key
order_id uuid unique, -- The order number passed from the upstream
cts timestamp not null -- The creation time of the order passed from the upstream
);
-- create index on t_order_q (cts); -- If the order is retrieved according to the order time, you need to create a time field index. You can also process the order according to the auto-increment primary key order. At this time, the time index is not required.
3. State table of orders taken out and processed
create table t_order_u (
id serial8 primary key, -- Auto-increment primary key
order_id uuid unique, -- The order number passed from the upstream
cts timestamp not null, -- The creation time of the order passed from the upstream
uts timestamp not null, -- Order processing time
status int not null -- Order processing status marker
);
4. Write 1 million order queues
insert into t_order_q (order_id, cts) select gen_random_uuid(), clock_timestamp() from generate_series(1,1000000);
5. Write a pgbench stress test script, take it out of the queue, and use AdLock to add a transaction lock to the queue ID to determine whether it is being processed. After the transaction ends, AdLock is automatically released. In addition, AdLock is often used to relieve stress in flash sale scenarios.
vi t.sql
with tmp as
(delete from t_order_q where ctid = (select ctid from t_order_q where pg_try_advisory_xact_lock(id) order by id limit 1) returning order_id, cts)
insert into t_order_u (order_id,cts,uts,status) select tmp.order_id, tmp.cts, now(), 1 from tmp;
Or
begin;
select id as v_id from t_order_q where pg_try_advisory_xact_lock(id) order by id limit 1 \gset
with tmp as (delete from t_order_q where id = :v_id returning order_id, cts)
insert into t_order_u (order_id,cts,uts,status) select tmp.order_id, tmp.cts, now(), 1 from tmp;
end;
Or (The increased duration of application-side operations after the sleep simulation application receives the order to be processed.)
begin;
select id as v_id from t_order_q where pg_try_advisory_xact_lock(id) order by id limit 1 \gset
\sleep 10ms
with tmp as (delete from t_order_q where id = :v_id returning order_id, cts)
insert into t_order_u (order_id,cts,uts,status) select tmp.order_id, tmp.cts, now(), 1 from tmp;
end;
6. Perform stress tests on 256 concurrent consumption queues, with an average of 3906 transactions handled by each connection.
select 1000000/256.0;
3906.2500000000000
7. Results of stress test
pgbench -M extended -f ./t.sql -n -r -P 1 -c 256 -j 2 -t 3906
transaction type: ./t.sql
scaling factor: 1
query mode: extended
number of clients: 256
number of threads: 2
maximum number of tries: 1
number of transactions per client: 3906
number of transactions actually processed: 999936/999936
number of failed transactions: 0 (0.000%)
latency average = 8.111 ms
latency stddev = 5.376 ms
initial connection time = 429.698 ms
tps = 25379.081141 (without initial connection time)
statement latencies in milliseconds and failures:
8.114 0 with tmp as
1. Write a pgbench stress test script, take it out of the queue, and use AdLock to add a transaction lock to the queue ID to determine whether it is being processed. After the transaction ends, AdLock is automatically released. In addition, AdLock is often used to relieve stress in flash sale scenarios.
vi t1.sql
begin;
select id as vid from t_order_q order by id for update limit 1 \gset
with tmp as
(delete from t_order_q where id = :vid returning order_id, cts)
insert into t_order_u (order_id,cts,uts,status) select tmp.order_id, tmp.cts, now(), 1 from tmp;
end;
2. Results of stress test
pgbench -M extended -f ./t1.sql -n -r -P 1 -c 256 -j 2 -t 3906
TPS is about 1200.
After the skip locked is added, TPS can only reach about 2500. After reducing concurrency, the performance of using skip locked can be improved to about 8K TPS.
begin;
select id as vid from t_order_q order by id for update skip locked limit 1 \gset
with tmp as
(delete from t_order_q where id = :vid returning order_id, cts)
insert into t_order_u (order_id,cts,uts,status) select tmp.order_id, tmp.cts, now(), 1 from tmp;
end;
1. Reduce I/O and CPU calculation waste:
• In the case of concurrency, order by id limit 1 needs to scan several rows instead of 1 row, because some IDs may have been touched by AdLock, the number of wasted pg_try_advisory_xact_lock() cpu ops calculations is approximately equal to n + n-1 + n-2 + ... + n-n, and the wasted I/O is approximately equal to n.
Optimization method:
• Fix N links and take out different data shards according to the ID hash mod, so as to reduce the I/O and CPU calculation waste.
• Or split the queue table into several partition tables, and each partition is assigned to a different process for retrieval according to the id hash mode during importing to the warehouse, thus reducing conflicts and wasted scanning and improving concurrency.
2. Improve the frequency of the index vacuum to reduce the waste of CPU and I/O back to the table caused by garbage data judgment due to the absence of the index version. Improve the autovacuum_work_mem to accommodate all dead tuple ctid and avoid scanning the index multiple times.
Optimization method:
• Configure the following parameters: autovacuum_naptime, autovacuum_work_mem (or old version maintenance_work_mem).
3. Use the parallel vacuum to configure max_parallel_maintenance_workers.
4. Configure the vacuum to use prefetch blocks to reduce the problem of prolonged vacuum caused by I/O delay. (Suitable for disks with high single I/O delay but no bottleneck in throughput)
5. Take out multiple pieces at a time and process them in batches.
6. Use a local NVMe SSD with high IOPS and low single I/O delay.
Implementing PostgreSQL Hook: Stats on Tables with Full Scans & Corresponding SQLs
Three Key Details and Principles of PostgreSQL Online Backup & Recovery
digoal - October 11, 2023
ApsaraDB - March 3, 2020
ApsaraDB - October 19, 2020
ApsaraDB - August 13, 2024
Morningking - September 26, 2023
Alibaba Clouder - September 28, 2020
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreMore Posts by digoal