×
Community Blog PostgreSQL Optimized Case – Where a Field Range Order by B Field Sort Limit X

PostgreSQL Optimized Case – Where a Field Range Order by B Field Sort Limit X

This short article explains a PostgreSQL optimization case.

By digoal

Background Information

The goal of optimization is to minimize computing, IO, filter, recheck, etc. For example, the following query is difficult to optimize at first glance:

select * from tbl where ts between x and y   
order by z limit x;  
  1. The query condition of the ts field is a wide range with a large scale, which gives many results. A large number of sort records are displayed sorting by z.
  2. Records that meet the ts criteria are returned in the z-field, and qualified results are ranked behind. Therefore, a large number of filters are required.

Because ts and z are two fields, index (ts,z) or index (z,ts) cannot be used to completely perform indexing.

Optimization

First, let’s understand the background. z is pk. z and ts have linear correlation (positive correlation or negative correlation), and both have directionality, such as increasing or decreasing. Therefore, this query can be modified.

select * from tbl where ts between x and y
and z?
order by z limit x;

n records are queried each time, and all records meeting the conditions of ts between x and y are successively filtered and processed.

Optimization Ideas:

1.  Supplement the condition of z as the initial condition, so there is no need to filter too much.

select min(z), max(z) into id1,id2 from tbl where ts between x and y;  
  
select * from tbl where ts between x and y   
and z>=id1 and z<=id2   
order by z desc limit x;   
  
min(z) into id2  

2.  Each time a batch of records is found, the id is passed in as a new condition, so there is no need to filter too much every time.

select * from tbl where ts between x and y and z<id2 order by z desc limit x;   
  
min(z) into id2  

Example

create table tbl (id int primary key, info text, crt_time timestamp);  
insert into tbl select generate_series(1,100000000), md5(random()::text) , clock_timestamp();  

Original SQL statement:

postgres=# select min(crt_time), max(crt_time) from tbl;  
            min             |            max               
----------------------------+----------------------------  
 2020-07-10 14:04:46.600332 | 2020-07-10 14:08:06.737822  
(1 row)  
  
select * from tbl where crt_time between '2020-07-10 14:05:45' and '2020-07-10 14:07:00'  
order by id limit 100;  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where crt_time between '2020-07-10 14:05:45' and '2020-07-10 14:07:00'  
order by id limit 100;  
                                                                           QUERY PLAN                                                                              
-----------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.57..9.22 rows=100 width=45) (actual time=4003.046..4003.071 rows=100 loops=1)  
   Output: id, info, crt_time  
   Buffers: shared hit=288650 read=63193  
   ->  Index Scan using tbl_pkey on public.tbl  (cost=0.57..3208769.85 rows=37078282 width=45) (actual time=4003.045..4003.062 rows=100 loops=1)  
         Output: id, info, crt_time  
         Filter: ((tbl.crt_time >= '2020-07-10 14:05:45'::timestamp without time zone) AND (tbl.crt_time <= '2020-07-10 14:07:00'::timestamp without time zone))  
         Rows Removed by Filter: 29130419  
         Buffers: shared hit=288650 read=63193  
 Planning Time: 0.093 ms  
 Execution Time: 4003.096 ms  
(10 rows)  
  
-- 大量filter  

Optimize the initial SQL:

select min(id),max(id) from tbl where crt_time between '2020-07-10 14:05:45' and '2020-07-10 14:07:00';  
   min    |   max      
----------+----------  
 29130420 | 66423729  
(1 row)  
  
select * from tbl where crt_time between '2020-07-10 14:05:45' and '2020-07-10 14:07:00'  
and id>=29130420 and id<=66423729   
order by id desc limit 100;   
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where crt_time between '2020-07-10 14:05:45' and '2020-07-10 14:07:00'  
and id>=29130420 and id<=66423729   
order by id desc limit 100;   
                                                                           QUERY PLAN                                                                              
-----------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.57..10.57 rows=100 width=45) (actual time=0.022..0.051 rows=100 loops=1)  
   Output: id, info, crt_time  
   Buffers: shared hit=6  
   ->  Index Scan Backward using tbl_pkey on public.tbl  (cost=0.57..1374883.82 rows=13745329 width=45) (actual time=0.021..0.043 rows=100 loops=1)  
         Output: id, info, crt_time  
         Index Cond: ((tbl.id >= 29130420) AND (tbl.id <= 66423729))  
         Filter: ((tbl.crt_time >= '2020-07-10 14:05:45'::timestamp without time zone) AND (tbl.crt_time <= '2020-07-10 14:07:00'::timestamp without time zone))  
         Buffers: shared hit=6  
 Planning Time: 0.121 ms  
 Execution Time: 0.068 ms  
(10 rows)  
  

Optimize SQL statements for batch query:

select * from tbl where   
crt_time between '2020-07-10 14:05:45' and '2020-07-10 14:07:00'  
and id<66423630  
order by id desc limit 100;   
  
select * from tbl where   
crt_time between '2020-07-10 14:05:45' and '2020-07-10 14:07:00'  
and id<66423530  
order by id desc limit 100;   
                                                                           QUERY PLAN                                                                              
-----------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.57..9.90 rows=100 width=45) (actual time=0.042..0.074 rows=100 loops=1)  
   Output: id, info, crt_time  
   Buffers: shared hit=8  
   ->  Index Scan Backward using tbl_pkey on public.tbl  (cost=0.57..2298576.97 rows=24640916 width=45) (actual time=0.041..0.065 rows=100 loops=1)  
         Output: id, info, crt_time  
         Index Cond: (tbl.id < 66423530)  
         Filter: ((tbl.crt_time >= '2020-07-10 14:05:45'::timestamp without time zone) AND (tbl.crt_time <= '2020-07-10 14:07:00'::timestamp without time zone))  
         Buffers: shared hit=8  
 Planning Time: 0.107 ms  
 Execution Time: 0.097 ms  
(10 rows)  

It has 40000 times better performance.

0 0 0
Share on

digoal

281 posts | 24 followers

You may also like

Comments