×
Community Blog PostgreSQL: How to Optimize Spatial Index-based Query Performance for Multipolygon Data

PostgreSQL: How to Optimize Spatial Index-based Query Performance for Multipolygon Data

In this article, the author discusses optimizing spatial index-based query performance for multipolygon data in PostgreSQL.

By digoal

Background

Currently, GiST indexes are used in PostgreSQL for spatial objects. The following figure shows the tree structure of spatial data, in which each data entry is a bounding box.

1

If an object is a polygon, its bounding box is stored in the index.

A non-box object usually causes bounding box amplification.

2

If the input is a polygon, this polygon's bounding box is used as an input criterion. The query's main task is finding a branch of the bounding box meeting the operators' conditions from the PostgreSQL database based on the indexes.

In this case, if the invalid area is too large, I/O amplification and CPU usage amplification will occur during index scanning.

Optimization method: Split the input polygon to reduce invalid areas and then use the UNION ALL command to merge the results.

For a multipolygon object, the bounding box can be huge.

In the following figure, the multipolygon's bounding box that comprises three separate polygons on the map is enormous.

3

select * from tbl where st_contains(multi_polygon, pos);  

When you run the preceding SQL statement to perform a GiST index-based search, all spatial objects in the bounding box are returned, instead of the spatial objects in these small polygons. Then "check filter" is used for filtering. This results in I/O and CPU usage amplification.

To sum it all up, split spatial objects and use the UNION ALL command to merge spatial search results to obtain the final result.

Example

1) Create a test table.

create table tbl (id int, pos geometry);  

2) Write 10 million spatial data points.

insert into tbl select id,   
st_setsrid(  
  st_makepoint(  
    round((random()*(135.085831-73.406586)+73.406586)::numeric,6),   
    round((random()*(53.880950-3.408477)+3.408477)::numeric,6)  
  ),   
  4326  
) from generate_series(1,10000000) t(id);  

3) Create spatial indexes.

create index idx_tbl_pos on tbl using gist(pos);  

4) Construct a multipolygon with multiple polygons.

select st_union(  
  array[  
    st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(75.15 29.53,77 29,77.6 29.5, 75.15 29.53)')), 4326),   
    st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(125.15 29.53,125 29,125.6 29.5, 125.15 29.53)')), 4326)  
  ]  
);  

5) Use the multipolygon as the input for the search.

explain (analyze,verbose,timing,costs,buffers) select * from tbl where st_contains(  
st_union(  
  array[  
    st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(75.15 29.53,77 29,77.6 29.5, 75.15 29.53)')), 4326),   
    st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(125.15 29.53,125 29,125.6 29.5, 125.15 29.53)')), 4326)  
  ]  
),  
pos  
);  
 Bitmap Heap Scan on public.tbl  (cost=156.65..13467.61 rows=3333 width=36) (actual time=41.020..5062.317 rows=2445 loops=1)  
   Output: id, pos  
   Recheck Cond: ('0106000020E610000002000000010300000001000000040000009A99999999C9524048E17A14AE873D4000000000004053400000000000003D4066666666666653400000000000803D409A99999999C9524048E17A14AE873D40010300000001000000040000009A99999999495F4048E17A14AE873D400000000000405F400000000000003D406666666666665F400000000000803D409A99999999495F4048E17A14AE873D40'::geometry ~ tbl.pos)  
   Filter: _st_contains('0106000020E610000002000000010300000001000000040000009A99999999C9524048E17A14AE873D4000000000004053400000000000003D4066666666666653400000000000803D409A99999999C9524048E17A14AE873D40010300000001000000040000009A99999999495F4048E17A14AE873D400000000000405F400000000000003D406666666666665F400000000000803D409A99999999495F4048E17A14AE873D40'::geometry, tbl.pos)  
   Rows Removed by Filter: 83589  
   Heap Blocks: exact=53874  
   Buffers: shared hit=823 read=53873 written=7641  
   ->  Bitmap Index Scan on idx_tbl_pos  (cost=0.00..155.82 rows=10000 width=0) (actual time=22.305..22.305 rows=86034 loops=1)  
         Index Cond: ('0106000020E610000002000000010300000001000000040000009A99999999C9524048E17A14AE873D4000000000004053400000000000003D4066666666666653400000000000803D409A99999999C9524048E17A14AE873D40010300000001000000040000009A99999999495F4048E17A14AE873D400000000000405F400000000000003D406666666666665F400000000000803D409A99999999495F4048E17A14AE873D40'::geometry ~ tbl.pos)  
         Buffers: shared hit=822  
 Planning time: 0.268 ms  
 Execution time: 5062.947 ms  
(12 rows)  

6) Split the multipolygon into multiple polygons for the search and use the UNION ALL command to merge the results.

explain (analyze,verbose,timing,costs,buffers)   
select * from tbl where st_contains(st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(75.15 29.53,77 29,77.6 29.5, 75.15 29.53)')), 4326), pos)  
union all  
select * from tbl where st_contains(st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(125.15 29.53,125 29,125.6 29.5, 125.15 29.53)')), 4326), pos);  
 Append  (cost=156.65..27001.89 rows=6666 width=36) (actual time=1.594..11.143 rows=2445 loops=1)  
   Buffers: shared hit=5230  
   ->  Bitmap Heap Scan on public.tbl  (cost=156.65..13467.61 rows=3333 width=36) (actual time=1.594..8.473 rows=2016 loops=1)  
         Output: tbl.id, tbl.pos  
         Recheck Cond: ('0103000020E610000001000000040000009A99999999C9524048E17A14AE873D4000000000004053400000000000003D4066666666666653400000000000803D409A99999999C9524048E17A14AE873D40'::geometry ~ tbl.pos)  
         Filter: _st_contains('0103000020E610000001000000040000009A99999999C9524048E17A14AE873D4000000000004053400000000000003D4066666666666653400000000000803D409A99999999C9524048E17A14AE873D40'::geometry, tbl.pos)  
         Rows Removed by Filter: 2172  
         Heap Blocks: exact=4083  
         Buffers: shared hit=4133  
         ->  Bitmap Index Scan on idx_tbl_pos  (cost=0.00..155.82 rows=10000 width=0) (actual time=1.001..1.001 rows=4188 loops=1)  
               Index Cond: ('0103000020E610000001000000040000009A99999999C9524048E17A14AE873D4000000000004053400000000000003D4066666666666653400000000000803D409A99999999C9524048E17A14AE873D40'::geometry ~ tbl.pos)  
               Buffers: shared hit=50  
   ->  Bitmap Heap Scan on public.tbl tbl_1  (cost=156.65..13467.61 rows=3333 width=36) (actual time=0.429..2.227 rows=429 loops=1)  
         Output: tbl_1.id, tbl_1.pos  
         Recheck Cond: ('0103000020E610000001000000040000009A99999999495F4048E17A14AE873D400000000000405F400000000000003D406666666666665F400000000000803D409A99999999495F4048E17A14AE873D40'::geometry ~ tbl_1.pos)  
         Filter: _st_contains('0103000020E610000001000000040000009A99999999495F4048E17A14AE873D400000000000405F400000000000003D406666666666665F400000000000803D409A99999999495F4048E17A14AE873D40'::geometry, tbl_1.pos)  
         Rows Removed by Filter: 655  
         Heap Blocks: exact=1076  
         Buffers: shared hit=1097  
         ->  Bitmap Index Scan on idx_tbl_pos  (cost=0.00..155.82 rows=10000 width=0) (actual time=0.295..0.295 rows=1084 loops=1)  
               Index Cond: ('0103000020E610000001000000040000009A99999999495F4048E17A14AE873D400000000000405F400000000000003D406666666666665F400000000000803D409A99999999495F4048E17A14AE873D40'::geometry ~ tbl_1.pos)  
               Buffers: shared hit=21  
 Planning time: 0.247 ms  
 Execution time: 11.432 ms  
(24 rows)  

7) Write user-defined functions (UDFs) to simplify multiple UNION ALL commands.

create or replace function q_mp(VARIADIC arr geometry[]) returns setof record as $$  
declare  
  sql text := '';  
  var geometry;  
begin  
  foreach var in array arr loop  
    sql := sql || format(' select * from tbl where st_contains(''%s''::geometry, pos) union all', var);  
  end loop;  
  sql := rtrim(sql, 'union all');  
  return query execute sql;  
end;  
$$ language plpgsql strict;  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from q_mp  
(  
st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(75.15 29.53,77 29,77.6 29.5, 75.15 29.53)')), 4326),   
st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(125.15 29.53,125 29,125.6 29.5, 125.15 29.53)')), 4326)  
) as t(id int, pos geometry);  
 Function Scan on public.q_mp t  (cost=0.25..10.25 rows=1000 width=36) (actual time=11.451..11.707 rows=2445 loops=1)  
   Output: id, pos  
   Function Call: q_mp(VARIADIC '{0103000020E610000001000000040000009A99999999C9524048E17A14AE873D4000000000004053400000000000003D4066666666666653400000000000803D409A99999999C9524048E17A14AE873D40:0103000020E610000001000000040000009A99999999495F4048E17A14AE873D400000000000405F400000000000003D406666666666665F400000000000803D409A99999999495F4048E17A14AE873D40}'::geometry[])  
   Buffers: shared hit=5230  
 Planning time: 0.095 ms  
 Execution time: 11.975 ms  
(6 rows)  

Summary

1) UDFs greatly simplify SQL statements and improve performance.

2) You can reduce the search criteria (or invalid areas in the bounding box) as much as possible to reduce the I/O and CPU usage and significantly improve the performance.

3) There may be another optimizable problem. If the spatial object is very large, the PGLZ compression method is used. If the CPU usage on compression and decompression has become a bottleneck, a non-compressed storage format can reduce CPU usage.

You can use perf to observe bottlenecks.

4) Custom compression methods: https://commitfest.postgresql.org/21/1294/

Reference

For more information on how to split a multigeometry (multipolygon) into multiple polygons, see PostGIS documentation: http://postgis.net/docs/manual-2.4/ST_GeometryN.html

0 0 0
Share on

digoal

148 posts | 10 followers

You may also like

Comments