PostgreSQL Random Sampling Application - table sample, tsm_system_rows, and tsm_system_time

By digoal

Background

Random sampling is used to randomly extract some records from the resource pool and return them to different clients.

The built-in sampling method named system shown below can access up to 10% of data blocks randomly to find records that meet the conditions and stop scanning when ten pieces of limits are found. It will scan up to 10% of video table data blocks. It will return even if there are not ten results that meet the conditions.

explain (analyze,verbose,timing,costs,buffers) select id,play_count from video as v1 TABLESAMPLE system(10)  where play_count>=2000 limit 10;  
                                                        QUERY PLAN                                                           
---------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.00..1.73 rows=10 width=12) (actual time=0.006..0.011 rows=10 loops=1)  
   Output: id, play_count  
   Buffers: shared hit=1  
   ->  Sample Scan on public.video v1  (cost=0.00..3856.95 rows=22351 width=12) (actual time=0.006..0.009 rows=10 loops=1)  
         Output: id, play_count  
         Sampling: system ('10'::real)  
         Filter: (v1.play_count >= 2000)  
         Buffers: shared hit=1  
 Planning Time: 0.151 ms  
 Execution Time: 0.024 ms  
(10 rows)

Is there a more refined sampling threshold control? If the call frequency and the sampling ratio are very high, but there are few qualified records, it may consume more resources and cause avalanches.

How can we optimize it?

Block-level random sampling method extension:

create extension tsm_system_rows ;  
CREATE EXTENSION  
  
create extension tsm_system_time ;  
CREATE EXTENSION

It samples for ten milliseconds at most, and ten entries that meet the play_count>=2000 condition are returned. (If ten entries meet the criteria soon, it stops scanning. Therefore, it is quick.)

explain (analyze,verbose,timing,costs,buffers) select id,play_count from video as v1 TABLESAMPLE system_time(10)  where play_count>=2000 limit 10;  
                                                      QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.00..1.73 rows=10 width=12) (actual time=0.009..0.030 rows=10 loops=1)  
   Output: id, play_count  
   Buffers: shared hit=3  
   ->  Sample Scan on public.video v1  (cost=0.00..10.71 rows=62 width=12) (actual time=0.008..0.028 rows=10 loops=1)  
         Output: id, play_count  
         Sampling: system_time ('10'::double precision)  
         Filter: (v1.play_count >= 2000)  
         Buffers: shared hit=3  
 Planning Time: 0.052 ms  
 Execution Time: 0.042 ms  
(10 rows)

It samples 100 entries at most, and ten entries that meet the play_count>=2000 condition are returned. (If ten entries meet the criteria soon, it stops scanning. Therefore, it is quick.)

explain (analyze,verbose,timing,costs,buffers) select id,play_count from video as v1 TABLESAMPLE  system_rows (100) where play_count>=2000 limit 10;  
                                                      QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.00..1.75 rows=10 width=12) (actual time=0.023..0.031 rows=10 loops=1)  
   Output: id, play_count  
   Buffers: shared hit=1  
   ->  Sample Scan on public.video v1  (cost=0.00..16.65 rows=95 width=12) (actual time=0.022..0.028 rows=10 loops=1)  
         Output: id, play_count  
         Sampling: system_rows ('100'::bigint)  
         Filter: (v1.play_count >= 2000)  
         Buffers: shared hit=1  
 Planning Time: 0.076 ms  
 Execution Time: 0.051 ms  
(10 rows)

When the time and the cost are controllable to prevent sampling avalanches, the sampling randomness is guaranteed.

If the percentage of records of the where condition is very small, the number of limits cannot be returned after the upper sampling limit is reached. Therefore, you can choose several methods:

Increase the upper sampling limit and pay attention to prevent avalanches
Modify the where condition to increase the coverage rate
Collect garbage to reduce the number of holes in each block
Perform partition to increase the percentage of records that match the where condition in the target sampling table

Community

PostgreSQL Random Sampling Application - table sample, tsm_system_rows, and tsm_system_time

Background

References

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for PostgreSQL

AnalyticDB for PostgreSQL

Tablestore

ApsaraDB RDS for PostgreSQL