×
Community Blog PostgreSQL: Real-time User Positioning and Selection Based on Massive Amounts of Data

PostgreSQL: Real-time User Positioning and Selection Based on Massive Amounts of Data

In this article, the author discusses real-time user positioning and selection using PostgreSQL. The author also discusses various design methods and optimization for performance improvements.

By digoal

Background

We must take good care of the elderly and children, but what if they go missing?

Alibaba has a public help system called Reunion, which help to publish information about missing people. The police can publish photos of missing people and their last-seen locations.

Partners of Reunion (such as telecom operators and shopping apps) can send messages about missing persons to people near missing persons’ last-seen locations. In turn, this mobilizes society to help find the missing persons. To achieve this goal, we need to collect nearby people’s real-time locations with technologies such as cell tower- and GPS-based positioning.

Assume that 1 billion mobile phone users are changing their location in real time and real-time positioning information needs to be updated in the database, leading to hundreds of billions of location updates every day.

After publishing the missing persons' information, we need to find people near the locations where the missing persons were last seen in the database.

Simple and Straightforward Design 1: Geometry + GiST Spatial Index

1) Schema Design

create table tbl_pos(      
  id int primary key,  -- 用户ID      
  pos point  -- 用户实时位置      
);      

2) Spatial Indexes

create index idx_tbl_pos on tbl_pos using gist(pos);      

Performance Evaluation

Use the INSERT ON CONFLICT statement to update the locations of 1 billion users in real time.

vi test.sql      
      
\set id random(1,1000000000)      
insert into tbl_pos values (:id, point(random()*180,random()*90)) on conflict (id) do update set pos=excluded.pos;      

Use 32 parallel tasks to generate random user locations in real time.

nohup pgbench -M prepared -n -r -P 5 -f ./test.sql -c 32 -j 32 -T 120000 > ./pos.log 2>&1 &      

1) The Transactions Per Second (TPS) of Real-time Location Updates is About 180,000/s.

179799      

The load on the server is minimal and has sufficient resources for queries.

top - 01:52:34 up 76 days, 15:32,  2 users,  load average: 33.74, 33.56, 31.47      
Tasks: 1064 total,  34 running, 1030 sleeping,   0 stopped,   0 zombie      
%Cpu(s): 47.6 us,  5.4 sy,  0.0 ni, 46.9 id,  0.2 wa,  0.0 hi,  0.0 si,  0.0 st      
KiB Mem : 52807456+total, 32911484+free, 10949652 used, 18801006+buff/cache      
KiB Swap:        0 total,        0 free,        0 used. 42997945+avail Mem 

2) Query Performance

Test the query performance when locations are updated.

For example, if a missing person’s last location is Hangzhou, we need to query points on a particular plane (Hangzhou in this case). We can obtain five million points (nearby users) within just 28 seconds.

The spatial index ensures a very fast return speed.

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl_pos where box(point(1,1), point(25.5,25.5)) @> pos limit 5000000;      
                                                                      QUERY PLAN                                                                             
-------------------------------------------------------------------------------------------------------------------------------------------------------      
 Limit  (cost=0.55..412954.11 rows=407872 width=20) (actual time=1.433..27536.623 rows=5000000 loops=1)      
   Output: id, pos      
   Buffers: shared hit=6183117 dirtied=31842      
   ->  Index Scan using idx_tbl_pos on public.tbl_pos  (cost=0.55..412954.11 rows=407872 width=20) (actual time=1.431..26861.352 rows=5000000 loops=1)      
         Output: id, pos      
         Index Cond: ('(25.5,25.5),(1,1)'::box @> tbl_pos.pos)      
         Buffers: shared hit=6183117 dirtied=31842      
 Planning time: 0.353 ms      
 Execution time: 27950.171 ms      
(9 rows)      

In actual queries, we can use cursors to return the results through stream push. Example:

postgres=# begin;      
BEGIN      
postgres=# declare cur cursor for select * from tbl_pos where box(point(1,1), point(25.5,25.5)) @> pos;      
DECLARE CURSOR      
postgres=# fetch 10 from cur;      
    id     |                 pos                       
-----------+-------------------------------------      
 680844515 | (2.08381220698357,1.25674836337566)      
 498274514 | (2.23715107887983,1.27883949782699)      
  72310072 | (2.1013452205807,1.32945269811898)      
 301147261 | (2.12246049195528,1.33455505594611)      
 186462127 | (2.13169047608972,1.24054086394608)      
 726143191 | (2.27320306934416,1.31862969137728)      
 902518425 | (2.27059512399137,1.32658164482564)      
 534516939 | (2.18118946999311,1.29441328346729)      
 329417311 | (2.27630747482181,1.2547113513574)      
 853173913 | (2.28139906190336,1.33868838194758)      
(10 rows)      
      
postgres=# \timing      

The timing is on.

postgres=# fetch 10 from cur;      
    id     |                 pos                       
-----------+-------------------------------------      
 223759458 | (2.24917919375002,1.31508464924991)      
 215111891 | (2.10541740059853,1.26674327999353)      
 925178989 | (2.08201663568616,1.2974686967209)      
 954808979 | (2.10515496321023,1.32548315450549)      
 514021414 | (2.17867707833648,1.27732987515628)      
 872436892 | (2.22504794597626,1.31386948283762)      
 507169369 | (2.05484946258366,1.30171341821551)      
 317349985 | (2.25962312892079,1.30945896729827)      
 200956423 | (2.10705514065921,1.30409182514995)      
 598969258 | (1.98812280781567,1.30866004619747)      
(10 rows)      
      
Time: 0.306 ms      

A cursor allows a client to receive the returned information while sending text messages or pushing notices for missing persons.

Stream push saves valuable search time.

Simple and Straightforward Design 2: Geohash + B-tree Index

The Reunion system does not have the same high precision requirement on spatial data as some LBS social networking programs. Therefore, we can use GeoHash + B-tree index instead of geometry + GiST spatial index.

We can use the ST_GeoHash(geometry, int) function provided by PostGIS as the conversion method.

Optimized Design: Schemaless Partitioning

The query performance is good in a single table that contains 1 billion spatial data records. With frequent updates, however, the GiST index may expand. Therefore, PostgreSQL provides a parallel index creation method that allows us to create the same index on a column without causing congestion to maintain the index. However, it takes a long time to create indexes for 1 billion data records.

To solve this problem, we recommend partitioning the table. For example, we split the table into 64 partitions by hash ID, each with about 15 million data records.

In PostgreSQL, the pg_pathman module provides the best partitioning performance. Or we can use a schemaless partitioning method.

The geometry + GiST spatial index is used as an example below to describe the schemaless partitioning method.

Defining a Base Table

postgres=# create table tbl_pos(id int primary key, pos point);    
CREATE TABLE    
postgres=# create index idx_tbl_pos_1 on tbl_pos using gist(pos);    
CREATE INDEX    

Defining an Automatic Table Creation Function

create or replace function create_schemaless(    
  target name,   -- 目标表名    
  src name       -- 源表名    
) returns void as $$        
declare        
begin        
  execute format('create table if not exists %I (like %I including all)', target, src);        
  execute format('alter table %I inherit %I', target, src);        
exception when others then        
  return;        
end;        
$$ language plpgsql strict;        

Defining a Function that Writes Data in Schemaless Mode

Create a function that inserts data and uses dynamic SQL. Invoke the table creation function to create tables if an error occurs, indicating no tables.

create or replace function ins_schemaless(    
  id int,   -- id    
  md int,   -- 取模数    
  pos point -- 位置    
) returns void as $$        
declare       
  target name := 'tbl_pos_'||mod(id,md) ;      
begin        
  execute format('insert into %I values (%L, %L) on conflict (id) do update set pos=point_add(%I.pos, point(random()*10-5, random()*10-5))', target, id, pos, target);       
  -- 为了模拟真实情况,因为人的移动速度有限,即使驾车,飞机(少数情况),所以用了pos=point_add(%I.pos, point(random()*10-5, random()*10-5))这种方法模拟更真实的情况    
  -- 实际场景,请改成pos=excluded.pos    
  exception         
    WHEN SQLSTATE '42P01' THEN         
    perform create_schemaless(target, 'tbl_pos');        
    execute format('insert into %I values (%L, %L) on conflict (id) do update set pos=point_add(%I.pos, point(random()*10-5, random()*10-5))', target, id, pos, target);         
    -- 为了模拟真实情况,因为人的移动速度有限,即使驾车,飞机(少数情况),所以用了pos=point_add(%I.pos, point(random()*10-5, random()*10-5))这种方法模拟更真实的情况    
    -- 实际场景,请改成pos=excluded.pos    
end;        
$$ language plpgsql strict;  

The schemaless design on the database side is implemented at the cost of performance to some extent because bind variables cannot be used.

We recommend using the schemaless design in the business layer (automatically splitting and joining table names, automatically creating tables, and automatically writing data) to improve the performance.

The testing code is as follows:

postgres=# select ins_schemaless(2,32,point(1,2));    
 ins_schemaless     
----------------    
     
(1 row)    
    
postgres=# select ins_schemaless(1,32,point(1,2));    
 ins_schemaless     
----------------    
     
(1 row)    
    
postgres=# select tableoid::regclass,* from tbl_pos;    
 tableoid  | id |  pos      
-----------+----+-------    
 tbl_pos_2 |  2 | (1,2)    
 tbl_pos_1 |  1 | (1,2)    
(2 rows)    

Stress Testing for the Schemaless Design

vi ~/test.sql    
\set id random(1,1000000000)    
select ins_schemaless(:id, 32, point(random()*360-180, random()*180-90));    
    
    
nohup pgbench -M prepared -n -r -P 5 -f ./test.sql -c 32 -j 32 -T 120000 > ./pos.log 2>&1 &    

The performance is still excellent.

125977 tps 

Summary

1) Using the spatial data types and spatial indexes of PostgreSQL together with the INSERT ON CONFLICT statement, we can update the locations of 1 billion users in real time on a single server at a TPS of about 180,000/s and return 5 million points in less than 30 seconds. The timeliness of the Reunion public welfare system is ensured.

2) We can use the cursor to return results in streams and obtain data while sending notices for missing persons to all nearby persons.

3) Some location changes can be filtered out. For example, when a user goes to the bathroom from the office, the location changes. However, it is a very subtle change and can be filtered out (not updated) in this system. Filtering out such changes reduces data updates.
According to the existing test data, 15.5 billion updates can be achieved every day. Assuming that only 1 of every 10 updates is valid, PostgreSQL supports the collection of 155 billion location-change data records.

4) PostgreSQL can be used in public help scenarios.

5) In the future, when the StreamCompute engine PipelineDB is developed as a plug-in, PostgreSQL will integrate this StreamCompute engine. Theoretically, the StreamCompute engine easily achieves an update rate of 400,000 rows, supporting more than 30 billion real-time location updates per day.

6) Using StreamCompute not only improves performance but also reduces transaction ID (XID) consumption. With the current 32-bit XID, it can effectively reduce the burden from FREEZE. If StreamCompute is not used, we recommend merging the data before updating it. For example, if several records (for example, 100 records) are updated within one transaction, the number of updated data records per day maybe 150 million.

Reference

0 0 0
Share on

digoal

148 posts | 10 followers

You may also like

Comments