×
Community Blog Discover Target Customers Using Visual Mining with PostGIS Spatial Databases

Discover Target Customers Using Visual Mining with PostGIS Spatial Databases

This article discusses ad recommendation systems and explores optimization requirements for handling data and performing visual mining efficiently in real time.

By Digoal

Background

A recommendation system is the backbone of any advertising platform. A proper recommendation system must be precise, capable of delivering recommendations in real time, and be highly efficient.

But which advertisement platform is the best one? Which one has the most robust core?

1) Precision refers to the ability to create distinct and accurate user profiles based on massive user data. This profiling system is also called a tagging system. Effective recommendations often hinge on precise tagging. For example, you wouldn't recommend a pair of reading glasses to young people unless their purchase intention tags show interest.

2) Tags must be updated in real time, and many tags are time-sensitive. For example, marketing to certain customers may only be effective within a certain time window. Likewise, a certain product may only be appropriate in one minute and less so in the next minute. If you generate tags every other day or longer, you may miss recommending in the best possible way. Therefore, operating a recommendation system in real time is critical.

3) Efficiency refers to the ability to discover customers based on predefined tags quickly. After all, the advertisement industry sponsors need to obtain information as quickly as possible. If you have several clients purchasing advertisements, your system's capacity to discover customers promptly and deliver advertisements will be tested.

An advertisement platform's competitiveness depends on the above three core elements.

PostgreSQL arrays and generalized inverted indexes (GINs) perfectly support these business scenarios.

The above cases are suitable for a scenario where the number of visits to shops is classified into different levels, for example, 1-1000, 1001-5000, etc.

How can we do data mining if the visit count is not classified into levels but is displayed in detail?

Next, we discuss how to address the above scenario with a spatial database.

Business Background

Before buying an item on Taobao, a user may visit many shops and decide after a serious comparison.

Users may visit many shops every day. What does it mean if a user visits a shop several times?

This shop must have something that attracts the user. Therefore, if the shop pushes promotions to this user or takes appropriate marketing measures, the user may probably buy from the shop.

1

This is one of the ways to discover target customers.

The data structure may look like this:

User ID, shop ID 1, number of visits, shop ID 2, number of visits, etc., for example, "1:1, 2:1" means that the user visited Shop 1 and Shop 2 once.

With this data, the business system can identify target customers based on visits, for example, the number of visits to Shop A or Shop B.

What technology would you use in this situation?

Data Scale

Assume hundreds of millions of shop IDs and users with an uncertain number of visits to each shop.

Direct computation for the above business requirements requires significant CPU usage.

Visual Mining

PostGIS is a spatial database. After converting the data into spatial data, use a spatial function to identify target customers, which, for example, represent based on the intersection between a MultiPoint geometry and a LineString.

It is achievable using PostGIS spatial indexes and converting the visit trace data into a MultiPoint geometry to meet this business requirement. To construct a multipoint geometry, use the following two functions:

In a database, visit traces of users are stored as geometry objects, each comprising multiple points.

2

Identify target customers based on the intersection of two geometry shapes, for example, users who visited Shop 2 for 2-100 times or Shop 4 for 3-100 times. This is converted into calculating the result set of intersection between a multiline geometry and a multipoint geometry.

3

Notes

Assume that PostGIS supports finding the intersection (&&) between a MultiPoint geometry and a LineString geometry (currently not supported by PostGIS GA).

Currently, && applies to the intersection between bounding boxes instead of between Multipoint geometries. Therefore, the direct result of && is not desirable.

PostGIS supports many complicated geometric operations. So, finding the intersection (&&) between a MultiPoint geometry and a LineString geometry is not a geometric requirement.

&& — Returns TRUE if A's 2D bounding box intersects B's 2D bounding box.

Example

Creating a Table

test=> create table test(userid int8 primary key, feeds text);        
CREATE TABLE      

The data format is as follows:

shop ID:number of visits; shop ID:number of visits, …

Creating Spatial Function Indexes

Construct a MultiPoint geometry in the preceding format and create spatial indexes. Convert a null string into an origin (0, 0).

test=> create index idx_test_feeds on test using gist ((case when feeds='' then ST_MPointFromText('MULTIPOINT(0 0)') else ST_MPointFromText('MULTIPOINT('||replace(replace(feeds,':', ' '),';',',')||')') end));          
CREATE INDEX          
        
        
test=> \d+ test        
                 Table "public.test"        
 Column |  Type  | Modifiers | Storage  | Description         
--------+--------+-----------+----------+-------------        
 userid | bigint | not null  | plain    |         
 feeds  | text   |           | extended |         
Indexes:        
    "test_pkey" PRIMARY KEY, btree (userid)        
    "idx_test_feeds" gist ((        
CASE        
    WHEN feeds = ''::text THEN st_mpointfromtext('MULTIPOINT(0 0)'::text)        
    ELSE st_mpointfromtext(('MULTIPOINT('::text || replace(replace(feeds, ':'::text, ' '::text), ';'::text, ','::text)) || ')'::text)        
END))        
Has OIDs: no        

Inserting Test Data

Insert several pieces of test data to indicate user-visited shops and the visit count.

insert into test values (1, '1:1');          
insert into test values (2, '1:100');          
insert into test values (3, '2:1');          
insert into test values (4, '2:100');          
insert into test values (5, '1:100;2:100');   

Searching for Target Customers

1) Search for Shop 1 visitors at least twice and at most 100 times.

Construct a LineString ST_LineFromText('LINESTRING(1 2, 1 100)') and find the intersection.

select * from test where           
(case when feeds='' then ST_MPointFromText('MULTIPOINT(0 0)') else ST_MPointFromText('MULTIPOINT('||replace(replace(feeds,':', ' '),';',',')||')') end)          
&&          
ST_LineFromText('LINESTRING(1 2, 1 100)');          
          
 userid |    feeds            
--------+-------------        
      2 | 1:100        
      5 | 1:100;2:100        
(2 rows)        

Spatial indexes are used during execution.

 Index Scan using idx_test_feeds on test  (cost=0.41..48.47 rows=232 width=40)        
   Index Cond: (CASE WHEN (feeds = ''::text) THEN '010400000001000000010100000000000000000000000000000000000000'::geometry ELSE st_mpointfromtext((('MULTIPOINT('::text || replace(replace(feeds, ':'::text, ' '::text), ';'::text, ','::text        
)) || ')'::text)) END && '010200000002000000000000000000F03F0000000000000040000000000000F03F0000000000005940'::geometry)        
(2 rows)        

2) Search for users who visited Shop 1 or Shop 2 at least twice and most 100 times.

Construct a multi-line string ST_MLineFromText('MULTILINESTRING((1 2, 1 100), (2 2, 2 100))') and find the intersection.

select * from test where           
(case when feeds='' then ST_MPointFromText('MULTIPOINT(0 0)') else ST_MPointFromText('MULTIPOINT('||replace(replace(feeds,':', ' '),';',',')||')') end)          
&&          
ST_MLineFromText('MULTILINESTRING((1 2, 1 100), (2 2, 2 100))');          
          
 userid |    feeds            
--------+-------------        
      2 | 1:100        
      4 | 2:100        
      5 | 1:100;2:100        
(3 rows)         

Spatial indexes are used during execution.

 Index Scan using idx_test_feeds on test  (cost=0.41..48.47 rows=232 width=40)        
   Index Cond: (CASE WHEN (feeds = ''::text) THEN '010400000001000000010100000000000000000000000000000000000000'::geometry ELSE st_mpointfromtext((('MULTIPOINT('::text || replace(replace(feeds, ':'::text, ' '::text), ';'::text, ','::text        
)) || ')'::text)) END && '010500000002000000010200000002000000000000000000F03F0000000000000040000000000000F03F00000000000059400102000000020000000000000000000040000000000000004000000000000000400000000000005940'::geometry)        
(2 rows)        

3) Search for users who visited shop 1 and shop 2 at least twice and at most 100 times.

Find the intersection between ST_LineFromText('LINESTRING(1 2, 1 100)') and ST_LineFromText('LINESTRING(2 2, 2 100)').

select * from test where           
(case when feeds='' then ST_MPointFromText('MULTIPOINT(0 0)') else ST_MPointFromText('MULTIPOINT('||replace(replace(feeds,':', ' '),';',',')||')') end)          
&&          
ST_LineFromText('LINESTRING(1 2, 1 100)')          
and          
(case when feeds='' then ST_MPointFromText('MULTIPOINT(0 0)') else ST_MPointFromText('MULTIPOINT('||replace(replace(feeds,':', ' '),';',',')||')') end)          
&&          
ST_LineFromText('LINESTRING(2 2, 2 100)');          
          
 userid |    feeds            
--------+-------------        
      5 | 1:100;2:100        
(1 row)        

Spatial indexes are used during execution.

 Bitmap Heap Scan on test  (cost=5.14..39.98 rows=46 width=40)        
   Recheck Cond: ((CASE WHEN (feeds = ''::text) THEN '010400000001000000010100000000000000000000000000000000000000'::geometry ELSE st_mpointfromtext((('MULTIPOINT('::text || replace(replace(feeds, ':'::text, ' '::text), ';'::text, ','::t        
ext)) || ')'::text)) END && '010200000002000000000000000000F03F0000000000000040000000000000F03F0000000000005940'::geometry) AND (CASE WHEN (feeds = ''::text) THEN '010400000001000000010100000000000000000000000000000000000000'::geometry E        
LSE st_mpointfromtext((('MULTIPOINT('::text || replace(replace(feeds, ':'::text, ' '::text), ';'::text, ','::text)) || ')'::text)) END && '0102000000020000000000000000000040000000000000004000000000000000400000000000005940'::geometry))        
   ->  Bitmap Index Scan on idx_test_feeds  (cost=0.00..5.13 rows=46 width=0)        
         Index Cond: ((CASE WHEN (feeds = ''::text) THEN '010400000001000000010100000000000000000000000000000000000000'::geometry ELSE st_mpointfromtext((('MULTIPOINT('::text || replace(replace(feeds, ':'::text, ' '::text), ';'::text, ',        
'::text)) || ')'::text)) END && '010200000002000000000000000000F03F0000000000000040000000000000F03F0000000000005940'::geometry) AND (CASE WHEN (feeds = ''::text) THEN '010400000001000000010100000000000000000000000000000000000000'::geomet        
ry ELSE st_mpointfromtext((('MULTIPOINT('::text || replace(replace(feeds, ':'::text, ' '::text), ';'::text, ','::text)) || ')'::text)) END && '0102000000020000000000000000000040000000000000004000000000000000400000000000005940'::geometry)        
)        
(4 rows)        
        
或        
        
 Index Scan using idx_test_feeds on test  (cost=0.67..45.59 rows=46 width=40)        
   Index Cond: ((CASE WHEN (feeds = ''::text) THEN '010400000001000000010100000000000000000000000000000000000000'::geometry ELSE st_mpointfromtext((('MULTIPOINT('::text || replace(replace(feeds, ':'::text, ' '::text), ';'::text, ','::tex        
t)) || ')'::text)) END && '010200000002000000000000000000F03F0000000000000040000000000000F03F0000000000005940'::geometry) AND (CASE WHEN (feeds = ''::text) THEN '010400000001000000010100000000000000000000000000000000000000'::geometry ELS        
E st_mpointfromtext((('MULTIPOINT('::text || replace(replace(feeds, ':'::text, ' '::text), ';'::text, ','::text)) || ')'::text)) END && '0102000000020000000000000000000040000000000000004000000000000000400000000000005940'::geometry))        
(2 rows)     

Visual Mining Solution When PostGIS Does Not Support MultiPoint && LineString

Which PostGIS visual and geometric operations this article discusses?

When processing MultiPoint data, most geometric operations of PostGIS automatically convert the data into a closed bounding box that occupies the largest area or volume.

4

1) All parts of B are inside A, and at least one internal point of B is inside A.

ST_Contains — Returns true if and only if no points of B lie in the exterior of A, and at least one point of the interior of B lies in the interior of A.

boolean ST_Contains(geometry geomA, geometry geomB); -- a包含b

A is marked blue, and B is marked gray.

TRUE:

5

FALSE: Not all points are inside A.

6

2) No point of B is outside A.

ST_Covers — Returns 1 (TRUE) if no point in Geometry B is outside Geometry A

boolean ST_Covers(geometry geomA, geometry geomB);    
    
boolean ST_Covers(geography geogpolyA, geography geogpointB);    

3) No point of A is outside B.

ST_CoveredBy — Returns 1 (TRUE) if no point in Geometry/Geography A is outside Geometry/Geography B

boolean ST_CoveredBy(geometry geomA, geometry geomB);    
    
boolean ST_CoveredBy(geography geogA, geography geogB);    

4) A and B have a common part that is not the whole of A or B (that is, the common part is only a part of A or B).

ST_Crosses — Returns TRUE if the supplied geometries have some, but not all, interior points in common.

boolean ST_Crosses(geometry g1, geometry g2);

7

5) A and B have no spatial intersection.

ST_Disjoint — Returns TRUE if the Geometries do not "spatially intersect," that is if they do not share any space.

boolean ST_Disjoint( geometry A , geometry B );

6) A and B have a spatial intersection, which is opposite to ST_Disjoint.

ST_Intersects — Returns TRUE if the Geometries/Geography "spatially intersect in 2D" - (share any portion of space) and FALSE if they don't (they are Disjoint).
For geography -- tolerance is 0.00001 meters (so any points that close are considered to intersect) -- geography类型允许0.00001 meters的误差

boolean ST_Intersects( geometry geomA , geometry geomB );    
    
boolean ST_Intersects( geography geogA , geography geogB );    

7) A and B have a spatial intersection, but neither A nor B fully contain each other; that is, there must be a part outside each other.

ST_Overlaps — Returns TRUE if the Geometries share space, are of the same dimension but are not completely contained by each other.

boolean ST_Overlaps(geometry A, geometry B);

8

8) Return the operation results of A and B under the specified geometric properties, or determine whether the two geometries conform to the specified geometric properties.

ST_Relate — Returns true if this Geometry is spatially related to another Geometry by testing for intersections between the Interior, Boundary, and Exterior of the two geometries as specified by the values in intersectionMatrixPattern. If no intersectionMatrixPattern is passed in, then it returns the maximum intersectionMatrixPattern that relates the 2 geometries.

boolean ST_Relate(geometry geomA, geometry geomB, text intersectionMatrixPattern);    
    
text ST_Relate(geometry geomA, geometry geomB);    
    
text ST_Relate(geometry geomA, geometry geomB, integer BoundaryNodeRule);

The operation style is as follows: https://en.wikipedia.org/wiki/DE-9IM

9

9) Determine whether Geometry A contains Geometry B.

ST_RelateMatch — Returns true if intersectionMattrixPattern1 implies intersectionMatrixPattern2

boolean ST_RelateMatch(text intersectionMatrix, text intersectionMatrixPattern);    
    
SELECT ST_RelateMatch('101202FFF', 'TTTTTTFFF') ;    
-- result --    
t    
    
--example of common intersection matrix patterns and example matrices    
-- comparing relationships of involving one invalid geometry and ( a line and polygon that intersect at interior and boundary)    
SELECT mat.name, pat.name, ST_RelateMatch(mat.val, pat.val) As satisfied    
    FROM    
        ( VALUES ('Equality', 'T1FF1FFF1'),    
                ('Overlaps', 'T*T***T**'),    
                ('Within', 'T*F**F***'),    
                ('Disjoint', 'FF*FF****') As pat(name,val)    
        CROSS JOIN    
            (    VALUES ('Self intersections (invalid)', '111111111'),    
                    ('IE2_BI1_BB0_BE1_EI1_EE2', 'FF2101102'),    
                    ('IB1_IE1_BB0_BE0_EI2_EI1_EE2', 'F11F00212')    
            ) As mat(name,val);    

10) A and B have at least one common point but no interior intersection.

ST_Touches — Returns TRUE if the geometries have at least one point in common, but their interiors do not intersect.

boolean ST_Touches(geometry g1, geometry g2);

10

11) A is completely inside B.

ST_Within — Returns true if geometry A is completely inside geometry B

boolean ST_Within(geometry A, geometry B);

11

Prerequisites for Visual Mining

Assume that A indicates the stored MultiPoint geometry (or single point) in storage and B indicates the conditional LineString geometry that we entered.

To perform visual mining, the following statements must be true:

1) A and B have common points but no interior intersection, or

2) A is completely inside B, or

3) A and B intersect internally.

SQL Conversion

Who has visited Shop 1 for 2 to 100 times?

select * from test1 where      
(  
st_touches( ST_LineFromText('LINESTRING(1 2, 1 100)'), feeds )  
or  
st_within( feeds, ST_LineFromText('LINESTRING(1 2, 1 100)') )  
or
st_crosses( feeds, ST_LineFromText('LINESTRING(1 2, 1 100)') )
)  
;   
  
 userid |                                                 feeds                                                    
--------+--------------------------------------------------------------------------------------------------------  
      2 | 0104000000010000000101000000000000000000F03F0000000000005940  
      5 | 0104000000020000000101000000000000000000F03F0000000000005940010100000000000000000000400000000000005940  
(2 rows)  
  
  
执行计划  
  
 Bitmap Heap Scan on public.test1  (cost=12.41..17.95 rows=1 width=40) (actual time=0.166..0.209 rows=2 loops=1)
   Output: userid, feeds
   Recheck Cond: (('010200000002000000000000000000F03F0000000000000040000000000000F03F0000000000005940'::geometry && test1.feeds) OR ('010200000002000000000000000000F03F0000000000000040000000000000F03F0000000000005940'::geometry ~ test1.feeds) OR (test1.feeds && '010200000002000000000000000000F03F0000000000000040000000000000F03F0000000000005940'::geometry))    -- 这个实现需要重新检查
   Filter: ((('010200000002000000000000000000F03F0000000000000040000000000000F03F0000000000005940'::geometry && test1.feeds) AND _st_touches('010200000002000000000000000000F03F0000000000000040000000000000F03F0000000000005940'::geometry, test1.feeds)) OR (('010200000002000000000000000000F03F0000000000000040000000000000F03F0000000000005940'::geometry ~ test1.feeds) AND _st_contains('010200000002000000000000000000F03F0000000000000040000000000000F03F0000000000005940'::geometry, test1.feeds)) OR ((test1.feeds && '010200000002000000000000000000F03F0000000000000040000000000000F03F0000000000005940'::geometry) AND _st_crosses(test1.feeds, '010200000002000000000000000000F03F0000000000000040000000000000F03F0000000000005940'::geometry)))   -- 重新检测过滤的条件
   Heap Blocks: exact=1
   Buffers: shared hit=4
   ->  BitmapOr  (cost=8.28..8.28 rows=1 width=0) (actual time=0.020..0.020 rows=0 loops=1)  
         Buffers: shared hit=2  
         ->  Bitmap Index Scan on idx_test1_feeds  (cost=0.00..4.14 rows=1 width=0) (actual time=0.015..0.015 rows=2 loops=1)  
               Index Cond: ('010200000002000000000000000000F03F0000000000000040000000000000F03F0000000000005940'::geometry && test1.feeds)  -- 这里可能会有大量符合条件的记录  
               Buffers: shared hit=1  
         ->  Bitmap Index Scan on idx_test1_feeds  (cost=0.00..4.14 rows=1 width=0) (actual time=0.004..0.004 rows=1 loops=1)  
               Index Cond: ('010200000002000000000000000000F03F0000000000000040000000000000F03F0000000000005940'::geometry ~ test1.feeds)   -- 针对单个点的  ~ — Returns TRUE if A's bounding box contains B's.  
               Buffers: shared hit=1  
         ->  Bitmap Index Scan on idx_test1_feeds  (cost=0.00..4.14 rows=1 width=0) (actual time=0.003..0.003 rows=2 loops=1)
               Index Cond: (test1.feeds && '010200000002000000000000000000F03F0000000000000040000000000000F03F0000000000005940'::geometry)
               Buffers: shared hit=1

Performance Optimization

Theoretically, it is easy to visually distinguish between MultiPoint and LineString data. However, PostGIS currently does not provide direct index filtering for this simple visual judgment. Instead, ST_Touches and ST_Within are used. In addition, MultiPoint data is converted into LineString data, which greatly increases the probability of intersection.

For optimization, map each user and shop to a record and represent the shops and number of visits with points.

In this case, use ST_Within for judgment, and use indexes.

In this way, efficiency is almost the same as that when two scalar fields are stored.

Summary

PostGIS is widely applied to various sectors, such as civil, scientific research, and military, covering surveying and mapping, astronomy, meteorology, vision, navigation, logistics, and IoT.

Almost all visual or map frameworks support PostGIS as a default component.

Many applications can be converted into visual processing. For example, if we store the data in MultiPoint format when discovering target customers based on visits to a shop, the operation can be converted into visual processing to find the intersection.

The spatial indexes of PostGIS undoubtedly enable millisecond-level response for such applications.

Let's embrace the millisecond era of real-time marketing.

References

0 0 0
Share on

digoal

201 posts | 12 followers

You may also like

Comments