All Products
Search
Document Center

ApsaraDB RDS:Implement real-time precision marketing and user group selection by using an ApsaraDB RDS for PostgreSQL instance

Last Updated:Dec 22, 2023

This topic describes how to use an ApsaraDB RDS for PostgreSQL instance to implement real-time precision marketing and identify intended users based on the characteristics of the users.

Prerequisites

  • An RDS instance is created. For more information, see Create an ApsaraDB RDS for PostgreSQL instance.

    Note

    This topic describes three solutions to implement real-time precision marketing and identify intended users. If you use Solution 3, the roaringbitmap extension is required, and you must make sure that your RDS instance runs PostgreSQL 12 or later. In this topic, PostgreSQL 12 is used. For more information about the roaringbitmap extension, see Use the roaringbitmap extension.

  • An IP address whitelist is configured for the RDS instance. For more information, see Configure an IP address whitelist.

  • An account is created on the RDS instance. For more information, see Create an account.

  • A database is created on the RDS instance. For more information, see Create a database.

Background information

Real-time precision marketing is required in most industries, such as the Internet, gaming, and education industries. Enterprises in these industries need to use a system to generate user profiles and identify intended user groups based on any combinations of industry-specific conditions. Examples:

  • In the e-commerce industry, before a merchant launches a promotion, the merchant identifies a group of intended users based on their characteristics, selects the users, and pushes advertisements to the users or determines whether the intended users can participate in the promotion.

  • In the gaming industry, bonus packages are often precisely distributed based on specific characteristics of players to increase active players.

  • In the education industry, various tailored exercises are pushed based on the characteristics of students to improve learning skills of the students.

  • In enterprises that are involved in online searching, portals, and video websites, content is pushed based on the concerns and recent behaviors of users.

These industries share the following pain points:

  • The huge volume of data requires a large number of computations.

  • A large number of tags and fields exist. In this case, a large number of storage resources are consumed.

  • The number of fields in a database may exceed the upper limit. In most cases, a database can contain up to 1,000 fields.

  • If you use arrays instead of multiple fields to store tags, the database must support Generalized Inverted Indexes (GINs). However, not all databases support GINs.

  • If you use arrays instead of multiple fields to store tags and use GINs, the consumption of storage resources is increased dramatically.

  • The combinations of selection conditions are diverse. Therefore, no fixed indexes are available to facilitate user selection. In this case, if you create an index for each field, the consumption of storage resources is increased dramatically.

  • High performance is required for quick responses of real-time marketing.

  • Quasi-real-time updates of user profiles are required to select users with precision. If a quasi-real-time update is not possible, the user selection result may be inaccurate. For example, if a user browsed mobile phones yesterday and placed an order later at night, but profiling data is not updated, the user is selected for sellers of mobile phones. However, the user is no longer in the intended user group.

Common database engines, such as MySQL, provide limited resources and cannot meet the requirements for real-time user group selection.

You can use one of the following PostgreSQL-based solutions to implement real-time precision marketing.

Solutions

Solution 1

Note

This solution is supported by both PostgreSQL and MySQL.

Overview

  • Schema:

    KEY: user ID
    Tag 1: 
    Tag 2: 
    ...
    Tag N:
  • Index: one index for each tag field

  • Search method:

    Combination of AND, OR, and NOT
    where Tag a and Tag b and ...
  • Disadvantages:

    • A large number of tags and fields exist. In this case, a large number of storage resources are consumed.

    • The number of fields in a database may exceed the upper limit. In most cases, a database can contain up to 1,000 fields. To resolve this issue, you can use a many-to-many schema to maintain one entry for each tag.

    • The combinations of selection conditions are diverse. Therefore, no fixed indexes are available to facilitate user selection. In this case, if you create an index for each field, the consumption of storage resources is increased dramatically.

    • A large amount of data must be updated when new group tags are added.

    • The query performance is poor.

Procedure

  1. Create a user group table, with each entry representing a user group. Sample code:

    create table t_tag_dict (
    tag int primary key,   -- The ID of the tag that identifies a user group.
    info text,  -- The user group description.
    crt_time timestamp  -- The time.
    ); 
  2. Create 100,000 user group tags. Sample code:

    insert into t_tag_dict values (1, 'Male', now());
    insert into t_tag_dict values (2, 'Female', now());
    insert into t_tag_dict values (3, 'Older than 24 years old', now());
    -- ...
    
    insert into t_tag_dict 
    select generate_series(4,100000), md5(random()::text), clock_timestamp();
  3. Create a user profile table, with each entry of a user representing a tag of the user. Sample code:

    create table t_user_tag (
    uid int8,   -- The user ID.
    tag int,            -- The tag of the user. A tag identifies a user group.
    mod_time timestamp,     -- The time.
    primary key (tag,uid)
    ); 
  4. Set 64 random tags for each of the 10 million male users and 10 million female users. A total of 1.28 billion entries are generated. Sample code:

    create or replace function gen_rand_tag(int,int) returns setof int as
    $$
      select case when random() > 0.5 then 1::int else 2::int end as tag
      union all
      select ceil(random()*$1)::int as tag from generate_series(1,$2);
    $$ language sql strict volatile;
    
    insert into t_user_tag
    select uid, gen_rand_tag(100000,63) as tag, clock_timestamp() 
    from generate_series(1,20000000) as uid on conflict (uid,tag) do nothing;
    
    -- You can also use the following method to import tags:
    create sequence seq;
    
    vi test.sql
    insert into t_user_tag
    select uid, gen_rand_tag(100000,63) as tag, clock_timestamp() 
    from nextval('seq'::regclass) as uid 
    on conflict(tag,uid) do nothing;
    
    pgbench -M prepared -n -r -P 1 -f ./test.sql -c 50 -j 50 -t 400000
  5. Query users who match tags 1 and 3. Sample code:

    -- Query the number of user groups.
    select count(*) from 
    (
    select uid from t_user_tag where tag=1 
    intersect
    select uid from t_user_tag where tag=3
    ) t;
    -- Time: 1494.789 ms (00:01.495)
    
    -- Obtain the user group ID.
    select uid from t_user_tag where tag=1 
    intersect
    select uid from t_user_tag where tag=3;
    -- Time: 3246.184 ms (00:03.246)
  6. Query users who match tags 1, 3, 10, or 200. Sample code:

    -- Query the number of user groups.
    select count(*) from 
    (
    select uid from t_user_tag where tag=1 
    union
    select uid from t_user_tag where tag=3
    union
    select uid from t_user_tag where tag=10
    union
    select uid from t_user_tag where tag=200
    ) t;
    -- Time: 3577.714 ms (00:03.578)
    
    -- Obtain the user group ID.
    select uid from t_user_tag where tag=1 
    union
    select uid from t_user_tag where tag=3
    union
    select uid from t_user_tag where tag=10
    union
    select uid from t_user_tag where tag=200;
    -- Time: 5682.458 ms (00:05.682)

Solution 2

Note

Solution 2 is supported only by PostgreSQL. MySQL does not support arrays or GINs.

Overview

  • Schema:

    KEY: user ID
    VALUES: array of tags
  • Index:

    Tag array field: GIN
  • Search method:

    AND, OR, and NOT
    where VALUES @> array[Tags] -- AND
    where VALUES && array[Tags] -- OR
    where not VALUES @> array [Tags] -- NOT
  • Disadvantages:

    • If you use arrays instead of multiple fields to store tags, the database must support GINs. However, not all databases support GINs.

    • If you use arrays instead of multiple fields to store tags and use GINs, the consumption of storage resources is increased dramatically.

    • A large amount of data must be updated when new group tags are added.

Procedure

  1. Create a user group table, with each entry representing a user group. Sample code:

    create table t_tag_dict (
    tag int primary key,   -- The ID of the tag that identifies a user group.
    info text,  -- The user group description.
    crt_time timestamp  -- The time.
    ); 
  2. Create 100,000 user group tags. Sample code:

    insert into t_tag_dict values (1, 'Male', now());
    insert into t_tag_dict values (2, 'Female', now());
    insert into t_tag_dict values (3, 'Older than 24 years old', now());
    -- ...
    
    insert into t_tag_dict 
    select generate_series(4,100000), md5(random()::text), clock_timestamp();
  3. Create a user profile table, with the single entry of each user representing an array of tags of the user. Sample code:

    create table t_user_tags (
    uid int8 primary key,   -- The user ID.
    tags int[],            -- The array of tags of the user. A tag identifies a user group.
    mod_time timestamp     -- The time.
    ); 
  4. Create a function to generate a random array of tags. Sample code:

    create or replace function gen_rand_tags(int,int) returns int[] as $$
      select array_agg(ceil(random()*$1)::int) from generate_series(1,$2);
    $$ language sql strict;
  5. Randomly select 8 tags from 100,000 tags. Sample code:

    select gen_rand_tags(100000, 8);
    
                       gen_rand_tags                   
    ---------------------------------------------------
     {43494,46038,74102,25308,99129,40893,33653,29690}
    (1 row)
  6. Tag 20 million users. Each user has 64 random tags. Half of the users are males and the other half are females. Sample code:

    insert into t_user_tags 
    select generate_series(1,10000000), 
    array_append(gen_rand_tags(100000, 63),1), now();
    
    insert into t_user_tags 
    select generate_series(10000001,20000000), 
    array_append(gen_rand_tags(100000, 63),2), now();
  7. Create a GIN for the group tag fields. Sample code:

    create index idx_t_user_tags_1 on t_user_tags using gin (tags);
  8. Query users who match tags 1 and 3. Sample code:

    -- Query the number of user groups.
    select count(uid) from t_user_tags where tags @> array[1,3];
    
    -- Obtain the user group ID.
    select uid from t_user_tags where tags @> array[1,3];
  9. Query users who match tags 1, 3, 10, or 200. Sample code:

    -- Query the number of user groups.
    select count(uid) from t_user_tags where tags && array[1,3,10,200];
    
    -- Obtain the user group ID.
    select uid from t_user_tags where tags && array[1,3,10,200];

Solution 3

Note

Solution 3 is supported only by PostgreSQL. MySQL does not support the roaringbitmap extension.

Solution 3 uses the roaringbitmap extension to implement fast queries. For more information, see Use the roaringbitmap extension.

Overview

  • Schema:

    KEY: tag ID
    VALUES: user bitmap
  • Index:

    Tag ID field: B-tree index
  • Search method:

    Aggregate bitmap: AND, OR, and NOT
    and_agg(bitmaps) where KEY in (Tags) -- AND
    or_agg(bitmaps) where KEY in (Tags) -- OR
    except(bitmap1,bitmap2) -- NOT 
  • Advantages:

    • Tables consume a small number of storage resources.

    • Indexes consume a small number of storage resources. Only one B-tree index is required to store index entries. The number of index entries is equal to the number of tags. In most cases, the number of tags is less than 1 million.

    • If you want to add a tag to a user group, you need to only add a group bitmap entry. In this case, you do not need to update a large amount of data.

    • The query performance is excellent.

  • Disadvantages:

    • The maximum length of a bitmap is 1 GB. If the number of users exceeds the maximum length, you must use an offset.

      offset0_bitmap, offset1gb_bitmap, ... 
    • A user ID must consist of digits, and consecutive numbers are recommended. If no numeric user IDs exist, you must create a mapping table.

Procedure

Note
  • When the number of user IDs exceeds 4 billion (INT4), you can use an offset to convert user IDs. For more information, see Troubleshooting for UID overflow.

  • For more information about how to use the roaringbitmap extension, see pg_roaringbitmap.

  1. Install the roaringbitmap extension. Sample code:

    create extension roaringbitmap;
  2. Create a bitmap table that contains user tags. Sample code:

    create table t_tag_users (
      tagid int primary key,   -- The tag ID or user group ID. 
      uid_offset int,          -- Convert the user ID from INT8 to INT4.    
      userbits roaringbitmap,     -- The bitmap of user IDs.  
      mod_time timestamp       -- The time. 
    );
  3. Insert data to generate a bitmap table that contains user ID tags. Sample code:

    insert into t_tag_users 
    select tagid, uid_offset, rb_build_agg(uid::int) as userbits from 
    (
    select 
      unnest(tags) as tagid, 
      (uid / (2^31)::int8) as uid_offset, 
      mod(uid, (2^31)::int8) as uid 
    from t_user_tags 
    ) t 
    group by tagid, uid_offset; 
  4. Query users who match tags 1 and 3. Sample code:

    -- Query the number of user groups.
    select sum(ub) from 
    (
    select uid_offset,rb_and_cardinality_agg(userbits) as ub 
    from t_tag_users 
    where tagid in (1,3) 
    group by uid_offset
    ) t;
    
    -- Obtain the user group ID.
    select uid_offset,rb_and_agg(userbits) as ub 
    from t_tag_users 
    where tagid in (1,3) 
    group by uid_offset;
  5. Query users who match tags 1, 3, 10, or 200. Sample code:

    -- Query the number of user groups.
    select sum(ub) from 
    (
    select uid_offset,rb_or_cardinality_agg(userbits) as ub 
    from t_tag_users 
    where tagid in (1,3,10,200) 
    group by uid_offset
    ) t;
    
    -- Obtain the user group ID.
    select uid_offset,rb_or_agg(userbits) as ub 
    from t_tag_users 
    where tagid in (1,3,10,200) 
    group by uid_offset;

Solution comparison

Item

Solution 1

(MySQL and PostgreSQL)

Solution 2

(PostgreSQL)

Solution 3

(PostgreSQL)

Advantages of Solution 3 over Solution 1

User selection speed of an AND query

1.5 seconds

0.042 seconds

0.0015 seconds

99900%

User selection speed of an OR query

3.6 seconds

3 seconds

0.0017 seconds

211665%

Storage used by tables

63,488 MB

3,126 MB

1390MB

4467%

Storage used by indexes

62,464 MB

3139 MB

2MB

3123100%

Index creation speed

N/A

20 minutes

Extremely fast (about 0 seconds)

N/A

Note

RDS instances that run MySQL 8.0 and PostgreSQL 12 are used in the preceding solutions. These instances have 8 CPU cores and 32 GB of memory and use an enhanced SSD (ESSD) of 1,500 GB.

Summary

RDS instances that run PostgreSQL 12 or later support the roaringbitmap extension. This extension allows you to generate, compress, or parse bitmap data in an efficient manner. This extension also supports the most common bitmap aggregation operations, such as AND, OR, NOT, and XOR. It meets the requirements of real-time precision marketing among hundreds of millions of users for which tens of millions tags are added.

Compared with the MySQL-based solution, the PostgreSQL-based solution is more cost-effective and efficient.