Real-time precision marketing (Audience Segmentation) - ApsaraDB RDS

Prerequisites

Before you begin, make sure you have:

An ApsaraDB RDS for PostgreSQL instance. See Create an ApsaraDB RDS for PostgreSQL instance.
An IP address whitelist configured for the RDS instance. See Configure an IP address whitelist.
A database account. See Create an account.
A database. See Create a database.

Solution 3 requires the roaringbitmap extension and PostgreSQL 12 or later. The examples in this topic use PostgreSQL 12. For extension details, see Use the roaringbitmap extension.

Background

Precision marketing systems need to query users by any combination of tags in real time — identifying, for example, all male users over 24 who browsed mobile phones in the last 24 hours but have not yet placed an order. The challenge is doing this quickly across tens of millions of users with tens of thousands of tags.

Common pain points:

Tag counts grow into the hundreds of thousands, pushing against database field limits (typically 1,000 fields per table).
Storing tags as arrays requires Generalized Inverted Index (GIN) support, which not all databases provide and which consumes significant storage.
Query condition combinations are unpredictable, so static indexes cannot cover all cases.
User profile data must stay near real-time — a user who placed an order last night should not receive ads for that same product today.

This topic describes three solutions built on PostgreSQL that address these pain points with different trade-offs.

Choose a solution

	Solution 1	Solution 2	Solution 3
Database support	PostgreSQL and MySQL	PostgreSQL only	PostgreSQL only
Schema model	One row per user per tag	One row per user, tag array	One row per tag, user bitmap
Index type	B-tree, one per tag field	GIN on tag array	B-tree on tag ID
AND query speed	1.5 s	0.042 s	0.0015 s
OR query speed	3.6 s	3 s	0.0017 s
Table storage	63,488 MB	3,126 MB	1,390 MB
Index storage	62,464 MB	3,139 MB	2 MB
Best for	Simple setups or MySQL constraints	Moderate scale with PostgreSQL	Large-scale, real-time requirements

Test environment: RDS instances running MySQL 8.0 and PostgreSQL 12, with 8 CPU cores, 32 GB of memory, and an Enhanced SSD (ESSD) of 1,500 GB. Data set: 20 million users, 100,000 tags, 64 tags per user (1.28 billion total tag records).

Use Solution 3 for production systems with millions of users and real-time query requirements. Use Solution 1 if you must support MySQL. Use Solution 2 as a middle ground when storage is more constrained than query speed.

Solution 1: One row per user per tag

This solution works with both PostgreSQL and MySQL.

Schema overview

KEY:    user ID
VALUES: one row per tag (tag 1, tag 2, ... tag N)

Index: B-tree index on each tag field
Search: WHERE tag = a AND tag = b (with INTERSECT/UNION)

Disadvantages

High storage usage: one index per tag field across 20 million users produces 62,464 MB of index storage alone.
Adding a new tag requires inserting rows for every user in the group.
Query performance degrades with more tags in the filter condition.

Procedure

Create a tag dictionary table, with one row per tag:

CREATE TABLE t_tag_dict (
  tag      int PRIMARY KEY,   -- Tag ID
  info     text,              -- Tag description
  crt_time timestamp          -- Creation time
);

Insert 100,000 tags:

INSERT INTO t_tag_dict VALUES (1, 'Male', now());
INSERT INTO t_tag_dict VALUES (2, 'Female', now());
INSERT INTO t_tag_dict VALUES (3, 'Older than 24 years old', now());
-- ... additional tags ...

INSERT INTO t_tag_dict
SELECT generate_series(4, 100000), md5(random()::text), clock_timestamp();

Create a user profile table, with one row per user per tag:

CREATE TABLE t_user_tag (
  uid      int8,       -- User ID
  tag      int,        -- Tag ID
  mod_time timestamp,  -- Last modified
  PRIMARY KEY (tag, uid)
);

Populate 20 million users with 64 random tags each (1.28 billion rows total):

CREATE OR REPLACE FUNCTION gen_rand_tag(int, int) RETURNS SETOF int AS
$$
  SELECT CASE WHEN random() > 0.5 THEN 1::int ELSE 2::int END AS tag
  UNION ALL
  SELECT ceil(random() * $1)::int AS tag FROM generate_series(1, $2);
$$ LANGUAGE sql STRICT VOLATILE;

INSERT INTO t_user_tag
SELECT uid, gen_rand_tag(100000, 63) AS tag, clock_timestamp()
FROM generate_series(1, 20000000) AS uid
ON CONFLICT (uid, tag) DO NOTHING;

Query users matching tags 1 AND 3:

-- Count matching users
SELECT count(*) FROM (
  SELECT uid FROM t_user_tag WHERE tag = 1
  INTERSECT
  SELECT uid FROM t_user_tag WHERE tag = 3
) t;
-- Time: 1,494 ms

-- Get user IDs
SELECT uid FROM t_user_tag WHERE tag = 1
INTERSECT
SELECT uid FROM t_user_tag WHERE tag = 3;
-- Time: 3,246 ms

Query users matching tag 1 OR 3 OR 10 OR 200:

-- Count matching users
SELECT count(*) FROM (
  SELECT uid FROM t_user_tag WHERE tag = 1
  UNION SELECT uid FROM t_user_tag WHERE tag = 3
  UNION SELECT uid FROM t_user_tag WHERE tag = 10
  UNION SELECT uid FROM t_user_tag WHERE tag = 200
) t;
-- Time: 3,578 ms

-- Get user IDs
SELECT uid FROM t_user_tag WHERE tag = 1
UNION SELECT uid FROM t_user_tag WHERE tag = 3
UNION SELECT uid FROM t_user_tag WHERE tag = 10
UNION SELECT uid FROM t_user_tag WHERE tag = 200;
-- Time: 5,682 ms

Solution 2: Tag arrays with GIN indexes

This solution requires PostgreSQL. MySQL does not support arrays or GIN indexes.

Schema overview

KEY:    user ID
VALUES: array of tag IDs

Index: GIN on the tag array field
Search operators: @> (AND/contains), && (OR/overlaps), NOT @> (NOT)

Disadvantages

GIN indexes consume substantially more storage than B-tree indexes for large data sets.
Adding tags to a user requires updating their row, which can be expensive at scale.
GIN writes use a deferred fastupdate mechanism: updates accumulate in a pending list and are flushed in batches. When the pending list reaches gin_pending_list_limit (default 4 MB), a flush occurs during a write operation, causing periodic latency spikes. For real-time user profile update workloads, monitor this behavior and consider tuning gin_pending_list_limit or disabling fastupdate if write latency consistency is critical.

Procedure

Create a tag dictionary table (same as Solution 1):

CREATE TABLE t_tag_dict (
  tag      int PRIMARY KEY,
  info     text,
  crt_time timestamp
);

Insert 100,000 tags:

INSERT INTO t_tag_dict VALUES (1, 'Male', now());
INSERT INTO t_tag_dict VALUES (2, 'Female', now());
INSERT INTO t_tag_dict VALUES (3, 'Older than 24 years old', now());

INSERT INTO t_tag_dict
SELECT generate_series(4, 100000), md5(random()::text), clock_timestamp();

Create a user profile table with one row per user and a tag array:

CREATE TABLE t_user_tags (
  uid      int8 PRIMARY KEY,  -- User ID
  tags     int[],             -- Array of tag IDs
  mod_time timestamp
);

Create a helper function to generate random tag arrays:

CREATE OR REPLACE FUNCTION gen_rand_tags(int, int) RETURNS int[] AS $$
  SELECT array_agg(ceil(random() * $1)::int) FROM generate_series(1, $2);
$$ LANGUAGE sql STRICT;

Tag 20 million users with 64 random tags each (10 million male, 10 million female):

INSERT INTO t_user_tags
SELECT generate_series(1, 10000000),
  array_append(gen_rand_tags(100000, 63), 1), now();

INSERT INTO t_user_tags
SELECT generate_series(10000001, 20000000),
  array_append(gen_rand_tags(100000, 63), 2), now();

Create a GIN index on the tag array:

CREATE INDEX idx_t_user_tags_1 ON t_user_tags USING gin (tags);
-- Index creation time: ~20 minutes

Query users matching tags 1 AND 3:

-- Count matching users
SELECT count(uid) FROM t_user_tags WHERE tags @> ARRAY[1, 3];

-- Get user IDs
SELECT uid FROM t_user_tags WHERE tags @> ARRAY[1, 3];

Query users matching tag 1 OR 3 OR 10 OR 200:

-- Count matching users
SELECT count(uid) FROM t_user_tags WHERE tags && ARRAY[1, 3, 10, 200];

-- Get user IDs
SELECT uid FROM t_user_tags WHERE tags && ARRAY[1, 3, 10, 200];

Solution 3: Roaring bitmap aggregation

This solution requires PostgreSQL with the roaringbitmap extension. MySQL does not support this extension.

How it works

The roaring bitmap model inverts the schema: instead of storing which tags a user has, it stores — for each tag — a compressed bitmap of all users who carry that tag. Think of a roaringbitmap as a very large bit string where each bit position represents a user ID; if bit *n* is set, user *n* has the tag.

The Roaring Bitmap algorithm divides 32-bit integers into 2^16 chunks based on the high 16 bits, and stores the low 16 bits in a container. Two container types are used automatically: an array container for sparse data (fewer than 4,096 integers) and a bitmap container for dense data. This adaptive structure is why the index shrinks from 62,464 MB in Solution 1 to just 2 MB here, and why AND/OR queries complete in under 2 milliseconds.

Schema overview

KEY:    tag ID
VALUES: compressed bitmap of user IDs

Index: B-tree on tag ID
Operations: rb_and_agg / rb_and_cardinality_agg (AND), rb_or_agg / rb_or_cardinality_agg (OR), rb_build_agg (build bitmap)

Advantages

Minimal storage: one B-tree index with one entry per tag (at most a few hundred thousand entries).
Adding a tag to a user group only requires inserting or updating one bitmap row — no mass row updates.
AND or OR queries run in under 2 milliseconds on 20 million users.

Limitations

User IDs must be integers. If your system uses non-integer IDs, create a mapping table.
The standard roaringbitmap type supports 32-bit integers (up to ~4 billion user IDs). If your user IDs exceed 4 billion, use the uid_offset approach described below.

Procedure

For more information about the roaringbitmap extension, see pg_roaringbitmap. For handling UID overflow, see Troubleshooting for UID overflow.

Install the roaringbitmap extension:
```
CREATE EXTENSION roaringbitmap;
```

Create a bitmap table that maps each tag to the set of users who have it:

CREATE TABLE t_tag_users (
  tagid      int PRIMARY KEY,   -- Tag ID
  uid_offset int,               -- Offset bucket (converts INT8 user IDs to INT4 range)
  userbits   roaringbitmap,     -- Compressed bitmap of user IDs
  mod_time   timestamp
);

Populate the bitmap table from the t_user_tags table created in Solution 2:

INSERT INTO t_tag_users
SELECT tagid, uid_offset, rb_build_agg(uid::int) AS userbits
FROM (
  SELECT
    unnest(tags) AS tagid,
    (uid / (2^31)::int8) AS uid_offset,   -- High bits as offset bucket
    mod(uid, (2^31)::int8) AS uid          -- Low bits as bitmap position
  FROM t_user_tags
) t
GROUP BY tagid, uid_offset;

Query users matching tags 1 AND 3:

-- Count matching users
SELECT sum(ub) FROM (
  SELECT uid_offset, rb_and_cardinality_agg(userbits) AS ub
  FROM t_tag_users
  WHERE tagid IN (1, 3)
  GROUP BY uid_offset
) t;
-- Time: 1.5 ms

-- Get the bitmap result (user IDs reconstructed from offset + bitmap)
SELECT uid_offset, rb_and_agg(userbits) AS ub
FROM t_tag_users
WHERE tagid IN (1, 3)
GROUP BY uid_offset;

Query users matching tag 1 OR 3 OR 10 OR 200:

-- Count matching users
SELECT sum(ub) FROM (
  SELECT uid_offset, rb_or_cardinality_agg(userbits) AS ub
  FROM t_tag_users
  WHERE tagid IN (1, 3, 10, 200)
  GROUP BY uid_offset
) t;
-- Time: 1.7 ms

-- Get the bitmap result
SELECT uid_offset, rb_or_agg(userbits) AS ub
FROM t_tag_users
WHERE tagid IN (1, 3, 10, 200)
GROUP BY uid_offset;

Summary

ApsaraDB RDS for PostgreSQL 12 and later supports the roaringbitmap extension, which enables efficient bitmap generation, compression, and aggregation with AND, OR, NOT, and XOR operations. For precision marketing systems with hundreds of millions of users and tens of millions of tags, Solution 3 delivers query times under 2 milliseconds and index storage under 2 MB — meeting both the speed and scale requirements that simpler schema models cannot satisfy.

ApsaraDB RDS:Real-time precision marketing (Audience Segmentation)

Prerequisites

Background

Choose a solution

Solution 1: One row per user per tag

Solution 2: Tag arrays with GIN indexes

Solution 3: Roaring bitmap aggregation

Summary

What's next