Tag-based user group selection is a core requirement across e-commerce, gaming, education, and content platforms. At scale — tens of millions of users, hundreds of thousands of tags — standard database approaches fail on storage, query speed, or both. This topic compares three PostgreSQL schema models for querying users by tag combinations and explains when to use each approach.
Prerequisites
Before you begin, make sure you have:
-
An ApsaraDB RDS for PostgreSQL instance. See Create an ApsaraDB RDS for PostgreSQL instance.
-
An IP address whitelist configured for the RDS instance. See Configure an IP address whitelist.
-
A database account. See Create an account.
-
A database. See Create a database.
Solution 3 requires the roaringbitmap extension and PostgreSQL 12 or later. The examples in this topic use PostgreSQL 12. For extension details, see Use the roaringbitmap extension.
Background
Precision marketing systems need to query users by any combination of tags in real time — identifying, for example, all male users over 24 who browsed mobile phones in the last 24 hours but have not yet placed an order. The challenge is doing this quickly across tens of millions of users with tens of thousands of tags.
Common pain points:
-
Tag counts grow into the hundreds of thousands, pushing against database field limits (typically 1,000 fields per table).
-
Storing tags as arrays requires Generalized Inverted Index (GIN) support, which not all databases provide and which consumes significant storage.
-
Query condition combinations are unpredictable, so static indexes cannot cover all cases.
-
User profile data must stay near real-time — a user who placed an order last night should not receive ads for that same product today.
This topic describes three solutions built on PostgreSQL that address these pain points with different trade-offs.
Choose a solution
| Solution 1 | Solution 2 | Solution 3 | |
|---|---|---|---|
| Database support | PostgreSQL and MySQL | PostgreSQL only | PostgreSQL only |
| Schema model | One row per user per tag | One row per user, tag array | One row per tag, user bitmap |
| Index type | B-tree, one per tag field | GIN on tag array | B-tree on tag ID |
| AND query speed | 1.5 s | 0.042 s | 0.0015 s |
| OR query speed | 3.6 s | 3 s | 0.0017 s |
| Table storage | 63,488 MB | 3,126 MB | 1,390 MB |
| Index storage | 62,464 MB | 3,139 MB | 2 MB |
| Best for | Simple setups or MySQL constraints | Moderate scale with PostgreSQL | Large-scale, real-time requirements |
Test environment: RDS instances running MySQL 8.0 and PostgreSQL 12, with 8 CPU cores, 32 GB of memory, and an Enhanced SSD (ESSD) of 1,500 GB. Data set: 20 million users, 100,000 tags, 64 tags per user (1.28 billion total tag records).
Use Solution 3 for production systems with millions of users and real-time query requirements. Use Solution 1 if you must support MySQL. Use Solution 2 as a middle ground when storage is more constrained than query speed.
Solution 1: One row per user per tag
This solution works with both PostgreSQL and MySQL.
Schema overview
KEY: user ID
VALUES: one row per tag (tag 1, tag 2, ... tag N)
-
Index: B-tree index on each tag field
-
Search:
WHERE tag = a AND tag = b(with INTERSECT/UNION)
Disadvantages
-
High storage usage: one index per tag field across 20 million users produces 62,464 MB of index storage alone.
-
Adding a new tag requires inserting rows for every user in the group.
-
Query performance degrades with more tags in the filter condition.
Procedure
-
Create a tag dictionary table, with one row per tag:
CREATE TABLE t_tag_dict ( tag int PRIMARY KEY, -- Tag ID info text, -- Tag description crt_time timestamp -- Creation time ); -
Insert 100,000 tags:
INSERT INTO t_tag_dict VALUES (1, 'Male', now()); INSERT INTO t_tag_dict VALUES (2, 'Female', now()); INSERT INTO t_tag_dict VALUES (3, 'Older than 24 years old', now()); -- ... additional tags ... INSERT INTO t_tag_dict SELECT generate_series(4, 100000), md5(random()::text), clock_timestamp(); -
Create a user profile table, with one row per user per tag:
CREATE TABLE t_user_tag ( uid int8, -- User ID tag int, -- Tag ID mod_time timestamp, -- Last modified PRIMARY KEY (tag, uid) ); -
Populate 20 million users with 64 random tags each (1.28 billion rows total):
CREATE OR REPLACE FUNCTION gen_rand_tag(int, int) RETURNS SETOF int AS $$ SELECT CASE WHEN random() > 0.5 THEN 1::int ELSE 2::int END AS tag UNION ALL SELECT ceil(random() * $1)::int AS tag FROM generate_series(1, $2); $$ LANGUAGE sql STRICT VOLATILE; INSERT INTO t_user_tag SELECT uid, gen_rand_tag(100000, 63) AS tag, clock_timestamp() FROM generate_series(1, 20000000) AS uid ON CONFLICT (uid, tag) DO NOTHING; -
Query users matching tags 1 AND 3:
-- Count matching users SELECT count(*) FROM ( SELECT uid FROM t_user_tag WHERE tag = 1 INTERSECT SELECT uid FROM t_user_tag WHERE tag = 3 ) t; -- Time: 1,494 ms -- Get user IDs SELECT uid FROM t_user_tag WHERE tag = 1 INTERSECT SELECT uid FROM t_user_tag WHERE tag = 3; -- Time: 3,246 ms -
Query users matching tag 1 OR 3 OR 10 OR 200:
-- Count matching users SELECT count(*) FROM ( SELECT uid FROM t_user_tag WHERE tag = 1 UNION SELECT uid FROM t_user_tag WHERE tag = 3 UNION SELECT uid FROM t_user_tag WHERE tag = 10 UNION SELECT uid FROM t_user_tag WHERE tag = 200 ) t; -- Time: 3,578 ms -- Get user IDs SELECT uid FROM t_user_tag WHERE tag = 1 UNION SELECT uid FROM t_user_tag WHERE tag = 3 UNION SELECT uid FROM t_user_tag WHERE tag = 10 UNION SELECT uid FROM t_user_tag WHERE tag = 200; -- Time: 5,682 ms
Solution 2: Tag arrays with GIN indexes
This solution requires PostgreSQL. MySQL does not support arrays or GIN indexes.
Schema overview
KEY: user ID
VALUES: array of tag IDs
-
Index: GIN on the tag array field
-
Search operators:
@>(AND/contains),&&(OR/overlaps),NOT @>(NOT)
Disadvantages
-
GIN indexes consume substantially more storage than B-tree indexes for large data sets.
-
Adding tags to a user requires updating their row, which can be expensive at scale.
-
GIN writes use a deferred
fastupdatemechanism: updates accumulate in a pending list and are flushed in batches. When the pending list reachesgin_pending_list_limit(default 4 MB), a flush occurs during a write operation, causing periodic latency spikes. For real-time user profile update workloads, monitor this behavior and consider tuninggin_pending_list_limitor disablingfastupdateif write latency consistency is critical.
Procedure
-
Create a tag dictionary table (same as Solution 1):
CREATE TABLE t_tag_dict ( tag int PRIMARY KEY, info text, crt_time timestamp ); -
Insert 100,000 tags:
INSERT INTO t_tag_dict VALUES (1, 'Male', now()); INSERT INTO t_tag_dict VALUES (2, 'Female', now()); INSERT INTO t_tag_dict VALUES (3, 'Older than 24 years old', now()); INSERT INTO t_tag_dict SELECT generate_series(4, 100000), md5(random()::text), clock_timestamp(); -
Create a user profile table with one row per user and a tag array:
CREATE TABLE t_user_tags ( uid int8 PRIMARY KEY, -- User ID tags int[], -- Array of tag IDs mod_time timestamp ); -
Create a helper function to generate random tag arrays:
CREATE OR REPLACE FUNCTION gen_rand_tags(int, int) RETURNS int[] AS $$ SELECT array_agg(ceil(random() * $1)::int) FROM generate_series(1, $2); $$ LANGUAGE sql STRICT; -
Tag 20 million users with 64 random tags each (10 million male, 10 million female):
INSERT INTO t_user_tags SELECT generate_series(1, 10000000), array_append(gen_rand_tags(100000, 63), 1), now(); INSERT INTO t_user_tags SELECT generate_series(10000001, 20000000), array_append(gen_rand_tags(100000, 63), 2), now(); -
Create a GIN index on the tag array:
CREATE INDEX idx_t_user_tags_1 ON t_user_tags USING gin (tags); -- Index creation time: ~20 minutes -
Query users matching tags 1 AND 3:
-- Count matching users SELECT count(uid) FROM t_user_tags WHERE tags @> ARRAY[1, 3]; -- Get user IDs SELECT uid FROM t_user_tags WHERE tags @> ARRAY[1, 3]; -
Query users matching tag 1 OR 3 OR 10 OR 200:
-- Count matching users SELECT count(uid) FROM t_user_tags WHERE tags && ARRAY[1, 3, 10, 200]; -- Get user IDs SELECT uid FROM t_user_tags WHERE tags && ARRAY[1, 3, 10, 200];
Solution 3: Roaring bitmap aggregation
This solution requires PostgreSQL with the roaringbitmap extension. MySQL does not support this extension.
How it works
The roaring bitmap model inverts the schema: instead of storing which tags a user has, it stores — for each tag — a compressed bitmap of all users who carry that tag. Think of a roaringbitmap as a very large bit string where each bit position represents a user ID; if bit *n* is set, user *n* has the tag.
The Roaring Bitmap algorithm divides 32-bit integers into 2^16 chunks based on the high 16 bits, and stores the low 16 bits in a container. Two container types are used automatically: an array container for sparse data (fewer than 4,096 integers) and a bitmap container for dense data. This adaptive structure is why the index shrinks from 62,464 MB in Solution 1 to just 2 MB here, and why AND/OR queries complete in under 2 milliseconds.
Schema overview
KEY: tag ID
VALUES: compressed bitmap of user IDs
-
Index: B-tree on tag ID
-
Operations:
rb_and_agg/rb_and_cardinality_agg(AND),rb_or_agg/rb_or_cardinality_agg(OR),rb_build_agg(build bitmap)
Advantages
-
Minimal storage: one B-tree index with one entry per tag (at most a few hundred thousand entries).
-
Adding a tag to a user group only requires inserting or updating one bitmap row — no mass row updates.
-
AND or OR queries run in under 2 milliseconds on 20 million users.
Limitations
-
User IDs must be integers. If your system uses non-integer IDs, create a mapping table.
-
The standard
roaringbitmaptype supports 32-bit integers (up to ~4 billion user IDs). If your user IDs exceed 4 billion, use theuid_offsetapproach described below.
Procedure
For more information about the roaringbitmap extension, see pg_roaringbitmap. For handling UID overflow, see Troubleshooting for UID overflow.
-
Install the
roaringbitmapextension:CREATE EXTENSION roaringbitmap; -
Create a bitmap table that maps each tag to the set of users who have it:
CREATE TABLE t_tag_users ( tagid int PRIMARY KEY, -- Tag ID uid_offset int, -- Offset bucket (converts INT8 user IDs to INT4 range) userbits roaringbitmap, -- Compressed bitmap of user IDs mod_time timestamp ); -
Populate the bitmap table from the
t_user_tagstable created in Solution 2:INSERT INTO t_tag_users SELECT tagid, uid_offset, rb_build_agg(uid::int) AS userbits FROM ( SELECT unnest(tags) AS tagid, (uid / (2^31)::int8) AS uid_offset, -- High bits as offset bucket mod(uid, (2^31)::int8) AS uid -- Low bits as bitmap position FROM t_user_tags ) t GROUP BY tagid, uid_offset; -
Query users matching tags 1 AND 3:
-- Count matching users SELECT sum(ub) FROM ( SELECT uid_offset, rb_and_cardinality_agg(userbits) AS ub FROM t_tag_users WHERE tagid IN (1, 3) GROUP BY uid_offset ) t; -- Time: 1.5 ms -- Get the bitmap result (user IDs reconstructed from offset + bitmap) SELECT uid_offset, rb_and_agg(userbits) AS ub FROM t_tag_users WHERE tagid IN (1, 3) GROUP BY uid_offset; -
Query users matching tag 1 OR 3 OR 10 OR 200:
-- Count matching users SELECT sum(ub) FROM ( SELECT uid_offset, rb_or_cardinality_agg(userbits) AS ub FROM t_tag_users WHERE tagid IN (1, 3, 10, 200) GROUP BY uid_offset ) t; -- Time: 1.7 ms -- Get the bitmap result SELECT uid_offset, rb_or_agg(userbits) AS ub FROM t_tag_users WHERE tagid IN (1, 3, 10, 200) GROUP BY uid_offset;
Summary
ApsaraDB RDS for PostgreSQL 12 and later supports the roaringbitmap extension, which enables efficient bitmap generation, compression, and aggregation with AND, OR, NOT, and XOR operations. For precision marketing systems with hundreds of millions of users and tens of millions of tags, Solution 3 delivers query times under 2 milliseconds and index storage under 2 MB — meeting both the speed and scale requirements that simpler schema models cannot satisfy.