Build a tag-based user preference recommendation system using the HLL (HyperLogLog) plug-in for ApsaraDB RDS for PostgreSQL. This guide covers the HLL-based design, shows how to store and query user listening history with millisecond-level response times, and compares the approach with a conventional design.
Prerequisites
Before you begin, make sure you have:
-
A database account and a database
-
The hll plug-in installed (a PostgreSQL extension for approximate cardinality estimation using HyperLogLog)
How the recommendation logic works
The recommendation system follows four steps using a music platform as the working example:
-
When a user (UID) listens to a song (vid), the system binds tags to that song. A song can have multiple tags, forming the mapping:
uid ->> tags ->> songs -
The system ranks tags for each user by the number of distinct songs associated with each tag.
-
The system identifies the top 5 tags and their relative weights — for example, tag1: 40%, tag2: 20%, tag3: 15%, tag4: 15%, tag5: 10%.
-
The system excludes songs the user has already heard, then recommends new songs from each tag's pool ordered by playback count.
This logic applies equally to e-commerce (purchase history), news (browsing habits), and app stores (download patterns).
Choose an approach: HLL-based vs. conventional
| Dimension | HLL-based design | Conventional design |
|---|---|---|
| Storage | Compact — stores approximate hash aggregations, not raw records | Full — stores every (uid, tagid, vid) row |
| Query performance | ~0.7 ms with index support, no runtime aggregation | ~4 ms requires GROUP BY aggregation at query time |
| Sliding window support | Yes — hll_union and add operations enable time-window queries |
Limited — requires timestamp filtering on large tables |
| Accuracy | Approximate (HLL trades precision for speed) | Exact |
| Best for | High-traffic platforms where speed and scale matter | Lower-traffic use cases requiring exact counts |
Use the HLL-based design when query latency and storage efficiency are priorities. Use the conventional design when exact counts are required and data volume is manageable.
Build the HLL-based recommendation system
Step 1: Create the user-tag listening table
Connect to your ApsaraDB RDS for PostgreSQL instance and create the t_like table. For each (user, tag) pair, the table stores one HLL per day of the week — each HLL holds the hash values of song IDs (vids) the user listened to on that day.
CREATE TABLE t_like (
uid INT,
tagid INT, -- Tag
w1 hll, w1_mod_time TIMESTAMP, -- Hash of vids listened to on Monday
w2 hll, w2_mod_time TIMESTAMP, -- Tuesday
w3 hll, w3_mod_time TIMESTAMP, -- Wednesday
w4 hll, w4_mod_time TIMESTAMP, -- Thursday
w5 hll, w5_mod_time TIMESTAMP, -- Friday
w6 hll, w6_mod_time TIMESTAMP, -- Saturday
w7 hll, w7_mod_time TIMESTAMP, -- Sunday
whole hll, -- Full listening history
PRIMARY KEY (uid, tagid)
);
The w1–w7 fields are optional. If you only need same-day data, use a single daily field instead of all seven.
Step 2: Record listening events
When a user listens to a song, write the event to the field for the current day of the week. The upsert logic handles two cases:
-
New day: the existing day's HLL is overwritten with the new hash.
-
Same day: the new hash is merged into the existing HLL using
hll_union.
The whole column always accumulates — it is never overwritten.
-- Record that user 1 listened to song 12346 (Friday, tagid 200)
INSERT INTO t_like (uid, tagid, w5, w5_mod_time, whole)
VALUES (
1, -- UID
200, -- Tag ID
hll_hash_integer(12346) || hll_empty(), -- Hash of the song vid
NOW(),
hll_hash_integer(12346) || hll_empty() -- Also add to full history
)
ON CONFLICT (uid, tagid) DO UPDATE
SET
w5 =
CASE
WHEN DATE(t_like.w5_mod_time) <> CURRENT_DATE
THEN EXCLUDED.w5 -- New day: overwrite
ELSE hll_union(COALESCE(t_like.w5, hll_empty()),
EXCLUDED.w5) -- Same day: merge
END,
w5_mod_time = EXCLUDED.w5_mod_time,
whole = hll_union(COALESCE(t_like.whole, hll_empty()),
EXCLUDED.whole)
WHERE
hll_union(COALESCE(t_like.w5, hll_empty()), EXCLUDED.w5) <> COALESCE(t_like.w5, hll_empty())
OR
hll_union(COALESCE(t_like.whole, hll_empty()), EXCLUDED.whole) <> COALESCE(t_like.whole, hll_empty());
In production, use batch merge updates — aggregate all events for a single (UID, tag) pair and apply a single hll_union per batch to reduce write amplification.
Step 3: Rank top tags for a user
To rank a user's top tags by listening activity over the last two days, compute the hll_cardinality of the union of the two daily HLLs:
SELECT
tagid,
hll_cardinality(
hll_union(
COALESCE(w4, hll_empty()),
COALESCE(w5, hll_empty())
)
) AS vids
FROM t_like
WHERE uid = 1
ORDER BY 2 DESC
LIMIT 10;
Sample output (before load testing):
tagid | vids
-------+------
200 | 2
(1 row)
Step 4: Create an index for millisecond-level responses
Add a function-based index on the HLL cardinality expression so the top-tag query uses an index scan instead of a sequential scan:
CREATE INDEX idx_t_like_1
ON t_like (
uid,
hll_cardinality(
hll_union(
COALESCE(w4, hll_empty()),
COALESCE(w5, hll_empty())
)
)
);
Verify the index is used:
EXPLAIN
SELECT
tagid,
hll_cardinality(
hll_union(
COALESCE(w4, hll_empty()),
COALESCE(w5, hll_empty())
)
) AS vids
FROM t_like
WHERE uid = 1
ORDER BY 2 DESC
LIMIT 10;
Expected execution plan:
QUERY PLAN
-------------------------------------------------------------------------------------------
Limit (cost=0.11..0.15 rows=1 width=12)
-> Index Scan Backward using idx_t_like_1 on t_like (cost=0.11..0.15 rows=1 width=12)
Index Cond: (uid = 1)
(3 rows)
The planner uses Index Scan Backward, confirming the index is in effect.
Step 5: Load test data with pgbench
Use pgbench to simulate concurrent writes. This example runs from an ECS instance in the same VPC as the RDS instance.
pgbench is a PostgreSQL benchmarking tool. Make sure the PostgreSQL client is installed on the ECS instance. For command reference, see the official PostgreSQL documentation.
-
On the ECS instance, create a test SQL file named
test.sql:\set uid random(1, 50000) \set tagid random(1, 5000) \set vid random(1, 10000000) INSERT INTO t_like (uid, tagid, w5, w5_mod_time, whole) VALUES ( :uid, :tagid, hll_hash_integer(:vid) || hll_empty(), NOW(), hll_hash_integer(:vid) || hll_empty() ) ON CONFLICT (uid, tagid) DO UPDATE SET w5 = CASE WHEN DATE(t_like.w5_mod_time) <> CURRENT_DATE THEN EXCLUDED.w5 ELSE hll_union( COALESCE(t_like.w5, hll_empty()), EXCLUDED.w5 ) END, w5_mod_time = EXCLUDED.w5_mod_time, whole = hll_union( COALESCE(t_like.whole, hll_empty()), EXCLUDED.whole ) WHERE hll_union(COALESCE(t_like.w5, hll_empty()), EXCLUDED.w5) <> COALESCE(t_like.w5, hll_empty()) OR hll_union(COALESCE(t_like.whole, hll_empty()), EXCLUDED.whole) <> COALESCE(t_like.whole, hll_empty()); -
Run pgbench with 32 concurrent clients for 120 seconds:
pgbench \ -M prepared \ -n \ -r \ -P 1 \ -c 32 \ -j 32 \ -T 120 \ -f ./test.sql \ -h pgm-****.pg.rds.aliyuncs.com \ # ApsaraDB RDS for PostgreSQL endpoint -p 5432 \ # Port -U testdbuser \ # Database account testdb # Database nameSample output:
transaction type: ./test.sql scaling factor: 1 query mode: prepared number of clients: 32 number of threads: 32 duration: 120 s number of transactions actually processed: 24636321 latency average = 0.156 ms latency stddev = 0.339 ms tps = 205301.110313 (including connections establishing) tps = 205354.851711 (excluding connections establishing) statement latencies in milliseconds: 0.001 \set uid random(1,5000000) 0.001 \set tagid random(1,5000) 0.000 \set vid random(1,10000000) 0.154 insert into t_like (The write throughput reaches ~205,000 transactions per second (tps).
Step 6: Verify top-tag rankings after load
Connect to your instance and run the top-tag query again:
SELECT
tagid,
hll_cardinality(
hll_union(
COALESCE(w4, hll_empty()),
COALESCE(w5, hll_empty())
)
) AS vids
FROM t_like
WHERE uid = 1
ORDER BY 2 DESC
LIMIT 10;
Sample output after data load:
tagid | vids
-------+------
200 | 2
1413 | 1
1996 | 1
2642 | 1
3664 | 1
4340 | 1
(6 rows)
Time: 0.688 ms
The query returns in 0.688 ms, well within millisecond-level response targets.
Filter out songs a user has already heard
After building tag rankings, exclude songs the user has already heard before surfacing recommendations. The hll plug-in supports two modes for this check.
Approximate check
Use HLL membership testing for high-throughput exclusion filtering. Results are approximate — HLL can return false positives but not false negatives.
Check if song vid:1 is not in the user's history (returns f — not present):
SELECT
whole || hll_hash_integer(1) = whole
FROM t_like
WHERE uid = 1
AND tagid = 200; ?column?
----------
f
(1 row)
Check if song vid:12345 is in the user's history (returns t — present):
SELECT
whole || hll_hash_integer(12345) = whole
FROM t_like
WHERE uid = 1
AND tagid = 200; ?column?
----------
t
(1 row)
Exact check
For exact deduplication, maintain a separate lossless table alongside the HLL table:
CREATE TABLE t_like_lossless (
uid INT,
vid INT,
PRIMARY KEY (uid, vid)
);
Write song events to both t_like (for fast tag ranking) and t_like_lossless (for exact exclusion checks). Use t_like_lossless to generate the final exclusion list before surfacing recommendations.
Appendix: Conventional design
The conventional design works across all relational databases and returns exact counts. Its trade-off is higher query latency when data volume is large, because tag rankings require a full GROUP BY aggregation at query time.
Create the table
CREATE TABLE t_like (
uid INT, -- User ID
tagid INT, -- Song tag ID
vid INT, -- Song ID
mod_time TIMESTAMP, -- Last update time
PRIMARY KEY (uid, tagid, vid)
);
Load test data
Use the same pgbench setup as the HLL design. The test SQL file inserts rows and updates mod_time only when more than one day has passed since the last write:
\set uid random(1, 50000)
\set tagid random(1, 5000)
\set vid random(1, 10000000)
INSERT INTO t_like
VALUES (:uid, :tagid, :vid, NOW())
ON CONFLICT (uid, tagid, vid) DO UPDATE
SET mod_time = EXCLUDED.mod_time
WHERE
EXCLUDED.mod_time - t_like.mod_time > INTERVAL '1 day';
Query top tags
Retrieve the top 10 tags for a user from the last day:
SELECT
tagid,
COUNT(*)
FROM t_like
WHERE uid = 1
AND NOW() - mod_time < INTERVAL '1 day'
GROUP BY tagid
ORDER BY COUNT(*) DESC
LIMIT 10;
Sample output:
tagid | count
-------+-------
2519 | 4
3049 | 4
3648 | 4
1777 | 3
1352 | 3
1491 | 3
1064 | 3
572 | 3
692 | 3
301 | 3
(10 rows)
Time: 3.947 ms
At this data volume, the conventional query takes 3.947 ms — roughly 5.7x slower than the HLL-based approach (0.688 ms). The gap widens as row counts grow, because the GROUP BY scan scales with data volume while the HLL index scan does not.
What's next
-
Cardinality estimation (hll) — learn more about the hll plug-in and its functions