All Products
Search
Document Center

ApsaraDB RDS:User preference recommendation system (PostgreSQL similarity computing application)

Last Updated:Mar 30, 2026

Build a tag-based user preference recommendation system using the HLL (HyperLogLog) plug-in for ApsaraDB RDS for PostgreSQL. This guide covers the HLL-based design, shows how to store and query user listening history with millisecond-level response times, and compares the approach with a conventional design.

Prerequisites

Before you begin, make sure you have:

How the recommendation logic works

The recommendation system follows four steps using a music platform as the working example:

  1. When a user (UID) listens to a song (vid), the system binds tags to that song. A song can have multiple tags, forming the mapping: uid ->> tags ->> songs

  2. The system ranks tags for each user by the number of distinct songs associated with each tag.

  3. The system identifies the top 5 tags and their relative weights — for example, tag1: 40%, tag2: 20%, tag3: 15%, tag4: 15%, tag5: 10%.

  4. The system excludes songs the user has already heard, then recommends new songs from each tag's pool ordered by playback count.

This logic applies equally to e-commerce (purchase history), news (browsing habits), and app stores (download patterns).

Choose an approach: HLL-based vs. conventional

Dimension HLL-based design Conventional design
Storage Compact — stores approximate hash aggregations, not raw records Full — stores every (uid, tagid, vid) row
Query performance ~0.7 ms with index support, no runtime aggregation ~4 ms requires GROUP BY aggregation at query time
Sliding window support Yes — hll_union and add operations enable time-window queries Limited — requires timestamp filtering on large tables
Accuracy Approximate (HLL trades precision for speed) Exact
Best for High-traffic platforms where speed and scale matter Lower-traffic use cases requiring exact counts

Use the HLL-based design when query latency and storage efficiency are priorities. Use the conventional design when exact counts are required and data volume is manageable.

Build the HLL-based recommendation system

Step 1: Create the user-tag listening table

Connect to your ApsaraDB RDS for PostgreSQL instance and create the t_like table. For each (user, tag) pair, the table stores one HLL per day of the week — each HLL holds the hash values of song IDs (vids) the user listened to on that day.

CREATE TABLE t_like (
    uid           INT,
    tagid         INT,            -- Tag
    w1 hll, w1_mod_time TIMESTAMP, -- Hash of vids listened to on Monday
    w2 hll, w2_mod_time TIMESTAMP, -- Tuesday
    w3 hll, w3_mod_time TIMESTAMP, -- Wednesday
    w4 hll, w4_mod_time TIMESTAMP, -- Thursday
    w5 hll, w5_mod_time TIMESTAMP, -- Friday
    w6 hll, w6_mod_time TIMESTAMP, -- Saturday
    w7 hll, w7_mod_time TIMESTAMP, -- Sunday
    whole hll,                     -- Full listening history
    PRIMARY KEY (uid, tagid)
);
Note

The w1w7 fields are optional. If you only need same-day data, use a single daily field instead of all seven.

Step 2: Record listening events

When a user listens to a song, write the event to the field for the current day of the week. The upsert logic handles two cases:

  • New day: the existing day's HLL is overwritten with the new hash.

  • Same day: the new hash is merged into the existing HLL using hll_union.

The whole column always accumulates — it is never overwritten.

-- Record that user 1 listened to song 12346 (Friday, tagid 200)
INSERT INTO t_like (uid, tagid, w5, w5_mod_time, whole)
VALUES (
    1,                                        -- UID
    200,                                      -- Tag ID
    hll_hash_integer(12346) || hll_empty(),   -- Hash of the song vid
    NOW(),
    hll_hash_integer(12346) || hll_empty()    -- Also add to full history
)
ON CONFLICT (uid, tagid) DO UPDATE
SET
    w5 =
        CASE
            WHEN DATE(t_like.w5_mod_time) <> CURRENT_DATE
            THEN EXCLUDED.w5                          -- New day: overwrite
            ELSE hll_union(COALESCE(t_like.w5, hll_empty()),
                           EXCLUDED.w5)               -- Same day: merge
        END,
    w5_mod_time = EXCLUDED.w5_mod_time,
    whole = hll_union(COALESCE(t_like.whole, hll_empty()),
                      EXCLUDED.whole)
WHERE
    hll_union(COALESCE(t_like.w5,    hll_empty()), EXCLUDED.w5)    <> COALESCE(t_like.w5,    hll_empty())
    OR
    hll_union(COALESCE(t_like.whole, hll_empty()), EXCLUDED.whole) <> COALESCE(t_like.whole, hll_empty());
Note

In production, use batch merge updates — aggregate all events for a single (UID, tag) pair and apply a single hll_union per batch to reduce write amplification.

Step 3: Rank top tags for a user

To rank a user's top tags by listening activity over the last two days, compute the hll_cardinality of the union of the two daily HLLs:

SELECT
    tagid,
    hll_cardinality(
        hll_union(
            COALESCE(w4, hll_empty()),
            COALESCE(w5, hll_empty())
        )
    ) AS vids
FROM t_like
WHERE uid = 1
ORDER BY 2 DESC
LIMIT 10;

Sample output (before load testing):

 tagid | vids
-------+------
   200 |    2
(1 row)

Step 4: Create an index for millisecond-level responses

Add a function-based index on the HLL cardinality expression so the top-tag query uses an index scan instead of a sequential scan:

CREATE INDEX idx_t_like_1
ON t_like (
    uid,
    hll_cardinality(
        hll_union(
            COALESCE(w4, hll_empty()),
            COALESCE(w5, hll_empty())
        )
    )
);

Verify the index is used:

EXPLAIN
SELECT
    tagid,
    hll_cardinality(
        hll_union(
            COALESCE(w4, hll_empty()),
            COALESCE(w5, hll_empty())
        )
    ) AS vids
FROM t_like
WHERE uid = 1
ORDER BY 2 DESC
LIMIT 10;

Expected execution plan:

                                           QUERY PLAN
-------------------------------------------------------------------------------------------
 Limit  (cost=0.11..0.15 rows=1 width=12)
   ->  Index Scan Backward using idx_t_like_1 on t_like  (cost=0.11..0.15 rows=1 width=12)
         Index Cond: (uid = 1)
(3 rows)

The planner uses Index Scan Backward, confirming the index is in effect.

Step 5: Load test data with pgbench

Use pgbench to simulate concurrent writes. This example runs from an ECS instance in the same VPC as the RDS instance.

Note

pgbench is a PostgreSQL benchmarking tool. Make sure the PostgreSQL client is installed on the ECS instance. For command reference, see the official PostgreSQL documentation.

  1. On the ECS instance, create a test SQL file named test.sql:

    \set uid    random(1, 50000)
    \set tagid  random(1, 5000)
    \set vid    random(1, 10000000)
    
    INSERT INTO t_like (uid, tagid, w5, w5_mod_time, whole)
    VALUES (
        :uid,
        :tagid,
        hll_hash_integer(:vid) || hll_empty(),
        NOW(),
        hll_hash_integer(:vid) || hll_empty()
    )
    ON CONFLICT (uid, tagid) DO UPDATE
    SET
        w5 =
            CASE
                WHEN DATE(t_like.w5_mod_time) <> CURRENT_DATE
                THEN EXCLUDED.w5
                ELSE hll_union(
                         COALESCE(t_like.w5, hll_empty()),
                         EXCLUDED.w5
                     )
            END,
        w5_mod_time = EXCLUDED.w5_mod_time,
        whole = hll_union(
                    COALESCE(t_like.whole, hll_empty()),
                    EXCLUDED.whole
                )
    WHERE
          hll_union(COALESCE(t_like.w5,    hll_empty()), EXCLUDED.w5)
        <> COALESCE(t_like.w5,    hll_empty())
       OR
          hll_union(COALESCE(t_like.whole, hll_empty()), EXCLUDED.whole)
        <> COALESCE(t_like.whole, hll_empty());
  2. Run pgbench with 32 concurrent clients for 120 seconds:

    pgbench \
      -M prepared \
      -n \
      -r \
      -P 1 \
      -c 32 \
      -j 32 \
      -T 120 \
      -f ./test.sql \
      -h pgm-****.pg.rds.aliyuncs.com \   # ApsaraDB RDS for PostgreSQL endpoint
      -p 5432 \                            # Port
      -U testdbuser \                      # Database account
      testdb                               # Database name

    Sample output:

    transaction type: ./test.sql
    scaling factor: 1
    query mode: prepared
    number of clients: 32
    number of threads: 32
    duration: 120 s
    number of transactions actually processed: 24636321
    latency average = 0.156 ms
    latency stddev = 0.339 ms
    tps = 205301.110313 (including connections establishing)
    tps = 205354.851711 (excluding connections establishing)
    statement latencies in milliseconds:
             0.001  \set uid random(1,5000000)
             0.001  \set tagid random(1,5000)
             0.000  \set vid random(1,10000000)
             0.154  insert into t_like (

    The write throughput reaches ~205,000 transactions per second (tps).

Step 6: Verify top-tag rankings after load

Connect to your instance and run the top-tag query again:

SELECT
    tagid,
    hll_cardinality(
        hll_union(
            COALESCE(w4, hll_empty()),
            COALESCE(w5, hll_empty())
        )
    ) AS vids
FROM t_like
WHERE uid = 1
ORDER BY 2 DESC
LIMIT 10;

Sample output after data load:

 tagid | vids
-------+------
   200 |    2
  1413 |    1
  1996 |    1
  2642 |    1
  3664 |    1
  4340 |    1
(6 rows)

Time: 0.688 ms

The query returns in 0.688 ms, well within millisecond-level response targets.

Filter out songs a user has already heard

After building tag rankings, exclude songs the user has already heard before surfacing recommendations. The hll plug-in supports two modes for this check.

Approximate check

Use HLL membership testing for high-throughput exclusion filtering. Results are approximate — HLL can return false positives but not false negatives.

Check if song vid:1 is not in the user's history (returns f — not present):

SELECT
    whole || hll_hash_integer(1) = whole
FROM t_like
WHERE uid = 1
  AND tagid = 200;
 ?column?
----------
 f
(1 row)

Check if song vid:12345 is in the user's history (returns t — present):

SELECT
    whole || hll_hash_integer(12345) = whole
FROM t_like
WHERE uid = 1
  AND tagid = 200;
 ?column?
----------
 t
(1 row)

Exact check

For exact deduplication, maintain a separate lossless table alongside the HLL table:

CREATE TABLE t_like_lossless (
    uid INT,
    vid INT,
    PRIMARY KEY (uid, vid)
);

Write song events to both t_like (for fast tag ranking) and t_like_lossless (for exact exclusion checks). Use t_like_lossless to generate the final exclusion list before surfacing recommendations.

Appendix: Conventional design

The conventional design works across all relational databases and returns exact counts. Its trade-off is higher query latency when data volume is large, because tag rankings require a full GROUP BY aggregation at query time.

Create the table

CREATE TABLE t_like (
    uid       INT,          -- User ID
    tagid     INT,          -- Song tag ID
    vid       INT,          -- Song ID
    mod_time  TIMESTAMP,    -- Last update time
    PRIMARY KEY (uid, tagid, vid)
);

Load test data

Use the same pgbench setup as the HLL design. The test SQL file inserts rows and updates mod_time only when more than one day has passed since the last write:

\set uid    random(1, 50000)
\set tagid  random(1, 5000)
\set vid    random(1, 10000000)

INSERT INTO t_like
VALUES (:uid, :tagid, :vid, NOW())
ON CONFLICT (uid, tagid, vid) DO UPDATE
    SET mod_time = EXCLUDED.mod_time
WHERE
    EXCLUDED.mod_time - t_like.mod_time > INTERVAL '1 day';

Query top tags

Retrieve the top 10 tags for a user from the last day:

SELECT
    tagid,
    COUNT(*)
FROM t_like
WHERE uid = 1
  AND NOW() - mod_time < INTERVAL '1 day'
GROUP BY tagid
ORDER BY COUNT(*) DESC
LIMIT 10;

Sample output:

 tagid | count
-------+-------
  2519 |     4
  3049 |     4
  3648 |     4
  1777 |     3
  1352 |     3
  1491 |     3
  1064 |     3
   572 |     3
   692 |     3
   301 |     3
(10 rows)

Time: 3.947 ms

At this data volume, the conventional query takes 3.947 ms — roughly 5.7x slower than the HLL-based approach (0.688 ms). The gap widens as row counts grow, because the GROUP BY scan scales with data volume while the HLL index scan does not.

What's next