This topic describes how to use the hll plug-in to recommend related content based on user preferences.
Prerequisites
Background information
A recommendation system can be used to increase user stickiness and conversion rates in the following scenarios:
E-commerce websites recommend products based on user purchase habits.
Music websites recommend music based on user listening preferences.
News websites recommend news based on user browsing habits.
Application websites recommend applications based on user download and usage habits.
In this topic, a music website is used as an example to describe how to design a recommendation system and elaborate on the differences between the conventional design and the design that is based on HHL and similarity computing of PostgreSQL.
Design background
When a user (UID) listens to a song (vid), the recommendation system binds tags to that song. A song can have multiple tags, which forms the following mapping:
uid ->> tags ->> musicsThe recommendation system ranks the tags for a user based on the number of songs associated with each tag to determine the popularity of the tags:
tag(count distinct music) ...The recommendation system obtains the top five tags and their weights:
tag1:40% tag2:20% tag3:15% tag4:15% tag5:10%The recommendation system excludes songs that the user has already listened to from the tag's song library. Then, the system recommends new songs based on their recommendation weights, such as in descending order of playback counts.
Design based on HLL similarity computing
You can use the hll plug-in to store the IDs (vids) of songs that a user (UID) has listened to. Compared with the conventional design, this design has the following advantages:
It stores a small amount of data using approximate hll hash aggregations instead of actual values.
It provides high query efficiency. It supports indexes and does not require calculations, which allows for millisecond-level responses.
It supports operations such as hash union and add. These operations are suitable for sliding window calculations and can meet more business requirements.
For more information about how to use the hll plug-in, see Cardinality estimation (hll).
Connect to the ApsaraDB RDS for PostgreSQL instance and create a test table. For each tag, the table stores an hll that contains the hash values of the IDs of songs that a user has listened to.
CREATE TABLE t_like ( uid INT, tagid INT, -- Tag w1 hll, w1_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Monday. w2 hll, w2_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Tuesday. w3 hll, w3_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Wednesday. w4 hll, w4_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Thursday. w5 hll, w5_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Friday. w6 hll, w6_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Saturday. w7 hll, w7_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Sunday. whole hll, -- All PRIMARY KEY (uid, tagid) );NoteThe w1 to w7 fields are set as needed. If you are only concerned with data for a single day, you only need to set one field.
When a user listens to a song, the information is written to the field that corresponds to the current date. If the field already has a value and was last modified on a previous day, the value is overwritten. Otherwise, a hash value is appended.
This operation uses the
INSERT INTO ON CONFLICTsyntax. The following code provides an example:-- Insert the hash value of the user's viewing history. INSERT INTO t_like ( uid, tagid, w5, w5_mod_time, whole ) VALUES ( 1, -- uid 200, -- Tag ID hll_hash_integer(12346) || hll_empty(), -- The vid of the song that was watched (multiple can be concatenated). NOW(), hll_hash_integer(12346) || hll_empty() -- The vid of the song that was watched. ) ON CONFLICT (uid, tagid) DO UPDATE SET w5 = CASE WHEN DATE(t_like.w5_mod_time) <> CURRENT_DATE THEN EXCLUDED.w5 ELSE hll_union(COALESCE(t_like.w5, hll_empty()), EXCLUDED.w5) END, w5_mod_time = EXCLUDED.w5_mod_time, whole = hll_union(COALESCE(t_like.whole, hll_empty()), EXCLUDED.whole) WHERE hll_union(COALESCE(t_like.w5, hll_empty()), EXCLUDED.w5) <> COALESCE(t_like.w5, hll_empty()) OR hll_union(COALESCE(t_like.whole,hll_empty()), EXCLUDED.whole) <> COALESCE(t_like.whole,hll_empty());NoteIn actual business scenarios, you can also perform batch merge updates. You can perform aggregate updates for a single tag of a single user and use
hll_unionto reduce the update rate.Query the top 10 tags for the user with UID 1 (uid=1) from the last two days. The following code provides an example:
SELECT tagid, hll_cardinality( hll_union( COALESCE(w4, hll_empty()), COALESCE(w5, hll_empty()) ) ) AS vids FROM t_like WHERE uid = 1 ORDER BY 2 DESC LIMIT 10;The following result is returned:
tagid | vids -------+------ 200 | 2 (1 row)Run the following command to create an index.
CREATE INDEX idx_t_like_1 ON t_like ( uid, hll_cardinality( hll_union( COALESCE(w4, hll_empty()), COALESCE(w5, hll_empty()) ) ) );Run the following command to view the execution plan.
EXPLAIN SELECT tagid, hll_cardinality( hll_union( COALESCE(w4, hll_empty()), COALESCE(w5, hll_empty()) ) ) AS vids FROM t_like WHERE uid = 1 ORDER BY 2 DESC LIMIT 10; QUERY PLAN ------------------------------------------------------------------------------------------- Limit (cost=0.11..0.15 rows=1 width=12) -> Index Scan Backward using idx_t_like_1 on t_like (cost=0.11..0.15 rows=1 width=12) Index Cond: (uid = 1) (3 rows)The following result is returned:
QUERY PLAN ------------------------------------------------------------------------------------------- Limit (cost=0.11..0.15 rows=1 width=12) -> Index Scan Backward using idx_t_like_1 on t_like (cost=0.11..0.15 rows=1 width=12) Index Cond: (uid = 1) (3 rows)Use pgbench to insert data into the test table. This topic uses an ECS instance that is in the same VPC as the ApsaraDB RDS for PostgreSQL instance as an example.
Notepgbench is a simple tool for running benchmark tests on PostgreSQL. Make sure that the PostgreSQL client is installed on the destination ECS instance. For more information about this command, see the official PostgreSQL documentation.
On the ECS instance, run the
vi test.sqlcommand to create a test SQL file named test.sql and insert the following content.\set uid random(1, 50000) \set tagid random(1, 5000) \set vid random(1, 10000000) INSERT INTO t_like ( uid, tagid, w5, w5_mod_time, whole ) VALUES ( :uid, :tagid, hll_hash_integer(:vid) || hll_empty(), NOW(), hll_hash_integer(:vid) || hll_empty() ) ON CONFLICT (uid, tagid) DO UPDATE SET w5 = CASE WHEN DATE(t_like.w5_mod_time) <> CURRENT_DATE THEN EXCLUDED.w5 ELSE hll_union( COALESCE(t_like.w5, hll_empty()), EXCLUDED.w5 ) END, w5_mod_time = EXCLUDED.w5_mod_time, whole = hll_union( COALESCE(t_like.whole, hll_empty()), EXCLUDED.whole ) WHERE hll_union(COALESCE(t_like.w5, hll_empty()), EXCLUDED.w5) <> COALESCE(t_like.w5, hll_empty()) OR hll_union(COALESCE(t_like.whole, hll_empty()), EXCLUDED.whole) <> COALESCE(t_like.whole, hll_empty());Run the following command to insert test data.
pgbench \ -M prepared \ -n \ -r \ -P 1 \ -c 32 \ -j 32 \ -T 120 \ -f ./test.sql \ -h pgm-****.pg.rds.aliyuncs.com \ # ApsaraDB RDS for PostgreSQL endpoint -p 5432 \ # Port -U testdbuser \ # Database account testdb # Destination database nameThe following result is returned:
transaction type: ./test.sql scaling factor: 1 query mode: prepared number of clients: 32 number of threads: 32 duration: 120 s number of transactions actually processed: 24636321 latency average = 0.156 ms latency stddev = 0.339 ms tps = 205301.110313 (including connections establishing) tps = 205354.851711 (excluding connections establishing) statement latencies in milliseconds: 0.001 \set uid random(1,5000000) 0.001 \set tagid random(1,5000) 0.000 \set vid random(1,10000000) 0.154 insert into t_like (
Connect to the ApsaraDB RDS for PostgreSQL instance and run the following command to query the tag popularity ranking for a specific UID.
SELECT tagid, hll_cardinality( hll_union( COALESCE(w4, hll_empty()), COALESCE(w5, hll_empty()) ) ) AS vids FROM t_like WHERE uid = 1 ORDER BY 2 DESC LIMIT 10;The following result is returned:
tagid | vids -------+------ 200 | 2 1413 | 1 1996 | 1 2642 | 1 3664 | 1 4340 | 1 (6 rows) Time: 0.688 msNoteThe response time is 0.688 ms.
Other requirement examples
Filter out songs that the user has already listened to. For example, you can determine whether a song ID (vid) is in the hash.
Inexact operation
Exclusions:
SELECT whole || hll_hash_integer(1) = whole FROM t_like WHERE uid = 1 AND tagid = 200; -- Returns false, indicating that vid:1 is not included. ?column? ---------- f (1 row)This includes the following:
SELECT whole || hll_hash_integer(12345) = whole FROM t_like WHERE uid = 1 AND tagid = 200; -- Returns true, indicating that vid:12345 is included. ?column? ---------- t (1 row)
Exact operation
Create a test table for songs that have been listened to. The following code provides an example:
CREATE TABLE t_like_lossless (
uid INT,
vid INT,
PRIMARY KEY (uid, vid)
); Conventional design
This design is applicable to all databases. However, its disadvantage is that it may require aggregate queries when the data volume is large, which can lead to lower efficiency.
Connect to the ApsaraDB RDS for PostgreSQL instance and create a test table. The following code provides an example of the table creation statement:
CREATE TABLE t_like ( uid INT, -- User ID tagid INT, -- Song tag ID vid INT, -- Song ID mod_time TIMESTAMP, -- The time of the last update. An update is triggered only when more than one day has passed since the last update. PRIMARY KEY (uid, tagid, vid) );Use pgbench to insert data into the test table. This topic uses an ECS instance that is in the same VPC as the ApsaraDB RDS for PostgreSQL instance as an example.
Notepgbench is a simple tool for running benchmark tests on PostgreSQL. Make sure that the PostgreSQL client is installed on the destination ECS instance. For more information about this command, see the official PostgreSQL documentation.
On the ECS instance, run the
vi test.sqlcommand to create a test SQL file named test.sql and insert the following content.\set uid random(1, 50000) \set tagid random(1, 5000) \set vid random(1, 10000000) INSERT INTO t_like VALUES (:uid, :tagid, :vid, NOW()) ON CONFLICT (uid, tagid, vid) DO UPDATE SET mod_time = EXCLUDED.mod_time WHERE EXCLUDED.mod_time - t_like.mod_time > INTERVAL '1 day';Run the following command to insert test data.
pgbench \ -M prepared \ -n \ -r \ -P 1 \ -c 32 \ -j 32 \ -T 120 \ -f ./test.sql \ -h pgm-****.pg.rds.aliyuncs.com \ # ApsaraDB RDS for PostgreSQL endpoint -p 5432 \ # Port -U testdbuser \ # Database account testdb # Destination database name
Connect to the ApsaraDB RDS for PostgreSQL instance and run the following command to obtain statistics for the top 10 tags from the last day.
SELECT tagid, COUNT(*) FROM t_like WHERE uid = 1 AND NOW() - mod_time < INTERVAL '1 day' GROUP BY tagid ORDER BY COUNT(*) DESC LIMIT 10;The following sample result is returned:
tagid | count -------+------- 2519 | 4 3049 | 4 3648 | 4 1777 | 3 1352 | 3 1491 | 3 1064 | 3 572 | 3 692 | 3 301 | 3 (10 rows) Time: 3.947 ms