This topic describes how to use the HyperLogLog (HLL) plug-in to design a recommendation system, which recommends content to a user based on the preferences of the user. The recommendation system is implemented based on similarity computing of PostgreSQL.
Background information
A recommendation system can be used to increase user stickiness and conversion rate for the following websites:
- E-commerce websites
- Music websites
- News websites
- App websites
In this topic, a music website is used as an example to describe how to design a recommendation system and elaborate on the differences between the conventional design and the design that is based on HHL and similarity computing of PostgreSQL.
Design background
- After a user (uid) finishes listening to a song (vid), the recommendation system binds
a tag (tagid) to the song. More than one tag is allowed per song.
uid ->> tags ->> musics
- The recommendation system obtains the popularity of each tag based on the number of
songs to which the tag is bound.
tag(count distinct music) ...
- The recommendation system obtains the top 5 tags and their recommendation weights.
tag1:40% tag2:20% tag3:15% tag4:15% tag5:10%
- The recommendation system excludes the songs to which the user finishes listening. The recommendation system obtains a library of songs to which the user has not listened. Then, the recommendation system recommends new songs to the user based on the recommendation weights of the songs in the library. For example, the recommendation weights can be the rankings of the songs in descending order.
Conventional design
The conventional design is suitable for all types of databases. However, if your database system has a large amount of data, the conventional design may use aggregate queries that run slowly. Example:
create table t_like(
uid int, -- The ID of a user.
tagid int, -- The ID of the tag that is bound to a song.
vid int, -- The ID of a song.
mod_time timestamp, -- The time of the last update. An update is triggered only when one day has passed since the last update.
primary key (uid,tagid,vid)
);
insert into t_like values (:uid, :tagid, :vid, :mod_time)
on conflict (uid,tagid,vid) do update
set mod_time=excluded.mod_time
where
excluded.mod_time - t_like.mod_time > interval '1 day'
;
-- Obtain the top 10 tags over the last day.
select tagid, count(*) from t_like
where uid=:uid
and now()-mod_time < interval '1 day'
group by tagid
order by count(*) desc limit 10;
The following code snippet is an example of stress testing:
vi test.sql
\set uid random(1,50000)
\set tagid random(1,5000)
\set vid random(1,10000000)
insert into t_like values (:uid, :tagid, :vid, now())
on conflict (uid,tagid,vid) do update
set mod_time=excluded.mod_time
where
excluded.mod_time - t_like.mod_time > interval '1 day';
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 32 -j 32 -T 240
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 32
number of threads: 32
duration: 240 s
number of transactions actually processed: 80975327
latency average = 0.095 ms
latency stddev = 0.340 ms
tps = 337396.279382 (including connections establishing)
tps = 337406.018908 (excluding connections establishing)
statement latencies in milliseconds:
0.000 \set uid random(1,50000)
0.000 \set tagid random(1,5000)
0.000 \set vid random(1,10000000)
0.094 insert into t_like values (:uid, :tagid, :vid, now())
db1=# select tagid, count(*) from t_like
where uid=1
and now()-mod_time < interval '1 day'
group by tagid
order by count(*) desc limit 10;
tagid | count
-------+-------
2519 | 4
3049 | 4
3648 | 4
1777 | 3
1352 | 3
1491 | 3
1064 | 3
572 | 3
692 | 3
301 | 3
(10 rows)
Time: 3.947 ms
Design based on HLL and similarity computing of PostgreSQL
The design based on HLL and similarity computing of PostgreSQL stores the IDs of the songs to which each user finishes listening. This design has the following benefits over the conventional design:
- Stores a small amount of data by using approximate clustered hash values in place of actual values.
- Supports index-based queries without the need for computing. The index-based queries allow your database system to respond to queries within milliseconds.
- Supports operations that can be used for sliding window computing to meet more diversified business requirements. These operations include HASH UNION and HASH ADD.
- The recommendation system maintains one HLL per tag. The HLL contains a hash value
that consists of the IDs of all the songs to which each user finishes listening.
create table t_like ( uid int, tagid int, -- The ID of the tag. w1 hll, w1_mod_time timestamp, -- The hash value that consists of the IDs of the songs to which the user finishes listening on a Monday within the tag. w2 hll, w2_mod_time timestamp, -- The hash value that consists of the IDs of the songs to which the user finishes listening on a Tuesday within the tag. w3 hll, w3_mod_time timestamp, -- The hash value that consists of the IDs of the songs to which the user finishes listening on a Wednesday within the tag. w4 hll, w4_mod_time timestamp, -- The hash value that consists of the IDs of the songs to which the user finishes listening on a Thursday within the tag. w5 hll, w5_mod_time timestamp, -- The hash value that consists of the IDs of the songs to which the user finishes listening on a Friday within the tag. w6 hll, w6_mod_time timestamp, -- The hash value that consists of the IDs of the songs to which the user finishes listening on a Saturday within the tag. w7 hll, w7_mod_time timestamp, -- The hash value that consists of the IDs of the songs to which the user finishes listening on a Sunday within the tag. whole hll, -- The hash value that consists of the IDs of the songs to which the user finishes listening from a Monday to Sunday period within the tag. primary key (uid,tagid) );
Note You can specify the w1 to w7 day fields based on your business requirements. For example, if you want to view only the data of a specific day, you can specify only one of the w1 to w7 day fields. - After a user finishes listening to a song, the recommendation system writes the information
of the song to the current day field. If the field has a value but the value is not
last modified on the current day, the recommendation system overwrites the existing
value by using the new value. Otherwise, the recommendation system appends a hash
value to the current day field. The hash value consists of both the existing value
and the new value. The preceding logic is implemented by using the
INSERT INTO ON CONFLICT
syntax.-- Configure the hash value that consists of the IDs of all the songs to which each user finishes listening within a tag. insert into t_like ( uid, tagid, w5, w5_mod_time, whole ) values ( 1, -- uid 200, -- The ID of the tag. hll_hash_integer(12346)||hll_empty(), -- The ID of the song to which the user finishes listening. If the user finishes listening to more than one song, the subsequent coding continues. now(), hll_hash_integer(12346)||hll_empty() -- The ID of the song to which the user finishes listening. ) on conflict (uid,tagid) do update set w5= case when date(t_like.w5_mod_time) <> current_date then excluded.w5 else hll_union(coalesce(t_like.w5,hll_empty()), excluded.w5) end, w5_mod_time = excluded.w5_mod_time, whole = hll_union(coalesce(t_like.whole,hll_empty()), excluded.whole) where hll_union(coalesce(t_like.w5,hll_empty()), excluded.w5) <> coalesce(t_like.w5,hll_empty()) or hll_union(coalesce(t_like.whole,hll_empty()), excluded.whole) <> coalesce(t_like.whole,hll_empty()) ;
Note You can perform HLL UNION operations to merge all the data updates to the data records of a user within a tag at a time. - Query the top 10 tags of the user with the uid value being 1 over the last two days.
Example:
select tagid, hll_cardinality( hll_union(coalesce(w4,hll_empty()), coalesce(w5,hll_empty())) ) as vids from t_like where uid = 1 order by 2 desc limit 10; tagid | vids -------+------ 200 | 2 (1 row)
- Create an index. Example:
create index idx_t_like_1 on t_like (uid, hll_cardinality( hll_union(coalesce(w4,hll_empty()), coalesce(w5,hll_empty())) ));
- View the query plan. Example:
postgres=# explain select tagid, hll_cardinality( hll_union(coalesce(w4,hll_empty()), coalesce(w5,hll_empty())) ) as vids from t_like where uid = 1 order by 2 desc limit 10; QUERY PLAN ------------------------------------------------------------------------------------------- Limit (cost=0.11..0.15 rows=1 width=12) -> Index Scan Backward using idx_t_like_1 on t_like (cost=0.11..0.15 rows=1 width=12) Index Cond: (uid = 1) (3 rows)
- Write tens of millions of data records and perform a stress test. Example:
vi test.sql \set uid random(1,50000) \set tagid random(1,5000) \set vid random(1,10000000) insert into t_like ( uid, tagid, w5, w5_mod_time, whole ) values ( :uid, :tagid, hll_hash_integer(:vid)||hll_empty(), now(), hll_hash_integer(:vid)||hll_empty() ) on conflict (uid,tagid) do update set w5= case when date(t_like.w5_mod_time) <> current_date then excluded.w5 else hll_union(coalesce(t_like.w5,hll_empty()), excluded.w5) end, w5_mod_time = excluded.w5_mod_time, whole = hll_union(coalesce(t_like.whole,hll_empty()), excluded.whole) where hll_union(coalesce(t_like.w5,hll_empty()), excluded.w5) <> coalesce(t_like.w5,hll_empty()) or hll_union(coalesce(t_like.whole,hll_empty()), excluded.whole) <> coalesce(t_like.whole,hll_empty()) ; pgbench -M prepared -n -r -P 1 -c 32 -j 32 -T 120 -f ./test.sql
The following test result is obtained:
transaction type: ./test.sql scaling factor: 1 query mode: prepared number of clients: 32 number of threads: 32 duration: 120 s number of transactions actually processed: 24636321 latency average = 0.156 ms latency stddev = 0.339 ms tps = 205301.110313 (including connections establishing) tps = 205354.851711 (excluding connections establishing) statement latencies in milliseconds: 0.001 \set uid random(1,5000000) 0.001 \set tagid random(1,5000) 0.000 \set vid random(1,10000000) 0.154 insert into t_like (
- Query the tags of a user based on the tag popularity. Example:
postgres=# select tagid, hll_cardinality( hll_union(coalesce(w4,hll_empty()), coalesce(w5,hll_empty())) ) as vids from t_like where uid = 1 order by 2 desc limit 10; tagid | vids -------+------ 200 | 2 1413 | 1 1996 | 1 2642 | 1 3664 | 1 4340 | 1 (6 rows) Time: 0.688 ms
Note This query is responded in 0.688 milliseconds.
You can use the preceding design method to implement other business logic in the recommendation system. For example, you can use the following design to filter songs to which a user finishes listening:
- Determine whether the ID of a song is included in the specified hash value by using
fuzzy match. Example:
postgres=# select whole || hll_hash_integer(1) = whole from t_like where uid=1 and tagid=200; -- If the value false is returned, the song with the vid value being 1 is not included. ? column? ---------- f (1 row) postgres=# select whole || hll_hash_integer(12345) = whole from t_like where uid=1 and tagid=200; -- If the value true is returned, the song with the vid value being 12345 is included. ? column? ---------- t (1 row)
- Determine whether the ID of a song is included in the specified hash value by using
exact match. The following example shows how to create a table:
create table t_like_lossless ( uid int, vid int, primary key (uid,vid) );
Note If you run the query based on a primary key, the query runs fast.