By digoal
A recommendation system is a general requirement to improve user stickiness and conversion rate in internet applications.
For example:
In this article, a music website is used as an example to describe the design of a recommendation system database.
1) Songs have corresponding tags. One song can have multiple tags. Songs that the user has listened to or finished listening to form a one-to-many mapping relationship.
uid ->> tags ->> musics
2) According to the ranking result of the song quantity under each tag, tag hotness can be obtained.
tag(count distinct music)
3) The first five tags and their weights are as follows:
4) Exclude songs that the user has listened to from the song library with these tags. Further, recommend new songs to users in proportion based on the recommendation weights of these songs. For example, the number of times that songs are played in reverse order.
It applies to all types of databases.
create table t_like(
uid int, -- 用户id
tagid int, -- 歌曲标签id
vid int, -- 歌曲id
mod_time timestamp, -- 最后一次更新时间, 仅与上次时间超过1天时更新
primary key (uid,tagid,vid)
);
insert into t_like values (:uid, :tagid, :vid, :mod_time)
on conflict (uid,tagid,vid) do update
set mod_time=excluded.mod_time
where
excluded.mod_time - t_like.mod_time > interval '1 day'
;
-- 根据tag里面歌曲id的歌手, 统计最近1天的top 10的tag
select tagid, count(*) from t_like
where uid=:uid
and now()-mod_time < interval '1 day'
group by tagid
order by count(*) desc limit 10;
Perform Stress testing.
vi test.sql
\set uid random(1,50000)
\set tagid random(1,5000)
\set vid random(1,10000000)
insert into t_like values (:uid, :tagid, :vid, now())
on conflict (uid,tagid,vid) do update
set mod_time=excluded.mod_time
where
excluded.mod_time - t_like.mod_time > interval '1 day';
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 32 -j 32 -T 240
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 32
number of threads: 32
duration: 240 s
number of transactions actually processed: 80975327
latency average = 0.095 ms
latency stddev = 0.340 ms
tps = 337396.279382 (including connections establishing)
tps = 337406.018908 (excluding connections establishing)
statement latencies in milliseconds:
0.000 \set uid random(1,50000)
0.000 \set tagid random(1,5000)
0.000 \set vid random(1,10000000)
0.094 insert into t_like values (:uid, :tagid, :vid, now())
db1=# select tagid, count(*) from t_like
where uid=1
and now()-mod_time < interval '1 day'
group by tagid
order by count(*) desc limit 10;
tagid | count
-------+-------
2519 | 4
3049 | 4
3648 | 4
1777 | 3
1352 | 3
1491 | 3
1064 | 3
572 | 3
692 | 3
301 | 3
(10 rows)
Time: 3.947 ms
Disadvantages of conventional database design
In the case of a large amount of data, the conventional design relies on aggregated queries, resulting in low efficiency.
Use HLL to store the VID listened by UID in the form of (tag(hll), tag n(hll)). This design has significant advantages over a conventional design, such as:
This design uses the PostgreSQL HLL plug-in. For more details related to the HLL plug-in, please refer to this page.
1) Each tag stores an HLL, which stores the VID hash value of songs that the users have listened to.
create table t_like (
uid int,
tagid int, -- 标签
w1 hll, w1_mod_time timestamp, -- 周一听完的歌曲对应的vid 构成的hash, 周一
w2 hll, w2_mod_time timestamp, -- 周二 ...
w3 hll, w3_mod_time timestamp,
w4 hll, w4_mod_time timestamp,
w5 hll, w5_mod_time timestamp,
w6 hll, w6_mod_time timestamp,
w7 hll, w7_mod_time timestamp,
whole hll, -- 所有
primary key (uid,tagid)
);
If you're only concerned about the results of one day, then there is no need for so many fields. Fields, from w1 to w7, can vary as per your business requirements.
2) After a user finishes listening to a track, the recommendation system writes that song's information to the field corresponding to the current time. If the corresponding field already has a value, and the last modified time is not today, then overwrite. Otherwise, append the hash. You can implement the preceding logic by using the insert into conflict syntax.
-- 设置观看历史行为 hash
insert into t_like (
uid,
tagid,
w5,
w5_mod_time,
whole
)
values (
1, -- uid
200, -- 标签id
hll_hash_integer(12346)||hll_empty(), -- 观看过的vid, 多个则继续||
now(),
hll_hash_integer(12346)||hll_empty() -- 观看过的vid
)
on conflict (uid,tagid)
do update
set w5=
case
when date(t_like.w5_mod_time) <> current_date
then excluded.w5
else hll_union(coalesce(t_like.w5,hll_empty()), excluded.w5)
end,
w5_mod_time = excluded.w5_mod_time,
whole = hll_union(coalesce(t_like.whole,hll_empty()), excluded.whole)
where
hll_union(coalesce(t_like.w5,hll_empty()), excluded.w5) <> coalesce(t_like.w5,hll_empty())
or
hll_union(coalesce(t_like.whole,hll_empty()), excluded.whole) <> coalesce(t_like.whole,hll_empty())
;
Note: It is also possible to merge updates in batches or perform aggregate updates for the single tag of an individual user. To make it happen, perform HLL UNION operation while reducing the update rate.
3) Query the top 10 tags of UID 1 over the last two days.
Example:
select tagid,
hll_cardinality( hll_union(coalesce(w4,hll_empty()), coalesce(w5,hll_empty())) ) as vids
from t_like
where uid = 1
order by 2 desc limit 10;
tagid | vids
-------+------
200 | 2
(1 row)
4) Support for indexes:
create index idx_t_like_1 on t_like (uid, hll_cardinality( hll_union(coalesce(w4,hll_empty()), coalesce(w5,hll_empty())) ));
5) Scan indexes:
postgres=# explain select tagid,
hll_cardinality( hll_union(coalesce(w4,hll_empty()), coalesce(w5,hll_empty())) ) as vids
from t_like
where uid = 1
order by 2 desc limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------
Limit (cost=0.11..0.15 rows=1 width=12)
-> Index Scan Backward using idx_t_like_1 on t_like (cost=0.11..0.15 rows=1 width=12)
Index Cond: (uid = 1)
(3 rows)
6) Write tens of millions of data records and perform stress testing to evaluate performance.
vi test.sql
\set uid random(1,50000)
\set tagid random(1,5000)
\set vid random(1,10000000)
insert into t_like (
uid,
tagid,
w5,
w5_mod_time,
whole
)
values (
:uid,
:tagid,
hll_hash_integer(:vid)||hll_empty(),
now(),
hll_hash_integer(:vid)||hll_empty()
)
on conflict (uid,tagid)
do update
set w5=
case
when date(t_like.w5_mod_time) <> current_date
then excluded.w5
else hll_union(coalesce(t_like.w5,hll_empty()), excluded.w5)
end,
w5_mod_time = excluded.w5_mod_time,
whole = hll_union(coalesce(t_like.whole,hll_empty()), excluded.whole)
where
hll_union(coalesce(t_like.w5,hll_empty()), excluded.w5) <> coalesce(t_like.w5,hll_empty())
or
hll_union(coalesce(t_like.whole,hll_empty()), excluded.whole) <> coalesce(t_like.whole,hll_empty())
;
pgbench -M prepared -n -r -P 1 -c 32 -j 32 -T 120 -f ./test.sql
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 32
number of threads: 32
duration: 120 s
number of transactions actually processed: 24636321
latency average = 0.156 ms
latency stddev = 0.339 ms
tps = 205301.110313 (including connections establishing)
tps = 205354.851711 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set uid random(1,5000000)
0.001 \set tagid random(1,5000)
0.000 \set vid random(1,10000000)
0.154 insert into t_like (
7) Execute for multiple rounds.
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 32
number of threads: 32
duration: 120 s
number of transactions actually processed: 23988181
latency average = 0.160 ms
latency stddev = 0.335 ms
tps = 199900.214256 (including connections establishing)
tps = 199956.049571 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set uid random(1,50000)
0.000 \set tagid random(1,5000)
0.000 \set vid random(1,10000000)
0.158 insert into t_like (
8) The current number of records is 47.47 million.
postgres=# select count(*) from t_like ;
count
----------
47473788
(1 row)
9) Query the tag ranking result of a UID. The response time is 0.688 milliseconds.
postgres=# select tagid,
hll_cardinality( hll_union(coalesce(w4,hll_empty()), coalesce(w5,hll_empty())) ) as vids
from t_like
where uid = 1
order by 2 desc limit 10;
tagid | vids
-------+------
200 | 2
1413 | 1
1996 | 1
2642 | 1
3664 | 1
4340 | 1
(6 rows)
Time: 0.688 ms
Other requirements: Determine whether a VID is in this hash, requiring no precise operations. This filters songs that the user has already listened to.
select whole || hll_hash_integer(:vid) = whole
from
t_like
where uid=:uid and tagid=:tagid;
For example:
postgres=# select whole || hll_hash_integer(1) = whole
from
t_like
where uid=1 and tagid=200; -- 返回false表示不包含vid:1
?column?
----------
f
(1 row)
postgres=# select whole || hll_hash_integer(12345) = whole
from
t_like
where uid=1 and tagid=200; -- 返回true表示包含vid:12345
?column?
----------
t
(1 row)
Separately store a copy of accurate values if you need precision.
create table t_like_lossless (
uid int,
vid int,
primary key (uid,vid)
);
This is a query for the primary key, and its speed is also very fast.
Alibaba Cloud will soon support the HLL plug-in. You are welcome to try it out. Now you can spend only 9.9 yuan to buy PostgreSQL for trial. Refer to the page here.
PostgreSQL HLL helps implement approximate hash computing. Millisecond-level recommendation queries are achieved based on hundreds of millions of relational data records. In contrast, this saves storage space and decreases the response time from 3.947 milliseconds to 0.688 milliseconds.
1) https://github.com/citusdata/postgresql-hll
2) PostgreSQL HLL (HyperLogLog) Extension for "State of the Art Cardinality Estimation Algorithm"-3
3) PostgreSQL HLL (HyperLogLog) Extension for "State of the Art Cardinality Estimation Algorithm"-2
4) PostgreSQL HLL (HyperLogLog) Extension for "State of the Art Cardinality Estimation Algorithm"-1
Professional Recommendation Database - RecDB
5) https://github.com/DataSystemsLab/recdb-postgresql
Professional Graph Database
Read/Write Splitting Configuration for ApsaraDB RDS for PostgreSQL 12 using pgpool
ApsaraDB - October 30, 2020
ApsaraDB - November 16, 2020
Alibaba Clouder - June 22, 2018
digoal - January 30, 2022
Apache Flink Community China - January 11, 2021
ApsaraDB - January 28, 2022
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreAlibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.
Learn MoreAn on-demand database hosting service for PostgreSQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreMore Posts by digoal