By digoal
Vector: It can be simply understood as a number array. Unstructured objects (images, videos, audio, etc.) can be converted into an N-dimensional number array, which contains the characteristics of unstructured objects. Then, the array can be searched for recognition, similarity, etc. (vector retrieval).
Vector retrieval is a query that consumes CPU and IO, even if there is an index. Therefore, the first thing that comes to mind is how to improve extensibility (such as how to use multi-machine resources to improve request processing throughput).
This article introduces the combination of PostgreSQL postgres_fdw and vector plug-ins. This combination helps accelerate the similarity search of image recognition. However, there are still some parallel problems. It is suggested to communicate FDW problems with the PostgreSQL community. Currently, it seems that the method with the best performance is dblink asynchronous call.
If the vector plug-ins and vector indexes meet your needs in the single-machine case, you don't need to read any further.
You can also refer to citus, dblink asynchronous call, and PolarDB for PostgreSQL for other methods.
https://www.postgresql.org/docs/devel/postgres-fdw.html
1. The WHERE condition pushdown of the SELECT statement must meet the following requirements:
2. It supports the FDW partition table.
3. It supports asynchronous queries in the FDW partition table and simultaneous queries in parallel with multiple FDW servers.
4. It supports the parallel query switch.
5. sort Pushdown
6. merge sort append
7. limit Pushdown or fetch_size (100 by default)
Deploy the vector plug-in:
git clone https://github.com/pgvector/pgvector
export PATH=/Users/digoal/pg14/bin:$PATH
export PGDATA=/Users/digoal/data14
export PGUSER=postgres
export PGPORT=1922
USE_PGXS=1 make
make install
Since it is a local test, configure password login authentication:
pg_hba.conf
host all all 127.0.0.1/32 md5
pg_ctl reload
Create a role:
postgres=# create role test login encrypted password 'test123';
CREATE ROLE
Create several databases (Different instances on multiple machines are used in the real scenario. This test uses different databases in the same instance to simulate the real test).
create database db0;
create database db1;
create database db2;
create database db3;
create database db4;
postgres=# grant all on database db0,db1,db2,db3,db4 to test;
GRANT
\c db0;
create extension postgres_fdw;
create extension vector ;
\c db1
create extension vector ;
\c db2
create extension vector ;
\c db3
create extension vector ;
\c db4
create extension vector ;
db0 is the query entry, so create the entry table, postgres_fdw plug-in, vector plug-in, and fdw partition table in db0.
db0=# \c db0 test
You are now connected to database "db0" as user "test".
db0=>
CREATE TABLE tbl (id int, c1 vector(32), c2 text, c3 timestamp) PARTITION BY hash (id);
CREATE INDEX idx_tbl_1 ON tbl USING ivfflat (c1 vector_l2_ops);
SELECT * FROM tbl ORDER BY c1 <-> '[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]' LIMIT 5;
db1, db2, db3, and db4/test role login:
CREATE TABLE tbl (id int, c1 vector(32), c2 text, c3 timestamp);
CREATE INDEX idx_tbl_1 ON tbl USING ivfflat (c1 vector_l2_ops);
In db0, create a foreign server to enable asynchronous requests, and set plug-in parameters (so op can be pushed down). Configure user mapping:
\c db0 postgres
db0=# show port;
port
------
1922
(1 row)
create server db1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '127.0.0.1', dbname 'db1', port '1922', async_capable 'true', extensions 'vector', batch_size '200', use_remote_estimate 'true');
create server db2 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '127.0.0.1', dbname 'db2', port '1922', async_capable 'true', extensions 'vector', batch_size '200', use_remote_estimate 'true');
create server db3 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '127.0.0.1', dbname 'db3', port '1922', async_capable 'true', extensions 'vector', batch_size '200', use_remote_estimate 'true');
create server db4 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '127.0.0.1', dbname 'db4', port '1922', async_capable 'true', extensions 'vector', batch_size '200', use_remote_estimate 'true');
grant all on FOREIGN server db1,db2,db3,db4 to test;
CREATE USER MAPPING FOR test SERVER db1 OPTIONS (user 'test', password 'test123');
CREATE USER MAPPING FOR test SERVER db2 OPTIONS (user 'test', password 'test123');
CREATE USER MAPPING FOR test SERVER db3 OPTIONS (user 'test', password 'test123');
CREATE USER MAPPING FOR test SERVER db4 OPTIONS (user 'test', password 'test123');
In db0, create an FDW partition table and refer to the postgres_fdw help manual for parameters. You don't have to follow this article for the configuration, but you must understand the purpose of the configuration (for example, if the return volume is large, the FETCH_SIZE can be increased; if the return volume is small, the FETCH_SIZE can be decreased).
\c db0 test
CREATE FOREIGN TABLE tbl_0
PARTITION OF tbl FOR VALUES WITH ( MODULUS 4, REMAINDER 0)
SERVER db1 OPTIONS (schema_name 'public', table_name 'tbl', async_capable 'true', fetch_size '1');
CREATE FOREIGN TABLE tbl_1
PARTITION OF tbl FOR VALUES WITH ( MODULUS 4, REMAINDER 1)
SERVER db2 OPTIONS (schema_name 'public', table_name 'tbl', async_capable 'true', fetch_size '1');
CREATE FOREIGN TABLE tbl_2
PARTITION OF tbl FOR VALUES WITH ( MODULUS 4, REMAINDER 2)
SERVER db3 OPTIONS (schema_name 'public', table_name 'tbl', async_capable 'true', fetch_size '1');
CREATE FOREIGN TABLE tbl_3
PARTITION OF tbl FOR VALUES WITH ( MODULUS 4, REMAINDER 3)
SERVER db4 OPTIONS (schema_name 'public', table_name 'tbl', async_capable 'true', fetch_size '1');
Confirm that the Euclidean Distance calculation operator of the vector corresponds to the immutable function:
\do+
List of operators
Schema | Name | Left arg type | Right arg type | Result type | Function | Description
--------+------+---------------+----------------+------------------+-------------------------------+------------------------------
public | <#> | vector | vector | double precision | vector_negative_inner_product |
public | <-> | vector | vector | double precision | l2_distance |
public | <=> | vector | vector | double precision | cosine_distance |
postgres=# \df+ l2_distance
List of functions
Schema | Name | Result data type | Argument data types | Type | Volatility | Parallel | Owner | Security | Access privileges | Language | Source code | Description
--------+-------------+------------------+---------------------+------+------------+----------+----------+----------+-------------------+----------+-------------+-------------
public | l2_distance | double precision | vector, vector | func | immutable | safe | postgres | invoker | | c | l2_distance |
(1 row)
Check whether all parameters that affect the parallel query for FDW partition tables are enabled:
postgres=# show enable_gathermerge ;
enable_gathermerge
--------------------
on
(1 row)
postgres=# show enable_async_append ;
enable_async_append
---------------------
on
(1 row)
db0=> show enable_parallel_append ;
enable_parallel_append
------------------------
on
(1 row)
Check the execution plan:
db0=> explain (verbose) SELECT * FROM tbl ORDER BY c1 <-> '[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]' LIMIT 5;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=400.04..400.36 rows=5 width=84)
Output: tbl.id, tbl.c1, tbl.c2, tbl.c3, ((tbl.c1 <-> '[1,2,3]'::vector))
-> Merge Append (cost=400.04..610.20 rows=3276 width=84)
Sort Key: ((tbl.c1 <-> '[1,2,3]'::vector))
-> Foreign Scan on public.tbl_0 tbl_1 (cost=100.00..140.26 rows=819 width=84)
Output: tbl_1.id, tbl_1.c1, tbl_1.c2, tbl_1.c3, (tbl_1.c1 <-> '[1,2,3]'::vector)
Remote SQL: SELECT id, c1, c2, c3 FROM public.tbl ORDER BY (c1 OPERATOR(public.<->) '[1,2,3]'::public.vector) ASC NULLS LAST
-> Foreign Scan on public.tbl_1 tbl_2 (cost=100.00..140.26 rows=819 width=84)
Output: tbl_2.id, tbl_2.c1, tbl_2.c2, tbl_2.c3, (tbl_2.c1 <-> '[1,2,3]'::vector)
Remote SQL: SELECT id, c1, c2, c3 FROM public.tbl ORDER BY (c1 OPERATOR(public.<->) '[1,2,3]'::public.vector) ASC NULLS LAST
-> Foreign Scan on public.tbl_2 tbl_3 (cost=100.00..140.26 rows=819 width=84)
Output: tbl_3.id, tbl_3.c1, tbl_3.c2, tbl_3.c3, (tbl_3.c1 <-> '[1,2,3]'::vector)
Remote SQL: SELECT id, c1, c2, c3 FROM public.tbl ORDER BY (c1 OPERATOR(public.<->) '[1,2,3]'::public.vector) ASC NULLS LAST
-> Foreign Scan on public.tbl_3 tbl_4 (cost=100.00..140.26 rows=819 width=84)
Output: tbl_4.id, tbl_4.c1, tbl_4.c2, tbl_4.c3, (tbl_4.c1 <-> '[1,2,3]'::vector)
Remote SQL: SELECT id, c1, c2, c3 FROM public.tbl ORDER BY (c1 OPERATOR(public.<->) '[1,2,3]'::public.vector) ASC NULLS LAST
Query Identifier: -3107671033622996886
(17 rows)
db0=> explain (verbose) select count(*) from tbl;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Aggregate (cost=951.95..951.96 rows=1 width=8)
Output: count(*)
-> Append (cost=100.00..917.82 rows=13652 width=0)
-> Async Foreign Scan on public.tbl_0 tbl_1 (cost=100.00..212.39 rows=3413 width=0)
Remote SQL: SELECT NULL FROM public.tbl
-> Async Foreign Scan on public.tbl_1 tbl_2 (cost=100.00..212.39 rows=3413 width=0)
Remote SQL: SELECT NULL FROM public.tbl
-> Async Foreign Scan on public.tbl_2 tbl_3 (cost=100.00..212.39 rows=3413 width=0)
Remote SQL: SELECT NULL FROM public.tbl
-> Async Foreign Scan on public.tbl_3 tbl_4 (cost=100.00..212.39 rows=3413 width=0)
Remote SQL: SELECT NULL FROM public.tbl
Query Identifier: -7696835127160622742
(12 rows)
Check the plan again after writing 1 million data:
db0=> insert into tbl select generate_series(1,1000000), '[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]', 'test', now();
db1=> explain (analyze,verbose) select count(*) from tbl;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=8924.86..8924.87 rows=1 width=8) (actual time=51.989..51.991 rows=1 loops=1)
Output: count(*)
-> Seq Scan on public.tbl (cost=0.00..8300.89 rows=249589 width=0) (actual time=0.016..34.891 rows=249589 loops=1)
Output: id, c1, c2, c3
Query Identifier: 5443052778932622058
Planning Time: 2.148 ms
Execution Time: 52.154 ms
(7 rows)
db0=> explain (analyze,verbose) select count(*) from tbl;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=951.95..951.96 rows=1 width=8) (actual time=528.009..528.016 rows=1 loops=1)
Output: count(*)
-> Append (cost=100.00..917.82 rows=13652 width=0) (actual time=6.367..467.490 rows=1000000 loops=1)
-> Async Foreign Scan on public.tbl_0 tbl_1 (cost=100.00..212.39 rows=3413 width=0) (actual time=1.964..85.057 rows=249589 loops=1)
Remote SQL: SELECT NULL FROM public.tbl
-> Async Foreign Scan on public.tbl_1 tbl_2 (cost=100.00..212.39 rows=3413 width=0) (actual time=1.430..82.288 rows=250376 loops=1)
Remote SQL: SELECT NULL FROM public.tbl
-> Async Foreign Scan on public.tbl_2 tbl_3 (cost=100.00..212.39 rows=3413 width=0) (actual time=1.422..79.067 rows=249786 loops=1)
Remote SQL: SELECT NULL FROM public.tbl
-> Async Foreign Scan on public.tbl_3 tbl_4 (cost=100.00..212.39 rows=3413 width=0) (actual time=1.557..76.345 rows=250249 loops=1)
Remote SQL: SELECT NULL FROM public.tbl
Query Identifier: 6515815319459192952
Planning Time: 2.976 ms
Execution Time: 582.248 ms
(14 rows)
It is not completed in about 50 ms, and the count is not pushed down.
db0=> SELECT * FROM tbl ORDER BY c1 <-> '[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]' LIMIT 5;
id | c1 | c2 | c3
----+------------------------------------------------------------------------------------------+------+----------------------------
1 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479
12 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479
14 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479
16 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479
17 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479
(5 rows)
db1=> SELECT * FROM tbl ORDER BY c1 <-> '[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]' LIMIT 5;
id | c1 | c2 | c3
----+------------------------------------------------------------------------------------------+------+----------------------------
1 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479
12 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479
14 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479
16 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479
17 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479
(5 rows)
db2=> SELECT * FROM tbl ORDER BY c1 <-> '[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]' LIMIT 5;
id | c1 | c2 | c3
----+------------------------------------------------------------------------------------------+------+----------------------------
3 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479
5 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479
8 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479
9 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479
11 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479
(5 rows)
Several problems are found:
\c db1
insert into tbl select generate_series(1,1), '[2,3,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]', 'test', now();
\c db2
insert into tbl select generate_series(1,1), '[2,3,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]', 'test', now();
db0=> select * from
(select * from tbl_0
union all select * from tbl_1 ) t order by c1 <-> '[2,3,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]' limit 5;
id | c1 | c2 | c3
--------+------------------------------------------------------------------------------------------+------+----------------------------
1 | [2,3,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:27:54.132848
862019 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:11:52.261664
862021 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:11:52.261664
862024 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:11:52.261664
862029 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:11:52.261664
(5 rows)
Time: 215.103 ms
db0=> select * from
(select * from tbl_0
union all
select * from tbl_1
union all
select * from tbl_2 ) t
order by c1 <-> '[2,3,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]' limit 5;
id | c1 | c2 | c3
--------+------------------------------------------------------------------------------------------+------+----------------------------
1 | [2,3,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:27:54.132848
1 | [2,3,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:28:52.980465
864133 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:11:52.261664
864134 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:11:52.261664
864136 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:11:52.261664
(5 rows)
Time: 348.644 ms
db0=> \df+ count
List of functions
Schema | Name | Result data type | Argument data types | Type | Volatility | Parallel | Owner | Security | Access privileges | Language | Source code | Description
------------+-------+------------------+---------------------+------+------------+----------+----------+----------+-------------------+----------+-----------------+-----------------------------------------------------------------
pg_catalog | count | bigint | | agg | immutable | safe | postgres | invoker | | internal | aggregate_dummy | number of input rows
pg_catalog | count | bigint | "any" | agg | immutable | safe | postgres | invoker | | internal | aggregate_dummy | number of input rows for which the input expression is not null
(2 rows)
I hope that the asynchronous query function of FDW will become more perfect.
Network Block Device for Testing RAC and Shared Storage Version of PolarDB for PostgreSQL
ApsaraDB - June 16, 2023
digoal - May 9, 2020
digoal - May 25, 2021
ApsaraDB - December 10, 2024
digoal - February 3, 2020
ApsaraDB - November 16, 2020
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreMore Posts by digoal