PostgreSQL + FDW + Vector Plug-in Accelerate Vector Retrieval

By digoal

Background

Vector: It can be simply understood as a number array. Unstructured objects (images, videos, audio, etc.) can be converted into an N-dimensional number array, which contains the characteristics of unstructured objects. Then, the array can be searched for recognition, similarity, etc. (vector retrieval).

Vector retrieval is a query that consumes CPU and IO, even if there is an index. Therefore, the first thing that comes to mind is how to improve extensibility (such as how to use multi-machine resources to improve request processing throughput).

This article introduces the combination of PostgreSQL postgres_fdw and vector plug-ins. This combination helps accelerate the similarity search of image recognition. However, there are still some parallel problems. It is suggested to communicate FDW problems with the PostgreSQL community. Currently, it seems that the method with the best performance is dblink asynchronous call.

If the vector plug-ins and vector indexes meet your needs in the single-machine case, you don't need to read any further.

Note: You must understand the ivfflat principle of vector. Consider the characteristics of the actual data set when creating data. The data should not be completely random, and it should not be the same. When creating indexes, you need to create several buckets and find several crucial buckets for screening when querying.

You can also refer to citus, dblink asynchronous call, and PolarDB for PostgreSQL for other methods.

How Does postgres_fdw Implement Parallel Query of SELECT?

https://www.postgresql.org/docs/devel/postgres-fdw.html

1. The WHERE condition pushdown of the SELECT statement must meet the following requirements:

They use only built-in data types, operators, and functions
They belong to an extension listed in the foreign server's extensions option.
Operators and functions in such clauses must be IMMUTABLE as well.

2. It supports the FDW partition table.

Supported

3. It supports asynchronous queries in the FDW partition table and simultaneous queries in parallel with multiple FDW servers.

Supported, but not all requests will enable async foreign scan. Please study the postgres_fdw code for more information.

4. It supports the parallel query switch.

Supported

5. sort Pushdown

Supported

6. merge sort append

Supported

7. limit Pushdown or fetch_size (100 by default)

Supported

Example

Deploy the vector plug-in:

git clone https://github.com/pgvector/pgvector  
  
export PATH=/Users/digoal/pg14/bin:$PATH  
export PGDATA=/Users/digoal/data14  
export PGUSER=postgres  
export PGPORT=1922  
  
USE_PGXS=1 make  
make install

Since it is a local test, configure password login authentication:

pg_hba.conf  
host    all             all             127.0.0.1/32            md5  
  
pg_ctl reload

Create a role:

postgres=# create role test login encrypted password 'test123';  
CREATE ROLE

Create several databases (Different instances on multiple machines are used in the real scenario. This test uses different databases in the same instance to simulate the real test).

create database db0;  
create database db1;  
create database db2;  
create database db3;  
create database db4;  
postgres=# grant all on database db0,db1,db2,db3,db4 to test;  
GRANT  


\c db0;  
create extension postgres_fdw;  
create extension vector ;  

\c db1  
create extension vector ;  
\c db2  
create extension vector ;  
\c db3  
create extension vector ;  
\c db4  
create extension vector ;

db0 is the query entry, so create the entry table, postgres_fdw plug-in, vector plug-in, and fdw partition table in db0.

db0=# \c db0 test  
You are now connected to database "db0" as user "test".  
db0=>   
  
CREATE TABLE tbl (id int, c1 vector(32), c2 text, c3 timestamp) PARTITION BY hash (id);  
CREATE INDEX idx_tbl_1 ON tbl USING ivfflat (c1 vector_l2_ops);  
SELECT * FROM tbl ORDER BY c1 <-> '[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]' LIMIT 5;

db1, db2, db3, and db4/test role login:

CREATE TABLE tbl (id int, c1 vector(32), c2 text, c3 timestamp);  
CREATE INDEX idx_tbl_1 ON tbl USING ivfflat (c1 vector_l2_ops);

In db0, create a foreign server to enable asynchronous requests, and set plug-in parameters (so op can be pushed down). Configure user mapping:

\c db0 postgres  
db0=# show port;  
 port   
------  
 1922  
(1 row)  
  
  
create server db1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '127.0.0.1', dbname 'db1', port '1922', async_capable 'true', extensions 'vector', batch_size '200', use_remote_estimate 'true');  
create server db2 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '127.0.0.1', dbname 'db2', port '1922', async_capable 'true', extensions 'vector', batch_size '200', use_remote_estimate 'true');  
create server db3 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '127.0.0.1', dbname 'db3', port '1922', async_capable 'true', extensions 'vector', batch_size '200', use_remote_estimate 'true');  
create server db4 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '127.0.0.1', dbname 'db4', port '1922', async_capable 'true', extensions 'vector', batch_size '200', use_remote_estimate 'true');  
  
grant all on FOREIGN server db1,db2,db3,db4 to test;  
  
  
CREATE USER MAPPING FOR test SERVER db1 OPTIONS (user 'test', password 'test123');  
CREATE USER MAPPING FOR test SERVER db2 OPTIONS (user 'test', password 'test123');  
CREATE USER MAPPING FOR test SERVER db3 OPTIONS (user 'test', password 'test123');  
CREATE USER MAPPING FOR test SERVER db4 OPTIONS (user 'test', password 'test123');

In db0, create an FDW partition table and refer to the postgres_fdw help manual for parameters. You don't have to follow this article for the configuration, but you must understand the purpose of the configuration (for example, if the return volume is large, the FETCH_SIZE can be increased; if the return volume is small, the FETCH_SIZE can be decreased).

\c db0 test  
  
CREATE FOREIGN TABLE tbl_0  
    PARTITION OF tbl FOR VALUES WITH ( MODULUS 4, REMAINDER 0)  
    SERVER db1 OPTIONS (schema_name 'public', table_name 'tbl', async_capable 'true', fetch_size '1');  
  
CREATE FOREIGN TABLE tbl_1  
    PARTITION OF tbl FOR VALUES WITH ( MODULUS 4, REMAINDER 1)  
    SERVER db2 OPTIONS (schema_name 'public', table_name 'tbl', async_capable 'true', fetch_size '1');  
  
CREATE FOREIGN TABLE tbl_2  
    PARTITION OF tbl FOR VALUES WITH ( MODULUS 4, REMAINDER 2)  
    SERVER db3 OPTIONS (schema_name 'public', table_name 'tbl', async_capable 'true', fetch_size '1');  
  
CREATE FOREIGN TABLE tbl_3  
    PARTITION OF tbl FOR VALUES WITH ( MODULUS 4, REMAINDER 3)  
    SERVER db4 OPTIONS (schema_name 'public', table_name 'tbl', async_capable 'true', fetch_size '1');

Confirm that the Euclidean Distance calculation operator of the vector corresponds to the immutable function:

\do+  
                                                        List of operators  
 Schema | Name | Left arg type | Right arg type |   Result type    |           Function            |         Description            
--------+------+---------------+----------------+------------------+-------------------------------+------------------------------  
 public | <#>  | vector        | vector         | double precision | vector_negative_inner_product |   
 public | <->  | vector        | vector         | double precision | l2_distance                   |   
 public | <=>  | vector        | vector         | double precision | cosine_distance               |   
  
  
postgres=# \df+ l2_distance  
                                                                               List of functions  
 Schema |    Name     | Result data type | Argument data types | Type | Volatility | Parallel |  Owner   | Security | Access privileges | Language | Source code | Description   
--------+-------------+------------------+---------------------+------+------------+----------+----------+----------+-------------------+----------+-------------+-------------  
 public | l2_distance | double precision | vector, vector      | func | immutable  | safe     | postgres | invoker  |                   | c        | l2_distance |   
(1 row)

Check whether all parameters that affect the parallel query for FDW partition tables are enabled:

postgres=# show enable_gathermerge ;  
 enable_gathermerge   
--------------------  
 on  
(1 row)  
  
postgres=# show enable_async_append ;  
 enable_async_append   
---------------------  
 on  
(1 row)  
  
db0=> show enable_parallel_append ;  
 enable_parallel_append   
------------------------  
 on  
(1 row)

Check the execution plan:

db0=> explain (verbose) SELECT * FROM tbl ORDER BY c1 <-> '[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]' LIMIT 5;  
                                                                 QUERY PLAN                                                                   
--------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=400.04..400.36 rows=5 width=84)  
   Output: tbl.id, tbl.c1, tbl.c2, tbl.c3, ((tbl.c1 <-> '[1,2,3]'::vector))  
   ->  Merge Append  (cost=400.04..610.20 rows=3276 width=84)  
         Sort Key: ((tbl.c1 <-> '[1,2,3]'::vector))  
         ->  Foreign Scan on public.tbl_0 tbl_1  (cost=100.00..140.26 rows=819 width=84)  
               Output: tbl_1.id, tbl_1.c1, tbl_1.c2, tbl_1.c3, (tbl_1.c1 <-> '[1,2,3]'::vector)  
               Remote SQL: SELECT id, c1, c2, c3 FROM public.tbl ORDER BY (c1 OPERATOR(public.<->) '[1,2,3]'::public.vector) ASC NULLS LAST  
         ->  Foreign Scan on public.tbl_1 tbl_2  (cost=100.00..140.26 rows=819 width=84)  
               Output: tbl_2.id, tbl_2.c1, tbl_2.c2, tbl_2.c3, (tbl_2.c1 <-> '[1,2,3]'::vector)  
               Remote SQL: SELECT id, c1, c2, c3 FROM public.tbl ORDER BY (c1 OPERATOR(public.<->) '[1,2,3]'::public.vector) ASC NULLS LAST  
         ->  Foreign Scan on public.tbl_2 tbl_3  (cost=100.00..140.26 rows=819 width=84)  
               Output: tbl_3.id, tbl_3.c1, tbl_3.c2, tbl_3.c3, (tbl_3.c1 <-> '[1,2,3]'::vector)  
               Remote SQL: SELECT id, c1, c2, c3 FROM public.tbl ORDER BY (c1 OPERATOR(public.<->) '[1,2,3]'::public.vector) ASC NULLS LAST  
         ->  Foreign Scan on public.tbl_3 tbl_4  (cost=100.00..140.26 rows=819 width=84)  
               Output: tbl_4.id, tbl_4.c1, tbl_4.c2, tbl_4.c3, (tbl_4.c1 <-> '[1,2,3]'::vector)  
               Remote SQL: SELECT id, c1, c2, c3 FROM public.tbl ORDER BY (c1 OPERATOR(public.<->) '[1,2,3]'::public.vector) ASC NULLS LAST  
 Query Identifier: -3107671033622996886  
(17 rows)  
  
  
db0=> explain (verbose) select count(*) from tbl;  
                                          QUERY PLAN                                             
-----------------------------------------------------------------------------------------------  
 Aggregate  (cost=951.95..951.96 rows=1 width=8)  
   Output: count(*)  
   ->  Append  (cost=100.00..917.82 rows=13652 width=0)  
         ->  Async Foreign Scan on public.tbl_0 tbl_1  (cost=100.00..212.39 rows=3413 width=0)  
               Remote SQL: SELECT NULL FROM public.tbl  
         ->  Async Foreign Scan on public.tbl_1 tbl_2  (cost=100.00..212.39 rows=3413 width=0)  
               Remote SQL: SELECT NULL FROM public.tbl  
         ->  Async Foreign Scan on public.tbl_2 tbl_3  (cost=100.00..212.39 rows=3413 width=0)  
               Remote SQL: SELECT NULL FROM public.tbl  
         ->  Async Foreign Scan on public.tbl_3 tbl_4  (cost=100.00..212.39 rows=3413 width=0)  
               Remote SQL: SELECT NULL FROM public.tbl  
 Query Identifier: -7696835127160622742  
(12 rows)

Check the plan again after writing 1 million data:

db0=> insert into tbl select generate_series(1,1000000), '[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]', 'test', now();  
  
db1=> explain (analyze,verbose) select count(*) from tbl;  
                                                       QUERY PLAN                                                         
------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=8924.86..8924.87 rows=1 width=8) (actual time=51.989..51.991 rows=1 loops=1)  
   Output: count(*)  
   ->  Seq Scan on public.tbl  (cost=0.00..8300.89 rows=249589 width=0) (actual time=0.016..34.891 rows=249589 loops=1)  
         Output: id, c1, c2, c3  
 Query Identifier: 5443052778932622058  
 Planning Time: 2.148 ms  
 Execution Time: 52.154 ms  
(7 rows)  
  
db0=> explain (analyze,verbose) select count(*) from tbl;   
                                                                  QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=951.95..951.96 rows=1 width=8) (actual time=528.009..528.016 rows=1 loops=1)  
   Output: count(*)  
   ->  Append  (cost=100.00..917.82 rows=13652 width=0) (actual time=6.367..467.490 rows=1000000 loops=1)  
         ->  Async Foreign Scan on public.tbl_0 tbl_1  (cost=100.00..212.39 rows=3413 width=0) (actual time=1.964..85.057 rows=249589 loops=1)  
               Remote SQL: SELECT NULL FROM public.tbl  
         ->  Async Foreign Scan on public.tbl_1 tbl_2  (cost=100.00..212.39 rows=3413 width=0) (actual time=1.430..82.288 rows=250376 loops=1)  
               Remote SQL: SELECT NULL FROM public.tbl  
         ->  Async Foreign Scan on public.tbl_2 tbl_3  (cost=100.00..212.39 rows=3413 width=0) (actual time=1.422..79.067 rows=249786 loops=1)  
               Remote SQL: SELECT NULL FROM public.tbl  
         ->  Async Foreign Scan on public.tbl_3 tbl_4  (cost=100.00..212.39 rows=3413 width=0) (actual time=1.557..76.345 rows=250249 loops=1)  
               Remote SQL: SELECT NULL FROM public.tbl  
 Query Identifier: 6515815319459192952  
 Planning Time: 2.976 ms  
 Execution Time: 582.248 ms  
(14 rows)

It is not completed in about 50 ms, and the count is not pushed down.

db0=> SELECT * FROM tbl ORDER BY c1 <-> '[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]' LIMIT 5;  
 id |                                            c1                                            |  c2  |             c3               
----+------------------------------------------------------------------------------------------+------+----------------------------  
  1 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479  
 12 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479  
 14 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479  
 16 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479  
 17 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479  
(5 rows)  
  
  
db1=> SELECT * FROM tbl ORDER BY c1 <-> '[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]' LIMIT 5;  
 id |                                            c1                                            |  c2  |             c3               
----+------------------------------------------------------------------------------------------+------+----------------------------  
  1 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479  
 12 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479  
 14 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479  
 16 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479  
 17 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479  
(5 rows)  
  
  
db2=> SELECT * FROM tbl ORDER BY c1 <-> '[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]' LIMIT 5;  
 id |                                            c1                                            |  c2  |             c3               
----+------------------------------------------------------------------------------------------+------+----------------------------  
  3 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479  
  5 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479  
  8 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479  
  9 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479  
 11 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-02 15:55:26.214479  
(5 rows)

Several problems are found:

Not all requests can enable async foreign scan. After initiating the request, you need to wait for the first batch of results to be returned before sending the request to the second ftbl.

\c db1
insert into tbl select generate_series(1,1), '[2,3,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]', 'test', now();  

\c db2
insert into tbl select generate_series(1,1), '[2,3,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]', 'test', now();  


db0=> select * from
(select * from tbl_0
union all                                                                                                                                                                    select * from tbl_1 ) t                                                                                                                                                      order by c1 <-> '[2,3,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]' limit 5;
   id   |                                            c1                                            |  c2  |             c3             
--------+------------------------------------------------------------------------------------------+------+----------------------------
      1 | [2,3,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:27:54.132848
 862019 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:11:52.261664
862021 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:11:52.261664
862024 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:11:52.261664
862029 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:11:52.261664
(5 rows)

Time: 215.103 ms

db0=> select * from
(select * from tbl_0
union all
select * from tbl_1
union all
select * from tbl_2 ) t
order by c1 <-> '[2,3,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]' limit 5;
   id   |                                            c1                                            |  c2  |             c3             
--------+------------------------------------------------------------------------------------------+------+----------------------------
      1 | [2,3,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:27:54.132848
      1 | [2,3,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:28:52.980465
 864133 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:11:52.261664
 864134 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:11:52.261664
 864136 | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] | test | 2022-03-03 11:11:52.261664
(5 rows)

Time: 348.644 ms

Although the count is a built-in aggregate function and parallel safe operation, it is not pushed down.

db0=> \df+ count
                                                                                                       
List of functions
   Schema   | Name  | Result data type | Argument data types | Type | Volatility | Parallel |  Owner   | Security | Access privileges | Language |   Source code   |                           Description                           
------------+-------+------------------+---------------------+------+------------+----------+----------+----------+-------------------+----------+-----------------+-----------------------------------------------------------------
 pg_catalog | count | bigint           |                     | agg  | immutable  | safe     | postgres | invoker  |                   | internal | aggregate_dummy | number of input rows
 pg_catalog | count | bigint           | "any"               | agg  | immutable  | safe     | postgres | invoker  |                   | internal | aggregate_dummy | number of input rows for which the input expression is not null
(2 rows)

I hope that the asynchronous query function of FDW will become more perfect.

Community

PostgreSQL + FDW + Vector Plug-in Accelerate Vector Retrieval

Background

How Does postgres_fdw Implement Parallel Query of SELECT?

Example

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for PostgreSQL

PolarDB for Xscale

PolarDB for MySQL

AnalyticDB for PostgreSQL