The vector analysis feature of AnalyticDB for PostgreSQL is used to retrieve and analyze unstructured data. This feature offers more features and benefits than conventional vector analysis systems in the following aspects:
  • Hybrid analysis of both structured and unstructured data

    For example, you can search for items that resemble the dress shown in an input picture, are priced between USD 100 and USD 200, and are released within the past month.

  • Real-time data updates

    Conventional vector analysis systems do not support real-time write operations. Data in these systems can only be updated the day after you insert data into them. Vector analysis of AnalyticDB for PostgreSQL supports real-time data updates and queries.

  • Vector analysis collision

    Vector analysis of AnalyticDB for PostgreSQL supports the k nearest neighbor join (kNN join) operation, which is similar to the kNN join operation in Spark. This operation compares the similarities between two heaps of vectors. The comparison requires computations to be run on huge volumes of data. AnalyticDB for PostgreSQL has been improved to handle these heavy computation workloads.

    A typical scenario for vector analysis collision is item deduplication. AnalyticDB for PostgreSQL computes new items that resemble items in the historical item database. Another scenario is face clustering. AnalyticDB for PostgreSQL analyzes database data across a period of time to identify faces that belong to the same person.

  • Ease of use

    You can use vector analysis of AnalyticDB for PostgreSQL after you purchase an AnalyticDB for PostgreSQL instance. To simplify application development, vector analysis supports standard SQL statements. In addition, vector analysis is embedded with common feature and attribute extraction services and can be integrated with third-party feature extraction services.

  • Cost-effectiveness

    Vector data occupies a large amount of storage space. One 512-dimensional FLOAT vector consumes 2 KB storage space. Vector analysis of AnalyticDB for PostgreSQL supports data compression from FP32 to FP16, which can reduce storage costs by half.

1. DDL

This section describes how to create a vector column and an index when you create a table.
Note The used syntax is compatible with standard SQL 98 syntax.

1.1 Create a table

Syntax:

CREATE TABLE [TABLE_NAME]
(  
    C1 DATATYPE,  
    C2 DATATYPE,  
    ......,  
    CN REAL[], 
    PRIMARY KEY(one or multiple columns)
);
Note The table name must be unique among other names of tables, sequences, indexes, views, or external tables in the same schema. A blank table will be created and owned by you after you execute the preceding statements.

Create a table named FACE_TABLE. Specify the C2 column as a vector column and the C1 column as the primary key. NOT NULL indicates that the field is not null.

CREATE TABLE FACE_TABLE (  
    C1 INT,  
    C2 REAL[] NOT NULL,  
    C3 TIMESTAMP NOT NULL,  
    C4 VARCHAR(20) NOT NULL,  
    PRIMARY KEY (C1)
);
Note Vector columns store vectors of the data types built into PostgreSQL. In this example, the C2 column stores a vector array of the REAL data type.

1.2 Create an index

An index is a data structure that speeds up data retrieval. A vector index is a pointer to the data in a table. The vector index traverses each node to find vectors similar to the input vector. An index speeds up the SELECT statements and WHERE clauses during vector retrieval, but may reduce the write or update performance of the INSERT and UPDATE statements.

Syntax:

CREATE INDEX [INDEX_NAME]
ON [SCHEMA_NAME].[ TABLE_NAME]   
USING ANN(COLUMN_NAME) 
WITH (DIM=$DIMENSION);

Description of each field:

  • INDEX_NAME: the name of the index.
  • SCHEMA_NAME: the name of the schema or namespace.
  • TABLE_NAME: the name of the table.
  • COLUMN_NAME: the name of the column that serves as the vector index.
  • DIMENSION: the length of the characteristic vector. The dimension must be 64 to 8,192 characters in length. This parameter is used to determine whether a vector can be inserted. The system will display an error message when a vector of invalid dimensions is to be inserted.

1.3 DDL examples

In an item database, the products table contains the following fields.

Field Type Description
Id serial The ID of an item.
Name varchar(256) The name of an item.
Price real The price of an item.
InTime timestamp The time when an item was released.
Url varchar(256) The URL of the picture showing an item.
Feature real[] The features of the picture showing an item.

The DDL statement to create the table is as follows:

create table products (
    Id serial primary key, 
    Name varchar(256), 
    Price real, 
    InTime timestamp, 
    Url varchar(256),
    Feature real[]
);

You need to create a vector index to speed up vector queries.

create index on products using ann(feature) with (dim=512);

The access method is specified as approximate nearest neighbor (ann). The ann method is provided by the FastANN plug-in. When creating a vector index, you must use the WITH clause to specify certain parameters. The dimension of the vector must be specified as dim.

Note Only Euclidean Distance (L2 Distance) is supported for vector indexes. To use vector indexes, you must first convert other distance types to Euclidean Distance.

For other types of distances, only the corresponding user-defined functions (UDFs) are provided for brute-force calculation. The following UDFs are provided to calculate the distance:

  • dp_distance: calculates the dot-product distance.
  • hm_distance: calculates the Hamming distance.
  • l2_distance: calculates the Euclidean distance.

You can create an index for common structured columns to speed up integrated queries of vectors and structured data. You can create a combined index for a common combination of conditions that you want to analyze. Example:

create index on products(price, intime);

2. Insert data

You can use the INSERT statement to insert data. A vector literal is written as a parenthesized vector type:

insert into products values (default, 'white shirt', 666.233, '2020-03-10', 'aaa.bbb.ccc/xxx.jpg', 
ARRAY[0.1, 0.2, 0.1, 0.3, ... 0.9]);

3. Execute a query

The syntax may vary slightly depending on whether to use an index.

3.1 Execute an explicit brute-force query

Assume that you need to search for items that resemble the item shown in an input picture, are priced between USD 100 and USD 200, and are released within the past month.

The SQL query statement is as follows:

select id, price from products where 
    price > 100 and price <= 200 
    and InTime > '2019-03-01' and InTime <= '2019-03-31' 
order by 
    l2_distance(array[10,2.0,..., 512.0], feature) 
limit 100;
Note This is a standard SQL statement. The execution procedure is as follows: filter out the items that meet the price and time conditions, calculate the L2 distance between input vectors in sequence, and finally sort the calculating results and show the first 100 rows of data. This type of SQL statements is suitable for scenarios with small amounts of data but high recall ratio requirements. This type of SQL statements is called an explicit brute-force query because it does not generate execution plans that use vector indexes.

3.2 Use vector indexes to speed up vector queries

To use vector indexes to speed up vector queries, you must replace the distance calculation UDF in the ORDER BY clause with the "vector column <-> vectors" statement. For example, you can use the following SQL statement to speed up the vector query in section 3.1:

select id, price from products where
    price > 100 and price <= 200 
    and InTime > '2019-03-01' and InTime <= '2019-03-31' 
order by 
    feature <-> array[10,2.0,..., 512.0] 
limit 100;
Note To use a vector index, you must specify the ORDER BY clause based on the following rules:
  • You must sort the vector column in ascending order or leave the order blank when you use the ORDER BY clause together with the vector index. If you need to sort the vector column in descending order of distance, or sort the output results based on other columns after using the vector index, use the vector query as a subquery and add the ORDER BY clause to the parent query.
  • The ORDER BY clause must contain only the "vector column <-> vectors to be queried" condition. If you do not follow the preceding rules, vector queries are performed but the vector indexes cannot speed them up.

3.3 Execute integrated queries of vectors and structured data

In addition to basic vector queries, vector analysis of AnalyticDB for PostgreSQL also supports the integrated queries of vectors and structured data. The following SQL statement is used as an example:

select id, price from products where 
    price > 100 and price <=200 
    and InTime > '2019-03-01 00:00:00' and InTime <= '2019-03-31 00:00:00' 
order by 
    feature <-> array[10,2.0,..., 512.0] 
limit 100;

To execute the preceding SQL statement, vector analysis of AnalyticDB for PostgreSQL may generate five execution plans of three categories and select the optimal plan to meet user requirements for performance and recall. The classifications are determined based on whether the structured condition columns (price and intime) have an index and the selection rate of the structured conditions.

First category: brute-force queries

Brute force queries are used to obtain all rows that meet the specified structured conditions, sort the rows based on the vector distance and show the 100 rows with the minimum distance. This type of execution plan has a recall rate of 100% and is the slowest of all execution plans. It delivers poor performance when the size of the underlying database is large or when the number of rows that meet the structured conditions is large. The execution plan with simplified output information is as follows:

                                       QUERY PLAN
-----------------------------------------------------------------------------------
 Limit
     ->  Gather Motion 3:1  (slice1; segments: 3)   
         Merge Key: (l2_distance($0, feature))     
             ->  Limit
                 ->  Sort
                     Sort Key: (l2_distance($0, feature))
                     ->  Index Scan using products_price_idx on products
                         Index Cond: the filter condition of the price.
                         Filter: the filter condition of the time.

 Optimizer: Postgres query optimizer

Second category: combination of vector queries and structured filtering

To speed up a query, you can use a vector index to query the N rows of data showing the highest similarity with that shown in an input picture, and then filter the data based on the structured conditions. This type of execution plan is the fastest among all types. However, if the probability of a data row meeting filter conditions is low during a structured query, the number of output data rows may be smaller than your specified limit. The execution plan with simplified output information is as follows:

                                       QUERY PLAN
-----------------------------------------------------------------------------------
 Limit 
    ->  Gather Motion 3:1  (slice1; segments: 3) 
        Merge Key: ((feature <-> $0))     
            ->  Limit
                ->  Ann Index Scan using products_feature_idx on products
                    Order By: (feature <-> $0)
                    Filter: the filter conditions of the price and time columns.

 Optimizer: Postgres query optimizer

Third category: integrated queries of vectors and structured data

Integrated queries of vectors and structured data combine the advantages of the first and second categories. You can not only use indexes, but also resolve the problem of less returned data presented by the second category. If a structured condition column has an index that allows you to generate bitmap representation of an image, you can also use other indexes to generate bitmaps for query acceleration during integrated queries. The execution plan with simplified output information is as follows:

                                       QUERY PLAN
-----------------------------------------------------------------------------------
 Limit
     ->  Gather Motion 3:1  (slice1; segments: 3)   
         Merge Key: ((feature <-> $0))   
         ->  Limit
             ->  Fusion Ann Scan
                 ->  Bitmap Index Scan on products_price_idx
                     Index Cond: the filter condition of the price.
                     ->  Ann Index Scan with filter using products_feature_idx on products
                         Order By: (feature <-> $0)
                         Filter: the filter condition of the time.

 Optimizer: Postgres query optimizer
Note If execution plans displayed using the Explain statement contain the Fusion Ann Scan or Ann Index Scan with filter node, it indicates that they incorporate integrated queries. The two nodes are described as follows:
  • Ann Index Scan with filter: This node pushes filter conditions down to a vector index so that you can use both the filter conditions and the index.
  • Fusion Ann Scan: This node may be displayed when certain structured condition columns have an index. This treenode can speed up the evaluation of structured conditions by generating a bitmap as a left subtree and pushing the bitmap down to a vector index as a right subtree. When a Fusion Ann Scan node is displayed, the right subtree may be either Ann Index Scan or Ann Index Scan withfilter. If the Ann Index Scan node is displayed, only the bitmap is pushed down. If the Ann Index Scan with filter node is displayed, filter conditions other than the bitmap are pushed down.

AnalyticDB for PostgreSQL selects the most cost-efficient execution plan while ensuring high recall rate.

4. Advanced features

Vector analysis of AnalyticDB for PostgreSQL provides a series of advanced features for ease of use at lower costs.

4.1 Feature extraction service

To help you use and experience vector analysis, AnalyticDB for PostgreSQL provides the vector extraction service. Vector analysis can be used after you purchase an AnalyticDB for PostgreSQL instance. You can also use your own vector extraction service. To better manage unstructured data, you can store unstructured data in Object Storage Service (OSS) or a picture server. The storage URL of unstructured data is stored in AnalyticDB for PostgreSQL. The following figure shows the overall architecture of vector analysis in OSS.

基于AnalyticDB for PG实现非结构化数据向量分析示例
  1. In the AnalyticDB for PostgreSQL console, register the feature extraction service.
  2. Store unstructured data in OSS, which then returns the URL used to access the data.
  3. AnalyticDB for PostgreSQL stores the URL.
  4. Use a web application to call a custom function of AnalyticDB for PostgreSQL to generate vector features. AnalyticDB for PostgreSQL calls the feature extraction service to read unstructured data from OSS, extract features, and then save the feature vector. The preceding operations are conducted using a single SQL statement that functions similar to the following statements:
    select feature_extractor('clothes','https://xxx/1036684144_687583347.jpg');  
    insert into product(id, url, feature) values(0, 'http://xxx/1036684144_687583347.jpg', feature_extractor('clothes','https://xxx/1036684144_687583347.jpg')); 
    feature_extractor is a custom function to extract item features, import the URL of the specified picture, and extract the feature vector. This vector is used to retrieve items and extract attributes.
    Note The first parameter in the feature_extractor function specifies the type of the attribute to extract. The type can be face, clothes, or text.

    Common services for extracting facial features, text features (BERT model), and clothing features are also integrated into AnalyticDB for PostgreSQL. You can also use your own feature extraction service.

4.2 FLOAT2 data type for compressed vector storage

This section describes compressed columns of the half-precision floating-point format and related operations. Conventional vector retrieval systems convert pictures, audios, and texts into high-dimensional floating-point arrays, which consumes a large amount of storage space. To save storage costs by compressing storage space, AnalyticDB for PostgreSQL provides the FLOAT2 data type for compressed storage.

Note The system converts data between the FLOAT2 and FLOAT4 data types, which may affect the read/write performance of databases.

4.2.1 FLOAT2 data type

The half-precision floating-point format (FLOAT2) is a binary floating-point computer data type that occupies two bytes (16 bits) to store FLOAT4 data that occupies four bytes (32 bits). The IEEE 754 standard refers to the format as binary16, which has the following format layout:

  • Sign bit: 1 bit
  • Exponent width: 5 bits
  • Significand precision: 11 bits (10 explicitly stored)
The following figure shows the format layout.FLOAT2 data typeThe format is assumed to have an implicit lead bit with value 1 unless the exponent field is stored with all zeros. Therefore, only 10 bits of the significand appear in the memory format but the total precision is 11 bits. In the IEEE 754 standard, the 10-bit significand requires 11 bits of significand precision (log10(211) ≈ 3.311 decimal digits).
0 01111 0000000000 = 1
0 01111 0000000001 = 1 + 2−10 = 1.0009765625 (smallest number greater than one)
1 10000 0000000000 = −2 

0 11110 1111111111 = 65504 (max half precision) 

0 00001 0000000000 = 2−14 ≈ 6.10352 × 10−5 (smallest positive normal number)
0 00000 1111111111 = 2−14 - 2−24 ≈ 6.09756 × 10−5 (greatest subnormal number)
0 00000 0000000001 = 2−24 ≈ 5.96046 × 10−8 (smallest positive subnormal number) 

0 00000 0000000000 = 0
1 00000 0000000000 = −0 

0 11111 0000000000 = infinity
1 11111 0000000000 = −infinity 

0 01101 0101010101 = 0.333251953125 ≈ 1/3
By default, 1/3 rounds down in the similar method for double precision because of the odd number of bits in the significand.
To convert data between the FLOAT2 and FLOAT4 types, the system shifts bits and converts the base number between 15 and 127. For example, to convert data from FLOAT2 to FLOAT4, the system performs the following operations:
  1. Shift the sign bit to the left by 16 bits.
  2. Convert the base number from 15 to 127 by adding 112, and align the exponent right by shifting it left by 13 bits.
  3. Align the significand left by shifting it left by 13 bits.
Note The opposite operations are performed to convert data from FLOAT4 to FLOAT2.

The floating-point data compression causes loss of precision during query and calculation. Some amount of precision loss is acceptable in business scenarios.

The FLOAT2 data type compresses four-byte data into two bytes and occupies half of the disk space compared with the FLOAT4 data type. The compression ratio for vector columns is 0.5.

Values stored as the FLOAT2 data type must be in the range of -65519.99 to 65519.99. If a value is greater than 65519, the system displays Infinity. If a value is less than -65519, the system displays -Infinity. Vectors must be normalized to a range from 0 to 1 before vector retrieval. Otherwise, calculated vector distances are likely to exceed the FLOAT2 value range and result in inaccuracy.

The conversion between FLOAT2 and FLOAT4 consumes some performance. Two conversion algorithms are implemented to convert data types of FLOAT2 arrays:
  • Use C programs to convert FLOAT2 elements in an array. Only one element is converted at a time.
  • For hardware devices that support Advanced Vector Extensions (AVX) and Streaming SIMD Extensions 2 (SSE2), use hardware-specific interfaces and functions. Four FLOAT2 elements can be converted at a time.
Few entries can be converted because the query uses traverse methods such as indexes.

4.2.2 Create a table that uses the FLOAT2 data type

FLOAT2 is an internally defined data type. Conversion among multiple data types and related operators are managed by the system. Therefore, FLOAT2 is operated as a primitive data type.

Syntax:

CREATE TABLE [TABLE_NAME]
(  
    C1 INT,  
    C2 FLOAT2[],  
    C3 VARCHAR(20),  
    PRIMARY KEY(C1)
);
Note The C2 column stores FLOAT2 vectors.

Example:

Create a table named FACE_TABLE. Specify the C2 column as a column that stores FLOAT2 vectors.
CREATE TABLE FACE_TABLE (  
    C1 INT PRIMARY KEY,  
    C2 FLOAT2[],  
    C3 VARCHAR(20)
);

4.2.3 Insert data

You can insert FLOAT2 arrays into a table by using the following methods. You can explicitly define a FLOAT2 array and insert it into the table as described by sql1. You can also define a FLOAT4 array, and the system will convert it into a FLOAT2 array and insert the array into the table as described by sql2 and sql3.

Example:

Insert three entries into the FACE_TABLE table:
sql1 = INSERT INTO FACE_TABLE (C1, C2, C3)
    VALUES (1, ARRAY[1.3, 2.4, 5.6]::FLOAT2[], 'name1'); 

sql2 = INSERT INTO FACE_TABLE (c1, c2, c3) 
    VALUES (2, ARRAY [3.4, 6.1, 7.6]::REAL[], 'name2'); 

sql3 = INSERT INTO FACE_TABLE (c1, c2, c3) 
    VALUES (3, ARRAY [9.5, 1.2, 0.6]::FLOAT4[],'name3');

4.2.4 Query data

FLOAT2 data results in a loss of data precision during query. For example, the inserted value 1.3 will be queried as 1.2998, and the inserted value 5.6 will be queried as 5.60156. This level of precision loss can be ignored for vector retrieval.

Example:

SELECT * FROM FACE_TABLE; 
c1  |            c2             |  c3   
----+---------------------------+-------
  1 | {1.2998,2.40039,5.60156}  | name1
  2 | {3.40039,6.10156,7.60156} | name2
  3 | {9.5,1.2002,0.600098}     | name3

4.2.5 Compression ratio of FLOAT2 data

The following statements create a table that stores FLOAT4 vectors and the other table that stores FLOAT2 vectors. Compare the sizes of the two tables.

--CREATE TABLE 
CREATE TABLE TAB1(A FLOAT4[]);
CREATE TABLE TAB2(A FLOAT2[]); 

--INSERT DATA 
INSERT INTO TAB1 
SELECT GEN_RAND_F2_ARR (1, 1024) FROM GENERATE_SERIES (1,10000);
INSERT INTO TAB2 
SELECT GEN_RAND_F2_ARR (1, 1024) FROM GENERATE_SERIES (1,10000); 

--QUERY SIZE
SELECT PG_SIZE_PRETTY (PG_RELATION_SIZE('tab1'));
 PG_SIZE_PRETTY 
----------------
 45 MB(1 row)
 
SELECT PG_SIZE_PRETTY (PG_RELATION_SIZE('tab2')); 
 PG_SIZE_PRETTY
----------------
 21 MB(1 row)

The FLOAT4 table occupies 45 MB of storage space and the FLOAT2 table occupies 21 MB of storage space. The FLOAT2 type data is compressed and only occupies half of the storage space occupied by the FLOAT4 type data.

4.2.6 Performance comparison between FLOAT2 data compression and decompression

The system provides two functions for data conversion. The array_f16_to_f32 function converts FLOAT2 vectors to FLOAT4 vectors, and the array_f32_to_f16 function converts FLOAT4 vectors to FLOAT2 vectors. The length of each vector for test is 1,024 dim and the test is performed on a machine that supports AVX and SSE2.

Example:

--CREATE TABLE 
CREATE TABLE TAB1(A FLOAT4[]);
CREATE TABLE TAB2(A FLOAT2[]); 

--INSERT TABLE
INSERT INTO TAB1 SELECT GEN_RAND_F2_ARR(1, 1024) FROM GENERATE_SERIES (1,10000);
INSERT INTO TAB2 SELECT GEN_RAND_F2_ARR(1, 1024) FROM GENERATE_SERIES (1,10000); 

\TIMING
--query size
SELECT ARRAY_F32_TO_F16(a) FROM TAB1;  
    Time: 5998.832 ms (00:05.999)
SELECT ARRAY_F16_TO_F32(a) FROM TAB2;
    Time: 5507.388 ms (00:05.507) 

4.2.7 Distance calculation

The system provides the L2 distance calculation method to convert FLOAT2 data to FLOAT4 data and calculate the distance.

Example:

Calculate the L2 distance.

SELECT L2_DISTANCE(ARRAY[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]::FLOAT2[],
    ARRAY[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]::FLOAT2[]); 

SELECT L2_DISTANCE(ARRAY[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]:: FLOAT4[], 
    ARRAY [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]:: FLOAT2[]); 

SELECT L2_DISTANCE (ARRAY[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]:: FLOAT2[], 
    ARRAY [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]::FLOAT2[]);

4.2.8 Usage scenarios of FLOAT2 data

Security systems store captured facial data in face tables at a scheduled time every day. Public security officers can search for relevant monitoring data based on an image of a face. The following section describes how to use FLOAT2 data during query.

  1. Create a table to store data related to facial recognition.
    CREATE TABLE FACE_TABLE (
      C1 INT PRIMARY KEY,
      C2 FLOAT2[],
      C3 VARCHAR(20)
    );
    Note
    • C1: the column that stores the face numbers
    • C2: the column that stores the face vectors
    • C3: the column that stores the face names
  2. Create a vector index in the FACE_TABLE table.
    CREATE INDEX FACE_TABLE_IDX 
    ON FACE_TABLE 
    USING ANN(C2) WITH(dim=10);
  3. Import monitoring data to the FACE_TABLE table.
    INSERT INTO FACE_TABLE (C1, C2, C3)  
    VALUES (1, ARRAY[1.3, 2.4, 5.6]::FLOAT2[], 'name1'); 
    
    INSERT INTO FACE_TABLE (c1, c2, c3) 
    VALUES (2, ARRAY[3.4, 6.1, 7.6]::REAL[], 'name2'); 
    
    INSERT INTO FACE_TABLE (c1, c2, c3) 
    VALUES (3, ARRAY[9.5, 1.2, 0.6]::FLOAT4[],'name3');
  4. Enter a face vector and query relevant information.
    SELECT * 
    FROM FACE_TABLE 
    ORDER BY C1 <-> ARRAY[2.81574,9.84361,8.07218]:: FLOAT2[] 
    LIMIT 10;
    Note ARRAY[2.81574,9.84361,8.07218]:: FLOAT2[] indicates the face vector to be queried. The system will search for relevant information in the underlying database.