All Products
Search
Document Center

PolarDB:Perform vector retrieval using the MySQL protocol

Last Updated:Nov 21, 2025

PolarDB for MySQL supports efficient vector similarity search using SQL through its built-in vector data type, distance functions, and vector indexes. This feature is suitable for applications such as recommendation systems, chatbots, and image retrieval. The vector computation capability is integrated into the database kernel, supports atomicity, consistency, isolation, and durability (ACID) transactions, and uses the In-Memory Column Index (IMCI) and algorithms such as HNSW to provide both exact k-nearest neighbor (KNN) search with 100% recall and high-performance approximate nearest neighbor (ANN) search.

Scope

Before you use this feature, ensure that your cluster meets the following version requirements: The kernel version must be MySQL 8.0.2, and the revision version must be 8.0.2.2.30 or later.

Version update history

Features vary by kernel version. We recommend that you upgrade your cluster to the latest version.

Feature

Minimum kernel version requirement

Description

Basic vector features and the HNSW index algorithm.

8.0.2.2.30

Provides the basic capabilities for vector retrieval.

Adds the FAISS_HNSW_FLAT and FAISS_HNSW_PQ index algorithms.

8.0.2.2.31

Introduces index algorithms based on the FAISS implementation.

Modify or delete vector indexes.

8.0.2.2.32

Lets you dynamically manage vector indexes using ALTER TABLE.

Vector type

Type definition

You can use the CREATE TABLE or ALTER TABLE statements with the VECTOR(N) type to define a vector column.

  • N represents the vector dimension, which can range from 1 to 16,383.

  • Each dimension of the vector is a single-precision floating-point number (4 bytes).

  • The vector type supports equality comparison only with other vector types and cannot be compared with any other type.

  • A vector column cannot be a primary key, foreign key, unique key, or partition key.

Example

-- The following example defines the vector dimension of column v1 in table t1 as 4.
CREATE TABLE t1 (id INT PRIMARY KEY, v1 VECTOR(4));
Note

If you define a dimension greater than 16,383, the error Data size (xxx Bytes, xxx dimensions) exceeds VECTOR max (65532 Bytes, 16383 dimensions) for column: 'xxx' is returned.

Type conversion

PolarDB provides the STRING_TO_VECTOR and VECTOR_TO_STRING functions to convert between string and binary formats.

Convert a string to binary (STRING_TO_VECTOR)

This function converts a vector from text format to the internal binary format of the database.

  • Input format: A sequence of floating-point numbers enclosed in square brackets ([]) and separated by commas (,), such as '[1.2, 3.4, 5.6]'.

    Note

    If the input string has an incorrect format, the error Data cannot be converted to a valid vector: 'xxx' is returned.

  • Byte order: The converted binary data uses the little-endian byte order. For example, the floating-point number 1.0, which has a hexadecimal representation of 0x3F800000, is stored in the database with a binary HEX value of 0000803F. Take note of this when you serialize data across systems.

Example

SELECT STRING_TO_VECTOR('[1,2,3,4]');
+-------------------------------+
| STRING_TO_VECTOR('[1,2,3,4]') |
+-------------------------------+
|   ?   @  @@  @              |
+-------------------------------+
SELECT HEX(STRING_TO_VECTOR('[1,2,3,4]'));
+------------------------------------+
| HEX(STRING_TO_VECTOR('[1,2,3,4]')) |
+------------------------------------+
| 0000803F000000400000404000008040   |
+------------------------------------+

Convert binary to a string (VECTOR_TO_STRING)

This function converts binary vector data stored in the database to a string format.

Input format: Every 4 bytes correspond to the binary representation of a floating-point number for one dimension of the vector, using the little-endian byte order. For example, the floating-point number 1.0 has a hexadecimal representation of 0x3F800000. When stored in the database, its binary HEX value is 0000803F. Take note of this when you serialize data across systems.

Note

If the input binary has an incorrect format, the error Data cannot be converted to a valid vector: '' is returned.

Example

SELECT VECTOR_TO_STRING(0x0000803F000000400000404000008040);
+------------------------------------------------------+
| VECTOR_TO_STRING(0x0000803F000000400000404000008040) |
+------------------------------------------------------+
| [1.00000e+00,2.00000e+00,3.00000e+00,4.00000e+00]    |
+------------------------------------------------------+

Related operation examples

Create a new table that contains a vector column

CREATE TABLE t1 (id INT PRIMARY KEY, v1 VECTOR(4));

Add a vector column to an existing table

ALTER TABLE t1 ADD COLUMN v1 VECTOR(4);

Insert a vector

  • Insert a vector as a string using the STRING_TO_VECTOR function.

    INSERT INTO t1 (id, v1) VALUES (1, STRING_TO_VECTOR('[1.0, 2.0, 3.0, 4.0]'));
  • Insert a vector in binary format.

    INSERT INTO t1 VALUES (2, 0x0000803F000000400000404000008040);

Insert or update a vector

Insert or update a vector as a string using STRING_TO_VECTOR.

INSERT INTO t1 VALUES (1, STRING_TO_VECTOR('[1.0, 2.0, 3.0, 4.0]')) ON DUPLICATE KEY UPDATE v1 = STRING_TO_VECTOR('[1.0, 2.0, 3.0, 4.0]');

Update a vector

Update a vector as a string using STRING_TO_VECTOR. You can also update the vector directly in binary format.

UPDATE t1 SET v1 = STRING_TO_VECTOR('[1.0, 2.0, 3.0, 4.0]') WHERE id = 1;

Delete a vector

Delete a vector as a string using STRING_TO_VECTOR. You can also delete the vector directly in binary format.

DELETE FROM t1 WHERE v1 = STRING_TO_VECTOR('[1.0, 2.0, 3.0, 4.0]');

Using vector indexes

PolarDB creates vector indexes by defining a specially formatted string in a column's COMMENT.

Prerequisites

Before you can use vector indexes, you must add a read-only node for the In-Memory Column Index (IMCI) and create an IMCI for the column or the entire table. Only IMCI read-only nodes can build vector indexes to support vector search.

Create a vector index

You can define a vector index by modifying a column's COMMENT in a CREATE TABLE or ALTER TABLE statement.

Syntax

COMMENT 'imci_vector_index=vector_index_algorithm(index_parameter_list) [other_comments]'
  • imci_vector_index=: A fixed prefix that declares a vector index definition.

  • vector_index_algorithm: The index algorithm, such as HNSW, FAISS_HNSW_FLAT, or FAISS_HNSW_PQ.

  • index_parameter_list: The index building parameters, enclosed in parentheses.

    • You can separate multiple parameters with spaces, commas, or semicolons. To avoid confusion, we recommend that you use commas.

    • The format for a single parameter is parameter_name=parameter_value or parameter_name:parameter_value. For example: HNSW(metric=COSINE,max_degree=16,ef_construction=300)

    • If all parameters use their default values, you must still include the parentheses. For example: COMMENT "imci_vector_index=HNSW() other_comments"

Index algorithms

Algorithm

Scenarios

HNSW

Implemented based on VSAG. This algorithm provides high precision but consumes a large amount of memory.

FAISS_HNSW_FLAT

Implemented based on FAISS. This algorithm provides high precision but consumes a large amount of memory.

FAISS_HNSW_PQ

Implemented based on FAISS. This algorithm uses product quantization (PQ) to compress vectors. This reduces memory usage but causes some precision loss.

Index parameters

Parameter

Available alias

Applicable algorithms

Description

Valid values/Range

Default value

metric

distance

  • HNSW

  • FAISS_HNSW_FLAT

  • FAISS_HNSW_PQ

The vector similarity measure.

  • COSINE (cosine distance)

  • INNER_PRODUCT (inner product)

  • EUCLIDEAN (Euclidean distance)

Note

Only uppercase values are supported.

COSINE

max_degree

M

  • HNSW

  • FAISS_HNSW_FLAT

  • FAISS_HNSW_PQ

The maximum number of connections for each node in the graph.

A positive integer. A larger value creates a denser graph, which slows down index building and increases memory usage, but may improve query precision. The recommended range is 16 to 64.

16

ef_construction

-

  • HNSW

  • FAISS_HNSW_FLAT

  • FAISS_HNSW_PQ

The number of neighbor nodes to search during index building.

A positive integer. A larger value improves index quality but increases the build time. The recommended range is 100 to 500.

200

pq_m

-

FAISS_HNSW_PQ

The number of subspaces for product quantization (PQ).

A positive integer that is a divisor of the vector dimension. The vector dimension must be divisible by pq_m. A larger value provides higher precision but also increases memory usage.

1

pq_nbits

-

FAISS_HNSW_PQ

The number of bits used for the quantized representation of each subspace in product quantization (PQ).

A positive integer less than or equal to 24. This parameter is typically set to 8, which means each subspace has 256 centroids.

8

Example 1: Define a vector index when creating a table

-- Method 1: Enable columnstore for the entire table, and then define a vector index on the v1 column.
CREATE TABLE t1 (
    id INT PRIMARY KEY,
    v1 VECTOR(4) COMMENT 'imci_vector_index=HNSW(metric=COSINE, max_degree=16)'
) COMMENT 'COLUMNAR=1';

-- Method 2: Enable columnstore and a vector index only for the v1 column.
CREATE TABLE t1 (
    id INT PRIMARY KEY,
    description TEXT,
    v1 VECTOR(4) COMMENT 'COLUMNAR=1, imci_vector_index=HNSW(metric=COSINE)'
);

Example 2: Add a vector index to an existing table

-- Assume that the existing table is t1.
CREATE TABLE t1 (
    id int auto_increment PRIMARY KEY,
    v1 vector(4)
);

-- Method 1: Add a table-level columnstore index and modify the COMMENT of the VECTOR column to specify the vector index algorithm and parameters.
ALTER TABLE t1 comment "COLUMNAR=1", MODIFY COLUMN v1 vector(4) COMMENT "imci_vector_index=HNSW(metric=COSINE,max_degree=16,ef_construction=300)";


-- Method 2: Create a columnstore index only for the v1 column. When the columnstore index is initialized, a vector index is automatically created based on the COMMENT of the v1 column.
ALTER TABLE t1 MODIFY COLUMN v1 vector(4) COMMENT "COLUMNAR=1 imci_vector_index=HNSW(metric=COSINE,max_degree=16,ef_construction=300)";

Parameters

Parameter

Description

imci_enable_inline_vector_index

Specifies whether to allow the creation of vector indexes.

  • ON (Default): When enabled, the IMCI read-only node creates a vector index based on the vector column's comment during IMCI creation.

  • OFF: When disabled, the IMCI read-only node no longer creates vector indexes for vector columns during IMCI creation. Existing vector indexes continue to be built, but are not loaded or built after a restart.

imci_vector_index_dump_rows_threshold

Controls the incremental write size for vector indexes. A background task periodically checks the increment between the current vector index snapshot offset and the IMCI snapshot offset. When the number of incremental rows exceeds this threshold, the background task appends the new data rows to the vector index.

  • Valid values: 1 to 4294967295.

  • Default value: 1000.

  • Unit: rows.

Modify a vector index

Note

Versions 8.0.2.2.32 and later support modifying the algorithm and parameters of a vector index by changing the comment on a vector column.

  • Modify the vector index algorithm.

    CREATE TABLE t1 (
        id int auto_increment PRIMARY KEY,
        v1 vector(4) COMMENT "imci_vector_index=HNSW(metric=COSINE,max_degree=16,ef_construction=300)"
    ) COMMENT 'COLUMNAR=1';
    
    -- Modify the index algorithm from HNSW to the FAISS-based HNSW algorithm.
    ALTER TABLE t1 modify column v1 vector(4) COMMENT "imci_vector_index=FAISS_HNSW_FLAT(metric=COSINE,max_degree=16,ef_construction=300)";
  • Modify the vector index parameters.

    CREATE TABLE t1 (
        id int auto_increment PRIMARY KEY,
        v1 vector(4) COMMENT "imci_vector_index=HNSW(metric=COSINE,max_degree=16,ef_construction=300)"
    ) COMMENT 'COLUMNAR=1';
    
    -- Modify the index parameters. Change the max_degree of HNSW from 16 to 32.
    ALTER TABLE t1 modify column v1 vector(4) COMMENT "imci_vector_index=HNSW(metric=COSINE,max_degree=32,ef_construction=300)";

Delete a vector index

  • Delete the vector index when deleting the columnstore index.

    CREATE TABLE t1 (
        id int auto_increment PRIMARY KEY,
        v1 vector(4) COMMENT "imci_vector_index=HNSW(metric=COSINE,max_degree=16,ef_construction=300)"
    ) COMMENT 'COLUMNAR=1';
    
    -- A vector index cannot exist independently of a columnstore index and is deleted by cascade delete.
    ALTER TABLE t1 COMMENT 'COLUMNAR=0';
  • Delete only the vector index.

    Note

    Versions 8.0.2.2.32 and later support deleting a vector index by deleting the comment on a vector column.

    CREATE TABLE t1 (
        id int auto_increment PRIMARY KEY,
        v1 vector(4) COMMENT "imci_vector_index=HNSW(metric=COSINE,max_degree=16,ef_construction=300)"
    ) COMMENT 'COLUMNAR=1';
    
    -- Delete the vector index by deleting the comment on the vector type column.
    ALTER TABLE t1 modify column v1 vector(4) COMMENT "";

Vector queries

Approximate Nearest Neighbor (ANN) search

Approximate Nearest Neighbor (ANN) search uses vector indexes to accelerate queries.

Limits

The optimizer uses a vector index only when it detects a specific SQL pattern. The query must meet all of the following conditions:

  • The query must include both the ORDER BY and LIMIT clauses.

  • The first expression in the ORDER BY clause must be DISTANCE(...).

  • The sorting direction for ORDER BY DISTANCE(...) must be ASC.

  • The field in the DISTANCE(...) expression must have an active vector index.

  • The metric parameter in the DISTANCE(...) expression must be the same as the metric parameter used when you created the vector index. For example, if the vector index was built with the COSINE distance, DISTANCE(...) must also specify COSINE.

  • If the query includes a JOIN, the execution plan must use a Right Deep Join Tree structure. The field referenced in the DISTANCE(...) expression must be from the driving table of the JOIN.

  • The query cannot contain a GROUP BY clause. To perform aggregation, you can complete the vector retrieval in a subquery.

Steps to enable ANN search

  1. Enable vector index acceleration.
    Execute the following command in the session to allow the optimizer to use vector indexes.

    SET imci_enable_vector_search = ON;
  2. Force queries to use a column-based execution plan.
    To ensure that queries can access vector indexes built on the column store, you must force the query to use the column store path.

    SET cost_threshold_for_imci = 0;
    Important

    SET cost_threshold_for_imci = 0; is a session-level setting that forces all queries in the session to use a columnstore index. This can degrade the performance of queries that would otherwise run efficiently on the row store, such as point queries that use a primary key. Therefore, use this setting only in sessions that perform vector retrieval. Do not set it as a global parameter.

  3. Write a query that meets the ANN trigger conditions.
    Execute an SQL statement that meets the constraints to perform an ANN search with a vector index.

    -- Find the top 5 most similar records
    SELECT id, v1
    FROM t1
    WHERE category_id = 123 -- Additional conditions can be added
    ORDER BY DISTANCE(v1, STRING_TO_VECTOR('[...]'), 'COSINE') ASC
    LIMIT 5;

    Correcting a non-compliant query (using GROUP BY as an example):

    -- Incorrect: GROUP BY conflicts with ORDER BY DISTANCE()
    -- SELECT category_id, COUNT(*) FROM t1 GROUP BY category_id ORDER BY DISTANCE(v1, STRING_TO_VECTOR('[...]'), 'COSINE') ASC LIMIT 5;
    
    -- Correct: First perform ANN search in a subquery, then aggregate the results
    SELECT category_id, COUNT(*)
    FROM (
        SELECT id, category_id
        FROM t1
        ORDER BY DISTANCE(v1, STRING_TO_VECTOR('[...]'), 'COSINE') ASC
        LIMIT 100 -- First retrieve 100 nearest neighbors
    ) AS nearest_neighbors
    GROUP BY category_id;
  4. You can use the EXPLAIN statement to check whether vector index acceleration for ANN search is active for your SQL statement. If the execution plan contains Vector Search, the feature is active.

    EXPLAIN SELECT * FROM t1 ORDER BY DISTANCE(v1, STRING_TO_VECTOR('[1.2, 2.3, 3.4, 4.5]'), 'COSINE') LIMIT 2;

    Sample output:

    +----+---------------------------+------+--------+--------+---------------------------------------------------------------------------------------+
    | ID | Operator                  | Name | E-Rows | E-Cost | Extra Info                                                                            |
    +----+---------------------------+------+--------+--------+---------------------------------------------------------------------------------------+
    |  1 | Select Statement          |      |        |        | IMCI Execution Plan (max_dop = 2, max_query_mem = 429496729)                          |
    |  2 | └─Compute Scalar          |      | 2      | 0.00   |                                                                                       |
    |  3 |   └─Limit                 |      | 2      | 0.00   | Offset=0 Limit=2                                                                      |
    |  4 |     └─Sort                |      | 2      | 0.00   | Sort Key: VECTOR_DISTANCE(t1.v1,"[1.200000,2.300000,3.400000,4.500000]","COSINE") ASC |
    |  5 |       └─Vector Search     | t1   | 2      | 0.00   |                                                                                       |
    +----+---------------------------+------+--------+--------+---------------------------------------------------------------------------------------+

Parameter descriptions

Parameter

Description

imci_enable_vector_search

A switch that controls whether to use vector indexes to accelerate ANN search.

  • ON (default): Enables the feature that uses vector indexes to accelerate ANN search.

  • OFF: Disables the feature that uses vector indexes to accelerate ANN search.

imci_hnswpq_k_factor

The recall expansion factor for vector retrieval when the vector index uses the FAISS_HNSW_PQ algorithm.

  • Value range: 1 to UINT32_MAX.

  • The default value is 1.

  • Unit: Multiplier.

Note

When performing a vector query to retrieve K approximate nearest neighbors (ANNs), the executor retrieves K × imci_hnswpq_k_factor results from the HNSW PQ index. It then performs an exact sort on these results using the original vectors and returns the top K results in order.

View vectors

SELECT VECTOR_TO_STRING(v1) FROM t1;

Distance calculation

You can use the DISTANCE function to calculate the similarity between two vectors in a specified way.

DISTANCE(vector1, vector2, '<metric>')

metric parameter

Parameter

Description

COSINE

Cosine similarity. Measures the directional similarity between two vectors. The result is the cosine of the angle between the two vectors. A smaller value indicates greater similarity.

EUCLIDEAN

Euclidean distance. Measures the straight-line distance between two vectors or points in Euclidean space. A smaller value indicates greater similarity.

DOT

Dot product. The result of multiplying the corresponding components of two vectors and then adding them up. A smaller value indicates greater similarity.

Example

SELECT DISTANCE(v1, STRING_TO_VECTOR('[1.2,2.3,3.4,4.5]'), 'COSINE') FROM t1;

Exact Nearest Neighbor (KNN) search

An exact search traverses all data to calculate distances, which guarantees a 100% recall rate. It does not rely on vector indexes and is suitable for scenarios with small datasets.

-- Find the 10 records most similar to the target
SELECT id, VECTOR_TO_STRING(v1)
FROM t1
ORDER BY DISTANCE(v1, STRING_TO_VECTOR('[...]'), 'COSINE')
LIMIT 10;

Example

SELECT id, VECTOR_TO_STRING(v1) FROM t1 ORDER BY DISTANCE(v1, STRING_TO_VECTOR('[1.2,2.3,3.4,4.5]'), 'COSINE') LIMIT 1;

Vector queries with predicates

You can use additional columns as predicates to filter vector query results. For example:

SELECT * FROM t1 WHERE id < 10 ORDER BY DISTANCE(v1, STRING_TO_VECTOR('[1.2, 2.3, 3.4, 4.5]'), 'COSINE') LIMIT 2;

For vector queries that include predicates, you can adjust specific parameters to control the order of vector index retrieval and predicate filtering.

Execution strategies

Vector search often involves filtering by other properties. The optimizer intelligently selects an execution strategy based on the selectivity of the filter condition:

  • Prefilter (prefilter): When a filter condition can significantly reduce the amount of data (for example, when the number of matching rows is less than imci_vector_search_prefilter_rows, which defaults to 10,000), the system first applies a predicate filter to the entire table to retrieve matching records, sorts these records, and then returns the number of records specified by the LIMIT clause.

  • Post-filtering (postfilter): When the filter condition has poor selectivity (for example, if the match rate is higher than imci_vector_search_filter_pct, which is 20% by default), the system first uses the vector index to recall results and then performs predicate filtering on the recalled results. If the number of filtered results is less than the number specified by the LIMIT clause, this procedure is repeated.

  • Inline filter: If the optimizer does not select the prefilter or postfilter strategy, it uses the inline filter strategy. The system first filters the entire table using the predicate and builds a bitmap filter from the RowIDs of the matching records. Then, it uses the vector index for retrieval. During the vector index retrieval process, the system uses the bitmap filter to ensure that the most similar records returned meet the predicate condition.

Adaptive execution optimization

The optimizer's estimate of the number of rows matched by a predicate filter can have a bias. This can lead to a suboptimal choice of execution strategy. An adaptive execution mechanism dynamically switches the execution strategy during runtime based on the actual number of matching rows.

When running a vector query with predicates using the inline filter strategy, if the actual number of matching records is less than imci_vector_search_prefilter_rows after the predicate is applied to the entire table, the system dynamically switches to the prefilter strategy.

Partition optimization

Vector queries often include predicate conditions such as tenant IDs or tags. For these queries, you can use the LIST DEFAULT HASH partitioning method. You can create separate LIST partitions for tenants or tags with large data volumes, and DEFAULT HASH partitions for other tenants or tags. The optimizer can detect these queries. Based on the results of partition pruning, the optimizer selects the postfilter strategy for vector queries on large tenants or tags. For other tenants, it uses the prefilter or inline filter strategy. This helps select the optimal execution strategy for tenants or tags of different sizes.

Parameter descriptions

Parameter

Description

imci_vector_search_filter_pct

If the estimated selectivity of the predicate filter is greater than or equal to this value, the system prioritizes vector index retrieval.

  • Value range: 0 to 100.

  • Default value: 20.

  • Unit: %.

The default value of 20 means that if the estimated selectivity of the predicate is 20% or higher, the system first retrieves results using the vector index and then applies the predicate filter.

imci_enable_vector_search_inline_filter

Controls whether to use the inline filter strategy for vector queries that include predicates.

  • OFF (default): Disabled.

  • ON: Enabled.

imci_vector_search_prefilter_rows

If the estimated number of matching rows from the predicate filter is less than this value, the system prioritizes the prefilter retrieval method.

  • Value range: 0 to UINT64_MAX.

  • Default value: 10000.

  • Unit: Rows.

The default value of 10,000 means that if the estimated number of matching rows from the predicate filter is less than 10,000, the system prioritizes the prefilter method. It retrieves matching records, sorts them, and then returns the number of rows specified by the LIMIT clause.

Monitor vector index status

Building vector index data and advancing the columnstore index snapshot are separate background tasks. With continuous writes, the vector index snapshot offset might lag behind the columnstore data snapshot. Before you run a vector search, you must confirm that the vector index is available. The vector index status is shown in system views. This includes the snapshot offset and the status of the index on the vector column.

Vector index snapshot offset

The VECTOR_ROWS column in the INFORMATION_SCHEMA.IMCI_INDEX_STATS view shows the number of vectors in the snapshot loaded by the columnstore index. A value of 0 means no vector index is available on the current columnstore index. This can happen if the table has no vector index, or if a new index has not yet generated a valid snapshot.

You can also compare this value with the ROW_ID column in the INFORMATION_SCHEMA.IMCI_INDEXES view to estimate the build latency for the vector index data. The vector index works even with latency. During a search, the system automatically handles any vector data that has not been written to the index.

Physical vector index status

The INFORMATION_SCHEMA.IMCI_VECTOR_INDEX_STATS system view shows the status of each physical vector index inside a vector index snapshot. The table schema is as follows:

Column name

Description

SCHEMA_NAME

DB name.

TABLE_NAME

Table name.

COLUMN_NAME

Vector column name.

VECTORS

The actual number of vectors in the vector index. This value is usually smaller than VECTOR_ROWS because NULL values and rows that were deleted during the build are excluded.

MEMORY_USAGE

Memory usage.

STORAGE_USAGE

Size of the persistent file.

INDEX_TYPE

The index type. Currently, only HNSW is supported.

INDEX_PARAMETERS

The vector index build parameters in JSON format.

This view outputs the memory status and statistics for the vector index on each column where a vector index is enabled. This information is not directly related to the availability of the vector index.

Note
  • Approximate vector searches cannot be performed on any column that is not included in the query results of this view.

  • In this view, a value of 0 in the VECTORS column does not mean the vector index is unavailable. This usually happens after a cluster restart when an incremental build or an approximate query has not yet run. The persistent data for the vector index has not been loaded, so the exact number of vectors is temporarily unavailable. For accurate vector snapshot information, use the VECTOR_ROWS column in the INFORMATION_SCHEMA.IMCI_INDEX_STATS system view.

Examples

  1. To ensure that vector indexes can be built normally with a small amount of data, you must set the imci_vector_index_dump_rows_threshold parameter to 1 in the PolarDB console.

  2. Create a test table and insert data:

    CREATE TABLE t1 (
        id int auto_increment PRIMARY KEY,
        v1 vector(4) COMMENT "imci_vector_index=HNSW(metric=COSINE,max_degree=16,ef_construction=300)"
    ) comment "COLUMNAR=1";
    
    INSERT INTO t1 VALUES(1, STRING_TO_VECTOR('[1.1,1.2,1.3,1.4]')), (2, STRING_TO_VECTOR('[2,2.2,2.4,2.6]'));
  3. Check the index snapshot status and the vector index status:

    SELECT schema_name,table_name,row_id FROM information_schema.imci_indexes;
    +-------------+------------+--------+
    | schema_name | table_name | row_id |
    +-------------+------------+--------+
    | test        | t1         |      2 |
    +-------------+------------+--------+
    1 row in set (0.00 sec)
    
    SELECT schema_name,table_name,vector_rows FROM information_schema.imci_index_stats;
    +-------------+------------+-------------+
    | schema_name | table_name | vector_rows |
    +-------------+------------+-------------+
    | test        | t1         |           2 |
    +-------------+------------+-------------+
    1 row in set (0.00 sec)
    
    SELECT schema_name,table_name,column_name,vectors FROM information_schema.imci_vector_index_stats;
    +-------------+------------+-------------+---------+
    | schema_name | table_name | column_name | vectors |
    +-------------+------------+-------------+---------+
    | test        | t1         | v1          |       2 |
    +-------------+------------+-------------+---------+
    1 row in set (0.00 sec)
  4. Insert more data:

    INSERT INTO t1 VALUES(3, STRING_TO_VECTOR("[8.8,8.88,8.888,8.8888]"));

    The value of the VECTORS field in the query result for INFORMATION_SCHEMA.IMCI_VECTOR_INDEX_STATS is now 3, which is an increase of 1. This indicates that the background task wrote the inserted data to the vector index:

    SELECT schema_name,table_name,column_name,vectors FROM information_schema.imci_vector_index_stats;
    +-------------+------------+-------------+---------+
    | schema_name | table_name | column_name | vectors |
    +-------------+------------+-------------+---------+
    | test        | t1         | v1          |       3 |
    +-------------+------------+-------------+---------+
    1 row in set (0.00 sec)
  5. Restart the columnstore index read-only node and check the index status:

    Important

    Restarting the columnstore index read-only node makes the current node unavailable for a short time.

    SELECT schema_name,table_name,row_id,state FROM information_schema.imci_indexes;
    +-------------+------------+--------+-----------+
    | schema_name | table_name | row_id | state     |
    +-------------+------------+--------+-----------+
    | test        | t1         |      3 | COMMITTED |
    +-------------+------------+--------+-----------+
    1 row in set (0.00 sec)
    
    SELECT schema_name,table_name,vector_rows FROM information_schema.imci_index_stats;
    +-------------+------------+-------------+
    | schema_name | table_name | vector_rows |
    +-------------+------------+-------------+
    | test        | t1         |           3 |
    +-------------+------------+-------------+
    1 row in set (0.00 sec)
    
    SELECT schema_name,table_name,column_name,vectors FROM information_schema.imci_vector_index_stats;
    +-------------+------------+-------------+---------+
    | schema_name | table_name | column_name | vectors |
    +-------------+------------+-------------+---------+
    | test        | t1         | v1          |       0 |
    +-------------+------------+-------------+---------+
    1 row in set (0.00 sec)

    The results show that VECTORS is 0 in the IMCI_VECTOR_INDEX_STATS system view, while the vector index snapshot offset is 3. You can use the execution plan to confirm that the vector index is available. Then, you can run an approximate query and check the vector index status again:

    SET cost_threshold_for_imci = 0;
    
    
    EXPLAIN SELECT * FROM t1 ORDER BY DISTANCE(v1, STRING_TO_VECTOR("[1.2, 2.3, 3.4, 4.5]"), "COSINE") LIMIT 1;
    +----+-----------------------+------+--------+--------+---------------------------------------------------------------------------------------+
    | ID | Operator              | Name | E-Rows | E-Cost | Extra Info                                                                            |
    +----+-----------------------+------+--------+--------+---------------------------------------------------------------------------------------+
    |  1 | Select Statement      |      |        |        | IMCI Execution Plan (max_dop = 4, max_query_mem = 858993459)                          |
    |  2 | +-Compute Scalar      |      |        |        |                                                                                       |
    |  3 |   +-Limit             |      |        |        | Offset=0 Limit=1                                                                      |
    |  4 |     +-Sort            |      |        |        | Sort Key: VECTOR_DISTANCE(t1.v1,"[1.200000,2.300000,3.400000,4.500000]","COSINE") ASC |
    |  5 |       +-Vector Search | t1   |        |        |                                                                                       |
    +----+-----------------------+------+--------+--------+---------------------------------------------------------------------------------------+
    5 rows in set (0.00 sec)
    
    SELECT id, vector_to_string(v1) FROM t1 ORDER BY DISTANCE(v1, STRING_TO_VECTOR("[1.2, 2.3, 3.4, 4.5]"), "COSINE") LIMIT 1;
    +----+---------------------------------------------------+
    | id | vector_to_string(v1)                              |
    +----+---------------------------------------------------+
    |  2 | [2.00000e+00,2.20000e+00,2.40000e+00,2.60000e+00] |
    +----+---------------------------------------------------+
    1 row in set (0.00 sec)
    
    SELECT schema_name,table_name,column_name,vectors FROM information_schema.imci_vector_index_stats;
    +-------------+------------+-------------+---------+
    | schema_name | table_name | column_name | vectors |
    +-------------+------------+-------------+---------+
    | test        | t1         | v1          |       3 |
    +-------------+------------+-------------+---------+
    1 row in set (0.00 sec)

    After the approximate query runs, the vector index is loaded into memory, and the vector count is restored to 3.

FAQ

Why is the index not used in a vector query?

Follow these steps to troubleshoot:

  1. Confirm that you have run SET imci_enable_vector_search = ON; and SET cost_threshold_for_imci = 0; in the session.

  2. You can use EXPLAIN to check the execution plan. Confirm that the Extra Info column contains the Vector Search operator.

  3. Verify that the SQL statement meets all constraints listed in Approximate Nearest Neighbor (ANN) search. Common errors include not using LIMIT or using the wrong sorting direction in ORDER BY.

  4. Verify that the metric parameter in the DISTANCE(...) expression is identical to the metric parameter used to create the vector index. For example, if the vector index was built using the COSINE distance, DISTANCE(...) must also specify COSINE.

Why do I receive the error ERROR 9040 (HY000): Data size exceeds VECTOR max when I create a vector?

The defined vector dimension exceeds the maximum limit of 16,383. You can reduce the value of N in VECTOR(N).

After I restart a read-only node of a columnstore index, the monitoring data shows that the VECTORS count is 0. Is the index lost?

No, the index is not lost. This is the expected behavior. The index data is persisted to disk. After a restart, the first query triggers the data to be loaded into memory. For more information, see the diagnostic flow in Monitor vector index status.

The pq_m parameter requires that the vector dimension must be divisible by pq_m. What should I do if the dimension is a prime number, such as 1023?

In this case, you cannot use the FAISS_HNSW_PQ index. You have the following options:

  1. Choose another index. You can use HNSW or FAISS_HNSW_FLAT. These indexes do not have this limitation, but they consume more memory.

  2. Adjust the data dimension. At the application layer, you can use dimension reduction or padding to adjust the vector dimension to a value divisible by pq_m, such as 8, 16, or 32.