Perform vector retrieval using the MySQL protocol - PolarDB

PolarDB for MySQL supports vector similarity search through SQL, using a built-in vector data type, distance functions, and vector indexes. The vector computation runs inside the database kernel, supports ACID transactions, and uses the In-Memory Column Index (IMCI) with HNSW-based algorithms to deliver both exact k-nearest neighbor (KNN) search with 100% recall and high-performance approximate nearest neighbor (ANN) search.

Common use cases include recommendation systems, chatbots, and image retrieval.

Prerequisites

Before you begin, ensure that you have:

A PolarDB for MySQL cluster with kernel version 8.0.2, revision 8.0.2.2.30 or later
An IMCI read-only node added to the cluster (required for vector indexes)

Version history

Features vary by kernel version. Upgrade your cluster to the latest version to access all features.

Feature	Minimum version	Description
Basic vector features and the HNSW index algorithm	8.0.2.2.30	Core vector retrieval capabilities
`FAISS_HNSW_FLAT` and `FAISS_HNSW_PQ` index algorithms	8.0.2.2.31	Additional index algorithms based on the FAISS library
Modify or delete vector indexes	8.0.2.2.32	Dynamically manage vector indexes using `ALTER TABLE`

Quick start

The following steps walk through creating a table with a vector index, inserting data, and running an ANN search.

Step 1: Set up the table.

-- Enable columnstore for the entire table and define a vector index on the v1 column
CREATE TABLE t1 (
    id INT PRIMARY KEY,
    v1 VECTOR(4) COMMENT 'imci_vector_index=HNSW(metric=COSINE, max_degree=16)'
) COMMENT 'COLUMNAR=1';

Step 2: Insert vectors.

INSERT INTO t1 (id, v1) VALUES
  (1, STRING_TO_VECTOR('[1.0, 2.0, 3.0, 4.0]')),
  (2, STRING_TO_VECTOR('[2.0, 2.2, 2.4, 2.6]')),
  (3, STRING_TO_VECTOR('[8.8, 8.88, 8.888, 8.8888]'));

Step 3: Enable ANN search and run a query.

-- Enable vector index acceleration (session-level)
SET imci_enable_vector_search = ON;
SET cost_threshold_for_imci = 0;

-- Find the 2 most similar vectors to the query
SELECT id, VECTOR_TO_STRING(v1)
FROM t1
ORDER BY DISTANCE(v1, STRING_TO_VECTOR('[1.2, 2.3, 3.4, 4.5]'), 'COSINE') ASC
LIMIT 2;

Step 4: Confirm index usage.

EXPLAIN SELECT id, VECTOR_TO_STRING(v1)
FROM t1
ORDER BY DISTANCE(v1, STRING_TO_VECTOR('[1.2, 2.3, 3.4, 4.5]'), 'COSINE') ASC
LIMIT 2;

If the execution plan contains Vector Search, the vector index is active.

+----+---------------------------+------+--------+--------+---------------------------------------------------------------------------------------+
| ID | Operator                  | Name | E-Rows | E-Cost | Extra Info                                                                            |
+----+---------------------------+------+--------+--------+---------------------------------------------------------------------------------------+
|  1 | Select Statement          |      |        |        | IMCI Execution Plan (max_dop = 2, max_query_mem = 429496729)                          |
|  2 | └─Compute Scalar          |      | 2      | 0.00   |                                                                                       |
|  3 |   └─Limit                 |      | 2      | 0.00   | Offset=0 Limit=2                                                                      |
|  4 |     └─Sort                |      | 2      | 0.00   | Sort Key: VECTOR_DISTANCE(t1.v1,"[1.200000,2.300000,3.400000,4.500000]","COSINE") ASC |
|  5 |       └─Vector Search     | t1   | 2      | 0.00   |                                                                                       |
+----+---------------------------+------+--------+--------+---------------------------------------------------------------------------------------+

Vector type

Define a vector column

Use VECTOR(N) in a CREATE TABLE or ALTER TABLE statement to define a vector column, where N is the number of dimensions (1–16,383). Each dimension is a single-precision floating-point number (4 bytes).

-- Define a 4-dimensional vector column
CREATE TABLE t1 (id INT PRIMARY KEY, v1 VECTOR(4));

Constraints:

A vector column supports equality comparison only with another vector column. It cannot be compared with other data types.
A vector column cannot be a primary key, foreign key, unique key, or partition key.
If N exceeds 16,383, the following error is returned: Data size (xxx Bytes, xxx dimensions) exceeds VECTOR max (65532 Bytes, 16383 dimensions) for column: 'xxx'

Convert between string and binary formats

PolarDB provides two functions for converting vector data between text and binary formats.

`STRING_TO_VECTOR` — converts a string to the internal binary format.

Input format: comma-separated floating-point numbers enclosed in square brackets, for example '[1.2, 3.4, 5.6]'.
The output uses little-endian byte order. For example, 1.0 (hex 0x3F800000) is stored as 0000803F.
If the input format is invalid, the error Data cannot be converted to a valid vector: 'xxx' is returned.

SELECT HEX(STRING_TO_VECTOR('[1,2,3,4]'));
-- Output: 0000803F000000400000404000008040

`VECTOR_TO_STRING` — converts binary vector data back to a readable string.

Input: binary data where every 4 bytes represent one dimension in little-endian byte order.
If the input is invalid, the error Data cannot be converted to a valid vector: '' is returned.

SELECT VECTOR_TO_STRING(0x0000803F000000400000404000008040);
-- Output: [1.00000e+00,2.00000e+00,3.00000e+00,4.00000e+00]

Work with vector data

Add a vector column to an existing table:

ALTER TABLE t1 ADD COLUMN v1 VECTOR(4);

Insert a vector:

-- As a string using STRING_TO_VECTOR
INSERT INTO t1 (id, v1) VALUES (1, STRING_TO_VECTOR('[1.0, 2.0, 3.0, 4.0]'));

-- In binary format directly
INSERT INTO t1 VALUES (2, 0x0000803F000000400000404000008040);

Upsert a vector:

INSERT INTO t1 VALUES (1, STRING_TO_VECTOR('[1.0, 2.0, 3.0, 4.0]'))
ON DUPLICATE KEY UPDATE v1 = STRING_TO_VECTOR('[1.0, 2.0, 3.0, 4.0]');

Update a vector:

UPDATE t1 SET v1 = STRING_TO_VECTOR('[1.0, 2.0, 3.0, 4.0]') WHERE id = 1;

Delete a row by vector value:

DELETE FROM t1 WHERE v1 = STRING_TO_VECTOR('[1.0, 2.0, 3.0, 4.0]');

View vector values as strings:

SELECT VECTOR_TO_STRING(v1) FROM t1;

Vector indexes

Vector indexes are defined in a vector column's COMMENT. Only IMCI read-only nodes build and serve vector indexes.

Create a vector index

Define a vector index by setting the column COMMENT in a CREATE TABLE or ALTER TABLE statement.

Syntax:

COMMENT 'imci_vector_index=<algorithm>(<parameters>) [other_comments]'

imci_vector_index= is a fixed prefix that declares the vector index.
Always include parentheses after the algorithm name, even when using all defaults: HNSW().
Separate multiple parameters with commas.
Each parameter uses the format name=value or name:value.

Index algorithms:

Algorithm	Implementation	Characteristics
`HNSW`	VSAG	High precision; higher memory usage
`FAISS_HNSW_FLAT`	FAISS	High precision; higher memory usage
`FAISS_HNSW_PQ`	FAISS	Uses product quantization (PQ) to compress vectors — lower memory usage, some precision loss

Index build parameters:

All three algorithms share the following build-time parameters:

Parameter	Alias	Description	Range	Default
`metric`	`distance`	Vector similarity measure. Only uppercase values are accepted.	`COSINE`, `INNER_PRODUCT`, `EUCLIDEAN`	`COSINE`
`max_degree`	`M`	Maximum connections per node in the graph. A larger value creates a denser graph — slower to build but may improve precision.	Positive integer; recommended 16–64	`16`
`ef_construction`	—	Candidate neighbors to evaluate during index building. A larger value improves index quality but increases build time.	Positive integer; recommended 100–500	`200`

FAISS_HNSW_PQ has two additional parameters for controlling product quantization:

Parameter	Description	Range	Default
`pq_m`	Number of PQ subspaces. Must be a divisor of the vector dimension. A larger value improves precision but increases memory usage.	Positive integer that divides the vector dimension	`1`
`pq_nbits`	Bits used for the quantized representation of each subspace. Typically set to 8, which gives 256 centroids per subspace.	Positive integer ≤ 24	`8`

Example 1: Define a vector index when creating a table.

-- Method 1: Enable columnstore for the entire table
CREATE TABLE t1 (
    id INT PRIMARY KEY,
    v1 VECTOR(4) COMMENT 'imci_vector_index=HNSW(metric=COSINE, max_degree=16)'
) COMMENT 'COLUMNAR=1';

-- Method 2: Enable columnstore only for the vector column
CREATE TABLE t1 (
    id INT PRIMARY KEY,
    description TEXT,
    v1 VECTOR(4) COMMENT 'COLUMNAR=1, imci_vector_index=HNSW(metric=COSINE)'
);

Example 2: Add a vector index to an existing table.

-- Method 1: Enable table-level columnstore and define the vector index in the same statement
ALTER TABLE t1
  COMMENT "COLUMNAR=1",
  MODIFY COLUMN v1 VECTOR(4) COMMENT "imci_vector_index=HNSW(metric=COSINE,max_degree=16,ef_construction=300)";

-- Method 2: Enable columnstore only for the vector column
-- The vector index is created automatically based on the column COMMENT
ALTER TABLE t1 MODIFY COLUMN v1 VECTOR(4)
  COMMENT "COLUMNAR=1 imci_vector_index=HNSW(metric=COSINE,max_degree=16,ef_construction=300)";

Modify a vector index

Requires version 8.0.2.2.32 or later.

Change the index algorithm or parameters by modifying the column COMMENT.

-- Change the algorithm from HNSW to FAISS_HNSW_FLAT
ALTER TABLE t1 MODIFY COLUMN v1 VECTOR(4)
  COMMENT "imci_vector_index=FAISS_HNSW_FLAT(metric=COSINE,max_degree=16,ef_construction=300)";

-- Change max_degree from 16 to 32
ALTER TABLE t1 MODIFY COLUMN v1 VECTOR(4)
  COMMENT "imci_vector_index=HNSW(metric=COSINE,max_degree=32,ef_construction=300)";

Delete a vector index

Delete the vector index along with the columnstore index:

A vector index cannot exist independently of a columnstore index. Disabling columnstore deletes the vector index by cascade.

ALTER TABLE t1 COMMENT 'COLUMNAR=0';

Delete only the vector index (requires version 8.0.2.2.32 or later):

Clear the column COMMENT to remove the vector index while keeping the columnstore index.

ALTER TABLE t1 MODIFY COLUMN v1 VECTOR(4) COMMENT "";

Vector index parameters

Parameter	Description
`imci_enable_inline_vector_index`	Controls whether IMCI creates vector indexes during initialization. ON (default): the IMCI read-only node builds a vector index based on the vector column's `COMMENT`. OFF: no new vector indexes are built. Existing vector indexes continue to be built, but are not loaded or built after a restart.
`imci_vector_index_dump_rows_threshold`	Controls when the background task writes incremental data to the vector index. When the number of new rows since the last snapshot exceeds this threshold, the background task flushes the data. Valid values: 1–4,294,967,295. Default: 1,000.

Vector queries

Distance calculation

Use the DISTANCE function to calculate the similarity between two vectors.

DISTANCE(vector1, vector2, '<metric>')

Metric	Description
`COSINE`	Cosine distance. Measures directional similarity between two vectors. A smaller value means greater similarity.
`EUCLIDEAN`	Euclidean distance. Measures the straight-line distance between two points. A smaller value means greater similarity.
`DOT`	Dot product. Multiplies corresponding components and sums the results. A smaller value means greater similarity.

SELECT DISTANCE(v1, STRING_TO_VECTOR('[1.2,2.3,3.4,4.5]'), 'COSINE') FROM t1;

ANN search

ANN search uses a vector index to retrieve approximate nearest neighbors efficiently. All of the following conditions must be met for the optimizer to use the vector index:

The query includes both ORDER BY and LIMIT clauses.
The first expression in ORDER BY is DISTANCE(...).
The sort direction is ASC.
The vector column in DISTANCE(...) has an active vector index.
The metric in DISTANCE(...) matches the metric used to build the index.
If the query includes a JOIN, the execution plan must use a Right Deep Join Tree, and the DISTANCE(...) field must come from the driving table.
The query cannot contain GROUP BY. To aggregate results, wrap the vector search in a subquery.

Enable ANN search:

Run these two session-level commands before executing ANN queries.

-- Allow the optimizer to use vector indexes
SET imci_enable_vector_search = ON;

-- Force queries to use the column store execution path
SET cost_threshold_for_imci = 0;

Important

SET cost_threshold_for_imci = 0 forces all queries in the session to use the columnstore index. This can degrade point queries that run efficiently on the row store. Use this setting only in sessions dedicated to vector retrieval — do not set it globally.

Run an ANN query:

-- Find the top 5 most similar records
SELECT id, v1
FROM t1
WHERE category_id = 123
ORDER BY DISTANCE(v1, STRING_TO_VECTOR('[...]'), 'COSINE') ASC
LIMIT 5;

Using `GROUP BY` with vector search:

GROUP BY cannot appear in the same query as ORDER BY DISTANCE(...). Run the vector search in a subquery first, then aggregate.

-- Incorrect: GROUP BY conflicts with ORDER BY DISTANCE()
-- SELECT category_id, COUNT(*) FROM t1
-- GROUP BY category_id
-- ORDER BY DISTANCE(v1, STRING_TO_VECTOR('[...]'), 'COSINE') ASC
-- LIMIT 5;

-- Correct: perform ANN search in a subquery, then aggregate
SELECT category_id, COUNT(*)
FROM (
    SELECT id, category_id
    FROM t1
    ORDER BY DISTANCE(v1, STRING_TO_VECTOR('[...]'), 'COSINE') ASC
    LIMIT 100
) AS nearest_neighbors
GROUP BY category_id;

ANN search parameters:

Parameter	Description
`imci_enable_vector_search`	Controls whether vector indexes accelerate ANN search. ON (default): enabled. OFF: disabled.
`imci_hnswpq_k_factor`	Recall expansion factor for `FAISS_HNSW_PQ`. When retrieving K nearest neighbors, the executor fetches `K × imci_hnswpq_k_factor` candidates from the index, sorts them exactly using the original vectors, and returns the top K. Valid values: 1–UINT32_MAX. Default: 1.

KNN (exact) search

KNN search traverses all data to calculate distances, guaranteeing 100% recall. It does not require a vector index and is suitable for small datasets.

-- Find the 10 most similar records
SELECT id, VECTOR_TO_STRING(v1)
FROM t1
ORDER BY DISTANCE(v1, STRING_TO_VECTOR('[1.2,2.3,3.4,4.5]'), 'COSINE')
LIMIT 10;

Vector queries with predicates

Add WHERE conditions to filter results alongside vector similarity.

SELECT * FROM t1
WHERE id < 10
ORDER BY DISTANCE(v1, STRING_TO_VECTOR('[1.2, 2.3, 3.4, 4.5]'), 'COSINE')
LIMIT 2;

The optimizer selects one of three execution strategies based on the estimated selectivity of the filter:

Strategy	When it applies
Prefilter	Estimated matching rows < `imci_vector_search_prefilter_rows` (default: 10,000). Applies the predicate to the full table first, then sorts the matching rows.
Postfilter	Estimated match rate >= `imci_vector_search_filter_pct` (default: 20%). Uses the vector index to retrieve candidates first, then applies the predicate. Repeats if the filtered results fall below the `LIMIT`.
Inline filter	Neither prefilter nor postfilter applies. Builds a bitmap from rows matching the predicate, then uses the bitmap during vector index traversal to ensure results satisfy the condition.

Adaptive execution: If the optimizer's row count estimate is inaccurate, the system can switch strategies at runtime. For example, when using the inline filter strategy, if the actual number of matching rows turns out to be less than imci_vector_search_prefilter_rows, the system switches to prefilter dynamically.

Partition optimization: For predicates based on tenant IDs or tags, use LIST DEFAULT HASH partitioning. The optimizer uses partition pruning to select the best strategy per partition — postfilter for large partitions, prefilter or inline filter for smaller ones.

Predicate parameters:

Parameter	Description
`imci_vector_search_filter_pct`	If the estimated selectivity is at or above this value, the system prioritizes vector index retrieval. Valid values: 0–100. Default: 20 (%).
`imci_enable_vector_search_inline_filter`	Controls whether the inline filter strategy is available. OFF (default): disabled. ON: enabled.
`imci_vector_search_prefilter_rows`	If the estimated number of matching rows is below this value, the system uses the prefilter strategy. Valid values: 0–UINT64_MAX. Default: 10,000.

Monitor vector index status

Vector index building and columnstore snapshot advancement run as separate background tasks. With continuous writes, the vector index snapshot may lag behind the columnstore data. Check the following system views to confirm the index is ready before running vector searches.

Index snapshot offset

Query INFORMATION_SCHEMA.IMCI_INDEX_STATS to see how many vectors the columnstore index has loaded into its snapshot.

SELECT schema_name, table_name, vector_rows
FROM information_schema.imci_index_stats;

A VECTOR_ROWS value of 0 means no vector index snapshot is available — either the table has no vector index, or a newly created index has not yet generated a valid snapshot.

Compare VECTOR_ROWS with ROW_ID in INFORMATION_SCHEMA.IMCI_INDEXES to estimate build latency. The vector index still works with some latency — any vectors not yet written to the index are handled automatically during a search.

Physical vector index status

Query INFORMATION_SCHEMA.IMCI_VECTOR_INDEX_STATS to inspect the physical state of each vector index.

SELECT schema_name, table_name, column_name, vectors, memory_usage, storage_usage
FROM information_schema.imci_vector_index_stats;

Column	Description
`SCHEMA_NAME`	Database name
`TABLE_NAME`	Table name
`COLUMN_NAME`	Vector column name
`VECTORS`	Number of vectors in the index. Excludes NULL values and rows deleted during build.
`MEMORY_USAGE`	Memory used by the index
`STORAGE_USAGE`	Size of the persisted index file on disk
`INDEX_TYPE`	Index type. Currently only `HNSW` is supported.
`INDEX_PARAMETERS`	Index build parameters in JSON format

ANN search is unavailable for any column not returned by this view. A VECTORS value of 0 after a cluster restart does not mean the index is lost. The index is persisted to disk. The first query after a restart triggers the data to load into memory. For accurate snapshot information, use VECTOR_ROWS in INFORMATION_SCHEMA.IMCI_INDEX_STATS.

Example: verify an index end to end

Set imci_vector_index_dump_rows_threshold to 1 in the PolarDB console to allow the index to build with a small number of rows. For instructions, see Specify cluster and node parameters.

Create a test table and insert data:

CREATE TABLE t1 (
    id INT AUTO_INCREMENT PRIMARY KEY,
    v1 VECTOR(4) COMMENT "imci_vector_index=HNSW(metric=COSINE,max_degree=16,ef_construction=300)"
) COMMENT "COLUMNAR=1";

INSERT INTO t1 VALUES
  (1, STRING_TO_VECTOR('[1.1,1.2,1.3,1.4]')),
  (2, STRING_TO_VECTOR('[2,2.2,2.4,2.6]'));

Check the index snapshot status:

SELECT schema_name, table_name, row_id FROM information_schema.imci_indexes;
-- Expected: row_id = 2

SELECT schema_name, table_name, vector_rows FROM information_schema.imci_index_stats;
-- Expected: vector_rows = 2

SELECT schema_name, table_name, column_name, vectors FROM information_schema.imci_vector_index_stats;
-- Expected: vectors = 2

Insert a third row:
```
INSERT INTO t1 VALUES (3, STRING_TO_VECTOR('[8.8,8.88,8.888,8.8888]'));
```
Query IMCI_VECTOR_INDEX_STATS again — VECTORS should now be 3, confirming the background task flushed the new row to the index.
Restart the IMCI read-only node. After the restart, VECTORS in IMCI_VECTOR_INDEX_STATS returns 0, while VECTOR_ROWS in IMCI_INDEX_STATS still shows 3. This is expected — the index data is on disk but not yet loaded into memory.
Important
Restarting the IMCI read-only node causes a brief service interruption.

Run an approximate query to trigger the index load:

SET cost_threshold_for_imci = 0;

EXPLAIN SELECT * FROM t1
ORDER BY DISTANCE(v1, STRING_TO_VECTOR('[1.2, 2.3, 3.4, 4.5]'), 'COSINE')
LIMIT 1;
-- The execution plan should contain Vector Search

SELECT id, VECTOR_TO_STRING(v1) FROM t1
ORDER BY DISTANCE(v1, STRING_TO_VECTOR('[1.2, 2.3, 3.4, 4.5]'), 'COSINE')
LIMIT 1;

Query IMCI_VECTOR_INDEX_STATS again — VECTORS returns to 3, confirming the index is loaded and fully operational.

FAQ

The vector index is not being used in my query. What should I check?

Run SET imci_enable_vector_search = ON and SET cost_threshold_for_imci = 0 in the session, then use EXPLAIN to check whether the plan contains Vector Search. If it does not, verify that your query meets all ANN search conditions: it must include ORDER BY DISTANCE(...) ASC and LIMIT, and the metric in DISTANCE(...) must match the metric used to build the index.

I get `ERROR 9040 (HY000): Data size exceeds VECTOR max` when creating a vector column.

The dimension N in VECTOR(N) exceeds the 16,383 limit. Reduce the value of N.

After restarting a read-only node, `VECTORS` in `IMCI_VECTOR_INDEX_STATS` shows 0. Is the index gone?

No. The index data is persisted to disk. After a restart, the count temporarily shows 0 until the first query triggers the data to load into memory. Use VECTOR_ROWS in INFORMATION_SCHEMA.IMCI_INDEX_STATS to confirm the index snapshot is intact. For a complete diagnostic flow, see Monitor vector index status.

My vector dimension is a prime number (for example, 1023). Can I use `FAISS_HNSW_PQ`?

No, because pq_m must be a divisor of the vector dimension. For a prime dimension, use HNSW or FAISS_HNSW_FLAT instead — these algorithms have no such constraint but use more memory. Alternatively, adjust the dimension at the application layer (using dimension reduction or padding) to a value divisible by pq_m, such as 8, 16, or 32.

PolarDB:Perform vector retrieval using the MySQL protocol