vector search similarity search - AnalyticDB - Alibaba Cloud Documentation Center

Prerequisites

Your cluster must run kernel version 3.1.4.0 or later.

The vector index feature is more stable on clusters that run kernel version 3.1.5.16, 3.1.6.8, 3.1.8.6, or a later version.
If your cluster is not running one of the stable versions listed, set the CSTORE_PROJECT_PUSH_DOWN and CSTORE_PPD_TOP_N_ENABLE parameters to false before you use the vector index feature.

Before you begin, make sure you have:

An AnalyticDB for MySQL cluster running V3.1.4.0 or later
(Recommended) Minor version 3.1.5.16, 3.1.6.8, 3.1.8.6, or later for optimal vector search performance

If your cluster is not on one of the recommended minor versions, set the CSTORE_PROJECT_PUSH_DOWN and CSTORE_PPD_TOP_N_ENABLE parameters to false before using vector search. To check your minor version, see How do I query the version of an AnalyticDB for MySQL cluster? To upgrade the minor version, contact technical support.

How it works

Use an AI model to extract features from your unstructured data and encode them as fixed-length float arrays (feature vectors). Store those vectors in AnalyticDB for MySQL alongside your structured columns.

At query time, pass a query vector to L2_DISTANCE. The database computes the squared Euclidean distance between the query vector and each stored vector, then returns the closest matches. A smaller distance means greater similarity.

The full workflow:

Create a table with one or more vector columns (ARRAY<FLOAT>, ARRAY<BYTE>, or ARRAY<SMALLINT>).
Add an ANN index on each vector column—at table creation time or with ALTER TABLE.
Insert rows, including the vector values.
Query with L2_DISTANCE and ORDER BY … LIMIT (k-nearest neighbor) or add a distance threshold in WHERE (radius nearest neighbor).

Key concepts

Term	Definition
Feature vector	A fixed-length numeric array that represents an entity in vector space. Distance between two vectors measures their similarity. For example, a 512-dimensional float array can represent a human face.
Vector index (ANN INDEX)	A specialized index that accelerates approximate nearest-neighbor search on vector columns.
k-nearest neighbor (KNN)	Returns the k rows whose vectors are closest to a query vector. Use `ORDER BY l2_distance(...) LIMIT k`.
Radius nearest neighbor (RNN)	Returns all rows whose vectors fall within a specified distance of a query vector. Use `WHERE l2_distance(...) < threshold`.
L2_DISTANCE	The SQL function for squared Euclidean distance: (x₁−y₁)² + (x₂−y₂)² + … + (xₙ−yₙ)². A lower value means vectors are more similar.

Create a table with a vector index

Define vector columns as ARRAY<FLOAT>, ARRAY<BYTE>, or ARRAY<SMALLINT> and add ANN INDEX declarations in the same CREATE TABLE statement.

Syntax:

ANN INDEX [index_name] (column_name) [algorithm=HNSW_PQ] [distancemeasure=SquaredL2]

Parameter	Description
`index_name`	Name of the index. See Naming limits.
`column_name`	Name of the vector column. Supported data types: `ARRAY<FLOAT>`, `ARRAY<BYTE>`, `ARRAY<SMALLINT>`. See Naming limits.
`algorithm`	Set to `HNSW_PQ` (hierarchical navigable small world with product quantization). This is the only supported algorithm.
`distancemeasure`	Set to `SquaredL2` (squared Euclidean distance). This is the only supported distance metric. Formula: (x₁−y₁)² + (x₂−y₂)² + … + (xₙ−yₙ)².

Example:

The following statement creates a table named vector with two indexed vector columns: float_feature (ARRAY<FLOAT>, 4-dimensional) and short_feature (ARRAY<SMALLINT>, 4-dimensional).

CREATE TABLE vector (
  xid bigint not null,
  cid bigint not null,
  uid varchar not null,
  vid varchar not null,
  wid varchar not null,
  float_feature array < float >(4),
  short_feature array < smallint >(4),
  ANN INDEX idx_short_feature(short_feature),
  ANN INDEX idx_float_feature(float_feature),
  PRIMARY KEY (xid, cid, vid)
) DISTRIBUTED BY HASH(xid);

Add a vector index to an existing table

Use ALTER TABLE to add an ANN index to a column in an existing table.

Syntax:

ALTER TABLE table_name ADD ANN INDEX [index_name] (column_name) [algorithm=HNSW_PQ] [distancemeasure=SquaredL2]

Example:

Given the table below (created without ANN indexes):

CREATE TABLE vector (
  xid BIGINT not null,
  cid BIGINT not null,
  uid VARCHAR not null,
  vid VARCHAR not null,
  wid VARCHAR not null,
  float_feature array < FLOAT >(4),
  short_feature array < SMALLINT >(4),
  PRIMARY KEY (xid, cid, vid)
) DISTRIBUTED BY HASH(xid);

Add indexes on both vector columns:

ALTER TABLE vector ADD ANN INDEX idx_float_feature(float_feature);
ALTER TABLE vector ADD ANN INDEX idx_short_feature(short_feature);

ALTER TABLE table_name ADD ANN INDEX [index_name] (column_name) [algorithm=HNSW_PQ ] [distancemeasure=SquaredL2]

Query vector data

All examples below use L2_DISTANCE to rank rows by similarity to a query vector.

Insert sample data:

INSERT into vector (xid,cid,uid,vid,wid,short_feature,float_feature) VALUES (1,2,'A','B','C','[1,1,1,1]','[1.2,1.5,2,3.0]');
INSERT into vector (xid,cid,uid,vid,wid,short_feature,float_feature) VALUES (2,1,'e','v','f','[2,2,2,2]','[1.5,1.15,2.2,2.7]');
INSERT into vector (xid,cid,uid,vid,wid,short_feature,float_feature) VALUES (0,6,'d','f','g','[3,3,3,3]','[0.2,1.6,5,3.7]');
INSERT into vector (xid,cid,uid,vid,wid,short_feature,float_feature) VALUES (5,4,'j','b','h','[4,4,4,4]','[1.0,4.15,6,2.9]');
INSERT into vector (xid,cid,uid,vid,wid,short_feature,float_feature) VALUES (8,5,'Sj','Hb','Dh','[5,5,5,5]','[1.3,4.5,6.9,5.2]');
INSERT into vector (xid,cid,uid,vid,wid,short_feature,float_feature) VALUES (5,4,'x','g','h','[3,4,4,4]','[1.0,4.15,6,2.9]');
INSERT into vector (xid,cid,uid,vid,wid,short_feature,float_feature) VALUES (5,4,'j','r','k','[6,6,4,4]','[1.0,4.15,6,2.9]');
INSERT into vector (xid,cid,uid,vid,wid,short_feature,float_feature) VALUES (5,4,'s','i','q','[2,2,4,4]','[1.0,4.15,6,2.9]');

KNN: return the top k nearest neighbors

Return the 3 rows in short_feature closest to [1,1,1,1]:

SELECT xid, l2_distance(short_feature, '[1,1,1,1]') as dis FROM vector ORDER BY 2 LIMIT 3;

Result:

+-------+--------------+
| xid   |   dis        |
+-------+--------------+
| 1     |   0.0        |
+-------+--------------+
| 2     |   4.0        |
+-------+--------------+
| 0     |   16.0       |
+-------+--------------+

KNN with a structured filter

Return the 4 closest rows where xid = 5 and cid = 4:

SELECT uid, l2_distance(short_feature, '[1,1,1,1]') as dis FROM vector WHERE xid = 5 AND cid = 4 ORDER BY 2 LIMIT 4;

Result:

+-------+--------------+
| uid   |   dis        |
+-------+--------------+
| s     |   20.0       |
+-------+--------------+
| x     |   31.0       |
+-------+--------------+
| j     |   36.0       |
+-------+--------------+
| j     |   68.0       |
+-------+--------------+

RNN: filter by distance threshold

Return up to 3 rows where distance is less than 50.0 and xid = 5:

SELECT uid, l2_distance(short_feature, '[1,1,1,1]') as dis FROM vector WHERE l2_distance(short_feature, '[1,1,1,1]') < 50.0 AND xid = 5 ORDER BY 2 LIMIT 3;

Result:

+-------+---------------+
| uid   |   dis         |
+-------+---------------+
| s     |   20.0        |
+-------+---------------+
| x     |   31.0        |
+-------+---------------+
| j     |   36.0        |
+-------+---------------+

Use cases

Image-based search — Encode product or scene images as vectors and find visually similar results.
Facial recognition — Store face embeddings and match query faces against a database.
Voiceprint matching — Compare speaker voice embeddings for authentication or identification.
Text search — Encode sentences with a language model and retrieve semantically similar passages.

Performance

AnalyticDB for MySQL uses a massively parallel processing (MPP) architecture to deliver millisecond-level vector search. Benchmark results using 512-dimensional face vectors:

Scenario	Data volume	QPS	Response time	Recall
High recall, moderate load	10 billion entries	100	50 ms	99%
High throughput	200 million entries	1,000	1 s	99%

The database supports high-concurrency real-time writes. Data is immediately queryable after insertion.