All Products
Search
Document Center

AnalyticDB for MySQL:Vector search

Last Updated:Dec 22, 2023

AnalyticDB for MySQL provides the vector search feature to help you implement similarity search on unstructured data. This topic describes the vector search feature and how to create and use vector indexes.

Prerequisites

An AnalyticDB for MySQL cluster of V3.1.4.0 or later is created.

Note
  • To use the vector search feature, we recommend that you use the following minor version: 3.1.5.16, 3.1.6.8, 3.1.8.6, or later.

  • If your cluster is not of the preceding versions, we recommend that you set the CSTORE_PROJECT_PUSH_DOWN and CSTORE_PPD_TOP_N_ENABLE parameters to false before you use the vector search feature.

  • For information about how to query the minor version of an AnalyticDB for MySQL cluster, see How do I query the version of an AnalyticDB for MySQL cluster? To update the minor version of a cluster, contact technical support.

Background information

Overview

You can use AI algorithms to extract features from unstructured data, encode the features into feature vectors, and then store the feature vectors in AnalyticDB for MySQL. When you use feature vectors to identify unstructured data, the distance between vectors can measure the similarity between unstructured data. AnalyticDB for MySQL provides the efficient vector search feature for scenarios such as image-based search, voiceprint matching, face recognition, and text search.

Architecture

image.png

Benefits

  • High dimensionality, high performance, and high recall for vector data

    In this example, a 512-dimensional vector that represents a human face is used. AnalyticDB for MySQL can provide a 99% recall for 10 billion entries of vector data in scenarios that require 100 queries per second (QPS) and a 50-millisecond response time or 0.2 billion entries of vector data in scenarios that require 1,000 QPS and a one-second response time.

  • Integrated query of structured and unstructured data

    AnalyticDB for MySQL supports integrated query by using k-nearest neighbor (KNN) and radius nearest neighbor (RNN) algorithms. For example, you can compare the similarity between two sets of vectors.

  • Real-time updates

    AnalyticDB for MySQL supports high-concurrency real-time writes and updates. You can query data immediately after the data is written.

  • Real-time search

    AnalyticDB for MySQL uses a massively parallel processing (MPP) architecture to provide millisecond-level data search and improve search efficiency.

  • Ease of use

    AnalyticDB for MySQL supports standard SQL statements to simplify the development process without the need for additional complex configurations.

Terms

feature vector

A vector is an algebraic representation of entities or applications. Vectors can be used to represent the relationships between entities in a vector space. The distance between two entities in the vector space represents their similarity. Examples: height, age, gender, and region. AnalyticDB for MySQL supports feature vectors only in the fixed-length array format. The following data types are supported: FLOAT[], BYTE[], and SMALLINT[].

vector search

A method for fast search and matching in a feature vector dataset.

vector index

An index of a specific type.

distance calculation

A custom function of a specific type. Each distance calculation formula corresponds to a custom function. Example: L2_DISTANCE.

KNN

An algorithm that is used to find k points in a feature vector dataset that are the nearest to a query point.

RNN

An algorithm that is used to find all points in a feature vector dataset that are within a specific radius of a query point.

Create a vector index

Syntax

You can create a vector index when you create a table. Syntax:

ann index [index_name] (column_name)] [algorithm=HNSW_PQ ] [distancemeasure=SquaredL2]

Parameters

  • ann index: the vector index keyword.

  • index_name: the name of the index. For information about the naming conventions of indexes, see the "Naming limits" section of the Limits topic.

  • column_name: the name of the vector column. For information about the naming conventions of columns, see the "Naming limits" section of the Limits topic.

  • algorithm: the algorithm that is used to calculate the vector distance. Default value: HNSW_PQ.

  • distancemeasure: the formula that is used to calculate the vector distance. Default value: SquaredL2. Calculation formula of SquaredL2: (x1-y1)2+(x2-y2)2+...(xn-yn)2.

Examples

In this example, a table named tbl_vector is created. The table contains two vector columns: float_feature and short_feature. The float_feature column is of the ARRAY<FLOAT> type and contains four-dimensional data. The short_feature column is of the ARRAY<SMALLINT> type and contains four-dimensional data. When the table is created, vector indexes are created on the two columns.

CREATE TABLE tbl_vector (
  xid bigint not null,
  cid bigint not null,
  uid varchar not null,
  vid varchar not null,
  wid varchar not null,
  float_feature array < float >(4),
  short_feature array < smallint >(4),
  ann index idx_short_feature(short_feature),
  ann index idx_float_feature(float_feature),
  PRIMARY KEY (xid, cid, vid)
) DISTRIBUTE BY HASH(xid) PARTITION BY VALUE(cid) LIFECYCLE 4;

Insert data

INSERT into tbl_vector (xid,cid,uid,vid,wid,short_feature,float_feature) VALUES (1,2,'A','B','C','[1,1,1,1]','[1.2,1.5,2,3.0]');
INSERT into tbl_vector (xid,cid,uid,vid,wid,short_feature,float_feature) VALUES (2,1,'e','v','f','[2,2,2,2]','[1.5,1.15,2.2,2.7]');
INSERT into tbl_vector (xid,cid,uid,vid,wid,short_feature,float_feature) VALUES (0,6,'d','f','g','[3,3,3,3]','[0.2,1.6,5,3.7]');
INSERT into tbl_vector (xid,cid,uid,vid,wid,short_feature,float_feature) VALUES (5,4,'j','b','h','[4,4,4,4]','[1.0,4.15,6,2.9]');
INSERT into tbl_vector (xid,cid,uid,vid,wid,short_feature,float_feature) VALUES (8,5,'Sj','Hb','Dh','[5,5,5,5]','[1.3,4.5,6.9,5.2]');
insert into tbl_vector (xid,cid,uid,vid,wid,short_feature,float_feature) VALUES (5,4,'x','g','h','[3,4,4,4]','[1.0,4.15,6,2.9]');
insert into tbl_vector (xid,cid,uid,vid,wid,short_feature,float_feature) VALUES (5,4,'j','r','k','[6,6,4,4]','[1.0,4.15,6,2.9]');
insert into tbl_vector (xid,cid,uid,vid,wid,short_feature,float_feature) VALUES (5,4,'s','i','q','[2,2,4,4]','[1.0,4.15,6,2.9]');

Query data

  • Query the top three entries in the short_feature column that have the shortest distance to the vector '[1,1,1,1]'.

    SELECT xid, l2_distance(short_feature, '[1,1,1,1]') as dis FROM tbl_vector ORDER BY 2 LIMIT 3;

    Sample result:

    +-------+--------------+
    | xid   |   dis        |
    +-------+--------------+
    | 1     |   0.0        |
    +-------+--------------+
    | 2     |   4.0        |
    +-------+--------------+
    | 0     |   16.0       |
    +-------+--------------+
  • Query the top four entries in the short_feature column that have the shortest distance to the vector '[1,1,1,1]' when xid is 5 and cid is 4.

    SELECT uid, l2_distance(short_feature, '[1,1,1,1]') as dis FROM tbl_vector WHERE xid = 5 AND cid = 4 ORDER BY 2 LIMIT 4;

    Sample result:

    +-------+--------------+
    | uid   |   dis        |
    +-------+--------------+
    | s     |   20.0       |
    +-------+--------------+
    | x     |   31.0       |
    +-------+--------------+
    | j     |   36.0       |
    +-------+--------------+
    | j     |   68.0       |
    +-------+--------------+
  • Query the top three entries in the short_feature column that have the shortest distance to the vector '[1,1,1,1]' when the distance is less than or equal to 50.

    SELECT uid, l2_distance(short_feature, '[1,1,1,1]') as dis FROM tbl_vector WHERE l2_distance(short_feature, '[1,1,1,1]') < 50.0 AND xid = 5 ORDER BY 2 LIMIT 3;

    Sample result:

    +-------+---------------+
    | uid   |   dis         |
    +-------+---------------+
    | s     |   20.0        |
    +-------+---------------+
    | x     |   31.0        |
    +-------+---------------+
    | j     |   36.0        |
    +-------+---------------+