AliSQL Vector Technology Analysis (1): Storage Format and Algorithm Implementation

This article details the storage format and HNSW algorithm implementation behind AliSQL’s native vector indexing capability for high-dimensional AI workloads.

By Shi Mengchu and Wang Zihao

Background

As AI and large language model (LLM) applications gain traction, so does the demand for efficient storage, search, and retrieval of high-dimensional vectors, which are key representations for complex data (such as text, images, and speech), in sectors including recommendation systems, image retrieval, and natural language processing (NLP). This has posed new challenges for database technology. Vector application scenarios have spawned dedicated vector databases, and also promoted the trend of adding vector capabilities to traditional databases.

In this context, PostgreSQL uses the pgvector extension for efficient vector search, whereas the MySQL ecosystem has long lacked native support. Although MySQL 9.0 supports the VECTOR data type, it lacks general vector index capabilities, and its built-in vector distance function is only supported within HeatWave. This technological gap results in enterprises' dependence on self-deployed vector databases or data migration for hyperdimensional computing.

Built on MySQL 8.0, AliSQL adds native support for enterprise-class vector data processing, providing an out-of-the-box vectorization solution. By seamlessly integrating high-precision vector matching and complex business logic using standard SQL interfaces, AliSQL helps enterprises quickly implement innovative AI applications with a cost-efficient and highly compatible architecture.

Based on AliSQL 8.0 20251031, this article focuses on the key implementation of vector indexes, from the storage format to algorithm implementation, to help readers better understand and use vector capabilities.

Vector Use Example

AliSQL natively supports the storage and computing of vector data with up to 16,383 dimensions. AliSQL integrates mainstream vector functions such as cosine similarity (COSINE) and Euclidean distance (EUCLIDEAN), and supports Hierarchical Navigable Small World (HNSW)-based creation of vector indexes on columns containing full-dimensional vectors.

Create a table with a vector index, insert data, and perform a vector search:

# Create a table with a vector index.
CREATE TABLE `t1` (
  `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  `animal` VARCHAR(10), 
  `vec` VECTOR(2) NOT NULL, # Add a column of the VECTOR data type
  VECTOR INDEX `vi`(`vec`) m=6 distance=cosine # Explicitly specify m and distance
  );

# Insert data.
INSERT INTO `t1`(`animal`, `vec`) VALUES
  ("Frog", VEC_FROMTEXT("[0.1, 0.2]")),
  ("Dog", VEC_FROMTEXT("[0.6, 0.7]")),
  ("Cat", VEC_FROMTEXT("[0.6, 0.6]"));

# Perform a vector search.
SELECT `animal`, VEC_DISTANCE(`vec`, VEC_FROMTEXT("[0.1, 0.1]")) AS `distance` FROM t1 ORDER BY `distance`;

Sample response:

|--------|---------------------------|
| animal | distance                  |
|--------|---------------------------|
| Cat    | 0.00000001552204198507212 |
| Dog    |     0.0029455257170004634 |
| Frog   |       0.05131670194948623 |
|--------|---------------------------|

You can also create a vector index on a table that already contains vectors.

# Convert a column to the VECTOR type.
ALTER TABLE `t1` MODIFY COLUMN `vec` VECTOR(2) NOT NULL;
# Create a vector index.
CREATE VECTOR INDEX vi ON t1(v); # If m and distance are not explicitly specified, the default values are used.

Detailed Guide on Vector Indexes

As one of the most popular Approximate Nearest Neighbor (ANN) algorithms, HNSW has been widely recognized and verified in institutional evaluations and engineering implementations. AliSQL prioritizes support for HNSW-based vector indexes. The following figure shows the overall architecture of vector search.

• Following the estimation of the ANN query cost, an appropriate index is selected for search. You can also use a hint, such as FORCE INDEX, to specify a vector index for search.

• Logically, a complete HNSW graph exists. The information in this graph is organized into an auxiliary table that is stored persistently on the disk. Each row of data in the table represents a node in the HNSW graph. The HNSW algorithm operates on this graph representation to support vector insertion and retrieval.

• Unlike common indexes that directly access the storage engine, an HNSW-based index uses a vector index extension to maintain a Nodes Cache for the HNSW graph in memory to improve query efficiency.

HNSW Algorithm

HNSW is an efficient ANN search algorithm that uses a multi-layer graph structure, proposed in the paper "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs" ("the Paper"). Here is a breakdown of key elements:

• [Layered skip lists] The graph at layer 0 has information on all nodes. From bottom to top, the graph at the higher layer is like a "snapshot" that contains only some nodes of the graph at the lower layer. The top layer is used for fast search, and the bottom layer is used for accurate search.

• [Connections to neighbors] Each layer represents a proximity graph where each node is connected to its neighbors (nearest nodes) based on vector distance.

NNS Layer

First, here is a step-by-step guide on how to perform Nearest Neighbor Search (NNS) in a Navigable Small World (NSW, a single-layer graph), which corresponds to Algorithm 2 in the Paper.

[Input] Query point q, number of returned results ef, candidate set C (including one or more nodes of the current layer), and result set W (initially empty).
[Search loop] Extract node c closest to q from the candidate set C. Traverse each neighbor e of c. If e has not been visited and is closer to q, add e to the result set W and also to the candidate set C. Repeat this process.
[Termination condition] When the distance between q and node c in the candidate set C is not less than that between q and the farthest node f in the result set W, terminate the search loop.
[Output] The first ef nearest neighbor nodes in the result set W.

Insert Algorithm

The insert algorithm corresponds to Algorithm 1 in the Paper. Parameter M indicates the maximum number of neighbors of a node in each layer, and L indicates the maximum number of layers of the current graph.

[Initialization] Determine the maximum layer l to which the new node q is assigned.
[Search from layer L down to layer l+1] Start from a random node at the topmost layer L of the HNSW graph. Execute the Search Layer algorithm with ef = 1. This means performing a greedy search on the NSW graph at each layer to find a locally nearest neighbor to q, which then serves as the entry point for the next lower layer. Repeat this process layer by layer until the search at layer l+1 is complete.
[Insertion from layer l down to layer 0] Use the nearest neighbor found at layer l+1 as the entry point for layer l, and execute the Search Layer algorithm with ef = M. At this layer, insert node q and create bidirectional connections between q and the neighbor nodes discovered during the search. Repeat this process layer by layer until reaching layer 0. Take note that layer 0 differs from higher layers as it uses ef = 2M for its search.
[Shrink] At each layer where q is inserted, if the number of connections to all neighbor nodes exceeds M (2M for layer 0), the farthest connection is removed through the shrink operation to keep the graph compact.

kNN Search Algorithm

The k-nearest neighbor (kNN) search algorithm corresponds to Algorithm 5 in the Paper.

[Fast search] Execute the Search Layer algorithm with ef = 1 from layer L down to layer 1.
[Accurate search] Search for k nearest neighbors at layer 0 and return them.

Vector Index Storage Format

The HNSW graph structure must be stored in the database to implement the HNSW algorithm. The structure of the auxiliary table is shown below. Each row of data corresponds to all information of a node in the graph. A node contains all its representations from layer 0 up to the maximum layer at which this node exists.

Index	Column name	Type	Description	Use
PRIMARY KEY	ROW_ID	System columns	Primary key, named gref	Neighbor search
	TRX_ID ROLL_PTR	System columns	Transaction information	Ensuring indexes are MVCC-aware
KEY	layer	TINYINT NOT NULL	Maximum layer	Algorithm traversal
Unique KEY	tref	VARBINARY(?)	Primary key column of the base table, stored in the server-layer binary format	Table lookup
	vec	BLOB NOT NULL	Quantized vector, with each element using the Int16 format and occupying 2 bytes	Vector distance calculation
	neighbors	BLOB NOT NULL	Neighbors of the node at each layer	Algorithm traversal

The following example shows some key points.

gref (graph reference) stores the primary key value of the auxiliary table and is used to search for neighbor nodes. It uses the InnoDB system column ROW_ID.
layer stores a key for quickly locating the entry point to the graph, which is typically a node at the highest layer.
tref (table reference) stores the primary key value of the base table in the server-layer binary format, and is used for table lookup following vector search. For example, if the primary key of the base table is an Int column (value = 1), it is stored as 80 00 00 01 at the InnoDB engine layer. However, it is stored as a little-endian binary sequence 01 00 00 00 (server-layer binary format) in the auxiliary table.
vec values (vectors in the auxiliary table) are quantized to the Int16 format. A scale value (stored in the float32 format occupying the first 4 bytes) must also be recorded for de-quantization. The quantization significantly reduces storage space and improves search efficiency, despite some loss of precision.
neighbors include separate neighbor lists of each node at every layer from layer 0 to its maximum layer. Each layer stores 1 byte of neighbor count and each neighbor's gref value, allowing the search algorithm to perform greedy navigation and search between same-layer neighbors.

[Quantization] When a float32 vector element is quantized to Int16, the linear mapping formula Int16 = round (float32/scale) is used: WHERE, scale is the ratio of the largest absolute value of a vector's elements to 32767 (maximum Int16 value). For example, the vector [0.6, 0.7] provided in the above figure is quantized in the following way:

[9A 99193F | 3333333F]

Primitive vector [0.6, 0.7] big-endian
    | 1. Find the largest absolute value: 0.7
    | 2. Calculate scale: 0.7/32767
    v
Scale = 2.13629555e-05
    | 3. Quantization: Each element is divided by the scale value and rounded off.
    v
Quantized vector: [28086, 32767]
    | 4. Stored as: scale (float32) + dims (Int16 array)
    v
Structure of stored vector: [scale: 2.13629555e-05] [dims: 28086] [dims: 32767] little-endian

[2860 C2 37 | B6 6D | FF 7F]

Overall, the auxiliary table with a vector index employs a well-designed format and quantization for efficient storage of HNSW graph node information. The proper indexing and a carefully designed neighbors column structure fully support the HNSW algorithm, which operates by successively loading node data from this auxiliary table.

DD Integration and DDL Atomicity Guarantee

Different from ordinary tables, auxiliary tables with a vector index are invisible to users and cannot be directly accessed. The following figure shows the relationship between the base table and the auxiliary table.

• An auxiliary table is named vidx_<table_id>_00 using the base table ID. The base table can use the auxiliary table's name to find the auxiliary table's metadata from the data dictionary (DD).

• To perform a vector search or update, you can open an auxiliary table through the base table.

The auxiliary table's TABLE_SHARE is constructed from its DD and anchored to the base table through the hlindex pointer. Additionally, a shared Nodes Cache is created for the auxiliary table and accessible through the hlindex_data pointer.
The auxiliary table's TABLE structure is also anchored to the base table through the hlindex pointer, and a context object is created to hold the vector search result set.

• The auxiliary table's TABLE_SHARE is not stored in the table cache. It is closed together with the base table.

During DDL execution, structural changes to both the base and auxiliary tables are encapsulated within the same transaction. Through MySQL's transaction system and DDL log, atomic and Crash Safe DDL operations are guaranteed.

Summary

By designing a structured auxiliary table with a vector index, AliSQL completely realizes the efficient HNSW graph storage and traversal. Combined with the integration of MySQL 8.0 data dictionary (DD), it ensures the consistency of metadata and the atomicity of DDL operations. This addresses MySQL ecosystem's long-term weakness in vector processing and provides out-of-the-box high-dimensional search capabilities for existing AliSQL instances.

This article focuses on the vector index storage and algorithm implementation. However, transactional guarantees and concurrency control are still required to realize database integrity and reliability. In AliSQL Vector Technology Analysis (2): Nodes Cache and Concurrency Control, we will discuss how Nodes Cache optimizes queries and how vector operations meet production-grade requirements through concurrency control and transaction isolation.

Community

AliSQL Vector Technology Analysis (1): Storage Format and Algorithm Implementation

Background

Vector Use Example

Detailed Guide on Vector Indexes

HNSW Algorithm

NNS Layer

Insert Algorithm

kNN Search Algorithm

Vector Index Storage Format

DD Integration and DDL Atomicity Guarantee

Summary

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

Alibaba Cloud Model Studio

Qwen

AI Acceleration Solution

Network Intelligence Service