×
Community Blog Reshaping AI Data Streams: A Deep Dive into PolarDB IMCI's Native Vector Capabilities

Reshaping AI Data Streams: A Deep Dive into PolarDB IMCI's Native Vector Capabilities

This article introduces PolarDB IMCI's native vector capabilities, which integrate vector indexing and embedding into the database kernel to simplify AI application development.

By Nanlong Yu

Introduction: The Hidden Challenges in AI Application Development

In the current wave of AI driven by large models, vectors play a crucial role in everything from building RAG applications to implementing long-term memory for Agents. A typical vector data flow involves two core stages:

1. Ingestion: Unstructured data, such as text, is converted into vectors by an embedding model and stored in a vector index.erts unstructured data such as text into vectors by using the embedding model, and then stores the vectors in the vector index.

2. Retrieval: A user's query is also converted into a vector, which is then used to retrieve relevant information from the index.

However, in real-world AI application development, developers often face a fragmented technology stack. Vector indexes, embedding models, and business databases are typically three separate systems. This separation introduces numerous "hidden costs":

1. Development Hurdles: Developers must select and integrate different software and cloud services, writing a large amount of glue code that increases complexity.

2. Operational Hurdles: Business data and vector data must be synchronized manually or through ETL tools. This not only increases the O&M burden but also leads to data latency, affecting the real-time performance of AI applications.

3. Usability Hurdles: Different vector databases and services have inconsistent APIs and lack a unified standard. When performing hybrid queries involving both vectors and scalars (such as product prices or tenant IDs), capabilities are often limited, leading to high learning and implementation costs.

1
PolarDB's Multi-Modal Hybrid Retrieval Architecture

In order to solve these problems, PolarDB IMCI (In-Memory Column Index, referred to as IMCI) proposed a new solution - in the database kernel to integrate vector indexing and embedding capabilities, building a multi-modal hybrid retrieval architecture, dedicated to providing developers with integrated vector full lifecycle management services.

2
Vector Lifecycle Management Process

Seeing is Believing: Powering a RAG Application with a Single SQL Statement

Before diving into the technical details, let's use a simple example to see the changes that PolarDB brings. Assume we want to build a technical Q&A bot for PolarDB IMCI, and our knowledge base contains the following three documents:

• "PolarDB IMCI supports accessing both row-store and column-store data in a single SQL statement by using Hybrid Plans."

• "HashMatch is a join operator in PolarDB IMCI."

• "PolarDB IMCI provides built-in vector indexing and embedding services."

In the past, this required coordinating multiple systems. Now you only need a few lines of SQL.

Step 1: Create and populate the knowledge base

-- Create a table where the vec column is automatically generated from the doc column by using the EMBEDDING expression.
-- At the same time, declare an HNSW vector index by using the COMMENT syntax.
CREATE TABLE t1 (
doc TEXT,
vec VECTOR( 1024) AS (EMBEDDING(doc, "text-embedding-v4", 1024 )) STORED COMMENT 'imci_vector_index=HNSW(metric=cosine,max_degree=16,ef_construction=300)' 
) COMMENT 'COLUMNAR=1';

-Insert the raw text data. The database automatically handles vector generation and index construction.
INSERT INTO t1 (doc) VALUES ("PolarDB IMCI can use Hybrid Plan to access both row store and column store in an SQL statement" ),( "HashMatch is a join operator in PolarDB IMCI" ),( "PolarDB IMCI provides built-in vector indexes and embedding services" );

3

Step 2: Use a single SQL statement for query, vectorization, retrieval, and prompt engineering

When a user asks, "What is HashMatch?", you can retrieve information from the knowledge base and generate a prompt as follows:

SELECT CONCAT ("Please refer to the following: ", GROUP_CONCAT (doc),"and answer the user's question in the appropriate tone: What is HashMatch?") FROM ( SELECT doc FROM t1 ORDER BY DISTANCE(vec, EMBEDDING("What is HashMatch?","text-embedding-v4",1024),'COSINE') LIMIT 1) AS t;

4

This example perfectly demonstrates the advantages of PolarDB:

  1. Integrated: The EMBEDDING expression is seamlessly integrated with the vector index. By using materialized generated columns, text vectorization requires no application-level intervention.
  2. Automated: Simply declare the index in the table definition, and a background database task will automatically build and maintain the vector index, completely eliminating the hassle of data synchronization.
  3. Standardized: All operations are performed in the SQL ecosystem that developers are most familiar with. EMBEDDING and DISTANCE expressions are as easy to use as standard SQL functions, resulting in a minimal learning curve.

Next, we will take a deep dive into the underlying technical design.

Detailed Design: Vector Capabilities Native to an HTAP Database

We have deeply integrated vector capabilities into PolarDB IMCI, an architectural decision that provides unique technical advantages. Instead of "reinventing the wheel," we've built our vector index "on the shoulders of giants."

Architectural Core: A Secondary Index Based on Columnar Storage

We implement the HNSW vector index as a kind of columnar storage secondary index. It stores the mapping from vector to data row ID(RowID). This design offers several benefits:

Leverage Mature Capabilities: Vector indexes typically lack mature, enterprise-grade features like transactions or backup and recovery (Checkpoint/Recover). By integrating with columnar storage, these capabilities are perfectly reused. For example, data visibility checks can directly use the columnar store's delete bitmap, which natively supports transactions.

Unified Data Management: The vector index is responsible only for vector retrieval, while scalar data (such as tenant IDs, timestamps, and categories) is stored in the columnar store. During a query, RowIDs can be used to quickly link the two.

High-Performance Scalar Filtering: During a hybrid retrieval of "vector + scalar" data, the system can fully leverage the columnar engine's powerful I/O pruning (reading only required columns) and vectorized executor to achieve highly efficient scalar filtering.

Vector Index Construction: An Asynchronous Mechanism for Efficiency and Timeliness

Building a vector index is resource-intensive. To avoid impacting the physical replication of columnar storage, we use an asynchronous build process. However, this introduces new challenges:

  1. If the foreground write traffic is heavy, incremental data may not be written to the vector index in time, causing a backlog. This backlogged data cannot be accelerated by the index and can only be retrieved by using a brute-force search, which affects performance.
  2. Columnar storage uses mark-and-delete, which creates holes in the baseline vector index. This not only increases space overhead but also degrades performance.
  3. When using quantization, the quality of the pre-trained codebook degrades with new writes, affecting recall.

To this end, adopted the design principles of the LSM-Tree and broke down the asynchronous build task into two sub-tasks:

  • Incremental Data Sync (similar to a Flush): This task dynamically monitors the latency of index building.

    • Low Latency: The incremental data is written to the baseline index row by row. This ensures that the data is available as soon as possible and that too many small indexes are generated.
    • If the latency is high, the accumulated incremental data is quickly built into a small index by Bulk Load to prevent accumulation from affecting queries.
  • Baseline Index Merge (similar to a Compaction): This task merges small indexes and cleans up holes left by deleted data.

    • If the amount of data is large, the deletion rate is low, and the recall rate is high, merge row by row is used to reduce the overhead.
    • When a large number of updates or deletes: Full rebuilds are triggered to rebuild a compact and efficient baseline index to ensure query performance and recall.

Vector Retrieval: Collaboration Between the Optimizer and Executor

PolarDB IMCI supports both exact and approximate search. Exact search is done via a simple brute-force scan, so our focus is on high-performance approximate nearest neighbor (ANN) search.

Optimizer: Intelligent Path Selection for Hybrid Retrieval

In hybrid retrieval scenarios, the order of filtering scalars and retrieving vectors has a significant impact on performance.

  • Pre-filter (scalar first): If a scalar condition (such as a price < 100) can filter a large amount of data, scalar filtering is performed first. Vector brute force computing on a small result set is very fast.
  • Post-filter (vector-first): If the scalar condition has poor selectivity, K nearest neighbors are recalled by the vector index. Then, the scalar condition is used to filter the K nearest neighbors.

The PolarDB optimizer dynamically selects a Pre-filter or Post-filter execution plan based on the statistics. In the future, we will also introduce execution feedback and adaptive execution to make decisions more intelligent.

Executor: Transaction-Level, Real-Time Retrieval

To ensure both data visibility (transactions) and data freshness (real-time), the executor uses a two-phase recall policy:

  • Baseline Index Retrieval: retrieves data from vector indexes. Using the delete bitmap of column store for visibility judgment, naturally supports transaction isolation.
  • Incremental Data Retrieval: The latest data that has not yet been written to the index is scanned and processed with a brute-force search.
  • Result Merging: Finally, an upstream operator performs a merge sort on the results from both stages to produce the final top-K result.

Additionally, through Sideway Information Passing, if an upstream Filter operator filters out some of the retrieved vector results, the executor can dynamically retrieve more candidates from the vector index to ensure the quantity and quality of the final result set.

EMBEDDING: Embracing an Open Ecosystem

Embedding model technology is evolving rapidly. We believe that the core of a database is data management and computation, not the models themselves. Therefore, PolarDB IMCI adopts a more open and flexible approach:

• API calls to external embedding services (such as Alibaba Cloud's Bailian) are encapsulated as built-in SQL expressions.

• Users can call EMBEDDING directly in SQL expressions, just as simply as calling AVG, SUM.

This design allows users to always have access to state-of-the-art (SOTA) models while remaining compatible with the simple SQL ecosystem, perfectly integrating the AI and database worlds.

Performance Testing: The Data Speaks for Itself

We compared the performance of PolarDB IMCI with open-source PGVector and MariaDB on the public GIST-960 dataset. Under the same hardware specifications (see Appendix), the test results show that at various recall rates, the vector retrieval performance (QPS) of PolarDB IMCI is 2 to 3 times higher than the other two products.

5

Summary

By integrating vector retrieval and embedding capabilities into the database kernel, PolarDB IMCI fundamentally solves the pain points of fragmented tech stacks, data silos, and complex O&M in traditional AI application development. It not only provides a high-performance, transactional, and real-time vector database but, more importantly, unifies everything under the familiar SQL language. This dramatically lowers the barrier to developing and maintaining AI applications. We believe this "all-in-one" design will become the new paradigm for the convergence of AI and databases, providing a solid and efficient data foundation for developers building the next generation of intelligent applications.

Appendix

The tests were performed on an Intel(R) Xeon(R) Platinum 8357C CPU @ 2.70GHz. PolarDB was configured with 128 GB of memory. The parameters for open-source PGVector and MariaDB are as follows:

1. MariaDB:

innodb_buffer_pool_size = 256G
mhnsw_max_cache_size = 128G

2. Open source PGVector:

shared_buffers = 128GB
work_mem = 1GB
maintenance_work_mem = 128GB
effective_cache_size = 128GB
0 1 0
Share on

ApsaraDB

573 posts | 179 followers

You may also like

Comments

ApsaraDB

573 posts | 179 followers

Related Products