×
Community Blog Rebuild Search Pipelines: An Analysis of PolarDB IMCI Capabilities

Rebuild Search Pipelines: An Analysis of PolarDB IMCI Capabilities

The article introduces PolarDB IMCI’s native columnar full-text indexing for efficient, integrated text and hybrid vector search—eliminating the need for external search engines.

By Bao Kai

1. Background

As the Internet and AI develop rapidly, the need to store and efficiently search unstructured text data in modern application systems has become more urgent than ever. Full-text indexing is a key technology for text search and plays a critical role in fuzzy match, content search, and multimodal search.

Traditional relational databases such as MySQL and PostgreSQL also offer full-text indexing. However, because they typically do not adopt mainstream inverted-index technology, they commonly have the following shortcomings:

  1. Performance bottlenecks: poor concurrent query throughput and severe write performance degradation under frequent update workloads.
  2. Limited features: lack of support for advanced tokenization strategies, such as Chinese tokenizers.

To address these limits, many systems choose to deploy a dedicated full-text search engine such as Elasticsearch or OpenSearch, and synchronize data to the engine for efficient text search. However, using an external search engine introduces several notable disadvantages:

  1. Challenges in timeliness and consistency: Data must be synchronized between the database and the search engine, making it difficult to guarantee timeliness. Search engines also typically do not support atomicity, consistency, isolation, and durability (ACID) transactions, so the atomicity of index updates and database operations cannot be guaranteed.
  2. Fragmented query semantics: Application code needs to maintain separate SQL and search-engine query logic, increasing code complexity and coupling.
  3. Higher costs and greater architectural complexity: Additional computing and storage resources are required, data is redundantly stored, and the overall architecture and system O&M become more complex.

1

To address the preceding issues, PolarDB In-Memory Column Index (IMCI) introduces an entirely new approach: database-native columnar full-text indexing. This design significantly improves query performance in fuzzy search scenarios such as e-commerce product search, log analysis, document and knowledge base retrieval, and user behavior and profile analysis. Together with built-in vector indexes, IMCI also enables multimodal hybrid search, positioning PolarDB as a unified data platform that supports online transactions, real-time analysis, full-text search, and vector search.

2. Inverted Indexes

The columnar inverted-index technology in PolarDB, part of the Alibaba Cloud ApsaraDB ecosystem, primarily incorporates tokenizers, the Single-Pass In-Memory Indexing (SPIMI) algorithm, the Roaring Bitmap (RBM) compression algorithm for postings lists, and a Finite State Transducer (FST)-based dictionary implementation.

2.1 Tokenizers

PolarDB's columnar full-text indexing supports multiple tokenizers, including token, ngram, jieba, ik, and json. It also provides tools such as dbms_imci.fts_tokenize for testing tokenization effects and managing custom dictionaries.

2.2 Postings Lists

Postings lists

A postings list is essentially a collection of document IDs. Its core technical challenges lie in achieving extreme compression and efficient computation. In PolarDB's columnar full-text indexing, a postings list stores the set of row IDs in all columnar indexes containing a given term, along with optional term frequency and document frequency information. Each term corresponds to a postings list.

RBM algorithm

The compression of postings lists primarily relies on the Frame of Reference (FOR) algorithm and the RBM algorithm. Bitmap-based RBM is more commonly adopted. For example, it is widely used in systems such as Elasticsearch. As a high-performance compressed bitmap data structure, RBM uses different types of containers to efficiently handle both sparse and dense numeric arrays, achieving strong compression while enabling extremely fast set operations. For more information, see Better bitmap performance with Roaring Bitmaps.

2

The RBM implementation for PolarDB's columnar inverted-index technology is extended from CRoaring [1] and customized for different cardinalities. It supports on-disk space compaction, single instruction multiple data (SIMD) acceleration during computation, and value-range pre-filtering and iterator optimizations during query processing.

[1] https://github.com/RoaringBitmap/CRoaring

2.3 Term Dictionary

Dictionary

The core idea of an inverted index is to use a dictionary to quickly find the postings list corresponding to a given term. Common dictionary implementations include trie structures, B+ trees, and FSTs.

FST algorithm

The FST algorithm provides a good balance between time and space complexity, and is widely used in systems such as Elasticsearch. For more information about the FST algorithm, see Direct Construction of Minimal Acyclic Subsequential Transducers.

PolarDB IMCI's full-text indexing uses an FST-based dictionary to record the relative offsets from terms to their postings lists. For example, given the document list "I am Chinese, and I love China", the terms China, Chinese, and love, together with their corresponding offsets (5, 10, and 15), are inserted in order to build the dictionary.

3

2.4 Inverted Index Building

SPIMI algorithm

Inverted index building generally involves tokenizing documents and generating postings lists and dictionaries. Common building algorithms include the block-based Blocked Sort-Based Indexing (BSBI) algorithm and the in-memory, hash-based

SPIMI algorithm. The SPIMI algorithm uses a "Single pass + Memory partitioning + Block-level flush + Final merge" approach to efficiently build high-quality inverted indexes under limited memory. It avoids global sorting and is considered a classic and efficient indexing method in information retrieval systems. For more information, see Efficient Single-Pass Index Construction for Text Databases. PolarDB's columnar full-text indexing adopts the SPIMI algorithm to build its inverted indexes.

Inverted index building process

During full-text index building, PolarDB continuously scans the columnar data of a column. For each row, the text is tokenized to obtain a set of terms. Each term and its corresponding row ID are then inserted into an in-memory hash table, while memory usage is tracked. If the size of a segment exceeds a predefined threshold, the hash table is flushed to generate a dictionary and postings lists. This process repeats until all columnar data is processed. The process of building a dictionary from the hash table begins by sorting the hash table by term. All postings lists in the hash table are then written to disks, and their respective offsets are recorded. Next, each term and its corresponding offset are sequentially inserted into an FST to build the dictionary. Finally, the dictionary is compressed and flushed to disks, and the segment metadata is updated to reflect the completed block.

4

Thanks to its local building approach and mark-delete design, PolarDB's columnar full-text indexing avoids full index rebuilding under frequent updates, requiring only incremental building. During updates, entries are simply marked as deleted, which generates minimal overhead and does not impact write performance. In addition, background asynchronous merge operations further generate more compact inverted indexes.

2.5 Inverted Retrieval

Inverted query

PolarDB's columnar full-text indexing supports the match function, as well as accelerated LIKE and JSON operations.

Queries can be run by using two methods: the FtsTableScan operator and the match expression. The columnar optimizer estimates the filter rate based on collected statistics and selects the appropriate method. If the operator method is selected, the optimizer rewrites the match function into an FtsTableScan + Filter operator.

Inverted retrieval process

PolarDB's columnar inverted indexes are organized into inverted segments, each consisting of independent metadata, a dictionary, and a set of postings lists. The metadata is kept resident in memory, whereas the dictionary is cached by using the least recently used (LRU) policy. Retrieval operates at the segment level: During a query, each inverted segment is traversed, and the starting address of the dictionary is obtained from the metadata. The dictionary data is then read to construct a dictionary object, which is used to look up the given term. If the term is found, its relative postings list address is retrieved, and by adding the postings list's starting address, the complete postings list can be accessed.

3. Performance Testing

3.1 Dataset

ESRally is the official benchmarking and stress-testing tool provided by Elasticsearch. The details of the http_logs dataset can be found in the Elasticsearch Rally Hub [2], containing approximately 247 million rows.

[2] https://github.com/elastic/rally-tracks/blob/master/http_logs/README.md

3.2 Performance Testing Results

To evaluate retrieval performance, tests were conducted by using tokens ranging from high-frequency to low-frequency. Under hot-data conditions and with single-threaded execution, PolarDB's match function demonstrated orders-of-magnitude improvements in query performance, and its performance was largely unaffected by data temperature such as hot or cold.

Query High-frequency HTTP (247 million lines) Relatively high-frequency French (15 million lines) Relatively low-frequency POST (100 thousand lines) Low-frequency Mozilla (100 lines)
like 1 minutes 21.96 seconds 1 minutes 18.44 seconds 1 minutes 24.59 seconds 1 minutes 31.19 seconds
smid like 25.46 seconds 22.80 seconds 21.98 seconds 21.60 seconds
match (self-developed FTS library) 2.43 seconds 0.25 seconds 0.01 seconds 0.00 seconds
Doris match_any (CLucene library) 3.49 seconds 0.24 seconds 0.03 seconds 0.03 seconds

4. Scenarios

Full-text indexing is widely used in business applications such as log analysis, product search, content search, and multimodal hybrid search.

4.1 Log Analysis

The primary requirement of log analysis systems is to quickly find logs containing specific keywords at low costs. Current industry solutions generally fall into two categories:

  1. Full-text search architectures, represented by Elasticsearch, offer high query performance for full-text indexes but suffer from low real-time write throughput and high storage costs, resulting in relatively high overall costs.
  2. Lightweight or no-index architectures, represented by Loki, provide high real-time write throughput and lower storage costs, but their query efficiency is limited and often cannot meet real-time requirements.

PolarDB's columnar full-text indexing, combined with the cold-data archiving feature of partitioned tables, addresses these challenges effectively. Its core technology, inverted index, enables the rapid retrieval of logs containing matching keywords from massive datasets. The cold-data archiving feature allows older partitioned logs to be stored in low-cost OSS buckets without affecting query performance, because recent logs are queried in most cases.

In summary, PolarDB's columnar full-text indexing, together with hot-and-cold data separation, fully meets the demands of log analysis while delivering several-fold improvements in overall cost-effectiveness.

4.2 Product Search

E-commerce platforms often need to search for products by name or description and quickly return a list of items containing the keyword, ranked by relevance. Common approaches include:

like function

This function is simple and easy to use, but offers poor performance and does not support relevance-based ranking. PolarDB's full-text indexing supports BM25 relevance scoring.

Traditional database full-text indexing

It is suitable for small, quasi-static datasets, but struggles to handle frequent updates and high-concurrency queries. It also lacks proper Chinese tokenizers.

PolarDB's full-text indexing supports multiple mainstream Chinese tokenizers and customizable dictionaries, allowing better matching for customized product data.

External full-text search engines

Such search engines can handle basic product searches, but add architectural complexity and costs. They are limited in handling complex multi-table joins and large-scale aggregate queries.

PolarDB's built-in columnar full-text indexing simplifies the architecture and accelerates queries by using columnar indexes.

4.3 Content Search

Content search is a traditional application of full-text indexing, primarily used for article keyword retrieval, customer feedback and comment analysis, and user behavior and profile analysis.

PolarDB's columnar full-text indexing provides diverse Chinese tokenizers and uses columnar concurrent execution, offering an efficient and precise solution for large-scale content search.

4.4 Hybrid Search

Hybrid search represents a new application of full-text indexing, combining full-text indexes and vector indexes to search documents.

Full-text indexing

Full-text indexing relies on keywords to accurately match relevant documents and rank them by relevance scores. It excels at exact match and is particularly effective when users know exactly what they are looking for.

Semantic search

Vector embeddings capture semantic similarity, where vectors close in the embedding space represent semantically similar content. Vector indexes use these embeddings to find documents with similar meanings, making them well-suited for natural language queries, similarity search, clustering, and retrieval-augmented generation (RAG) applications.

Hybrid search

Hybrid search integrates full-text search and semantic search, combining the strengths of both approaches to produce a unified ranking. This enhances search accuracy and robustness and is especially valuable in RAG applications.

5

PolarDB's hybrid search, based on its built-in full-text indexes and vector indexes, enables precise and efficient recall in RAG systems that involve specialized knowledge domains.

5. Summary

PolarDB's columnar full-text indexing employs mainstream inverted-index technology along with a mark-delete design, enabling efficient text search with relevance scoring. PolarDB's built-in full-text indexing simplifies system architecture, and when combined with columnar real-time analysis, it effectively handles the high-concurrency retrieval of large text datasets under frequent updates, providing enterprises with a cost-effective platform for large-scale data search. With the rise of foundation models and new applications, PolarDB also offers hybrid search that integrates full-text indexing and vector indexing. Combined with the inductive generation capabilities of Qwen models built into PolarDB for AI, PolarDB effectively addresses the limits of large language models (LLMs) in accuracy and real-time responsiveness.

Looking ahead, PolarDB will continue to expand its capabilities across online transactions, real-time analysis, full-text indexing, vector search, and stream processing, further deepening the integration of databases, search, stream processing, and AI. This empowers enterprises to build unified, all-in-one data platforms.

0 1 0
Share on

ApsaraDB

577 posts | 179 followers

You may also like

Comments