All Products
Search
Document Center

PolarDB:Build a document retrieval system

Last Updated:May 26, 2025

This topic describes the concept of the document retrieval system and introduces how to build a document retrieval system.

Overview

A document retrieval system is a technique designed for retrieving and obtaining document information quickly and accurately. This system is based on the content and structure of documents, uses the computer natural language processing and information retrieval to convert the documents into retrievable ones, and match and sort relevant documents according to your query needs.

This system is widely applicable for various scenarios. Sample scenarios:

  • In enterprises and organizations, this system is used to manage and retrieve a large number of documents and knowledge materials so that employees can quickly obtain information that they need.

  • In the field of academic research, this system is used to retrieve and obtain relevant academic literature to help researchers quickly understand and master the latest research results.

  • In the field of news and media, this system is used to retrieve and obtain relevant news reports to understand and analyze important news events in time.

This system can also be applied in the field of law and medicine for retrieving and obtaining relevant legal documents and medical literature.

System building procedure

Create a document table

You can execute the following SQL statement to create a document table:

CREATE TABLE enterprise_context(
    id bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'Primary key ID',
    content text DEFAULT NULL COMMENT 'Content',
    PRIMARY KEY (id)
) ENGINE=InnoDB COMMENT='Text data';

Insert document data into the table

You can execute the following SQL statement to insert document data into the table:

INSERT INTO enterprise_context(id,content) VALUES
(1. 'Sinosoft Company Limited with the stock abbreviation "	Sinosoft Technology" and the stock code "603927", a listed company, is the product of the knowledge innovation pilot project implemented by the Institute of Software Chinese Academy of Sciences, and is the result of the transformation of the technology research and development subject of the Institute. This company is headquartered in Beijing and has the registered capital of 593.6 million CNY. It is a knowledge-intensive high-tech enterprise specialized in the research and development, application, and service of computer software. '),
(2. 'Sinosoft Technology focuses on large-scale application software development and computer system integration as its core competencies, integrates self-developed industry-generic software products, network information security software products, comprehensive network application platforms, middleware software products, and application tools under one roof, covers all levels of system software, supporting software, intelligent architectural engineering, industry application software, and provides all-round support for large-scale application system engineering. ')

Create a document vector table

You can execute the following SQL statement to create a document vector table:

/*polar4ai*/CREATE TABLE vector_table(
  id bigint,
  content text, 
  content_vector vector_768, 
  primary key(id)
);

Vectorize document data

You can vectorize document data offline or online based on your business scenarios.

  • You can execute the following SQL statement to offline vectorize document data:

    /*polar4ai*/select id, content from predict(model _polar4ai_text2vec, select id,content from enterprise_context) 
    with (
      x_cols='content',
      primary_key='id',
      mode='async',
      vec_col='content_vector'
    )
    into vector_table;

    _polar4ai_text2vec is a model designed to convert text into vectors. This model only converts text into vectors with 768 dimensions. The following table describes the parameters included in with().

    Parameter

    Description

    Example

    primary_key

    The primary key of the vector table.

    id

    x_cols

    The field that is used to store text.

    content

    mode

    The write mode of the document data. Only the async (asynchronous) mode is supported.

    async

    vec_col

    The field that is used to store the vector in the vector table.

    content_vector

  • You can execute the following SQL statement to online vectorize document data:

    /*polar4ai*/SELECT * FROM predict(model _polar4ai_text2vec, SELECT 'What is the stock code of Sinosoft Technology) with();

Perform a vector search

You can execute the following SQL statement to perform a vector search:

/*polar4ai*/SELECT id,'distance(content_vector,[1,2,3,4,5……,768])' FROM vector_table LIMIT 10;