This topic describes the concept of the document retrieval system and introduces how to build a document retrieval system.
Overview
A document retrieval system is a technique designed for retrieving and obtaining document information quickly and accurately. This system is based on the content and structure of documents, uses the computer natural language processing and information retrieval to convert the documents into retrievable ones, and match and sort relevant documents according to your query needs.
This system is widely applicable for various scenarios. Sample scenarios:
In enterprises and organizations, this system is used to manage and retrieve a large number of documents and knowledge materials so that employees can quickly obtain information that they need.
In the field of academic research, this system is used to retrieve and obtain relevant academic literature to help researchers quickly understand and master the latest research results.
In the field of news and media, this system is used to retrieve and obtain relevant news reports to understand and analyze important news events in time.
This system can also be applied in the field of law and medicine for retrieving and obtaining relevant legal documents and medical literature.
System building procedure
Create a document table
You can execute the following SQL statement to create a document table:
CREATE TABLE enterprise_context(
id bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'Primary key ID',
content text DEFAULT NULL COMMENT 'Content',
PRIMARY KEY (id)
) ENGINE=InnoDB COMMENT='Text data';
Insert document data into the table
You can execute the following SQL statement to insert document data into the table:
INSERT INTO enterprise_context(id,content) VALUES
(1. 'Sinosoft Company Limited with the stock abbreviation " Sinosoft Technology" and the stock code "603927", a listed company, is the product of the knowledge innovation pilot project implemented by the Institute of Software Chinese Academy of Sciences, and is the result of the transformation of the technology research and development subject of the Institute. This company is headquartered in Beijing and has the registered capital of 593.6 million CNY. It is a knowledge-intensive high-tech enterprise specialized in the research and development, application, and service of computer software. '),
(2. 'Sinosoft Technology focuses on large-scale application software development and computer system integration as its core competencies, integrates self-developed industry-generic software products, network information security software products, comprehensive network application platforms, middleware software products, and application tools under one roof, covers all levels of system software, supporting software, intelligent architectural engineering, industry application software, and provides all-round support for large-scale application system engineering. ')Create a document vector table
You can execute the following SQL statement to create a document vector table:
/*polar4ai*/CREATE TABLE vector_table(
id bigint,
content text,
content_vector vector_768,
primary key(id)
);Vectorize document data
You can vectorize document data offline or online based on your business scenarios.
You can execute the following SQL statement to offline vectorize document data:
/*polar4ai*/select id, content from predict(model _polar4ai_text2vec, select id,content from enterprise_context) with ( x_cols='content', primary_key='id', mode='async', vec_col='content_vector' ) into vector_table;_polar4ai_text2vecis a model designed to convert text into vectors. This model only converts text into vectors with 768 dimensions. The following table describes the parameters included inwith().Parameter
Description
Example
primary_key
The primary key of the vector table.
id
x_cols
The field that is used to store text.
content
mode
The write mode of the document data. Only the async (asynchronous) mode is supported.
async
vec_col
The field that is used to store the vector in the vector table.
content_vector
You can execute the following SQL statement to online vectorize document data:
/*polar4ai*/SELECT * FROM predict(model _polar4ai_text2vec, SELECT 'What is the stock code of Sinosoft Technology) with();
Perform a vector search
You can execute the following SQL statement to perform a vector search:
/*polar4ai*/SELECT id,'distance(content_vector,[1,2,3,4,5……,768])' FROM vector_table LIMIT 10;