RAG 用のベクトルベースドキュメント取得システムの構築 - PolarDB

このトピックでは、ドキュメント検索システムの概念について説明し、ドキュメント検索システムの構築方法を紹介します。

概要

ドキュメント検索システムは、ドキュメント情報を迅速かつ正確に取得するために設計された技術です。このシステムは、ドキュメントの内容と構造に基づいて、コンピューターの自然言語処理と情報検索を使用してドキュメントを取得可能なものに変換し、クエリニーズに応じて関連ドキュメントを照合およびソートします。

このシステムは、さまざまなシナリオに幅広く適用できます。シナリオ例：

企業や組織では、このシステムを使用して多数のドキュメントとナレッジ資料を管理および取得し、従業員が必要な情報を迅速に取得できるようにします。
学術研究の分野では、このシステムを使用して関連する学術文献を取得し、研究者が最新の研究結果を迅速に理解し、習得できるようにします。
ニュースやメディアの分野では、このシステムを使用して関連するニュースレポートを取得し、重要なニュースイベントをタイムリーに理解および分析します。

このシステムは、関連する法的文書や医学文献を取得するために、法律や医学の分野にも適用できます。

システム構築手順

ドキュメントテーブルの作成

次の SQL 文を実行して、ドキュメントテーブルを作成できます。

CREATE TABLE enterprise_context(
    id bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'プライマリキー ID', /* プライマリキー ID */
    content text DEFAULT NULL COMMENT 'コンテンツ', /* コンテンツ */
    PRIMARY KEY (id)
) ENGINE=InnoDB COMMENT='テキストデータ'; /* テキストデータ */

ドキュメントデータをテーブルに挿入する

次の SQL 文を実行して、ドキュメントデータをテーブルに挿入できます。

INSERT INTO enterprise_context(id,content) VALUES
(1, 'Sinosoft Company Limited with the stock abbreviation "	Sinosoft Technology" and the stock code "603927", a listed company, is the product of the knowledge innovation pilot project implemented by the Institute of Software Chinese Academy of Sciences, and is the result of the transformation of the technology research and development subject of the Institute. This company is headquartered in Beijing and has the registered capital of 593.6 million CNY. It is a knowledge-intensive high-tech enterprise specialized in the research and development, application, and service of computer software. '),
(2, 'Sinosoft Technology focuses on large-scale application software development and computer system integration as its core competencies, integrates self-developed industry-generic software products, network information security software products, comprehensive network application platforms, middleware software products, and application tools under one roof, covers all levels of system software, supporting software, intelligent architectural engineering, industry application software, and provides all-round support for large-scale application system engineering. ')

ドキュメントベクターテーブルの作成

次の SQL 文を実行して、ドキュメントベクターテーブルを作成できます。

/*polar4ai*/CREATE TABLE vector_table(
  id bigint,
  content text, 
  content_vector vector_768, 
  primary key(id)
);

ドキュメントデータのベクトル化

ビジネスシナリオに基づいて、ドキュメントデータをオフラインまたはオンラインでベクトル化できます。

次の SQL 文を実行して、ドキュメントデータをオフラインでベクトル化できます。

/*polar4ai*/select id, content from predict(model _polar4ai_text2vec, select id,content from enterprise_context) 
with (
  x_cols='content', /* テキストを格納するために使用されるフィールド */
  primary_key='id', /* ベクターテーブルのプライマリキー */
  mode='async', /* ドキュメントデータの書き込みモード。非同期モードのみがサポートされています */
  vec_col='content_vector' /* ベクターテーブルにベクターを格納するために使用されるフィールド */
)
into vector_table;

_polar4ai_text2vec は、テキストをベクターに変換するために設計されたモデルです。このモデルは、テキストを 768 次元のベクターにのみ変換します。次の表に、with() に含まれるパラメーターを示します。

パラメーター	説明	例
primary_key	ベクターテーブルのプライマリキー。	id
x_cols	テキストを格納するために使用されるフィールド。	content
mode	ドキュメントデータの書き込みモード。async（非同期）モードのみがサポートされています。	async
vec_col	ベクターテーブルにベクターを格納するために使用されるフィールド。	content_vector

次の SQL 文を実行して、ドキュメントデータをオンラインでベクトル化できます。

/*polar4ai*/SELECT * FROM predict(model _polar4ai_text2vec, SELECT 'What is the stock code of Sinosoft Technology') with();

ベクター検索の実行

次の SQL 文を実行して、ベクター検索を実行できます。

/*polar4ai*/SELECT id,'distance(content_vector,[1,2,3,4,5……,768])' FROM vector_table LIMIT 10;