ApsaraDB RDS for PostgreSQL provides powerful vector storage and efficient full-text search capabilities. This makes ApsaraDB RDS for PostgreSQL an ideal vector database for retrieval-augmented generation (RAG) applications to implement keyword-based retrieval.
Introduction
This topic provides an example on how to use ApsaraDB RDS for PostgreSQL to create a dedicated ticket chatbot to demonstrate the advantages of ApsaraDB RDS for PostgreSQL in creating RAG applications. After you read this topic, you can obtain the following information:
Procedure
Using ApsaraDB RDS for PostgreSQL to create a dedicated ticket chatbot involves several crucial steps: data processing, multi-path recall, merge sort, and Q&A analysis.
Data processing
The source data, including data from official documentation, knowledge bases, and historical tickets, is processed and stored in an ApsaraDB RDS for PostgreSQL instance. Data processing consists of data segmentation and embedding.
Multi-path recall
Document keyword-based recall: performs a similarity search to match user questions with the keywords in the document table and returns the top N documents that have the highest similarity.
Content keyword-based recall: performs a similarity search based on the keywords of the document content to match user questions with the keywords.
BM25-based recall: performs statistical analysis on the relevance scores based on the word frequency, word distance, and importance of the document modules to which the words belong, and returns the content with a high score.
Embedding-based recall: converts a question into an embedding, calculates its cosine similarity, and then returns the content with a high similarity.
Merge sort
The reciprocal rank fusion (RRF) algorithm and the open source bce-reranker-base_v1 model are used to accurately sort the results of multi-path recalls.
Q&A analysis
Questions and their answers are stored in the database. During the test phase, Q&A performance is scored to evaluate the effects of different retrieval policies.
Data processing
The source data, including data from official documentation, knowledge bases, and historical tickets, is processed and stored in an RDS instance. Data processing consists of data segmentation and embedding.
Data acquisition
Relevant data is obtained based on the purpose of a RAG application. This topic uses an ApsaraDB RDS for PostgreSQL-based DingTalk ticket chatbot as an example to collect data from help documents, knowledge bases, and historical tickets.
Data processing
The source data is processed and stored in an RDS instance in a specific format. For example, you can use the HTMLHeaderTextSplitter class provided by the LangChain framework to segment the text of a help document based on the HTML hierarchy, such as H1 and H2. You can specify the data segmentation size and data overlap size to flexibly control the text segmentation performance. For more information, see the various text splitters provided by LangChain.
If you need to process a Markdown document, you can use the MarkdownHeaderTextSplitter class and flags such as # and ## to perform hierarchical splitting.
Data storage
Data is stored in two core tables: document and embedding. The document table is used to store document information. The embedding table is used to store embedding information after the segmentation.
document table
Example:
\d document Data table "public.document" Field | Type | Collation | Nullable | Default -------------------+-----------------------------+----------+----------+-------------------------------------- id | bigint | | not null | nextval('document_id_seq'::regclass) title | character varying(255) | | | url | character varying(255) | | | ''::character varying key_word | character varying(255) | | | ''::character varying tag | character varying(255) | | | ''::character varying created | timestamp without time zone | | not null | now() modified | timestamp without time zone | | not null | now() key_word_tsvector | tsvector | | | product_name | character varying(255) | | | ''::character varying Indexes: "document_pkey" PRIMARY KEY, btree (id) "document_key_word_tsvector_gin" gin (key_word_tsvector) "document_product_name_key" btree (product_name) "document_title_key" UNIQUE CONSTRAINT, btree (title) Triggers: trigger_update_tsvector BEFORE INSERT OR UPDATE ON document FOR EACH ROW EXECUTE FUNCTION update_tsvector()document table
The table contains the following fields: id, title, url, key_word, tag, created, modified, key_word_tsvector, and product_name. The key_word field indicates the keyword. The created field indicates the creation time. The modified field indicates the modification time. The key_word_tsvector field indicates the weighted tsvector value of the keyword. The key_word_tsvector field is used for keyword matching and as a part of a content keyword-based recall.
Indexes
The indexes include document_pkey, document_title_key, document_product_name_key, and document_key_word_tsvector_gin. The document_pkey index is the index for primary key IDs. The document_title_key index is the index for document titles. Document titles must be unique because documents are updated based on the document title. The document_product_name_key index is an index for product names. The document_key_word_tsvector_gin index is a GIN index for tsvector values. In ApsaraDB RDS for PostgreSQL, GIN indexes are allowed.
Triggers
When the document table is updated or data is inserted into the table, the trigger automatically updates content in the key_word_tsvector column.
embedding table
Example:
\d embedding Data table "public.embedding" Field | Type | Collation | Nullable | Default -------------------+-----------------------------+----------+----------+--------------------------------------- id | bigint | | not null | nextval('embedding_id_seq'::regclass) doc_id | integer | | | '-1'::integer content_chunk | text | | not null | content_embedding | vector(1536) | | not null | created | timestamp without time zone | | not null | now() modified | timestamp without time zone | | not null | now() ts_vector_extra | tsvector | | | Indexes: "embedding_pkey" PRIMARY KEY, btree (id) "embedding_content_embedding_idx" hnsw (content_embedding vector_cosine_ops) WITH (m='16', ef_construction='64') "embedding_doc_id_key" btree (doc_id) "embedding_rumidx" rum (ts_vector_extra) Triggers: embedding_tsvector_update BEFORE INSERT OR UPDATE ON embedding FOR EACH ROW EXECUTE FUNCTION tsvector_update_trigger('ts_vector_extra', 'public.jiebacfg', 'content_chunk')embedding table
The embedding table contains the following fields: id, doc_id, content_chunk, content_embedding, created, modified, and ts_vector_extra. The doc_id field indicates the document ID. The content_chunk field indicates the chunk after segmentation. The content_embedding field indicates the embedding of the chunk obtained after word segmentation. The created field indicates the creation time. The modified field indicates the modification time. The ts_vector_extra field indicates the tsvector value of the chunk after segmentation.
Indexes
The indexes include embedding_pkey, embedding_doc_id_key, embedding_content_embedding_idx, and embedding_rumidx. The embedding_pkey index is the index for primary keys. The embedding_doc_id_key index is the index for document IDs. The embedding_content_embedding_idx index is a vector index for the content_embedding column. The embedding_rumidx index is the RUM index for tsvector values.
Triggers
When the embedding table is updated or data is inserted into the table, the trigger automatically updates content in the ts_vector_extra column.
Multi-path recall
Document keyword-based recall
A document keyword-based recall performs a similarity search to match user questions with the keywords in the document table and returns the top N documents that have the highest similarity.
Document keyword-based recall uses the GIN indexes that are predefined in ApsaraDB RDS for PostgreSQL to efficiently match the keywords in the RDS instance with user questions.
Convert the keywords in the document to tsvector values and assign weights for the tsvector values.
During the conversion, you must specify a word segmentation mode. In most cases, the Chinese word segmentation extensions pg_jieba and zhparser are used. For more information about how to install extensions, see Manage extensions.
You can use the to_tsvector function to segment a keyword and then use the setweight function to specify weights for the strings after segmentation. This adds weight information to the location information. For example, you use the pg_jieba extension, set the word segmentation mode to jiebacfg, and use the setweight function to set the weight of the document keyword to A.
SELECT setweight(to_tsvector('jiebacfg', 'PostgreSQL是世界上先进的开源关系型数据库'), 'A'); setweight ------------------------------------------------------------------------------- 'postgresql':1A '世界':3A '先进':5A '关系':8A '型':9A '开源':7A '数据库':10AConvert the user question into a tsquery value and match the tsquery value with the document keyword.
Use the full-text search capability of ApsaraDB RDS for PostgreSQL to match the user question with the document keyword. Sample statement:
SELECT id, title, url, key_word, ts_rank( key_word_tsvector, to_tsquery(replace(text(plainto_tsquery('jiebacfg', '%s')), '&', '|')) ) AS score FROM public.document WHERE key_word_tsvector @@ to_tsquery(replace(text(plainto_tsquery('jiebacfg', '%s')), '&', '|')) AND product_name = '%s' ORDER BY score DESC LIMIT 1;to_tsquery function
Use the to_tsquery function to convert a user question into a tsquery value. Example:
SELECT to_tsquery('jiebacfg', 'PostgreSQL是世界上先进的开源关系型数据库'); to_tsquery -------------------------------------------------------------------------------- 'postgresql' <2> '世界' <2> '先进' <2> '开源' <-> '关系' <-> '型' <-> '数据库'<2>indicates the distance between two words.<->indicates the adjacent relationship. For example,开源and关系must be adjacent. In addition, stopwords, such as是and的, are automatically removed. An ampersand (&) indicates AND, a vertical bar (|) indicates OR, and an exclamation point (!) indicates NOT. You can also specify a weight. Example:SELECT to_tsquery('jiebacfg', 'postgresql:A'); to_tsquery ---------------- 'postgresql':AThe input to the to_tsquery function must meet the operator requirements of tsquery. To ensure that the user question can be successfully converted into a tsquery value, you can use the plainto_tsquery function instead of the to_tsquery function to effectively ignore invalid operators. Example:
SELECT to_tsquery('jiebacfg','日志|&堆积'); ERROR: syntax error in tsquery: "日志|&堆积" SELECT plainto_tsquery('jiebacfg','日志|&堆积'); plainto_tsquery ----------------- '日志' & '堆积'Use the text function to convert the result returned by the plainto_tsquery function into a text value. Then, use the replace function to replace
&with|to convert the match condition from AND to OR. Examples:-- Use the plainto_tsquery function. SELECT plainto_tsquery('jiebacfg', 'PostgreSQL是世界上先进的开源关系型数据库'); plainto_tsquery -------------------------------------------------------------------- 'postgresql' & '世界' & '先进' & '开源' & '关系' & '型' & '数据库' -- Use the plainto_tsquery, text, and replace functions. SELECT replace(text(plainto_tsquery('jiebacfg', 'PostgreSQL是世界上先进的开源关系型数据库')), '&', '|'); replace -------------------------------------------------------------------- 'postgresql' | '世界' | '先进' | '开源' | '关系' | '型' | '数据库' -- Use the to_tsquery, plainto_tsquery, text, and replace functions. SELECT to_tsquery(replace(text(plainto_tsquery('jiebacfg', 'PostgreSQL是世界上先进的开源关系型数据库')), '&', '|')); to_tsquery -------------------------------------------------------------------- 'postgresql' | '世界' | '先进' | '开源' | '关系' | '型' | '数据库'Use the custom word segmentation feature of the pg_jieba extension to create custom word segmentation methods. For example, the text segmentation result is
'关系' & '型'before you add关系型to a custom dictionary and is'关系型'after you add 关系型 to the custom dictionary.-- Automatically add 关系型 to the custom dictionary 0 with a weight of 100000. INSERT INTO jieba_user_dict VALUES ('关系型',0,100000); -- Load the custom dictionary 0. The first 0 indicates the sequence number of the custom dictionary, and the second 0 indicates that the default dictionary is loaded. SELECT jieba_load_user_dict(0,0); jieba_load_user_dict ---------------------- -- Convert a user question into a tsquery value. SELECT to_tsquery(replace(text(plainto_tsquery('jiebacfg', 'PostgreSQL是世界上先进的开源关系型数据库')), '&', '|')); to_tsquery -------------------------------------------------------------------- 'postgresql' | '世界' | '先进' | '开源' | '关系型' | '数据库'Operators
ApsaraDB RDS for PostgreSQL provides various operators to perform operations on tsvector values after word segmentation and tsquery values. For example, you can use
@@to indicate whether a tsvector value matches a tsquery value.SELECT to_tsvector('jiebacfg', 'PostgreSQL是世界上先进的开源关系型数据库') @@ to_tsquery('jiebacfg', 'postgresql:A'); ?column? ---------- f SELECT setweight(to_tsvector('jiebacfg', 'PostgreSQL是世界上先进的开源关系型数据库'),'A') @@ to_tsquery('jiebacfg', 'postgresql:A'); ?column? ---------- tThe query condition is the word
postgresqlwith a weight of A. In this case, even if the tsvector value that you want to query contains the wordpostgresql, false is returned because the weight of A is not assigned. However, if the weight of A is assigned for the tsvector value that you want to query by using the setweight function, true is returned.ts_rank function
Use the ts_rank function to calculate the match degree between the tsquery value and the tsvector value that you want to query. For example, the query condition is
postgresql | 开源. The row that you want to query must contain at least one of thepostgresqland开源keywords. The first sentence contains both keywords. Its score is higher than the second sentence that contains only the开源keyword. The third sentence is skipped by the@@operator because the sentence does not containpostgresqlor开源.WITH sentence AS ( SELECT 'PostgreSQL是世界上先进的开源关系型数据库' AS content UNION ALL SELECT 'MySQL是应用广泛的开源关系数据库' UNION ALL SELECT 'MySQL在全球非常流行' ) SELECT content, ts_rank(to_tsvector('jiebacfg', content), to_tsquery('jiebacfg', 'postgresql | 开源')) AS score FROM sentence WHERE to_tsvector('jiebacfg', content) @@ to_tsquery('jiebacfg', 'postgresql | 开源') ORDER BY score DESC; content | score --------------------------------------------+------------- PostgreSQL是世界上先进的开源关系型数据库 | 0.06079271 MySQL是应用广泛的开源关系数据库 | 0.030396355In extreme cases, if you encounter word segmentation problems or the input contains incorrect characters, you may fail to match any document. To prevent this issue, ApsaraDB RDS for PostgreSQL allows you to use the pg_bigm extension to perform fuzzy match-based searches. You can use the bigm_similarity function to return the document that has the highest similarity. Example:
WITH sentence AS ( SELECT 'PostgreSQL是世界上先进的开源关系型数据库' AS content UNION ALL SELECT 'MySQL是应用广泛的开源关系数据库' UNION ALL SELECT 'MySQL在全球非常流行' ) SELECT content, bigm_similarity(content, 'postgres | 开源产品') AS score FROM sentence ORDER BY score DESC; content | score --------------------------------------------+------------ PostgreSQL是世界上先进的开源关系型数据库 | 0.23076923 MySQL是应用广泛的开源关系数据库 | 0.05263158 MySQL在全球非常流行 | 0.0 (3 rows)The bigm_similarity function converts the two input texts into a 2-gram element that is a combination of two consecutive characters or two words. Then, the function calculates the number of elements between the input texts. Valid values: [0, 1]. The value 1 indicates that the texts are the same. If the segmentation is inaccurate, spelling errors exist, or abbreviations are used, you can use the pg_bigm extension to perform fuzzy match-based searches. For more information, see Use the pg_bigm extension to perform fuzzy match-based queries.
Content keyword-based recall
To search for document content, you can use the same method as you use in document keyword-based recall. In most cases, document content is longer than a keyword. If you perform a full-text search, we recommend that you use the RUM extension to accelerate the search. For example, you can use the sample query plans to perform a similarity search and match user questions with the keywords of the document content.
The RUM extension is based on the GIN index and stores additional information, including the position and timestamp of a word. The following query plan uses the execution plan of the RUM extension, uses the embedding_rumidx index, filters data, calculates the similarity, and sorts the results. All operations are completed by index, which ensures the query efficiency. The query takes 3.219 ms.
The following query plan uses native GIN indexes that are built based on the redundant ts_vector_extra column. The execution plan of the native GIN indexes is more complex than that of the RUM extension. The execution plan of the native GIN indexes starts two processes and uses the bitmap index scan instead of the index scan in RUM. Specifically, use an index to obtain a bitmap of all rows that meet the query condition and use the index specified by embedding_ts_vector_gin to identify the related rows. Perform a secondary filter to search for valid results. Then, use the Top-N algorithm to sort the results and perform gather merge on the processes to combine the results. The query takes 14.234 ms.
GIN indexes are built on
to_tsvector('jiebacfg'::regconfig, content_chunk). The execution plan of these GIN indexes is similar to the execution plan of the GIN indexes that are built on the ts_vector_extra column. The WHERE conditions are the same, and the number of rows that enter the ranking stage is also similar. The main difference lies in the calculation of ts_rank. The ts_rank function requires word location information. When GIN indexes are created, location information is not stored. If you use the ts_rank function, the to_tsvector function is executed on each row, which is time-consuming. The query takes 1081.547 ms.
BM25-based recall
BM25 is a classic text matching algorithm that takes into account the effects of term frequency (TF) and inverse document frequency (IDF). A word with a large TF indicates that the word frequently appears in a document and has a strong relation. A word with a large IDF indicates that the word appears in multiple documents and has low importance. The BM25 algorithm introduces parameters to optimize the TF-IDF algorithm and improve the query effect.
BM25-based recall is used as a keyword-based retrieval method and is compared with the keyword-based retrieval method of ApsaraDB RDS for PostgreSQL. The keyword-based retrieval method of ApsaraDB RDS for PostgreSQL uses built-in functions such as ts_rank and considers the word frequency, word distance, and word importance in each part of the document. The BM25-based recall result can be regarded as a part of keyword-based retrieval results to improve the retrieval accuracy.
Embedding-based recall
ApsaraDB RDS for PostgreSQL supports the high-dimensional vector similarity search (pgvector) and vector generation (rds_embedding) extensions. The pgvector extension supports the required vector data type and basic vector data operations, such as calculations of the distance and similarities between vectors. The rds_embedding extension only translates high-dimensional text into embeddings. For more information, see Use the pgvector extension to perform high-dimensional vector similarity searches and Use the rds_embedding extension to generate vectors.
In ApsaraDB RDS for PostgreSQL, a vector can be stored as an array. However, a data type must still be defined as the vector type. When you perform vector calculations and ranking, if you do not create an index for the vector type, the overhead of full table scans and ranking significantly increases.
The pgvector extension supports the Hierarchical Navigable Small World (HNSW) and Inverted File with Flat Compression (IVFFlat) indexing methods for approximate nearest neighbor (ANN) search. HNSW indexing does not require data inserts and delivers a faster query speed than IVFFlat indexing. As a result, HNSW indexing is used in this method.
SELECT
embedding.id,
doc_id,
content_chunk,
content_embedding <=> '%s' AS similarity
FROM
public.embedding
LEFT JOIN
document ON document.id = embedding.doc_id
WHERE
product_name = '%s'
ORDER BY
similarity
LIMIT %s;For more information about the best practices for IVFFlat indexing, see Build an LLM-driven dedicated chatbot on top of ApsaraDB RDS for PostgreSQL.
For more information about the performance test of the pgvector extension, see Use the pgvector extension to perform high-dimensional vector similarity searches.
Merge sort
This method uses the RRF algorithm and the open source bce-reranker-base_v1 model to accurately sort the results of document keyword-based recall, content keyword-based recall, BM25-based recall, and embedding-based recall.
The RRF algorithm is easy to understand. In the following formula,
indicates the required document, indicates the ranking of the document in the system , and is a constant. You can set k to 60 or another value. You can add together the rankings of the document in the systems to obtain the final result. The higher the ranking, the greater the value obtained in reverse order. As a result, the calculated RRF value is also greater. If the document ranks high in multiple systems, the RRF value of the document significantly increases. The RRF algorithm can be used to effectively sort the chunks of multi-path recall. bce-reranker-base_v1 is a cross-language semantic representation algorithm model that aims to optimize semantic search results and relevance rankings. This model supports Chinese, English, Japanese, and Korean, and has strong fine ranking capabilities. In actual scenarios, this model requires a longer period of time for fine ranking than performing recalls, especially when a large number of chunks need to be sorted. This may affect the session fluency. To achieve a faster response from the dedicated ticket chatbot, you can use the simple and efficient RRF algorithm to sort the results or perform fine ranking after RRF ranking. If you require high accuracy, you can directly use the bce-reranker-base_v1 model to sort results.
Q&A analysis
Different policies are defined for the dedicated ticket chatbot to respond to different data sources. This helps prevent unexpected processing of content in the prompt_content field when a large model is called.
The content from the knowledge base is not used as the input of the prompt_content field for the large model to process. Instead, the content is directly provided.
The content from official documentation is summarized and formatted by the large model because the obtained HTML file is segmented and may have layout issues and duplicate content.
The content from the large model is directly provided only when the knowledge base and official documentation cannot provide the required answers. A large model is directly called to resolve issues in a small number of scenarios. In most cases, a large model serves as a general knowledge assistant.
The content from historical tickets is provided only with the ticket name and URL.
Relevant documents and their links are provided for each part of the answer. If the problem cannot be resolved based on the answer, we recommend that you check the original document to obtain more comprehensive and accurate content.
In the testing phase, the score of the most recent question and answer can be used as a preliminary assessment of the effectiveness of the current policy. You can write code to create multiple policies and then determine the most suitable recall policy based on testing.
prompt = f'''请整理并格式化下面的内容并整理输出格式,
```
{prompt_content}
```
基于自己的能力做出回答,我的问题是:{question}。
'''Connect to a DingTalk chatbot
You can use Streamlit to create a web app or connect to the DingTalk chatbot. A web app is used for self-testing and document management. The DingTalk chatbot allows all users to access the service.
In each Q&A session in a DingTalk group, a new connection needs to be established at the database level. Frequent connection establishment may cause performance issues, including time and memory consumption. If connections are not promptly released, the number of connections may reach the upper limit and new connections cannot be established to the database. You can use the connection pool in your project to prevent frequent short-lived connections. You can also use the built-in PgBouncer connection pool of ApsaraDB RDS for PostgreSQL.
Example
In this example, multi-path recall is used to describe the powerful features and ease of use of ApsaraDB RDS for PostgreSQL in RAG. In this example, a user asks the question of 介绍PostgreSQL. Results from three recall methods are obtained.
Prepare data
Use the privileged account to execute the following SQL statements on the required database to install the pg_jieba, pgvector, RUM, and rds_embedding extensions:
ImportantBefore you install the pg_jieba extension, you must add pg_jieba to the value of the shared_preload_libraries parameter. For more information about how to modify the shared_preload_libraries parameter, see Modify the parameters of an ApsaraDB RDS for PostgreSQL instance.
You can execute the
SELECT * FROM pg_extension;statement to view the installed extensions.
CREATE EXTENSION IF NOT EXISTS pg_jieba; CREATE EXTENSION IF NOT EXISTS vector; CREATE EXTENSION IF NOT EXISTS rum; CREATE EXTENSION IF NOT EXISTS rds_embedding;Create a NAT gateway for the virtual private cloud (VPC) in which the RDS instance resides to allow the RDS instance to access external models. For more information, see Use the rds_embedding extension to generate vectors.
NoteBy default, you cannot connect to an RDS instance over the Internet. When you use an external large model, such as the text embedding model provided by Alibaba Cloud Model Studio, you must create a NAT gateway for the VPC in which the RDS instance resides for the RDS instance to access external models.
Execute the following SQL statements on the required database to create the test tables named doc and embed, and create indexes for the tables:
--Create the doc test table and an index for the table. DROP TABLE IF EXISTS doc; CREATE TABLE doc ( id bigserial PRIMARY KEY, title character varying(255) UNIQUE, key_word character varying(255) DEFAULT '' ); CREATE INDEX doc_gin ON doc USING GIN (to_tsvector('jiebacfg', key_word)); --Create the embed test table and an index for the table. DROP TABLE IF EXISTS embed; CREATE TABLE embed ( id bigserial PRIMARY KEY, doc_id integer, content text, embedding vector(1536), ts_vector_extra tsvector ); CREATE INDEX ON embed USING hnsw (embedding vector_cosine_ops) WITH ( m = 16, ef_construction = 64 );Execute the following SQL statements to create a trigger. When a row in the embed table is updated or data is inserted into the row, the ts_vector_extra column is automatically updated.
-- Convert the text into a tsvector value for full-text search based on keywords. CREATE TRIGGER embed_tsvector_update BEFORE UPDATE OR INSERT ON embed FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('ts_vector_extra', 'public.jiebacfg', 'content');Execute the following SQL statements. Each time an insert or update is performed on the embed table, an embedding is generated based on the inserted or updated content and stored in the embedding column.
ImportantIn this example, the text embedding model provided by Alibaba Cloud Model Studio is used. You must activate Alibaba Cloud Model Studio and obtain the required API key. For more information, see Obtain an API key.
-- Convert the question into an embedding. Specify the api_key parameter based on your business requirements. CREATE OR REPLACE FUNCTION update_embedding() RETURNS TRIGGER AS $$ BEGIN NEW.embedding := rds_embedding.get_embedding_by_model('dashscope', 'sk-****', NEW.content)::real[]; RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER set_embedding BEFORE INSERT OR UPDATE ON embed FOR EACH ROW EXECUTE FUNCTION update_embedding();Insert test data into the table.
INSERT INTO doc(id, title, key_word) VALUES (1, 'PostgreSQL介绍', 'PostgreSQL 插件'), (2, 'MySQL介绍', 'MySQL MGR'), (3, 'SQL Server介绍', 'SQL Server Microsoft'); INSERT INTO embed(doc_id, content) VALUES (1, 'PostgreSQL是以加州大学伯克利分校计算机系开发的POSTGRES,版本 4.2为基础的对象关系型数据库管理系统(ORDBMS)。 POSTGRES领先的许多概念在很久以后才出现在一些商业数据库系统中'), (1, 'PostgreSQL是最初的伯克利代码的开源继承者。 它支持大部分SQL标准并且提供了许多现代特性:复杂查询、外键、触发器、可更新视图、事务完整性、多版本并发控制,同样,PostgreSQL可以用许多方法扩展,比如,通过增加新的:数据类型、函数、操作符、聚集函数、索引方法、过程语言'), (1, '并且,因为自由宽松的许可证,任何人都可以以任何目的免费使用、修改和分发PostgreSQL,不管是私用、商用还是学术研究目的。 '), (1, 'Ganos插件和PostGIS插件不能安装在同一个Schema下'), (1, '丰富的生态系统:有大量现成的插件和扩展可供使用,比如PostGIS(地理信息处理)、TimescaleDB(时间序列数据库)、pg_stat_statements(性能监控)等,能够满足不同场景的需要'); INSERT INTO embed(doc_id, content) VALUES (2, 'MySQL名称的起源不明。 10多年来,我们的基本目录以及大量库和工具均采用了前缀“my”。 不过,共同创办人Monty Widenius的女儿名字也叫“My”。 时至今日,MySQL名称的起源仍是一个迷,即使对我们也一样'), (2, 'MySQL软件采用双许可方式。 用户可根据GNU通用公共许可(http://www.fsf.org/licenses/)条款,将MySQL软件作为开放源码产品使用,或从MySQL AB公司购买标准的商业许可证。 关于我方许可策略的更多信息,请参见http://www.mysql.com/company/legal/licensing/。 '), (2, '组复制MySQL Group Replication(简称MGR)是MySQL官方在已有的Binlog复制框架之上,基于Paxos协议实现的一种分布式复制形态。 RDS MySQL集群系列实例支持组复制。 本文介绍如何使复制方式为组复制。 使用了组复制的MySQL集群能够基于分布式Paxos协议自我管理,具有很强的数据可靠性和数据一致性。 相比传统主备复制方式,组复制具有以下优势:数据的强一致性,数据的强可靠性,全局事务强一致性'); INSERT INTO embed(doc_id, content) VALUES (3, 'Microsoft SQL Server是一种关系数据库管理系统 (RDBMS)。 应用程序和工具连接到SQL Server实例或数据库,并使用Transact-SQL (T-SQL)进行通信。 '), (3, 'SQL Server 2022 (16.x)在早期版本的基础上构建,旨在将SQL Server发展成一个平台,以提供开发语言、数据类型、本地或云环境以及操作系统选项。 '), (3, 'SQL Server在企业级应用中广受欢迎,与其他Microsoft产品(如Excel、Power BI)无缝集成,便于数据分析');
Perform multi-path recall
Execute the following SQL statements to retrieve the query text of 介绍一下postgresql from multiple methods and sort relevant documents by similarity.
-- The text that you want to query.
WITH query AS (
SELECT '介绍一下postgresql' AS query_text
),
-- Convert the question into an embedding. Replace sk-**** with the API key obtained from Alibaba Cloud Model Studio.
query_embedding AS (
SELECT rds_embedding.get_embedding_by_model('dashscope', 'sk-****', query.query_text)::real[]::vector AS embedding
FROM query
),
-- Run a search based on document keywords and sort the results by using ts_rank. A high similarity indicates a high score.
first_method AS (
SELECT
id,
title,
ts_rank(to_tsvector('jiebacfg', doc.key_word),
to_tsquery(replace(text(plainto_tsquery('jiebacfg', (SELECT query_text FROM query))), '&', '|'))) AS score,
'doc_key_word' AS method
FROM doc
WHERE
to_tsvector('jiebacfg', doc.key_word) @@
to_tsquery(replace(text(plainto_tsquery('jiebacfg', (SELECT query_text FROM query))), '&', '|'))
ORDER BY
score DESC
LIMIT 3
),
-- Run a full-text search based on content keywords and sort the results by using the <=> operator of RUM. A high similarity indicates a low score.
second_method AS (
SELECT
id,
doc_id,
content,
to_tsvector('jiebacfg', content) <=>
to_tsquery(replace(text(plainto_tsquery('jiebacfg', (SELECT query_text FROM query))), '&', '|')) AS score,
'content_key_word' AS method
FROM embed
WHERE
to_tsvector('jiebacfg', content) @@
to_tsquery(replace(text(plainto_tsquery('jiebacfg', (SELECT query_text FROM query))), '&', '|'))
ORDER BY
score
LIMIT 3
),
Run a full-text search based on embeddings and sort the results by using the <=> operator of pgvector. A high similarity indicates a low score.
third_method AS (
SELECT
embed.id,
embed.doc_id,
embed.content,
embedding <=> (SELECT embedding FROM query_embedding LIMIT 1) AS score,
'embedding' AS method
FROM embed
ORDER BY score
LIMIT 3
)
-- Perform a join query to obtain more field information.
SELECT
first_method.title,
embed.id AS chunk_id,
SUBSTRING(embed.content FROM 1 FOR 30),
first_method.score,
first_method.method
FROM first_method
LEFT JOIN embed ON first_method.id = embed.doc_id
-- Combine the output of the second_method field.
UNION
SELECT
doc.title,
second_method.id AS chunk_id,
SUBSTRING(second_method.content FROM 1 FOR 30),
second_method.score,
second_method.method
FROM second_method
LEFT JOIN doc ON second_method.doc_id = doc.id
-- Combine the output of the third_method field.
UNION
SELECT
doc.title,
third_method.id AS chunk_id,
SUBSTRING(third_method.content FROM 1 FOR 30),
third_method.score,
third_method.method
FROM third_method
LEFT JOIN doc ON third_method.doc_id = doc.id
ORDER BY method, score;
Output:
title | chunk_id | substring | score | method
----------------+----------+--------------------------------------------------------------+----------------------+------------------
PostgreSQL介绍 | 3 | PostgreSQL是最初的伯克利代码的开源继承者。 它支持大 | 13.159472465515137 | content_key_word
PostgreSQL介绍 | 2 | PostgreSQL是以加州大学伯克利分校计算机系开发的PO | 16.4493408203125 | content_key_word
PostgreSQL介绍 | 4 | 并且,因为自由宽松的许可证,任何人都可以以任何目的免费使用、 | 16.4493408203125 | content_key_word
PostgreSQL介绍 | 6 | 丰富的生态系统:有大量现成的插件和扩展可供使用,比如Post | 0.020264236256480217 | doc_key_word
PostgreSQL介绍 | 5 | Ganos插件和PostGIS插件不能安装在同一个Schem | 0.020264236256480217 | doc_key_word
PostgreSQL介绍 | 3 | PostgreSQL是最初的伯克利代码的开源继承者。 它支持大 | 0.020264236256480217 | doc_key_word
PostgreSQL介绍 | 2 | PostgreSQL是以加州大学伯克利分校计算机系开发的PO | 0.020264236256480217 | doc_key_word
PostgreSQL介绍 | 4 | 并且,因为自由宽松的许可证,任何人都可以以任何目的免费使用、 | 0.020264236256480217 | doc_key_word
PostgreSQL介绍 | 2 | PostgreSQL是以加州大学伯克利分校计算机系开发的PO | 0.2546271233144539 | embedding
PostgreSQL介绍 | 3 | PostgreSQL是最初的伯克利代码的开源继承者。 它支持大 | 0.28679098231865074 | embedding
PostgreSQL介绍 | 6 | 丰富的生态系统:有大量现成的插件和扩展可供使用,比如Post | 0.41783296077761967 | embeddingReferences
For more information about the best practices of ApsaraDB RDS for PostgreSQL in RAG, see the following topics: