In a data-driven world, unstructured data such as text, images, audio, video, and logs, along with structured and semi-structured data such as JSON, form the core data assets of an enterprise. Unstructured data, in its raw and diverse forms, contains vast business insights, including user feedback, contract terms, and images of product defects. This topic describes how to retrieve and analyze PDF files, such as prospectuses and contracts, in a finance scenario to help you make fine-grained operational decisions.
Core capabilities
This best practice focuses on processing and retrieving unstructured PDF data using the following Hologres capabilities:
Object Table: Read unstructured data, such as PDF, image, and PPT files, from Object Storage Service (OSS) in a tabular format.
AI functions: Call AI functions powered by built-in large language models (LLMs) using standard SQL syntax to build AI services.
Data transformation: Use the Embed and Chunk operators to transform unstructured data into structured data for storage. You can automatically embed data without using external algorithms.
Data retrieval and analytics: Use operators such as
ai_genandai_summarizeto perform operations such as inference, summarization, and translation using SQL.
Dynamic Table: Use the incremental refresh mode to automatically process unstructured data. This mode calculates only incremental data, which reduces repeated calculations and resource usage.
Vector search: Use standard SQL for vector search to perform similarity searches and scenario recognition on unstructured data. You can combine vector and scalar searches in the same query.
Full-text search: Use mechanisms such as inverted indexes and tokenization to efficiently retrieve unstructured data. This feature supports various retrieval methods, such as keyword matching and phrase searching, for more flexible retrieval.
Solution benefits
Using the aforementioned core capabilities, multimodal AI retrieval and analysis in Hologres provides the following benefits:
Complete AI data processing flow: Covers the entire process from data embedding, chunking, and incremental processing to retrieval and analysis. Developers can easily build AI applications in the same way they would for a big data system.
Process and analyze unstructured data with standard SQL: Use pure SQL to extract and process unstructured data without needing a dedicated development language or external systems. This makes data processing simpler and more efficient, and reduces the learning curve for developers.
More accurate, flexible, and intelligent retrieval: Easily build a hybrid retrieval pipeline that combines keyword, semantic, and multimodal searches to meet all scenario requirements from exact searches to intention recognition. You can also use AI functions for in-depth understanding of user intent, semantic association, and contextual inference to enable more intelligent retrieval.
More secure because data is not exported: You do not need to export data to external systems. This solution seamlessly integrates with various security features of Hologres to ensure data security.
This topic describes how to use the aforementioned core capabilities to process and retrieve unstructured data in Hologres. This helps you build an enterprise-level multimodal AI data platform, break down data silos, and unlock the value of all your data.
Workflow
The workflow for this solution is as follows:
Prepare the dataset.
Upload the PDF files from the finance dataset to OSS.
Process the PDF data.
Use an Object Table to read the metadata of the PDF files. Then, create a dynamic table that supports incremental refresh to embed and chunk the data. Also, build vector and full-text indexes on the dynamic table to accelerate subsequent retrieval.
Use the
ai_embedoperator to embed natural language questions. Then, use full-text and vector dual retrieval to retrieve results and sort them. Finally, use an LLM's inference capability to output the answer with the highest similarity.
Preparations
Data preparation
This topic uses the 80 company prospectuses in the PDF folder of the public finance dataset on ModelScope.
Environment preparation
Purchase a Hologres instance of V4.0 or later and create a database.
This topic uses one large-96core-512GB-384GB node as an example.
Deploy models. The models and allocated resources for this solution are as follows:
Model name
Model category
Model description
vCPUs per replica
Memory per replica
GPU per replica
Number of replicas
to_doc
ds4sd/docling-models
Converts a PDF file into a document.
20
100 GB
1 card (48 GB)
1
chunk
recursive-character-text-splitter
Chunks documents. Chunking is recommended because each PDF file is large.
15
30 GB
0 cards (0 GB)
1
pdf_embed
BAAI/bge-base-zh-v1.5
Embeds documents.
7
30 GB
1 card (96 GB)
1
llm
Qwen/Qwen3-32B
Uses an LLM to perform inference on the retrieved document content based on a prompt.
7
30 GB
1 card (96 GB)
1
NoteThe models use the default resource allocations.
Procedure
Download the PDF files and upload them to OSS.
Download the 80 prospectuses (PDF) from the Bosera-JM 14B Challenge Dataset for Finance.
Log on to the OSS console, create a bucket, and upload the downloaded PDF files to the bucket path. For more information, see Simple upload.
Grant permissions.
Log on to the Resource Access Management (RAM) console. Create an Alibaba Cloud RAM role and grant it OSS permissions.
We recommend granting the AliyunOSSReadOnlyAccess permission.
Grant login and Hologres access permissions to the RAM role.
Alibaba Cloud account (root account)
Update the trust policy for the RAM role. Update these parameters:
Action: Set to
sts:AssumeRole.Service: Set to
hologres.aliyuncs.com.
{ "Statement": [ { "Action": "sts:AssumeRole", "Effect": "Allow", "Principal": { "RAM": [ "acs:ram::1866xxxx:root" ], "Service": [ "hologres.aliyuncs.com" ] } } ], "Version": "1" }RAM user (sub-account)
Grant permissions to the RAM user.
On the page, click Create Policy, and select the Script Editor mode to create a policy. For more information, see Create a custom policy.
Hologres uses this policy to check whether the RAM user has permission to create the RAM role. Policy content:
{ "Version": "1", "Statement": [ { "Effect": "Allow", "Action": "hologram:GrantAssumeRole", "Resource": "<RoleARN>" } ] }On the page, click Add Permissions in the Actions column for the target RAM user to grant the access policy created in the preceding steps. For more information, see Manage permissions for RAM users.
Grant permissions to the RAM role.
Update the trust policy for the RAM role. Update these parameters:
Action: Set to
sts:AssumeRole.Service: Set to
hologres.aliyuncs.com.
{ "Statement": [ { "Action": "sts:AssumeRole", "Effect": "Allow", "Principal": { "RAM": [ "acs:ram::1866xxxx:root" ], "Service": [ "hologres.aliyuncs.com" ] } } ], "Version": "1" }
Embed and chunk the PDF files.
Create an Object Table and a Dynamic Table to read and process the PDF metadata. To simplify this multi-step process, Hologres provides a stored procedure. This stored procedure provides the following capabilities:
Creates an Object Table to read the PDF metadata.
Creates a sink table for the dynamic table in incremental refresh mode to store the processed data. Vector and full-text indexes are configured on this table. The dynamic table must be refreshed manually.
During the dynamic table refresh, data is embedded and chunked using
ai_embedandai_chunk.
The code for the stored procedure is as follows:
CALL create_rag_corpus_from_oss( oss_path => 'oss://xxxx/bs_challenge_financial_14b_dataset/pdf', oss_endpoint => 'oss-cn-hangzhou-internal.aliyuncs.com', oss_role_arn => 'acs:ram::186xxxx:role/xxxx', corpus_table => 'public.dt_bs_challenge_financial' );Refresh the sink table.
To complete the data transformation, you must manually refresh the Object Table and Dynamic Table. Hologres provides the stored procedure for PDF processing, which provides the following capabilities:
Refreshes the Object Table to obtain the PDF metadata.
Refreshes the dynamic table to perform PDF embedding and chunking.
The following code calls this stored procedure:
CALL refresh_rag_corpus_table( corpus_table => 'public.dt_bs_challenge_financial' );Search PDF files.
After the data is processed, you can perform vector or full-text searches based on your business scenario. For example, you can query a company's performance trend from its prospectus to determine the company's future outlook. This information can inform future investment decisions.
Vector search
For vector search, Hologres provides a vector search function that encapsulates processes such as question embedding, prompt construction, and answer generation by an LLM. You can directly call this function to perform vector retrieval.
-- Vector-only retrieval + AI reranking SELECT qa_vector_search_retrieval( question => 'By what percentage did the operating income and net profit of Goke Microelectronics increase year-over-year in 2014, 2015, and 2016 during the reporting period?', corpus_table => 'dt_bs_challenge_financial', prompt => 'Please analyze whether the following performance trend is pessimistic or optimistic and provide reasons: ${question}\n\n Reference information:\n\n ${context}' )The retrieved answer is as follows:
qa_retrieval --------- "Based on the information provided, the analysis of the performance trend of Goke Microelectronics leads to the following conclusions: ### I. Performance Trend Analysis: Pessimistic #### 1. **Sluggish revenue growth** - Revenue in 2014 increased by **15.13%** year-over-year, but in 2015, it **decreased by 5.21%**. Data for 2016 is not provided, but it is clear that the revenue growth trend saw a significant decline in 2015. - The compound annual growth rate (CAGR) of revenue from 2012 to 2014 was only **4.47%**, indicating slow business expansion. #### 2. **Continuous decline in net profit growth** - Net profit grew by **5.43%** in 2014 but **decreased by 3.29%** in 2015. - After deducting non-recurring gains and losses, the net profit attributable to parent company shareholders decreased by **3.14%** in 2014 and further reduced by **5.60%** in 2015. This indicates that the profitability of the company's main business is continuously deteriorating. - The CAGR of net profit after deducting non-recurring gains and losses from 2012 to 2014 was **-4.38%**. This is much lower than the revenue growth, indicating that the main business is not profitable and growth relies on non-recurring gains and losses. #### 3. **High proportion of non-recurring gains and losses** - During the reporting period, non-recurring gains and losses accounted for a high proportion of net profit: **17.54%**, **10.25%**, and **8.06%** in 2014, 2013, and 2012, respectively. This shows that a portion of the company's profit comes from non-recurring factors such as policy support and government subsidies, rather than from the sustained growth of its core business. - Relying on non-recurring gains and losses to maintain profit growth is not conducive to the company's long-term stable development. #### 4. **Decline in return on equity (ROE)** - The weighted average ROE increased from **18.10%** in 2014 to **24.82%** in 2015, and then to **28.23%** in 2016. Although the data seems to show growth, note that this indicator is calculated based on net profit after deducting non-recurring gains and losses, and the net profit itself is declining. Therefore, this growth may be related to changes in the capital structure rather than a substantial improvement in profitability. ### II. Summary 1. **Sluggish main business growth**: Both revenue and net profit growth show a downward trend, especially the decline in net profit, which indicates weakening profitability. 2. **High reliance on non-recurring gains and losses**: Non-recurring gains and losses account for a high proportion of the company's profit, indicating that the main business is not sufficiently profitable and the sustainability of the company's performance is questionable. 3. **Intense market competition**: The market for industrial computers, displays, and power supplies that the company purchases is highly competitive, with stable prices and squeezed profit margins. 4. **Market environment**: Fluctuations in stainless steel market prices and raw material prices may have a certain impact on the company's operating income. Although the company has taken measures to reduce the impact, it still requires attention in the long run. ### III. Conclusion Overall, the performance trend of Goke Microelectronics is **pessimistic**. The company's main business growth is sluggish, net profit is continuously declining, and it relies heavily on non-recurring gains and losses. The sustainability of its future profitability is questionable. The company needs to strengthen the competitiveness of its core business, optimize its cost structure, and improve the profitability of its main business to achieve long-term stable development."Full-text search
For full-text search, Hologres provides a full-text search function that encapsulates processes such as question embedding, prompt construction, and answer generation by an LLM. You can directly call this function to perform full-text retrieval.
--Full-text search retrieval SELECT qa_text_search_retrieval( question => 'By what percentage did the operating income and net profit of Goke Microelectronics increase year-over-year in 2014, 2015, and 2016 during the reporting period?', corpus_table => 'dt_bs_challenge_financial', prompt => 'Please analyze whether the following performance trend is pessimistic or optimistic and provide reasons: ${question}\n\n Reference information:\n\n ${context}' );The retrieved answer is as follows:
qa_text_search_retrieval ---------------- "Based on the information provided, the overall performance trend of Goke Microelectronics in 2014, 2015, and 2016 is **pessimistic**. The specific reasons are as follows: ### 1. **Sluggish revenue growth** - The revenue growth rate in 2014 was **15.13%**, but in 2015, it turned to **-5.21%**, indicating negative growth. - The compound annual growth rate (CAGR) of revenue from 2012 to 2014 was only **4.47%**, indicating that the company's revenue growth was slow and its business development was not strong. - The revenue forecast for the first half of 2015 was nearly the same as the same period in 2014, but the net profit for the first half of 2015 **slightly decreased** compared to the same period of the previous year, indicating a decline in profitability. ### 2. **Poor growth in net profit and net profit excluding non-recurring items** - The net profit growth rate in 2014 was **5.43%**, but it decreased to **-3.29%** in 2015, meaning net profit declined. - The growth rate of net profit after deducting non-recurring gains and losses was **-3.14%** in 2014 and further decreased to **-5.60%** in 2015, indicating that the profitability of the company's main business continued to decline. - The CAGR of net profit excluding non-recurring items from 2012 to 2014 was **-4.38%**, which is significantly lower than the CAGR of revenue. This indicates that the company's main business profitability is weak. ### 3. **Fluctuations in cash flow from operating activities** - In 2014, the proportion of cash from the sale of goods and provision of services to operating income decreased compared to the previous two years. This was mainly due to the **delayed payment** of some revenue items, indicating problems in the company's cash flow management. - In 2013, the proportion of cash for the purchase of goods and services to operating costs was relatively high. This was mainly because **raw materials were purchased and production was completed** in that year, but some costs were not carried over until 2014, resulting in a lower proportion in 2014. This reflects that the company's procurement and production pace was not stable. ### 4. **Investment and profitability indicators** - The weighted average return on equity (ROE) was **18.10%** in 2014, rose to **24.82%** in 2015, and further increased to **28.23%** in 2016. Although there was an improvement, the increase in ROE may have been mainly attributed to **financial leverage** rather than an improvement in core business profitability. - Considering the continuous decline in net profit and net profit excluding non-recurring items, the increase in ROE does not fully reflect an improvement in the company's operational performance. ### 5. **Performance forecast for the first half of 2015** - The estimated operating revenue of H1 2015 is **CNY 85.05 million to 103.95 million**, which is similar to H1 2024's **CNY 101.2735 million**. However, the estimated net profit is **CNY 23.40 million to 28.60 million**, which is lower than the 2014's **CNY 29.1266 million**, indicating a decline in the company's profitability. ### Summary In summary, the performance trend of Goke Microelectronics from 2014 to 2016 is **pessimistic**. Although ROE has improved, the sluggish revenue growth, continuous decline in net profit and net profit excluding non-recurring items, and fluctuations in cash flow from operational activities indicate that the company's main business profitability is weak and its operational quality needs to be improved."Hybrid search
In a hybrid search scenario that combines vector search, full-text search, and ranking, Hologres provides a hybrid search function with ranking. This function has the following capabilities:
Retrieves the top 20 answers using vector calculations based on the question.
Retrieves the top 20 answers using full-text search based on the question.
Uses
ai_rankto sort the answers from the vector and full-text searches, and then outputs the top 1 answer.Uses
ai_genand an LLM to generate the final answer based on the prompt and the retrieved answer.
-- Full-text and vector dual-retrieval + AI reranking SELECT qa_hybrid_retrieval( question => 'By what percentage did the operating revenue and net profit of Hunan Goke Microelectronics Co., Ltd. increase year-over-year in 2014, 2015, and 2016 during the reporting period?', corpus_table => 'dt_bs_challenge_financial', prompt => 'Please analyze whether the following performance trend is pessimistic or optimistic and provide reasons: ${question}\n\n Reference information:\n\n ${context}' );The retrieved answer is as follows:
qa_hybrid_retrieval --- "Based on the information provided, we can analyze the performance trend of Goke Microelectronics and determine whether it is pessimistic or optimistic as follows: --- ### I. **Operating Revenue Trend Analysis** 1. **2012-2014 Compound Annual Growth Rate (CAGR)**: - The CAGR of operating income was **4.47%**, indicating that the company's revenue growth was relatively stable. - In 2014, operating income was **CNY 181.5406 million**, an increase of **15.13%** from 2013. - In 2015, operating income **decreased by 5.21%** year-over-year, showing negative growth. 2. **Conclusion**: - Revenue growth recovered in 2014 but saw a significant decline in 2015, indicating that the company's business expansion has encountered some resistance. --- ### II. **Net Profit Trend Analysis** 1. **2012-2014 CAGR**: - The CAGR of net profit after deducting non-recurring gains and losses was **-4.38%**, which is lower than the CAGR of operating income. This indicates a decline in the company's profitability. - In 2014, net profit excluding non-recurring items was **CNY 42.731 million**, a decrease of **3.14%** from 2013. - In 2015, net profit excluding non-recurring items further **decreased by 5.60%** year-over-year. 2. **Impact of Non-recurring Gains and Losses**: - The proportion of non-recurring gains and losses to net profit was **17.54%**, **10.25%**, and **8.06%** in 2014, 2013, and 2012, respectively, showing an upward trend. - The increase in non-recurring gains and losses mainly came from government subsidies and investment income, rather than sustained growth from the main business. 3. **Conclusion**: - The continuous two-year decline in net profit excluding non-recurring items indicates that the profitability of the company's main business is weakening, and performance growth relies on non-recurring gains and losses, which is a worrying signal. --- ### III. **Cash Flow and Operational Stability** 1. **Cash Flow from Operating Activities**: - In 2014, operating income was **CNY 181.5406 million**, but the cash from the sale of goods and provision of services was not clearly stated, making it impossible to determine if the cash flow is healthy. - During the reporting period, the company's bank deposits were **CNY 130.6338 million**, **CNY 41.5254 million**, and **CNY 98.6461 million**, respectively. The liquidity fluctuated significantly, but the main customers, suppliers, and business model remained stable. 2. **Conclusion**: - Although the company's cash flow fluctuates, its stable customer and supplier base and business model provide a certain guarantee for future development. --- ### IV. **Performance Forecast for the H1 2015** - From January to June 2015, the estimated operating income is **CNY 85.05 million to 103.95 million**, a significant increase from H1 2014's **CNY 46.4119 million**. - However, from January to March 2015, net profit decreased by **48.26%** year-over-year, mainly because the projects with recognized revenue had lower gross profit margins. --- ### V. **Comprehensive Analysis and Judgment** 1. **Optimistic Factors**: - Operating income grew rapidly in 2014, reaching **15.13%**. - Operating income is expected to grow significantly in H1 2015, indicating that the company may be gradually recovering. - The stability of major customers, suppliers, and the business model provides a good operational foundation for the company. 2. **Pessimistic Factors**: - In 2015, operating revenue **decreased by 5.21%** year-over-year, and net profit also declined. - The continuous two-year decline in net profit excluding non-recurring items indicates insufficient profitability of the main business. - The proportion of non-recurring gains and losses is increasing, and performance growth relies on government subsidies and investment income, lacking growth momentum. - Net profit from January to March 2015 dropped sharply by **48.26%**, indicating significant short-term performance fluctuations. --- ### **Final Conclusion: Overall Trend is Pessimistic** - Although the company's operating income recovered in 2014 and is expected to grow in the first half of 2015, the **continuous decline in net profit excluding non-recurring items**, **reliance on non-recurring gains and losses for net profit growth**, and **large short-term performance fluctuations** indicate that the company's current performance growth lacks sustainability and stability. - Therefore, from a long-term perspective, the company's performance trend is **pessimistic**. Attention should be paid to the profitability of its main business and its reliance on non-recurring gains and losses. --- ### **Recommendations** 1. Pay attention to whether the profitability of the company's main business will be improved. 2. Reduce reliance on non-recurring gains and losses and enhance growth momentum. 3. Stabilize customer and supplier relationships, optimize the business structure, and increase gross profit margins."Hybrid search with RRF
After using vector and full-text search, you can use Reciprocal Rank Fusion (RRF) to sort the retrieved results. For convenience, Hologres has encapsulated this process into a hybrid search function with RRF. For more information, see the appendix. This function has the following capabilities:
Retrieves the top 20 answers using vector calculations based on the question.
Retrieves the top 20 answers using full-text search based on the question.
Calculates RRF scores for the answers from the vector and full-text searches, and then outputs the top N answers.
Uses
ai_genand an LLM to generate the final answer based on the prompt and the retrieved answer.
-- Full-text and vector dual-retrieval + RRF reranking SELECT qa_hybrid_retrieval_rrf( question => 'By what percentage did the operating income and net profit of Goke Microelectronics increase year-over-year in 2014, 2015, and 2016 during the reporting period?', corpus_table => 'dt_bs_challenge_financial', prompt => 'Please analyze whether the following performance trend is pessimistic or optimistic and provide reasons: ${question}\n\n Reference information:\n\n ${context}' );The retrieved answer is as follows:
qa_hybrid_retrieval_rrf ------------------ "Based on the information provided, the analysis of the performance trend of Goke Microelectronics leads to the following conclusions: ### **Performance Trend Judgment: Pessimistic** #### **The reasons are as follows:** 1. **Net profit growth is lower than revenue growth:** - The provided information indicates that the company's **compound annual growth rate (CAGR) of operating income from 2012 to 2014 was 4.47%**, suggesting that the overall business growth was relatively stable. - However, the **CAGR of net profit attributable to parent company shareholders after deducting non-recurring gains and losses was -4.38%**, which is significantly lower than the revenue growth rate. This indicates that while revenue was growing, the company's profitability did not improve in sync and even declined. This could be due to factors such as rising costs, declining gross profit margins, or a reduction in non-recurring gains and losses. 2. **Significant fluctuations in net profit:** - Net profit from January to March 2015 decreased by 48.26% year-over-year. The main reason was the lower gross profit margin of the projects with recognized revenue (such as the Wuxi Metro Line 1 project, which was mainly based on module outsourcing). This indicates that the company's short-term performance is susceptible to changes in its business structure, showing a certain degree of instability. 3. **Decline in gross profit margin and profitability:** - It was mentioned that the "contribution of main business profit to the company's net profit" slightly decreased in 2014 compared to 2013, and 2013 was lower than 2012. This suggests that the profitability of the company's core business may be weakening, possibly due to increased market competition, rising costs, or changes in the product structure. 4. **Projected profit decline in the first half of 2015:** - The estimated operating income for H1 2015 is CNY 85.05 million to 103.95 million, which is similar to that in H1 2014. However, the estimated net profit is CNY 23.40 million to 28.60 million, which is lower than 2014 H1's CNY 29.1266 million. This indicates that the company's profitability is further declining and it may be facing certain operational pressures. 5. **Sluggish business growth:** - Although revenue growth is stable, the decline in net profit indicates that the quality of the company's business growth is not high and has not been effectively converted into profit. This may affect investors' confidence in the company's future development. ### **Summary:** The overall performance trend of Goke Microelectronics is **pessimistic**. Although operating revenue has maintained stable growth, the growth of net profit has significantly lagged or even shown negative growth. This indicates that the company's profitability is declining, the quality of its business development is not high, and there is a risk of short-term performance fluctuations. If the company cannot effectively increase its gross profit margin, control costs, or optimize its product structure, its future performance may continue to be under pressure."
Appendix: Function and procedure definitions
The definitions of the stored procedures and functions used in this topic are provided for your reference.
You cannot create functions in Hologres. The following procedures and functions are for reference only. You cannot modify and run them.
Stored procedures for PDF processing
Create an Object Table and a Dynamic Table
CREATE OR REPLACE PROCEDURE create_rag_corpus_from_oss( oss_path TEXT, oss_endpoint TEXT, oss_role_arn TEXT, corpus_table TEXT, embedding_model TEXT DEFAULT NULL, parse_document_model TEXT DEFAULT NULL, chunk_model TEXT DEFAULT NULL, chunk_size INT DEFAULT 300, chunk_overlap INT DEFAULT 50, overwrite BOOLEAN DEFAULT FALSE ) AS $$ DECLARE corpus_schema TEXT; corpus_name TEXT; obj_table_name TEXT; full_corpus_ident TEXT; full_obj_ident TEXT; embed_expr TEXT; chunk_expr TEXT; parse_expr TEXT; embedding_dims INT; BEGIN -- 1. Split the schema name and table name. IF position('.' in corpus_table) > 0 THEN corpus_schema := split_part(corpus_table, '.', 1); corpus_name := split_part(corpus_table, '.', 2); ELSE corpus_schema := 'public'; corpus_name := corpus_table; END IF; obj_table_name := corpus_name || '_obj_table'; full_corpus_ident := format('%I.%I', corpus_schema, corpus_name); full_obj_ident := format('%I.%I', corpus_schema, obj_table_name); -- 2. If overwrite is needed, drop the table and index first. IF overwrite THEN DECLARE dyn_table_exists BOOLEAN; rec RECORD; BEGIN -- Check if the dynamic table exists. SELECT EXISTS ( SELECT 1 FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE c.relname = corpus_name AND n.nspname = corpus_schema ) INTO dyn_table_exists; IF dyn_table_exists THEN -- 2.1 Disable auto-refresh for the dynamic table. -- RAISE NOTICE 'Disabling auto refresh for %', full_corpus_ident; -- EXECUTE format('ALTER TABLE IF EXISTS %s SET (auto_refresh_enable=false)', full_corpus_ident); -- 2.2 Find and cancel RUNNING refresh tasks. FOR rec IN EXECUTE format( $f$ SELECT query_job_id FROM hologres.hg_dynamic_table_refresh_log(%L) WHERE status = 'RUNNING'; $f$, corpus_table ) LOOP RAISE NOTICE 'Found running refresh job: %', rec.query_job_id; IF hologres.hg_internal_cancel_query_job(rec.query_job_id::bigint) THEN RAISE NOTICE 'Cancel job % succeeded.', rec.query_job_id; ELSE RAISE WARNING 'Cancel job % failed.', rec.query_job_id; END IF; END LOOP; -- 2.3 Drop the Dynamic Table. EXECUTE format('DROP TABLE IF EXISTS %s;', full_corpus_ident); ELSE RAISE NOTICE 'Dynamic table % does not exist, skip cancel job and drop.', full_corpus_ident; END IF; -- 2.4 In any case, the Object Table must be dropped. EXECUTE format('DROP OBJECT TABLE IF EXISTS %s;', full_obj_ident); END; END IF; -- 3. Create an Object Table. RAISE NOTICE 'Create object table: %', obj_table_name; EXECUTE format( $f$ CREATE OBJECT TABLE %s WITH ( path = %L, oss_endpoint = %L, role_arn = %L ); $f$, full_obj_ident, oss_path, oss_endpoint, oss_role_arn ); COMMIT; -- 4. Refresh the Object Table. RAISE NOTICE 'Refresh object table: %', obj_table_name; EXECUTE format('REFRESH OBJECT TABLE %s;', full_obj_ident); COMMIT; -- 5. Select a document parsing model. IF parse_document_model IS NULL OR length(trim(parse_document_model)) = 0 THEN parse_expr := 'ai_parse_document(file, ''auto'', ''markdown'')'; ELSE parse_expr := format( 'ai_parse_document(%L, file, ''auto'', ''markdown'')', parse_document_model ); END IF; -- 6. Select a chunking model. IF chunk_model IS NULL OR length(trim(chunk_model)) = 0 THEN chunk_expr := format('ai_chunk(doc, %s, %s)', chunk_size, chunk_overlap); ELSE chunk_expr := format( 'ai_chunk(%L, doc, %s, %s)', chunk_model, chunk_size, chunk_overlap ); END IF; -- 7. Select an embedding model. IF embedding_model IS NULL OR length(trim(embedding_model)) = 0 THEN embed_expr := 'ai_embed(chunk)'; EXECUTE 'SELECT array_length(ai_embed(''dummy''), 1)' INTO embedding_dims; ELSE embed_expr := format('ai_embed(%L, chunk)', embedding_model); EXECUTE format( 'SELECT array_length(ai_embed(%L, ''dummy''), 1)', embedding_model ) INTO embedding_dims; END IF; RAISE NOTICE 'embedding dimension is: %', embedding_dims; -- 8. Create a dynamic table to store RAG outputs. RAISE NOTICE 'create dynamic table: %', corpus_name; EXECUTE format( $f$ CREATE DYNAMIC TABLE %s( CHECK(array_ndims(embedding_vector) = 1 AND array_length(embedding_vector, 1) = %s) ) WITH ( vectors = '{ "embedding_vector": { "algorithm": "HGraph", "distance_method": "Cosine", "builder_params": { "base_quantization_type": "sq8_uniform", "max_degree": 64, "ef_construction": 400, "precise_quantization_type": "fp32", "use_reorder": true } } }', auto_refresh_mode = 'incremental', freshness = '5 minutes', auto_refresh_enable = 'false' ) AS WITH parsed_doc AS ( SELECT object_uri, etag, %s AS doc FROM %s ), chunked_doc AS ( SELECT object_uri, etag, unnest(%s) AS chunk FROM parsed_doc ) SELECT object_uri, etag, chunk, %s AS embedding_vector FROM chunked_doc; $f$, full_corpus_ident, embedding_dims, parse_expr, full_obj_ident, chunk_expr, embed_expr ); COMMIT; -- 9. Create a full-text index (index name = table name || '_fulltext_idx'). EXECUTE format( 'CREATE INDEX %I ON %s USING FULLTEXT (chunk);', corpus_name || '_fulltext_idx', full_corpus_ident ); RAISE NOTICE ''; RAISE NOTICE 'Create RAG corpus success to table: %', corpus_table; RAISE NOTICE ' Vector index is: %.embedding_vector', corpus_table; RAISE NOTICE ' TextSearch index is: %.chunk', corpus_table; END; $$ LANGUAGE plpgsql;Stored procedure to refresh the Object Table and Dynamic Table
CREATE OR REPLACE PROCEDURE refresh_rag_corpus_table( corpus_table TEXT ) AS $$ DECLARE corpus_schema TEXT; corpus_name TEXT; obj_table_name TEXT; full_corpus_ident TEXT; full_obj_ident TEXT; BEGIN -- 1. Parse the schema and table name. IF position('.' in corpus_table) > 0 THEN corpus_schema := split_part(corpus_table, '.', 1); corpus_name := split_part(corpus_table, '.', 2); ELSE corpus_schema := 'public'; corpus_name := corpus_table; END IF; obj_table_name := corpus_name || '_obj_table'; full_corpus_ident := format('%I.%I', corpus_schema, corpus_name); full_obj_ident := format('%I.%I', corpus_schema, obj_table_name); -- 2. Refresh the Object Table. RAISE NOTICE 'Refreshing Object Table: %', obj_table_name; EXECUTE format('REFRESH OBJECT TABLE %s;', full_obj_ident); -- 3. Refresh the Dynamic Table. RAISE NOTICE 'Refreshing Dynamic Table: %', corpus_name; EXECUTE format('REFRESH TABLE %s;', full_corpus_ident); RAISE NOTICE 'Refresh complete for corpus table %', corpus_table; END; $$ LANGUAGE plpgsql;Stored procedure to drop the Object Table and Dynamic Table
CREATE OR REPLACE PROCEDURE drop_rag_corpus_table( corpus_table TEXT ) AS $$ DECLARE corpus_schema TEXT; corpus_name TEXT; obj_table_name TEXT; full_corpus_ident TEXT; full_obj_ident TEXT; rec RECORD; BEGIN -- 1. Parse the schema and table name. IF position('.' in corpus_table) > 0 THEN corpus_schema := split_part(corpus_table, '.', 1); corpus_name := split_part(corpus_table, '.', 2); ELSE corpus_schema := 'public'; corpus_name := corpus_table; END IF; obj_table_name := corpus_name || '_obj_table'; full_corpus_ident := format('%I.%I', corpus_schema, corpus_name); full_obj_ident := format('%I.%I', corpus_schema, obj_table_name); -- 2. Drop the table. -- 2.1 Disable auto-refresh for the dynamic table. -- RAISE NOTICE 'Disabling auto refresh for %', full_corpus_ident; -- EXECUTE format('ALTER TABLE IF EXISTS %s SET (auto_refresh_enable=false)', full_corpus_ident); -- 2.2 Find and cancel RUNNING refresh tasks. FOR rec IN EXECUTE format( $f$ SELECT query_job_id FROM hologres.hg_dynamic_table_refresh_log(%L) WHERE status = 'RUNNING'; $f$, corpus_table ) LOOP RAISE NOTICE 'Found running refresh job: %', rec.query_job_id; IF hologres.hg_internal_cancel_query_job(rec.query_job_id::bigint) THEN RAISE NOTICE 'Cancel job % succeeded.', rec.query_job_id; ELSE RAISE WARNING 'Cancel job % failed.', rec.query_job_id; END IF; END LOOP; -- 2.3 Drop the Dynamic Table. RAISE NOTICE 'Dropping Dynamic Table: %', corpus_name; EXECUTE format('DROP TABLE IF EXISTS %s;', full_corpus_ident); -- 2.4 Drop the Object Table. RAISE NOTICE 'Dropping Object Table: %', obj_table_name; EXECUTE format('DROP OBJECT TABLE IF EXISTS %s;', full_obj_ident); RAISE NOTICE 'Drop complete for corpus: %', corpus_table; END; $$ LANGUAGE plpgsql;
Vector search function
-- Vector search for Q&A
CREATE OR REPLACE FUNCTION qa_vector_search_retrieval(
question TEXT,
corpus_table TEXT,
embedding_model TEXT DEFAULT NULL,
llm_model TEXT DEFAULT NULL,
ranking_model TEXT DEFAULT NULL,
prompt TEXT DEFAULT 'Please answer the following question in ${language} based on the reference information.\n\n Question: ${question}\n\n Reference information:\n\n ${context}',
language TEXT DEFAULT 'Chinese',
vector_recall_count INT DEFAULT 20,
rerank_recall_count INT DEFAULT 5,
vector_col TEXT DEFAULT 'embedding_vector'
)
RETURNS TEXT AS
$$
DECLARE
final_answer TEXT;
sql TEXT;
embedding_expr TEXT;
ai_rank_expr TEXT;
ai_gen_expr TEXT;
embedding_model_valid BOOLEAN;
llm_model_valid BOOLEAN;
ranking_model_valid BOOLEAN;
BEGIN
embedding_model_valid := (embedding_model IS NOT NULL AND trim(embedding_model) != '');
llm_model_valid := (llm_model IS NOT NULL AND trim(llm_model) != '');
ranking_model_valid := (ranking_model IS NOT NULL AND trim(ranking_model) != '');
IF embedding_model_valid THEN
embedding_expr := 'ai_embed(' || quote_literal(embedding_model) || ', ' || quote_literal(question) || ')';
ELSE
embedding_expr := 'ai_embed(' || quote_literal(question) || ')';
END IF;
IF ranking_model_valid THEN
ai_rank_expr := 'ai_rank(' || quote_literal(ranking_model) || ', ' || quote_literal(question) || ', chunk)';
ELSE
ai_rank_expr := 'ai_rank(' || quote_literal(question) || ', chunk)';
END IF;
IF llm_model_valid THEN
ai_gen_expr := 'ai_gen(' || quote_literal(llm_model) ||
', replace(replace(replace(' || quote_literal(prompt) ||
', ''${question}'', ' || quote_literal(question) || '), ''${context}'', merged_chunks), ''${language}'', ' || quote_literal(language) || ') )';
ELSE
ai_gen_expr := 'ai_gen(replace(replace(replace(' || quote_literal(prompt) ||
', ''${question}'', ' || quote_literal(question) || '), ''${context}'', merged_chunks), ''${language}'', ' || quote_literal(language) || '))';
END IF;
sql := '
WITH
embedding_recall AS (
SELECT
chunk,
approx_cosine_distance(' || vector_col || ', ' || embedding_expr || ') AS distance
FROM
' || corpus_table || '
ORDER BY
distance DESC
LIMIT ' || vector_recall_count || '
),
rerank AS (
SELECT
chunk,
' || ai_rank_expr || ' AS score
FROM
embedding_recall
ORDER BY
score DESC
LIMIT ' || rerank_recall_count || '
),
concat_top_chunks AS (
SELECT string_agg(chunk, E''\n\n----\n\n'') AS merged_chunks FROM rerank
)
SELECT ' || ai_gen_expr || '
FROM concat_top_chunks;
';
EXECUTE sql INTO final_answer;
RETURN final_answer;
END;
$$ LANGUAGE plpgsql;Full-text search function
CREATE OR REPLACE FUNCTION qa_text_search_retrieval(
question TEXT,
corpus_table TEXT,
llm_model TEXT DEFAULT NULL,
ranking_model TEXT DEFAULT NULL,
prompt TEXT DEFAULT 'Please answer the following question in ${language} based on the reference information.\n\n Question: ${question}\n\n Reference information:\n\n ${context}',
language TEXT DEFAULT 'Chinese',
text_search_recall_count INT DEFAULT 20,
rerank_recall_count INT DEFAULT 5,
text_search_col TEXT DEFAULT 'chunk'
)
RETURNS TEXT AS
$$
DECLARE
final_answer TEXT;
sql TEXT;
ai_rank_expr TEXT;
ai_gen_expr TEXT;
llm_model_valid BOOLEAN;
ranking_model_valid BOOLEAN;
BEGIN
llm_model_valid := (llm_model IS NOT NULL AND trim(llm_model) != '');
ranking_model_valid := (ranking_model IS NOT NULL AND trim(ranking_model) != '');
IF ranking_model_valid THEN
ai_rank_expr := 'ai_rank(' || quote_literal(ranking_model) || ', ' || quote_literal(question) || ', chunk)';
ELSE
ai_rank_expr := 'ai_rank(' || quote_literal(question) || ', chunk)';
END IF;
IF llm_model_valid THEN
ai_gen_expr := 'ai_gen(' || quote_literal(llm_model) ||
', replace(replace(replace(' || quote_literal(prompt) ||
', ''${question}'', ' || quote_literal(question) ||
'), ''${context}'', merged_chunks), ''${language}'', ' || quote_literal(language) || ') )';
ELSE
ai_gen_expr := 'ai_gen(replace(replace(replace(' || quote_literal(prompt) ||
', ''${question}'', ' || quote_literal(question) ||
'), ''${context}'', merged_chunks), ''${language}'', ' || quote_literal(language) || '))';
END IF;
sql := '
WITH
text_search_recall AS (
SELECT
chunk
FROM
' || corpus_table || '
ORDER BY
text_search(' || text_search_col || ', ' || quote_literal(question) || ') DESC
LIMIT ' || text_search_recall_count || '
),
rerank AS (
SELECT
chunk,
' || ai_rank_expr || ' AS score
FROM
text_search_recall
ORDER BY
score DESC
LIMIT ' || rerank_recall_count || '
),
concat_top_chunks AS (
SELECT string_agg(chunk, E''\n\n----\n\n'') AS merged_chunks FROM rerank
)
SELECT ' || ai_gen_expr || '
FROM concat_top_chunks;
';
EXECUTE sql INTO final_answer;
RETURN final_answer;
END;
$$ LANGUAGE plpgsql;Hybrid search function with ranking
CREATE OR REPLACE FUNCTION qa_hybrid_retrieval(
question TEXT,
corpus_table TEXT,
embedding_model TEXT DEFAULT NULL,
llm_model TEXT DEFAULT NULL,
ranking_model TEXT DEFAULT NULL,
prompt TEXT DEFAULT 'Please answer the following question in ${language} based on the reference information.\n\n Question: ${question}\n\n Reference information:\n\n ${context}',
language TEXT DEFAULT 'Chinese',
text_search_recall_count INT DEFAULT 20,
vector_recall_count INT DEFAULT 20,
rerank_recall_count INT DEFAULT 5,
vector_col TEXT DEFAULT 'embedding_vector',
text_search_col TEXT DEFAULT 'chunk'
)
RETURNS TEXT AS
$$
DECLARE
final_answer TEXT;
sql TEXT;
embedding_expr TEXT;
ai_rank_expr TEXT;
ai_gen_expr TEXT;
embedding_model_valid BOOLEAN;
llm_model_valid BOOLEAN;
ranking_model_valid BOOLEAN;
BEGIN
embedding_model_valid := (embedding_model IS NOT NULL AND trim(embedding_model) != '');
llm_model_valid := (llm_model IS NOT NULL AND trim(llm_model) != '');
ranking_model_valid := (ranking_model IS NOT NULL AND trim(ranking_model) != '');
IF embedding_model_valid THEN
embedding_expr := 'ai_embed(' || quote_literal(embedding_model) || ', ' || quote_literal(question) || ')';
ELSE
embedding_expr := 'ai_embed(' || quote_literal(question) || ')';
END IF;
IF ranking_model_valid THEN
ai_rank_expr := 'ai_rank(' || quote_literal(ranking_model) || ', ' || quote_literal(question) || ', chunk)';
ELSE
ai_rank_expr := 'ai_rank(' || quote_literal(question) || ', chunk)';
END IF;
IF llm_model_valid THEN
ai_gen_expr := 'ai_gen(' || quote_literal(llm_model) ||
', replace(replace(replace(' || quote_literal(prompt) ||
', ''${question}'', ' || quote_literal(question) || '), ''${context}'', merged_chunks), ''${language}'', ' || quote_literal(language) || ') )';
ELSE
ai_gen_expr := 'ai_gen(replace(replace(replace(' || quote_literal(prompt) ||
', ''${question}'', ' || quote_literal(question) || '), ''${context}'', merged_chunks), ''${language}'', ' || quote_literal(language) || '))';
END IF;
sql := '
WITH
embedding_recall AS (
SELECT
chunk
FROM
' || corpus_table || '
ORDER BY
approx_cosine_distance(' || vector_col || ', ' || embedding_expr || ') DESC
LIMIT ' || vector_recall_count || '
),
text_search_recall AS (
SELECT
chunk
FROM
' || corpus_table || '
ORDER BY
text_search(' || text_search_col || ', ' || quote_literal(question) || ') DESC
LIMIT ' || text_search_recall_count || '
),
union_recall AS (
SELECT chunk FROM embedding_recall
UNION
SELECT chunk FROM text_search_recall
),
rerank AS (
SELECT
chunk,
' || ai_rank_expr || ' AS score
FROM
union_recall
ORDER BY
score DESC
LIMIT ' || rerank_recall_count || '
),
concat_top_chunks AS (
SELECT string_agg(chunk, E''\n\n----\n\n'') AS merged_chunks FROM rerank
)
SELECT ' || ai_gen_expr || '
FROM concat_top_chunks;
';
EXECUTE sql INTO final_answer;
RETURN final_answer;
END;
$$ LANGUAGE plpgsql;Hybrid search function with RRF
CREATE OR REPLACE FUNCTION qa_hybrid_retrieval_rrf(
question TEXT,
corpus_table TEXT,
embedding_model TEXT DEFAULT NULL,
llm_model TEXT DEFAULT NULL,
ranking_model TEXT DEFAULT NULL,
prompt TEXT DEFAULT 'Please answer the following question in ${language} based on the reference information.\n\n Question: ${question}\n\n Reference information:\n\n ${context}',
language TEXT DEFAULT 'Chinese',
text_search_recall_count INT DEFAULT 20,
vector_recall_count INT DEFAULT 20,
rerank_recall_count INT DEFAULT 5,
rrf_k INT DEFAULT 60,
vector_col TEXT DEFAULT 'embedding_vector',
text_search_col TEXT DEFAULT 'chunk'
)
RETURNS TEXT AS
$$
DECLARE
final_answer TEXT;
sql TEXT;
embedding_expr TEXT;
ai_rank_expr TEXT;
ai_gen_expr TEXT;
embedding_model_valid BOOLEAN;
llm_model_valid BOOLEAN;
ranking_model_valid BOOLEAN;
BEGIN
embedding_model_valid := (embedding_model IS NOT NULL AND trim(embedding_model) <> '');
llm_model_valid := (llm_model IS NOT NULL AND trim(llm_model) <> '');
ranking_model_valid := (ranking_model IS NOT NULL AND trim(ranking_model) <> '');
IF embedding_model_valid THEN
embedding_expr := 'ai_embed(' || quote_literal(embedding_model) || ', ' || quote_literal(question) || ')';
ELSE
embedding_expr := 'ai_embed(' || quote_literal(question) || ')';
END IF;
IF ranking_model_valid THEN
ai_rank_expr := 'ai_rank(' || quote_literal(ranking_model) || ', ' || quote_literal(question) || ', chunk)';
ELSE
ai_rank_expr := 'ai_rank(' || quote_literal(question) || ', chunk)';
END IF;
IF llm_model_valid THEN
ai_gen_expr := 'ai_gen(' || quote_literal(llm_model) ||
', replace(replace(replace(' || quote_literal(prompt) ||
', ''${question}'', ' || quote_literal(question) || '), ''${context}'', merged_chunks), ''${language}'', ' || quote_literal(language) || ') )';
ELSE
ai_gen_expr := 'ai_gen(replace(replace(replace(' || quote_literal(prompt) ||
', ''${question}'', ' || quote_literal(question) || '), ''${context}'', merged_chunks), ''${language}'', ' || quote_literal(language) || '))';
END IF;
sql := '
WITH embedding_recall AS (
SELECT
chunk,
vec_score,
ROW_NUMBER() OVER (ORDER BY vec_score DESC) AS rank_vec
FROM (
SELECT
chunk,
approx_cosine_distance(' || vector_col || ', ' || embedding_expr || ') AS vec_score
FROM
' || corpus_table || '
) t
ORDER BY vec_score DESC
LIMIT ' || vector_recall_count || '
),
text_search_recall AS (
SELECT
chunk,
text_score,
ROW_NUMBER() OVER (ORDER BY text_score DESC) AS rank_text
FROM (
SELECT
chunk,
text_search(' || text_search_col || ', ' || quote_literal(question) || ') AS text_score
FROM
' || corpus_table || '
) ts
WHERE text_score > 0
ORDER BY text_score DESC
LIMIT ' || text_search_recall_count || '
),
rrf_scores AS (
SELECT
chunk,
SUM(1.0 / (' || rrf_k || ' + rank_val)) AS rrf_score
FROM (
SELECT chunk, rank_vec AS rank_val FROM embedding_recall
UNION ALL
SELECT chunk, rank_text AS rank_val FROM text_search_recall
) sub
GROUP BY chunk
),
top_chunks AS (
SELECT chunk
FROM rrf_scores
ORDER BY rrf_score DESC
LIMIT ' || rerank_recall_count || '
),
concat_top_chunks AS (
SELECT string_agg(chunk, E''\n\n----\n\n'') AS merged_chunks FROM top_chunks
)
SELECT ' || ai_gen_expr || '
FROM concat_top_chunks;
';
EXECUTE sql INTO final_answer;
RETURN final_answer;
END;
$$ LANGUAGE plpgsql;