This topic describes how to use AnalyticDB for PostgreSQL to help ColourData implement full-text search, data processing, and data analysis in a one-stop manner.
Background information
ColourData Shanghai Business Consulting Co., Ltd. is a wholly owned Chinese subsidiary of Cheil Worldwide under Samsung group. ColourData is dedicated to collecting and analyzing data such as social media data, e-commerce data, questionnaire data, and event tracking data from social media, news, and e-commerce websites. ColourData uses this data to provide real-time marketing and consulting reports for large-scale enterprises in the world. Typical users of ColourData are South Korea Hyundai, BMW, Samsung, GlaxoSmithKline, and Bayer.
According to the data of ColourData, an e-commerce platform accumulates a large amount of historical comment data on Product A, and receives new comment data on the product every day. The platform wants to write the historical data and daily incremental data to AnalyticDB for PostgreSQL for data processing and multi-dimensional analysis.
Data write or synchronization
Execute the following statement to create a table named product_customer_reply
to store the comment data on Product A:
CREATE TABLE product_customer_reply (
customer_id INTEGER, -- User ID
gender INTEGER, -- Gender
age INTEGER, -- Age
---
--- Other information about the user
---
reply_time TIMESTAMP, -- Comment time
reply TEXT -- Comment text
) DISTRIBUTED BY(customer_id);
If business data is saved in a formatted data file, you can execute the COPY statement to load the data. For example, execute the following statement and specify a delimiter to load a data file to AnalyticDB for PostgreSQL:
\COPY product_customer_reply FROM '</path/localfile>' DELIMITER as '|';
You can batch load incremental data by executing the COPY statement or using AnalyticDB for PostgreSQL Client SDK. For more information, see Use AnalyticDB for PostgreSQL Client SDK to write data.
If business data is stored in an online transaction processing (OLTP) database, you can use Data Transmission Service (DTS) to perform schema synchronization or full data synchronization. You can also use DTS to perform incremental data synchronization to update data in real time.
Full-text search
If your data contains Chinese words, you must configure Chinese word segmentation before you use full-text search. By default, AnalyticDB for PostgreSQL provides basic configurations for Chinese word segmentation. You can use the default configurations or customize configurations based on your business requirements. For example, execute the following statements to add Product A and Brand A to a custom dictionary:
-- Add custom words.
INSERT INTO zhparser.zhprs_custom_word VALUES('Product A');
INSERT INTO zhparser.zhprs_custom_word VALUES('Brand A');
As the amount of business data increases and more words are added to the custom dictionary, the execution of full-text search may become slower. For example, execute the following statement to query the users that give good comments on Product A and want to purchase Product A again:
SELECT count(*) FROM product_customer_reply WHERE to_tsvector('zh_cn', reply) @@ to_tsquery('zh_cn','Product A<1,10>Purchase') AND to_tsvector('zh_cn', reply) @@ to_tsquery('zh_cn','Product A<1,10>Good');
When the amount of business data increases, a longer period of time is required to complete the query.
SELECT count(*) FROM product_customer_reply WHERE reply_ts @@ to_tsquery('zh_cn','Product A<1,10>Purchase') AND reply_ts @@ to_tsquery('zh_cn','Product A<1,10>Good');
count
--------
428571
(1 row)
Time: 7625.684 ms (00:07.626)
You can use one of the following methods to accelerate the query:
Method 1: Create a GIN index for the
reply
column to accelerate the full-text search on thereply
column.CREATE INDEX on product_customer_reply USING GIN (to_tsvector('zh_cn',reply));
Execute the query again. The execution duration is reduced.
SELECT count(*) FROM product_customer_reply WHERE to_tsvector('zh_cn', reply) @@ to_tsquery('zh_cn','Product A<1,10>Purchase') AND to_tsvector('zh_cn', reply) @@ to_tsquery('zh_cn','Product A<1,10>Good'); count -------- 428571 (1 row) Time: 4539.930 ms (00:04.540)
Method 2: Create a TSVECTOR-type column for the
reply
column to reduce query and computing operations of full-text search. For example, execute the following statement to create a TSVECTOR-type column namedreply_ts
to store the word data of thereply
column:ALTER TABLE product_customer_reply ADD COLUMN reply_ts tsvector;
Execute the following statement to create a GIN index for the
reply_ts
column:CREATE INDEX ON product_customer_reply USING GIN (reply_ts);
Execute the query again. The execution duration is significantly reduced.
SELECT count(*) FROM product_customer_reply WHERE reply_ts @@ to_tsquery('zh_cn','Product A'<1,10>'Purchase') AND reply_ts @@ to_tsquery('zh_cn','Product A'<1,10>'Good'); count -------- 428571 (1 row) Time: 465.849 ms
Full-text search performance is significantly improved by configuring full-text search, designing schemas, and using indexes.
Data processing
After a full-text search, you can process all comment data of the product to extract text and grouping characteristics for analysis. Data processing may involve a large number of SQL statements for full data search. Therefore, you can use stored procedures to control data processing. For example, execute the following statement to create a table named ts_search_detail
to store full-text search conditions for columns:
CREATE TABLE ts_search_detail (
search_id INTEGER,
ts_search_text TEXT
) DISTRIBUTED BY(id);
Execute the following statement to create a table named proc_results
to store the data processing results. The table contains columns such as the user ID, gender, and age.
CREATE TABLE proc_results (
id INTEGER,
gender INTEGER,
age INTEGER,
search_id INTEGER
) DISTRIBUTED BY(id);
Execute the following statement to create a stored procedure named ts_proc_jobs
to process the full-text search conditions in the ts_search_text
column one by one and store the processing results in the proc_results
table:
CREATE OR REPLACE PROCEDURE ts_proc_jobs()
AS $$
DECLARE
ts_search record;
proc_query text;
BEGIN
FOR ts_search IN (SELECT ts_search_text, search_id FROM ts_search_detail) LOOP
proc_query := '';
proc_query := 'INSERT INTO proc_results (id, gender, age, search_id)
SELECT customer_id, gender, age, '
|| ts_search.search_id
|| 'FROM product_customer_reply WHERE '
|| ts_search.ts_search_text;
execute(proc_query);
commit;
raise notice 'search id % finish', ts_search.search_id;
END LOOP;
END;
$$
LANGUAGE 'plpgsql';
After data processing, you can perform complex associated query and analysis and full-text search analysis based on your business requirements.