Use AnalyticDB for PostgreSQL to help ColourData implement full-text search, data processing, and data analysis - AnalyticDB for PostgreSQL

This topic describes how to use AnalyticDB for PostgreSQL to help ColourData implement full-text search, data processing, and data analysis in a one-stop manner.

Background information

ColourData Shanghai Business Consulting Co., Ltd. is a wholly owned Chinese subsidiary of Cheil Worldwide under Samsung group. ColourData is dedicated to collecting and analyzing data such as social media data, e-commerce data, questionnaire data, and event tracking data from social media, news, and e-commerce websites. ColourData uses this data to provide real-time marketing and consulting reports for large-scale enterprises in the world. Typical users of ColourData are South Korea Hyundai, BMW, Samsung, GlaxoSmithKline, and Bayer.

According to the data of ColourData, an e-commerce platform accumulates a large amount of historical comment data on Product A, and receives new comment data on the product every day. The platform wants to write the historical data and daily incremental data to AnalyticDB for PostgreSQL for data processing and multi-dimensional analysis.

Data write or synchronization

Execute the following statement to create a table named product_customer_reply to store the comment data on Product A:

CREATE TABLE product_customer_reply (
  customer_id   INTEGER,		-- User ID
  gender        INTEGER,		-- Gender
  age           INTEGER,		-- Age
  ---
  --- Other information about the user 
  ---
  reply_time    TIMESTAMP, 	-- Comment time
  reply         TEXT				-- Comment text
) DISTRIBUTED BY(customer_id);

If business data is saved in a formatted data file, you can execute the COPY statement to load the data. For example, execute the following statement and specify a delimiter to load a data file to AnalyticDB for PostgreSQL:

\COPY product_customer_reply FROM '</path/localfile>' DELIMITER as '|';

You can batch load incremental data by executing the COPY statement or using AnalyticDB for PostgreSQL Client SDK. For more information, see Use AnalyticDB for PostgreSQL Client SDK to write data.

If business data is stored in an online transaction processing (OLTP) database, you can use Data Transmission Service (DTS) to perform schema synchronization or full data synchronization. You can also use DTS to perform incremental data synchronization to update data in real time.

Full-text search

If your data contains Chinese words, you must configure Chinese word segmentation before you use full-text search. By default, AnalyticDB for PostgreSQL provides basic configurations for Chinese word segmentation. You can use the default configurations or customize configurations based on your business requirements. For example, execute the following statements to add Product A and Brand A to a custom dictionary:

-- Add custom words.
INSERT INTO zhparser.zhprs_custom_word VALUES('Product A');
INSERT INTO zhparser.zhprs_custom_word VALUES('Brand A');

As the amount of business data increases and more words are added to the custom dictionary, the execution of full-text search may become slower. For example, execute the following statement to query the users that give good comments on Product A and want to purchase Product A again:

SELECT count(*) FROM product_customer_reply WHERE to_tsvector('zh_cn', reply) @@ to_tsquery('zh_cn','Product A<1,10>Purchase') AND to_tsvector('zh_cn', reply)  @@ to_tsquery('zh_cn','Product A<1,10>Good');

When the amount of business data increases, a longer period of time is required to complete the query.

SELECT count(*) FROM product_customer_reply WHERE reply_ts @@ to_tsquery('zh_cn','Product A<1,10>Purchase') AND reply_ts @@ to_tsquery('zh_cn','Product A<1,10>Good');
 count
--------
 428571
(1 row)

Time: 7625.684 ms (00:07.626)

You can use one of the following methods to accelerate the query:

Method 1: Create a GIN index for the reply column to accelerate the full-text search on the reply column.

CREATE INDEX on  product_customer_reply USING GIN (to_tsvector('zh_cn',reply));

Execute the query again. The execution duration is reduced.

 SELECT count(*) FROM product_customer_reply WHERE to_tsvector('zh_cn', reply) @@ to_tsquery('zh_cn','Product A<1,10>Purchase') AND to_tsvector('zh_cn', reply)  @@ to_tsquery('zh_cn','Product A<1,10>Good');
 count
--------
 428571
(1 row)

Time: 4539.930 ms (00:04.540)

Method 2: Create a TSVECTOR-type column for the reply column to reduce query and computing operations of full-text search. For example, execute the following statement to create a TSVECTOR-type column named reply_ts to store the word data of the reply column:
```
ALTER TABLE product_customer_reply ADD COLUMN reply_ts tsvector;
```
Execute the following statement to create a GIN index for the reply_ts column:
```
CREATE INDEX ON product_customer_reply USING GIN (reply_ts);
```
Execute the query again. The execution duration is significantly reduced.
```
SELECT count(*) FROM product_customer_reply WHERE reply_ts @@ to_tsquery('zh_cn','Product A'<1,10>'Purchase') AND reply_ts @@ to_tsquery('zh_cn','Product A'<1,10>'Good');
 count
--------
 428571
(1 row)

Time: 465.849 ms
```

Full-text search performance is significantly improved by configuring full-text search, designing schemas, and using indexes.

Data processing

After a full-text search, you can process all comment data of the product to extract text and grouping characteristics for analysis. Data processing may involve a large number of SQL statements for full data search. Therefore, you can use stored procedures to control data processing. For example, execute the following statement to create a table named ts_search_detail to store full-text search conditions for columns:

CREATE TABLE ts_search_detail (
  search_id       INTEGER,
  ts_search_text  TEXT
) DISTRIBUTED BY(id);

Execute the following statement to create a table named proc_results to store the data processing results. The table contains columns such as the user ID, gender, and age.

CREATE TABLE proc_results (
  id        INTEGER, 
  gender    INTEGER,
  age       INTEGER,
  search_id INTEGER
) DISTRIBUTED BY(id);

Execute the following statement to create a stored procedure named ts_proc_jobs to process the full-text search conditions in the ts_search_text column one by one and store the processing results in the proc_results table:

CREATE OR REPLACE PROCEDURE ts_proc_jobs() 
AS $$
DECLARE
  ts_search   record;
  proc_query  text;
BEGIN
  FOR ts_search IN (SELECT ts_search_text, search_id FROM ts_search_detail) LOOP
    proc_query := '';
    proc_query := 'INSERT INTO proc_results (id, gender, age, search_id) 
                   SELECT customer_id, gender, age, '
                   || ts_search.search_id
                   || 'FROM product_customer_reply WHERE '
                   || ts_search.ts_search_text;
    execute(proc_query);
  	commit;
    raise notice 'search id % finish', ts_search.search_id;
  END LOOP;
END;
$$
LANGUAGE 'plpgsql';

After data processing, you can perform complex associated query and analysis and full-text search analysis based on your business requirements.

AnalyticDB for PostgreSQL:Use AnalyticDB for PostgreSQL to help ColourData implement full-text search, data processing, and data analysis

Background information

Data write or synchronization

Full-text search

Data processing

References