Perform full-text search data processing and analysis for PostgreSQL - AnalyticDB

ColourData Shanghai Business Consulting Co., Ltd. — a wholly owned subsidiary of Cheil Worldwide (Samsung group) — collects and analyzes data such as social media data, e-commerce data, questionnaire data, and event tracking data from social media, news, and e-commerce websites to deliver real-time marketing reports for global enterprises such as Hyundai, BMW, Samsung, GlaxoSmithKline, and Bayer.

This article walks through how ColourData uses AnalyticDB for PostgreSQL to ingest large volumes of e-commerce comment data, run Chinese full-text search at scale, and batch-process results with stored procedures.

Load data

Create the product_customer_reply table to store comment data:

CREATE TABLE product_customer_reply (
  customer_id   INTEGER,    -- User ID
  gender        INTEGER,    -- Gender
  age           INTEGER,    -- Age
  reply_time    TIMESTAMP,  -- Comment time
  reply         TEXT        -- Comment text
) DISTRIBUTED BY(customer_id);

Load historical data from a file

If the data is in a delimited flat file, use the COPY statement to load it:

\COPY product_customer_reply FROM '</path/localfile>' DELIMITER as '|';

Load incremental data

For ongoing incremental loads, use either the COPY statement or the AnalyticDB for PostgreSQL Client SDK. For details, see Use AnalyticDB for PostgreSQL Client SDK to write data.

Sync from an OLTP database

If your source data lives in an online transaction processing (OLTP) database, use Data Transmission Service (DTS) to perform schema synchronization, full data synchronization, and incremental data synchronization.

Run full-text search

PostgreSQL's built-in full-text search works natively for alphabet-based languages. For Chinese, it does not tokenize text correctly without an extension — characters in a word are treated as independent units, which breaks search accuracy. AnalyticDB for PostgreSQL includes the zhparser extension, which tokenizes Chinese text into meaningful lexemes for accurate search.

Configure Chinese word segmentation

AnalyticDB for PostgreSQL includes default zhparser configurations. To add product-specific terms to the custom dictionary:

-- Add custom words
INSERT INTO zhparser.zhprs_custom_word VALUES('Product A');
INSERT INTO zhparser.zhprs_custom_word VALUES('Brand A');

Query with full-text search

Use to_tsvector() and to_tsquery() to search for comments that mention both "Product A" and positive purchase intent:

to_tsvector('zh_cn', reply) — converts a text column into a searchable token vector using Chinese tokenization.
to_tsquery('zh_cn', '...') — converts a search term into a query object.
@@ — checks whether the vector matches the query.

SELECT count(*)
FROM product_customer_reply
WHERE to_tsvector('zh_cn', reply) @@ to_tsquery('zh_cn', 'Product A<1,10>Purchase')
  AND to_tsvector('zh_cn', reply) @@ to_tsquery('zh_cn', 'Product A<1,10>Good');

As data volume grows, this query becomes slower because to_tsvector() is recomputed on every row at query time. The baseline query time without an index:

count
--------
428571
(1 row)

Time: 7625.684 ms (00:07.626)

Accelerate queries with indexes

Two methods reduce query time significantly.

Method 1: GIN index on the expression

Create a Generalized Inverted Index (GIN) index on the to_tsvector() expression:

CREATE INDEX ON product_customer_reply USING GIN (to_tsvector('zh_cn', reply));

The index stores precomputed token vectors, avoiding full-table recomputation on each query.

Query time after indexing:

count
--------
428571
(1 row)

Time: 4539.930 ms (00:04.540)

Method 2: Dedicated tsvector column + GIN index (recommended)

Add a tsvector-type column to store precomputed token vectors:

ALTER TABLE product_customer_reply ADD COLUMN reply_ts tsvector;

Create a GIN index on the new column:

CREATE INDEX ON product_customer_reply USING GIN (reply_ts);

Query using the precomputed column instead of recomputing at query time:

SELECT count(*)
FROM product_customer_reply
WHERE reply_ts @@ to_tsquery('zh_cn', 'Product A<1,10>Purchase')
  AND reply_ts @@ to_tsquery('zh_cn', 'Product A<1,10>Good');

Query time after indexing:

count
--------
428571
(1 row)

Time: 465.849 ms

The reply_ts column must be kept in sync with the reply column. Populate it with an UPDATE statement after bulk loads, and maintain it with a trigger or application logic for ongoing writes.

Performance summary

Method	Query time
No index (compute at query time)	7,626 ms
GIN index on expression	4,540 ms
Dedicated tsvector column + GIN index	466 ms

Method 2 reduces query time by approximately 16x compared to the baseline.

Process data with stored procedures

After full-text search identifies relevant comments, use stored procedures to batch-process the results and extract user attributes for analysis.

Create a table to store full-text search conditions:

CREATE TABLE ts_search_detail (
  search_id       INTEGER,
  ts_search_text  TEXT
) DISTRIBUTED BY(id);

Create a table to store processing results:

CREATE TABLE proc_results (
  id        INTEGER,
  gender    INTEGER,
  age       INTEGER,
  search_id INTEGER
) DISTRIBUTED BY(id);

Create the stored procedure ts_proc_jobs to iterate over each search condition and insert matching user records into proc_results:

CREATE OR REPLACE PROCEDURE ts_proc_jobs()
AS $$
DECLARE
  ts_search   record;
  proc_query  text;
BEGIN
  FOR ts_search IN (SELECT ts_search_text, search_id FROM ts_search_detail) LOOP
    proc_query := '';
    proc_query := 'INSERT INTO proc_results (id, gender, age, search_id)
                   SELECT customer_id, gender, age, '
                   || ts_search.search_id
                   || ' FROM product_customer_reply WHERE '
                   || ts_search.ts_search_text;
    execute(proc_query);
    commit;
    raise notice 'search id % finish', ts_search.search_id;
  END LOOP;
END;
$$
LANGUAGE 'plpgsql';

The procedure reads each search condition from ts_search_detail, dynamically constructs an INSERT ... SELECT statement, commits after each iteration, and logs progress with RAISE NOTICE.

After processing completes, run complex joined queries and multi-dimensional analysis on the proc_results table.

AnalyticDB:Use AnalyticDB for PostgreSQL to help ColourData implement full-text search, data processing, and data analysis