Full-text search - AnalyticDB for PostgreSQL - Alibaba Cloud Documentation Center

Full-text search is a database capability of converting natural language text data to queryable data. AnalyticDB for PostgreSQL uses the PostgreSQL engine to provide a one-stop full-text search feature. This topic describes how to use the full-text search feature of AnalyticDB for PostgreSQL.

Background information

As the digital era unfolds, data comes from more sources and is generated by using a wider range of methods. This includes large amounts of text data. In most cases, databases or data warehouses are used to store text data. However, to extract and analyze valuable information from text data, you must use multiple data processing systems. This results in higher requirements and maintenance costs for users.

When you use a data warehouse to process and analyze text data, you must make sure that your data warehouse provides capabilities such as real-time data write, full-text search, and job scheduling. Achieving the preceding capabilities in a single data warehouse poses the following challenges:

Not all data warehouse engines provide the full-text search feature. If your data warehouse does not provide the full-text search feature, you must perform additional development work to prepare and process text data before you can import data to the data warehouse.
Job scheduling relies on the support for SQL standards of data warehouse engines and powerful support for external tools.
Full-text search involves processing large amounts of text data, but data warehouses do not provide the same level of performance for processing text data as they do for processing numeric data.
Specific data warehouse engines do not provide flexible configuration change capabilities.

AnalyticDB for PostgreSQL provides full-text search and data processing capabilities and can cope with the preceding challenges.

Overview

In most cases, a full-text search application finds specific words from text data that is stored in a database and sorts the words based on the number of occurrences.

Most databases provide a basic text search feature. For example, you can use expressions such as LIKE in a query to search text data. However, this method has the following disadvantages in modern database applications:

The expression-based query method that is commonly used for databases cannot process syntax such as derivatives. For example, satisfies is a derivative of the English word satisfy. If you use satisfy as a keyword to query data, the query results that are related to satisfies may be missing. This cannot meet the requirements of a full-text search. You can use the OR expression to match satisfy and satisfies at the same time. However, this operation is inefficient and prone to errors because specific words have a large number of derivatives.
The expression-based query method cannot efficiently sort the matching results. If a large number of query results are returned, the sorting result is inefficient.
The expression-based query method provides low query performance and cannot create efficient indexes. You must traverse the entire text data.

AnalyticDB for PostgreSQL provides a one-stop full-text search feature that resolves the preceding issues.

Functionality of full-text search

The full-text search feature of AnalyticDB for PostgreSQL provides fast query performance by precomputing text data. Precomputing involves the following operations:

Parse text data to symbols. Symbols are used to classify text words into different types, such as numbers, adjectives, and adverbs. Different types of symbols can be processed in different operations. The PostgreSQL engine uses the default parser to parse text data to symbols and allows you to use custom parsers for text data in different languages.
Convert symbols to words. Compared with symbols, words are normalized to a combination of different forms, such as satisfy and satisfies. This achieves efficient performance of full-text search based on semantics. The PostgreSQL engine uses dictionaries to convert symbols to words and allows you to configure custom dictionaries.
Optimize word storage and perform efficient queries. For example, the PostgreSQL engine provides the text search vector (TSVECTOR) type to parse and convert text data to sorted data that has word information. The PostgreSQL engine also provides the text search query (TSQUERY) type to achieve efficient full-text search.

TSVECTOR

A TSVECTOR value is a sorted list of distinct words that have location information. You can use the to_tsvector syntax of PostgreSQL to convert text data to a TSVECTOR value. For example, convert the sentence 'a fat cat jumped on a mat and ate two fat rats' to sorted data.

postgres=# SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats');
                          to_tsvector
---------------------------------------------------------------
 'ate':9 'cat':3 'fat':2,11 'jump':4 'mat':7 'rat':12 'two':10
(1 row)

The query result of the to_tsvector syntax contains a list of sorted words. Each word is followed by its location information in the sentence. For example, fat':2,11 indicates that the word fat resides in positions 2 and 11 of the sentence. In addition, the query result of the to_tsvector syntax omits the conjunctions in the sentence and normalizes specific words. For example, jumped is normalized to jump.

The to_tsvector syntax completes precomputing and conversion for text data.

TSQUERY

A TSQUERY value stores the words that you want to search. You can use the to_tsquery syntax of PostgreSQL to convert text data to a TSQUERY value, and then use the to_tsvector syntax and full-text search operators to implement full-text search.

The to_tsquery syntax supports a combination of the @@ (CONTAINS) operator and Boolean operators such as & (AND), | (OR), and ! (NOT).

For example, use the @@ operator to query whether a TSVECTOR value contains a TSQUERY value.

postgres=# SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats') @@ to_tsquery('cat');
 ?column?
----------
 t
(1 row)

postgres=# SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats') @@ to_tsquery('cats');
 ?column?
----------
 t
(1 row)

postgres=# SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats') @@ to_tsquery('cat | dog');
 ?column?
----------
 t
(1 row)

postgres=# SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats') @@ to_tsquery('cat & dog');
 ?column?
----------
 f
(1 row)

The query result for the word cat is t, which indicates true. The query result for the word cats is also t because cats is the plural form of cat.

Distance search

Important

Only AnalyticDB for PostgreSQL V7.0 instances support distance search.

You can use the full-text search feature to query whether a text contains a word and perform further analysis on phrases. The to_tsquery method of PostgreSQL supports the phrase search operator <N>. The integer N specifies the distance between words. For example, if you want to query whether a text contains the word cat followed by a jump phrase, you can use the <1> operator.

postgres=# SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats') @@ to_tsquery('cat<1>jump');
 ?column?
----------
 t
(1 row)

To query a specific phrase, you can also use the distance search method.

postgres=# SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats') @@ to_tsquery('cat<2>mat');
 ?column?
----------
 f
(1 row)

postgres=# SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats') @@ to_tsquery('cat<4>mat');
 ?column?
----------
 t
(1 row)

AnalyticDB for PostgreSQL performs in-depth development based on the full-text search feature of PostgreSQL and supports the distance search operator <N,M> for full-text search. The integers M and N specify a range of distance between words from N to M. For example, if you want to query whether a text contains a phrase whose distance is less than or equal to 5 between the words cat and mat, you can use the <1,5> operator.

postgres=# SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats') @@ to_tsquery('cat<1,5>mat');
 ?column?
----------
 t
(1 row)

Chinese word segmentation: zhparser extension

Specific database applications store large amounts of Chinese text data, such as user comment tables and address information. The analysis of Chinese text data requires the full-text search feature. Chinese words are the minimum semantic unit and are not separated by spaces in the same manner as English words. The default full-text search engine of PostgreSQL cannot meet the word segmentation requirements of Chinese semantics.

Simple Chinese Word Segmentation (SCWS) is an open source Chinese word segmentation engine that is based on word frequency dictionaries. SCWS can segment a Chinese text into appropriate words. SCWS is developed in the C language and can be used as a dynamic-link library to access applications. Combined with the code extension capability of PostgreSQL, SCWS can be used to implement Chinese word segmentation in PostgreSQL databases.

Zhparser is a PostgreSQL Chinese word segmentation extension that is developed based on the capabilities of SCWS. Zhparser is compatible with the full-text search feature of PostgreSQL and provides a wide range of feature configuration options and custom dictionaries.

By default, the zhparser extension is built into AnalyticDB for PostgreSQL. You can configure the zhparser extension based on your Chinese word segmentation requirements. For example, create a Chinese word segmentation parser named zh_cn and configure word segmentation policies.

--- Create a Chinese word segmentation parser. 
CREATE TEXT SEARCH CONFIGURATION zh_cn (PARSER = zhparser);
--- Configure noun (n), verb (v), adjective (a), idiom (i), exclamation (e), temporary idiom (l), and custom (x) word segmentation policies. 
ALTER TEXT SEARCH CONFIGURATION zh_cn ADD MAPPING FOR n,v,a,i,e,l,x WITH simple;

After you complete the basic configurations, you can use the Chinese word segmentation capability to develop Chinese text search.

You can use the to_tsquery syntax and the zhparser extension to perform text search.

For more information about the zhparser extension, see Use zhparser to perform Chinese word segmentation.

Custom dictionaries

The zhparser extension supports custom Chinese dictionaries. If the default dictionary cannot meet your word segmentation requirements, you can use a custom dictionary and optimize the query results in real time. The system table zhparser.zhprs_custom_word of zhparser is a user-oriented custom dictionary table. You can use a custom dictionary only by updating the system table. The zhparser.zhprs_custom_word table uses the following schema:

                   Table "zhparser.zhprs_custom_word"
 Column |       Type       | Collation | Nullable |        Default
--------+------------------+-----------+----------+-----------------------
 word   | text             |           | not null |
 tf     | double precision |           |          | '1'::double precision
 idf    | double precision |           |          | '1'::double precision
 attr   | character(1)     |           |          | '@'::bpchar
Indexes:
    "zhprs_custom_word_pkey" PRIMARY KEY, btree (word)
Check constraints:
    "zhprs_custom_word_attr_check" CHECK (attr = '@'::bpchar OR attr = '!'::bpchar)

The word column contains custom words. The tf and idf columns are used to specify the custom word weights. TF-IDF is short for term frequency–inverse document frequency. The attr column defines the word segment and stop word attributes for custom words.

AnalyticDB for PostgreSQL provides database-level custom dictionaries. The dictionaries are stored in the data directory of the database of each data node.

Full-text search indexes

Full-text search may involve large amounts of text data. Appropriate indexes can effectively improve query performance. An inverted index is a data structure that stores data and location relationships. You can use an inverted index to search large amounts of text data in data systems.

AnalyticDB for PostgreSQL provides the generalized inverted index (GIN) feature to improve the query performance of TSVECTOR data.

CREATE INDEX text_idx ON document USING GIN (to_tsvector('zh_cn',text));

References

Use AnalyticDB for PostgreSQL to help ColourData implement full-text search, data processing, and data analysis