This topic describes how to use the zhparser extension to run Chinese full-text searches on an ApsaraDB RDS for PostgreSQL instance.

Prerequisites

  • Your RDS instance runs one of the following PostgreSQL versions:
    • PostgreSQL 13
    • PostgreSQL 12
    • PostgreSQL 11
    • PostgreSQL 10
    • PostgreSQL 9.4
  • zhparser is added to the value of the shared_preload_libraries parameter of your RDS instance.

    For more information about how to add zhparser to the value of the shared_preload_libraries parameter, see Manage the parameters of an ApsaraDB RDS for PostgreSQL instance.

Note If the zhparser extension cannot be created, you must update the minor engine version of your RDS instance. For more information, see Update the minor engine version of an ApsaraDB RDS for PostgreSQL instance.

Background information

Open source PostgreSQL provides a built-in parser extension that can segment text in languages, such as English, that not only have punctuation marks but also have spaces between words. However, text in Chinese does not have spaces between words and needs to be segmented based on semantics. Therefore, we recommend that you use the zhparser extension rather than the parser extension to segment text in Chinese.

The zhparser extension of PostgreSQL can segment text in Chinese. After you install the zhparser extension on your RDS instance, you can run Chinese full-text searches on your RDS instance.

Enable the zhparser extension

Execute the following statements to enable the zhparser extension:
CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;
--Optional parameter configuration
alter role all set zhparser.multi_short=on;
--Perform a simple test
SELECT * FROM ts_parse('zhparser', 'hello world! 2010年保障房建设在全国范围内获全面启动,从中央到地方纷纷加大 了 保 障 房 的 建 设 和 投 入 力 度 。 2011年,保障房进入了更大规模的建设阶段。 住房城乡建设部党组书记、部长姜伟新去年底在全国住房城乡建设工作会议上表示,要继续推进保障性安居工程建设。 ');
SELECT to_tsvector('testzhcfg','“今年保障房新开工数量虽然有所下调,但实际的年度在建规模以及竣工规模会超以往年份,相对应的对资金的需求也会创历史纪录。”陈国强说。 在他看来,与2011年相比,2012年的保障房建设在资金配套上的压力将更为严峻。 ');
SELECT to_tsquery('testzhcfg', '保障房资金压力');
Execute the following statements to enable the zhparser extension to run a full-text search:
-- Create a full-text index for the name field of the t1 table. You must replacet1 and name in the SQL statements with the actual table name and field value based on your business requirements. 
create index idx_t1 on t1 using gin (to_tsvector('testzhcfg',upper(name) ));
-- Use the full-text index
 select * from t1 where to_tsvector('testzhcfg',upper(t1.name)) @@ to_tsquery('testzhcfg','(防火)') ;

Customize a Chinese word segment dictionary

Execute the following statements to customize a Chinese word segment dictionary

-- The segmentation result
SELECT to_tsquery('testzhcfg', '保障房资金压力');
-- Insert a new word segment to the dictionary
insert into pg_ts_custom_word values ('保障房资');
-- Make the inserted word segment take effect
select zhprs_sync_dict_xdb();
-- End the connection
\c
-- Requery to obtain new segmentation results
SELECT to_tsquery('testzhcfg', '保障房资金压力');
Instructions to use custom word segments:
  • A maximum of 1 million custom word segments can be added. If the number of word segments exceed the limit, the word segments outside the limit are not processed. Ensure that the number of word segments is within this range. The custom and default word segmentation dictionaries take effect at the same time.
  • Each word segment can be a maximum of 128 bytes in length. The section after the 128th byte will be truncated.
  • After adding, deleting, or changing word segments, execute the select zhprs_sync_dict_xdb(); statement and re-establish a connection to make the operation take effect.