This topic describes how to use the zhparser extension to run Chinese full-text searches on an ApsaraDB RDS for PostgreSQL instance.
Prerequisites
- The RDS instance runs PostgreSQL 10 or later.
- The minor engine version of the RDS instance is 20230330 or later. For more information about how to view and update the minor engine version of an RDS instance, see Update the minor engine version of an ApsaraDB RDS for PostgreSQL instance. Important This extension is supported in minor engine versions that are earlier than 20230330. However, the extensions that are supported for ApsaraDB RDS for PostgreSQL instances are changed. Starting April 17, 2023, some extensions can no longer be created for RDS instances that run minor engine versions earlier than 20230330. For more information, see [Notice] Starting April 17, 2023, some extensions can no longer be created for ApsaraDB RDS for PostgreSQL instances that run earlier minor engine versions.
- If you have created this extension for your RDS instance that runs a minor engine version earlier than 20230330, the extension is not affected.
- If this is the first time you create this extension for your RDS instance or re-create the extension, you must update the minor engine version of the RDS instance to 20230330 or later. For more information, see Update the minor engine version of an ApsaraDB RDS for PostgreSQL instance.
- The name of the zhparser extension is added to the value of the shared_preload_libraries parameter of your RDS instance.
For more information about how to add the name of the zhparser extension to the value of the shared_preload_libraries parameter, see Modify the parameters of an ApsaraDB RDS for PostgreSQL instance.
Background information
Open source PostgreSQL provides a built-in parser extension that can segment text in languages, such as English, that not only have punctuation marks but also have spaces between words. However, text in Chinese does not have spaces between words and needs to be segmented based on semantics. Therefore, we recommend that you use the zhparser extension rather than the parser extension to segment text in Chinese.
The zhparser extension of PostgreSQL can segment text in Chinese. After you install the zhparser extension on your RDS instance, you can run Chinese full-text searches on your RDS instance.
Enable the zhparser extension
CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;
--Optional parameter configuration
alter role all set zhparser.multi_short=on;
--Perform a simple test
SELECT * FROM ts_parse('zhparser', 'hello world! 2010年保障房建设在全国范围内获全面启动,从中央到地方纷纷加大 了 保 障 房 的 建 设 和 投 入 力 度 。 2011年,保障房进入了更大规模的建设阶段。 住房城乡建设部党组书记、部长姜伟新去年底在全国住房城乡建设工作会议上表示,要继续推进保障性安居工程建设。 ');
SELECT to_tsvector('testzhcfg','“今年保障房新开工数量虽然有所下调,但实际的年度在建规模以及竣工规模会超以往年份,相对应的对资金的需求也会创历史纪录。”陈国强说。 在他看来,与2011年相比,2012年的保障房建设在资金配套上的压力将更为严峻。 ');
SELECT to_tsquery('testzhcfg', '保障房资金压力');
-- Create a full-text index for the name field of the t1 table. You must replacet1 and name in the SQL statements with the actual table name and field value based on your business requirements.
create index idx_t1 on t1 using gin (to_tsvector('testzhcfg',upper(name) ));
-- Use the full-text index
select * from t1 where to_tsvector('testzhcfg',upper(t1.name)) @@ to_tsquery('testzhcfg','(防火)') ;
Customize a Chinese word segment dictionary
Execute the following statements to customize a Chinese word segment dictionary
-- The segmentation result
SELECT to_tsquery('testzhcfg', '保障房资金压力');
-- Insert a new word segment to the dictionary
insert into pg_ts_custom_word values ('保障房资');
-- Make the inserted word segment take effect
select zhprs_sync_dict_xdb();
-- End the connection
\c
-- Requery to obtain new segmentation results
SELECT to_tsquery('testzhcfg', '保障房资金压力');
- A maximum of 1 million custom word segments can be added. If the number of word segments exceed the limit, the word segments outside the limit are not processed. Ensure that the number of word segments is within this range. The custom and default word segmentation dictionaries take effect at the same time.
- Each word segment can be a maximum of 128 bytes in length. The section after the 128th byte will be truncated.
- After adding, deleting, or changing word segments, execute the
select zhprs_sync_dict_xdb();
statement and re-establish a connection to make the operation take effect.