This topic describes how to use the zhparser extension to process Chinese text and how to customize a Chinese text search dictionary in PolarDB for PostgreSQL (Compatible with Oracle) clusters.
Use the zhparser extension
Use the zhparser extension. Sample statements:
CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;
--Configure optional parameters.
alter role all set zhparser.multi_short=on;
--Perform a test.
SELECT * FROM ts_parse('zhparser', 'hello world! 2010年保障房建设在全国范围内获全面启动,从中央到地方纷纷加大 了 保 障 房 的 建 设 和 投 入 力 度 。 2011年,保障房进入了更大规模的建设阶段。 住房城乡建设部党组书记、部长姜伟新去年底在全国住房城乡建设工作会议上表示,要继续推进保障性安居工程建设。 ');
SELECT to_tsvector('testzhcfg','“今年保障房新开工数量虽然有所下调,但实际的年度在建规模以及竣工规模会超以往年份,相对应的对资金的需求也会创历史纪录。”陈国强说。 在他看来,与2011年相比,2012年的保障房建设在资金配套上的压力将更为严峻。 ');
SELECT to_tsquery('testzhcfg', '保障房资金压力');
Use the zhparser extension to create and use a full-text index for search of Chinese text. Sample statements:
--Create a full-text index on the name field in the t1 table.
create index idx_t1 on t1 using gin (to_tsvector('zhcfg',upper(name) ));
-Use the full-text index to perform a search.
select * from t1 where to_tsvector('zhcfg',upper(t1.name)) @@ to_tsquery('zhcfg','(防火)') ;
Customize a Chinese text search dictionary
Customize a Chinese text search dictionary. Sample statements:
-- View the initial tokenization results.
SELECT to_tsquery('testzhcfg', '保障房资金压力');
-- Add new tokens to the Chinese text search dictionary.
insert into pg_ts_custom_word values ('保障房资');
-- Allow the new tokens to take effect.
select zhprs_sync_dict_xdb();
-- Reconnect to the database.
\c
-- View the updated tokenization results.
SELECT to_tsquery('testzhcfg', '保障房资金压力');
Guidelines for using a Chinese text search dictionary:
A dictionary supports up to one million custom tokens. After the maximum number of custom tokens is reached, subsequent tokens are not processed. Make sure that the limit is not exceeded. Custom and default tokens work together in the text search process.
Each token can be up to 128 bytes in length. The section after the 128th byte is truncated.
After you add, delete, or change a token, you must execute the
select zhprs_sync_dict_xdb();
statement and reconnect to the database for the changes to take effect.