This topic explains how to enable Chinese tokenization and customize a Chinese tokenization dictionary for PolarDB for PostgreSQL (Compatible with Oracle).
Enable Chinese tokenization
You can run the following commands to enable Chinese tokenization:
CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;
--Configure optional parameters.
alter role all set zhparser.multi_short=on;
--Run a simple test.
SELECT * FROM ts_parse('zhparser', 'hello world! In 2010, the construction of subsidized housing was fully launched nationwide. Central and local governments increased construction and investment efforts for subsidized housing. In 2011, subsidized housing entered a larger phase of construction. At the end of last year, Jiang Weixin, the Party Secretary and Minister of Housing and Urban-Rural Development, stated at the National Housing and Urban-Rural Development Work Conference that the government would continue to advance affordable housing projects.');
SELECT to_tsvector('testzhcfg','"Although the number of new subsidized housing projects has decreased this year, the actual annual scale of projects under construction and completed will exceed previous years. The corresponding demand for funds will also set a record," said Chen Guoqiang. In his view, the funding pressure for subsidized housing construction in 2012 will be more severe than in 2011.');
SELECT to_tsquery('testzhcfg', 'subsidized housing funding pressure');You can create and use a full-text index based on tokenization as follows:
--Create a full-text index for the name field of the t1 table.
create index idx_t1 on t1 using gin (to_tsvector('zhcfg',upper(name) ));
--Use the full-text index.
select * from t1 where to_tsvector('zhcfg',upper(t1.name)) @@ to_tsquery('zhcfg','(fire prevention)') ;