Overview

This topic describes how to use Chinese word segmentation and custom dictionaries in apsaradb RDS for PostgreSQL.

 

Detail

Alibaba Cloud reminds you that:

  • Before you perform operations that may cause risks, such as modifying instance configurations or data, we recommend that you check the disaster recovery and fault tolerance capabilities of the instances to ensure data security.
  • You can modify the configurations and data of instances including but not limited to Elastic Compute Service (ECS) and Relational Database Service (RDS) instances. Before the modification, we recommend that you create snapshots or enable RDS log backup.
  • If you have authorized or submitted security information such as the logon account and password in the Alibaba Cloud Management console, we recommend that you modify such information in a timely manner.

The following is the details of enabling Chinese word segmentation and custom dictionaries.

 

Enable the zhparser plug-in

You can run the following SQL statement to enable the Chinese word segmentation:
Note:
  • The word breaker zhparser cannot be installed in pg_catalog schema.
  • You must have the write permission on the target schema.
  • You need to use the super privilege execution alter role all the set zhparser.multi_short=on; and RDS for PostgreSQL does not support superuser modifying can improve handover single modified. If you do not submit a ticket, run set zhparser.multi_short=on;, that is, modify in session layer.

CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;
alter role all set zhparser.multi_short=on;
SELECT * FROM ts_parse('zhparser', 'hello world! In 2010, the construction of safeguard houses was fully launched nationwide. From the central government to the local governments, the construction and investment of safeguard houses have been increased. In 2011, affordable housing entered a larger stage of construction. Jiang Weixin, Party secretary and minister of the Ministry of Housing and Urban-Rural Development, said at the National Housing and Urban-Rural Construction Work Conference at the end of last year that it is necessary to continue to promote the construction of affordable housing projects. ') ;
SELECT to_tsvector('testzhcfg',') "Although the number of new construction of affordable housing projects has been lowered this year, the actual scale of construction and completion in the year will exceed that in the past years, and the corresponding capital demand will set a historical record." Chen Guoqiang said. In his view, compared with 2011, the pressure on the security housing construction in 2012 will be more severe. ') ;
SELECT to_tsquery('testzhcfg', 'Financial pressure of affordable housing is');
The method of full-text index by using word segmentation is as follows: The first SQL statement creates a full-text index for the name field of the T1 table. The second SQL statement uses full-text index.
create index idx_t1 on t1 using gin (to_tsvector('zhcfg',upper(name) ));
select * from t1 where to_tsvector('zhcfg',uppe r(t1.name)) @ @ to_tsquery('zhcfg','(fire protection)');

 

Customize a Chinese word segment dictionary

An example of a custom Chinese word segment dictionary is as follows: The second SQL statement inserts a new word segment into the dictionary. The third SQL statement is to make the new word-breaking take effect. The fourth SQL statement is requery to obtain a new word segmentation result.
SELECT to_tsquery('testzhcfg', 'Financial pressure of affordable housing is');
insert into pg_ts_custom_word values ('affordable housing ");
select zhprs_sync_dict_xdb();
SELECT to_tsquery('testzhcfg', 'Financial pressure of affordable housing is');
Note: you can customize a Chinese word segment dictionary only when the kernel version is 20160801 or later. You can run the show rds_release_date; SQL statement to query the kernel version. Note the following points for using custom Word Segmentation:
  • A maximum of one million custom word segments can be added. If the number of word segments exceed the limit, the word segments outside the limit are not processed. Ensure that the number of word segments is within this range. The custom and default word segmentation dictionaries take effect at the same time.
  • Each word segment can be a maximum of 128 bytes in length. The section after the 128th byte will be truncated.
  • After you add, delete, and change the word by using the statement, you must run the select zhprs_sync_dict_xdb (;; SQL statement and then re-establish the partition.
 

Application scope

  • ApsaraDB RDS for PostgreSQL