This topic describes how to use zhparser to perform Chinese word segmentation during a full-text search in AnalyticDB for PostgreSQL.

Note Only AnalyticDB for PostgreSQL V6.0 supports full-text search.

Full-text search overview

By default, PostgreSQL performs word segmentation based on spaces and punctuation marks. PostgreSQL does not support Chinese word segmentation. AnalyticDB for PostgreSQL can be integrated with zhparser to support Chinese word segmentation.

In most cases, you can use one of the following methods to perform a full-text search:

  • Query data in a table:
    SELECT name FROM <table...>
    WHERE to_tsvector('english', name) @@ to_tsquery('english', 'friend');
  • Create a Generalized Inverted Index (GIN) index:
    CREATE INDEX <idx_...> ON <table...> USING gin(to_tsvector('english', name));

Configure zhparser

  1. Execute the following statement to install zhparser:
    CREATE EXTENSION zhparser;
    Note The account that is used to install zhparser must be granted the rds_superuser permission.
  2. Execute the following statement to configure zhparser as the Chinese text parser, and then set the name to zh_cn.
    CREATE TEXT SEARCH CONFIGURATION zh_cn (PARSER = zhparser);

    After the configuration is complete, you can run the \dF or \dFp command to view the configuration. Custom dictionaries are not supported.

  3. Query the token types that are used for word segmentation.
    • Execute the following statement to query the dictionary configuration of zhparser:
      SELECT ts_token_type('zhparser');

      The following result is returned:

                ts_token_type
      ---------------------------------
       (97,a,"adjective,形容词")
       (98,b,"differentiation,区别词")
       (99,c,"conjunction,连词")
       (100,d,"adverb,副词")
       (101,e,"exclamation,感叹词")
       (102,f,"position,方位词")
       (103,g,"root,词根")
       (104,h,"head,前连接成分")
       (105,i,"idiom,成语")
       (106,j,"abbreviation,简称")
       (107,k,"tail,后连接成分")
       (108,l,"tmp,习用语")
       (109,m,"numeral,数词")
       (110,n,"noun,名词")
       (111,o,"onomatopoeia,拟声词")
       (112,p,"prepositional,介词")
       (113,q,"quantity,量词")
       (114,r,"pronoun,代词")
       (115,s,"space,处所词")
       (116,t,"time,时语素")
       (117,u,"auxiliary,助词")
       (118,v,"verb,动词")
       (119,w,"punctuation,标点符号")
       (120,x,"unknown,未知词")
       (121,y,"modal,语气词")
       (122,z,"status,状态词")
      (26 rows)
                                  
    • Execute the following statement to query the configuration of zh_cn:
      SELECT * FROM pg_ts_config_map 
       WHERE mapcfg=(SELECT oid FROM pg_ts_config WHERE cfgname='zh_cn');
  4. Add or remove token types.
    • Add token types.

      Execute the following statement to add nouns, verbs, adjectives, idioms, exclamations, and temporary idioms as token types that are used for word segmentation:

      ALTER TEXT SEARCH CONFIGURATION zh_cn ADD MAPPING FOR n,v,a,i,e,l WITH simple;
    • Remove token types.

      Execute the following statement to remove nouns, verbs, adjectives, idioms, exclamations, and temporary idioms as token types that are used for word segmentation:

      ALTER TEXT SEARCH CONFIGURATION zh_cn DROP MAPPING IF exists FOR n,v,a,i,e,l;
  5. Use the following two functions to test the Chinese word segmentation feature during a full-text search:
    • to_tsvector:
      SELECT to_tsvector('zh_cn', '有两种方法进行全文检索');

      The following result is returned:

                    to_tsvector
      ---------------------------------------
       '全文检索':4 '方法':2 '有':1 '进行':3
      (1 row)
    • to_tsquery:
      SELECT to_tsquery('zh_cn', '有两种方法进行全文检索');

      The following result is returned:

                   to_tsquery
      -------------------------------------
       '有' & '方法' & '进行' & '全文检索'
      (1 row)

References