This topic describes how to use zhparser to perform Chinese word segmentation during a full-text search in AnalyticDB for PostgreSQL.

Important
  • Only AnalyticDB for PostgreSQL V6.0 supports full-text search.
  • To install or upgrade extensions on an instance that runs V6.3.8.9 or later, Submit a ticket.

    For more information about how to view the minor version of an instance, see View the minor engine version.

Full-text search overview

By default, PostgreSQL performs word segmentation based on spaces and punctuation marks. PostgreSQL does not support Chinese word segmentation. AnalyticDB for PostgreSQL can be integrated with zhparser to support Chinese word segmentation.

In most cases, you can use one of the following methods to perform a full-text search:

  • Query data in a table:
    SELECT name FROM <table...>
    WHERE to_tsvector('english', name) @@ to_tsquery('english', 'friend');
  • Create a Generalized Inverted Index (GIN) index:
    CREATE INDEX <idx_...> ON <table...> USING gin(to_tsvector('english', name));

Configure zhparser

  1. Execute the following statement to install zhparser:
    CREATE EXTENSION zhparser;
    Note The account that is used to install zhparser must be granted the rds_superuser permission.
  2. Execute the following statement to configure zhparser as the Chinese text parser, and then set the name to zh_cn.
    CREATE TEXT SEARCH CONFIGURATION zh_cn (PARSER = zhparser);

    After the configuration is complete, you can run the \dF or \dFp command to view the configuration. Custom dictionaries are not supported.

  3. Query the token types that are used for word segmentation.
    • Execute the following statement to query the dictionary configuration of zhparser:
      SELECT ts_token_type('zhparser');

      The following result is returned:

                ts_token_type
      ---------------------------------
       (97,a,"adjective")
       (98,b,"differentiation")
       (99,c,"conjunction")
       (100,d,"adverb")
       (101,e,"exclamation")
       (102,f,"position")
       (103,g,"root")
       (104,h,"head")
       (105,i,"idiom")
       (106,j,"abbreviation")
       (107,k,"tail")
       (108,l,"tmp")
       (109,m,"numeral")
       (110,n,"noun")
       (111,o,"onomatopoeia")
       (112,p,"prepositional")
       (113,q,"quantity")
       (114,r,"pronoun")
       (115,s,"space")
       (116,t,"time")
       (117,u,"auxiliary")
       (118,v,"verb")
       (119,w,"punctuation")
       (120,x,"unknown")
       (121,y,"modal")
       (122,z,"status")
      (26 rows)
                                  
    • Execute the following statement to query the configuration of zh_cn:
      SELECT * FROM pg_ts_config_map 
       WHERE mapcfg=(SELECT oid FROM pg_ts_config WHERE cfgname='zh_cn');
  4. Add or remove token types.
    • Add token types.

      Execute the following statement to add nouns, verbs, adjectives, idioms, exclamations, and temporary idioms as token types that are used for word segmentation:

      ALTER TEXT SEARCH CONFIGURATION zh_cn ADD MAPPING FOR n,v,a,i,e,l WITH simple;
    • Remove token types.

      Execute the following statement to remove nouns, verbs, adjectives, idioms, exclamations, and temporary idioms as token types that are used for word segmentation:

      ALTER TEXT SEARCH CONFIGURATION zh_cn DROP MAPPING IF exists FOR n,v,a,i,e,l;
  5. Use the following functions to test the Chinese word segmentation feature during a full-text search:
    • to_tsvector
    • to_tsquery

References