All Products
Search
Document Center

ApsaraDB RDS:Use the pg_jieba extension

Last Updated:Dec 26, 2023

This topic describes how to use the pg_jieba extension to run Chinese full-text searches on an ApsaraDB RDS for PostgreSQL instance.

Prerequisites

Methods to use the pg_jieba extension

  • Create the pg_jieba extension.

    CREATE EXTENSION pg_jieba;
    Note

    Only privileged accounts are authorized to run the preceding command.

  • Delete the pg_jieba extension.

    DROP EXTENSION pg_jieba;
    Note

    Only privileged accounts are authorized to run the preceding command.

  • Example 1:

    SELECT * FROM to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所,后在日本京都大学深造');
                                                     to_tsvector
    --------------------------------------------------------------------------------------------------------------
     '中国科学院':5 '于':4 '后':8 '在':9 '小明':1 '日本京都大学':10 '毕业':3 '深造':11 '硕士':2 '计算所':6 ',':7
    (1 row)
  • Example 2:

    SELECT * FROM to_tsvector('jiebacfg', '李小福是创新办主任也是云计算方面的专家');
                                            to_tsvector
    -------------------------------------------------------------------------------------------
     '专家':11 '主任':5 '也':6 '云计算':8 '创新':3 '办':4 '方面':9 '是':2,7 '李小福':1 '的':10
    (1 row)

Extended features

You can view the extended features of the pg_jieba extension based on the version of the extension that you have installed.

  • Execute the following SQL statement to query the version of the pg_jieba extension:

    SELECT * FROM pg_available_extensions WHERE name='pg_jieba';
  • View the default versions of the pg_jieba extension installed on RDS instances that run different major and minor engine versions.

    Major engine version

    Minor engine version

    Default version of the extension

    RDS PostgreSQL 15

    20230630 or later

    1.2.0

    20221030 ~ 20230530

    1.1.0

    RDS PostgreSQL 14

    20230630 or later

    1.2.0

    20220730 ~ 20230530

    1.1.0

    PostgreSQL 10, PostgreSQL 11, PostgreSQL 12, or PostgreSQL 13

    20211130 or later

    1.1.0

Extended features in version 1.1.0

  • The pg_jieba extension allows you to configure multiple custom dictionaries and switch between the dictionaries.

    -- Insert data into the first custom dictionary. By default, data is inserted into the first custom dictionary. The first custom dictionary is represented by 0. The weight value of the first custom dictionary is 10.
    INSERT INTO jieba_user_dict VALUES ('阿里云');
    
    INSERT INTO jieba_user_dict VALUES ('研发工程师',0,10);
    
    
    -- Use the dictionary predefined in the pg_jieba extension to segment Chinese text.
    SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');
    to_tsvector
    ------------------------------------------------------
    'zth':1 '一个':6 '云':4 '工程师':8 '研发':7 '阿里':3
    (1 row)
    
    -- Switch to the first custom dictionary. The jieba_load_user_dict() parameter specifies the sequence number of the custom dictionary.
    SELECT jieba_load_user_dict(0);
    jieba_load_user_dict
    ----------------------
    
    (1 row)
    
    SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');
    to_tsvector
    --------------------------------------------
    'zth':1 '一个':5 '研发工程师':6 '阿里云':3
    (1 row)
  • The pg_jieba extension allows you to view the text segmentation results based on offsets.

    SELECT * FROM to_tsvector('jiebacfg_pos', 'zth是阿里云的一个研发工程师');
                                         to_tsvector
    --------------------------------------------------------------------------------------
     'zth:0':1 '一个:8':6 '云:6':4 '工程师:12':8 '是:3':2 '的:7':5 '研发:10':7 '阿里:4':3'zth:0':1 ' One: 8':6 ' Cloud: 6':4 ' Engineer: 12':8 ' Yes: 3':2':7':5 ' R&D: 10':7 ' Ali: 4':3
    (1 row)

Extended features in version 1.2.0

  • The jieba_load_user_dict() function is optimized to reduce its CPU utilization and memory usage.

  • A new parameter is added to the jieba_load_user_dict() function to specify whether to use custom dictionaries during retrieval.

    • Syntax

      jieba_load_user_dict(parameter1, parameter2)
    • Parameter description

      Parameter

      Description

      parameter1

      Specifies the sequence number of the custom dictionary that you want to load.

      parameter2

      Specifies whether to load the default dictionary.

      • 0: loads the default dictionary.

      • 1: does not load the default dictionary.

    • Examples

      INSERT INTO jieba_user_dict VALUES ('阿里云');
      INSERT 0 1
      INSERT INTO jieba_user_dict VALUES ('研发工程师',0,10);
      INSERT 0 1
      
      -- The first 0 indicates the sequence number of the custom dictionary, and the second 0 indicates that the default dictionary is loaded.
      SELECT jieba_load_user_dict(0,0);
      jieba_load_user_dict
      ----------------------
      
      (1 row)
      SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');
      to_tsvector
      --------------------------------------------
      'zth':1 '一个':5 '研发工程师':6 '阿里云':3
      (1 row)
      
      SELECT jieba_load_user_dict(0,1);
      jieba_load_user_dict
      ----------------------
      
      (1 row)
      
      SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');
                           to_tsvector
      ------------------------------------------------------
      'zth':1 '一个':6 '云':4 '工程师':8 '研发':7 '阿里':3
      (1 row)
      Note

      If the jieba_user_dict table or the jieba_load_user_dict() function does not exist, you must update the minor engine version of your RDS instance to 20220730 and reinstall the extension.

      1. For more information about how to update the minor engine version, see Update the minor engine version of an ApsaraDB RDS for PostgreSQL instance.

      2. Execute the following statements to reinstall the extension:

        DROP EXTENSION pg_jieba;
        CREATE EXTENSION pg_jieba;

References

For more information about how to use the pg_jieba extension, see pg_jieba official documentation.