Analysis mechanism and usage of the General-purpose Chinese text analyzer - OpenSearch - Alibaba Cloud - OpenSearch

chn_standard is the general-purpose Chinese text analyzer for Open Search Retrieval Engine Edition. It tokenizes Chinese text based on semantics and is suitable for all industries over the entire network. Unlike basic character-splitting approaches, chn_standard produces both base search units and extended terms, improving search recall without requiring manual synonym configuration.

How it works

chn_standard splits text into search units, the minimum granularity used during analysis. In addition to base search units, the analyzer generates extended terms — semantically related expansions of a base unit — to broaden search recall.

Example

Input: juhuacha (Chinese for "chrysanthemum tea")

Token	Type
`juhua (chrysanthemum)`	Base search unit
`cha (tea)`	Base search unit
`huacha (flower tea)`	Extended term (derived from `cha`)

Customize tokenization with a dictionary

To control how chn_standard tokenizes specific terms, add intervention entries to its dictionary.

An intervention entry is a medium-granularity entry. When a user searches for an intervention entry, the engine looks it up in the chn_standard.dict dictionary, then converts it into its constituent search units for matching.

Example

Add search engine as an intervention entry. When a user searches for search engine, the engine:

Finds search engine in the chn_standard.dict dictionary.
Converts it into two search units: search and engine.

To add an intervention entry:

Go to Advanced settings and open the chn_standard.dict dictionary.
Add the term as an intervention entry.
Publish the modified configuration as a new version.

Constraints

chn_standard applies only to fields of the TEXT data type.
To enable it, set the analyzer field to chn_standard when configuring a schema.