chn_standard is the general-purpose Chinese text analyzer for Open Search Retrieval Engine Edition. It tokenizes Chinese text based on semantics and is suitable for all industries over the entire network. Unlike basic character-splitting approaches, chn_standard produces both base search units and extended terms, improving search recall without requiring manual synonym configuration.
How it works
chn_standard splits text into search units, the minimum granularity used during analysis. In addition to base search units, the analyzer generates extended terms — semantically related expansions of a base unit — to broaden search recall.
Example
Input: 菊花茶
| Token | Type |
|---|---|
菊花 | Base search unit |
茶 | Base search unit |
花茶 | Extended term (derived from 茶) |
Customize tokenization with a dictionary
To control how chn_standard tokenizes specific terms, add intervention entries to its dictionary.
An intervention entry is a medium-granularity entry. When a user searches for an intervention entry, the engine looks it up in the chn_standard.dict dictionary, then converts it into its constituent search units for matching.
Example
Add 搜索引擎 as an intervention entry. When a user searches for 搜索引擎, the engine:
Finds
搜索引擎in thechn_standard.dictdictionary.Converts it into two search units:
搜索and引擎.
To add an intervention entry:
Go to Advanced settings and open the
chn_standard.dictdictionary.Add the term as an intervention entry.
Publish the modified configuration as a new version.
Constraints
chn_standardapplies only to fields of the TEXT data type.To enable it, set the
analyzerfield tochn_standardwhen configuring a schema.