All Products
Search
Document Center

OpenSearch:Text analyzers

Last Updated:Nov 24, 2023

Keyword analyzer

Description: This analyzer does not segment text into terms. It is suitable for exact searches. For example, it can be applied to tags, keywords, strings that need to be processed as a whole, and numbers.

Usage notes: This analyzer applies to fields of the LITERAL, INT, LITERAL_ARRAY, and INT_ARRAY types.

Examples:

If the value of a field is "chrysanthemum tea" in a document and the keyword analyzer is used, the document can be retrieved only when you search for "chrysanthemum tea".

General analyzer for Chinese

Description: This analyzer segments text by search unit based on Chinese semantics. It is a general analyzer that applies to most industries. This analyzer is an industry-specific analyzer.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types.

Examples:

If the value of a field is "菊花茶" in a document and the general analyzer for Chinese is used, the document can be retrieved when you search for "菊花茶", "菊花", "茶", or "花茶".

E-commerce analyzer for Chinese

Description: This analyzer is applicable to the E-commerce industry.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types.

Examples:

If the value of a field is "大宝SOD蜜" in a document and the E-commerce analyzer for Chinese is used, the document can be retrieved when you search for "大宝", "sod", "sod蜜", "SOD蜜", or "蜜".

Single character analyzer for Chinese

Description: This analyzer segments text into Chinese characters and words. It is suitable for searches that are not based on Chinese semantics, such as searches for author names or store names.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types.

Examples:

If the value of a field is "菊花茶" in a document and the single character analyzer for Chinese is used, the document can be retrieved when you search for "菊花茶", "菊花", "茶", "花茶", "菊", "花", or "菊茶".

Fuzzy analyzer

Description: This analyzer allows the system to support searches by pinyin, prefix or suffix, and single word or single letter. Chinese text does not support searches by prefix or suffix. Letters, numbers, and pinyin support searches by prefix or suffix. This analyzer supports only fields whose size does not exceed 100 bytes. For more information, see Fuzzy searches.

Usage notes: This analyzer applies to fields of the SHORT_TEXT type.

Examples:

If the value of a field is "菊花茶" in a document and the fuzzy analyzer is used, the document can be retrieved when you search for "菊花茶", "菊花", "茶", "花茶", "菊", "花", "菊茶", "ju", "juhua", "juhuacha", "j", "jh", or "jhc". 
If the value of a field is the mobile number "138****5678" in a document and the fuzzy analyzer is used, the document can be retrieved when you search for "^138" or "5678$". "^138" instructs the system to search for all mobile numbers that start with "138". "5678$" instructs the system to search for all mobile numbers that end with "5678". 
If the value of a field is "OpenSearch" in a document and the fuzzy analyzer is used, the document can be retrieved when you search for a single letter that is contained in the value or a combination of consecutive letters that are contained in the value.

Word stemming analyzer for English

Description: This analyzer stems each English word to its root form. It is suitable for searches based on English semantics.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types. This analyzer does not support query analysis.

Examples:

If the value of a field is "英文分词器 english analyzer" in a document and the word stemming analyzer for English is used, the document can be retrieved when you search for "英文分词器", "english", "analyz", "analyzer", "analyzers", "analyze", "analyzed", or "analyzing". 
Note: An English text analyzer analyzes consecutive Chinese characters as one word.

Unstemmed word analyzer for English

Description: This analyzer segments text based on spaces and punctuation marks. It is suitable for searches that are not based on English semantics, such as searches for book titles or author names.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types. This analyzer does not support query analysis.

Examples:

If the value of a field is "英文分词器 english analyzer" in a document and the unstemmed word analyzer for English is used, the document can be retrieved when you search for "英文分词器", "english", or "analyzer". 
Note: An English text analyzer analyzes consecutive Chinese characters as one word.

Analyzer for fine-grained analysis for English

Description: This analyzer segments text by search unit based on English semantics. It is an analyzer that applies to English text analysis in general industries.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types.

This analyzer is specific to exclusive applications.

Examples:

If the value of a field is "dataprocess" in a document and the analyzer for fine-grained analysis for English is used, the analysis result is "data process". In this case, the document can be retrieved when you search for "dataprocess", "data process", "data", or "process".

Full pinyin spelling analyzer

Description: This analyzer allows you to search for Chinese characters in short text by using full pinyin spelling or the first letters of abbreviated pinyin spelling. It is suitable for searches based on full pinyin spelling or abbreviated pinyin spelling, such as searches for movie names or author names. To search for characters based on full pinyin spelling, you must enter the full pinyin spelling of Chinese characters, not part of the full pinyin spelling.

Usage notes: This analyzer applies to fields of the SHORT_TEXT type.

Examples:

If the value of a field is "大内密探007" in a document and the full pinyin spelling analyzer is used, the document can be retrieved when you search for "d", "dn", "dnm", "dnmt", "dnmt007", "da", "danei", "daneimi", or "daneimitan". The document cannot be retrieved when you search for "an" or "anei".

Abbreviated pinyin spelling analyzer

Description: This analyzer allows you to search for Chinese characters in short text by using the first letters of abbreviated pinyin spelling. It is suitable for searches based on abbreviated pinyin spelling, such as searches for movie names or author names.

Usage notes: This analyzer applies to fields of the SHORT_TEXT type.

Examples:

If the value of a field is "大内密探007" in a document and the abbreviated pinyin spelling analyzer is used, the document can be retrieved when you search for "d", "dn", "dnm", "dnmt", "dnmt0", "damt007", "m", "mt", "mt007", or "007".

Simple analyzer

Description: This analyzer allows you to fully control searches. It is suitable for special scenarios in which other built-in analyzers cannot meet the requirements. Tab characters (\t) are used to separate terms in field values and search queries. Make sure that field values and search queries are segmented in the same way. Otherwise, documents cannot be retrieved.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types. This analyzer does not support query analysis.

Examples:

If the value of a field is "菊\t花茶\thao" in a document and the simple analyzer is used, the document can be retrieved only when you search for "菊", "花茶", "菊\t花茶", "花茶\thao", "菊\thao", or "菊\t花茶\thao".

Numerical value analyzer

Description: This analyzer is suitable for searches based on time intervals or numerical value ranges.

Usage notes: This analyzer applies to fields of the INT and TIMESTAMP types.

Examples:

query=default:'OpenSearch' AND index:[number1,number2]
// In this example, index is the name of the index for which the numerical value analyzer is configured.

Geo-location analyzer

Description: This analyzer is suitable for searches based on geographical locations.

Usage notes: This analyzer applies to fields of the GEO_POINT type.

Examples:

query=spatial_index:'circle(116.5806 39.99624, 1000)'
// Query geographical locations within a circle whose radius can be several kilometers.

IT content analyzer

Description: This analyzer is suitable for technical content in the IT industry. This analyzer is an industry-specific analyzer. Compared with the general analyzer, the IT content analyzer segments IT-related text in another way.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types.

Examples:

Text: c++数组使用注意事项
General analyzer: c ++数组使用注意事项
IT content analyzer: c++数组使用注意事项

General analyzer for E-commerce for Chinese

Description: This analyzer is applicable to the E-commerce industry. It is an industry-specific analyzer. With the industry experience accumulated over years and the natural language processing (NLP) technology of Alibaba DAMO Academy, OpenSearch provides query analysis capabilities dedicated to the E-commerce industry to resolve the pain points and meet the needs of the industry.

Usage notes:

This analyzer applies to fields of the TEXT type.

This analyzer is specific to exclusive applications of Industry-specific Enhanced Edition for E-commerce.

Examples:

Text: 小金管遮瑕膏
General analyzer: 小金管 遮瑕膏
General analyzer for E-commerce for Chinese: 小金管 遮瑕 膏

General analyzer for Thai

Description: This analyzer segments Thai text by search unit. It is a general analyzer that applies to Thai text analysis in general industries.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types.

This analyzer is specific to exclusive applications.

Examples:

If the value of a field is "แหล่งดึงดูดนักท่องเที่ยว" in a document and the general analyzer for Thai is used, the analysis result is "แหล่ง ดึง ดูด นักท่องเที่ยว". In this case, the document can be retrieved when you search for "นักท่องเที่ยว" or "แหล่งดึงดูดนักท่องเที่ยว".

E-commerce analyzer for Thai

Description: This analyzer is applicable to Thai text analysis in the E-commerce industry.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types.

This analyzer is specific to exclusive applications.

Examples:

If the value of a field is "หน้าจอโทรศัพท์" in a document and the E-commerce analyzer for Thai is used, the analysis result is "หน้าจอ โทรศัพท์". In this case, the document can be retrieved when you search for "หน้าจอโทรศัพท์", "หน้าจอ", or "โทรศัพท์".

General analyzer for Vietnamese

Description: This analyzer is applicable to Vietnamese text analysis in general industries.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types.

This analyzer is specific to exclusive applications.

General analyzer for gaming

Description: This analyzer is applicable to the gaming industry.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types.

This analyzer is specific to exclusive applications of Industry-specific Enhanced Edition for gaming.

Examples:

If the value of a field is "原神装备" in a document and the general analyzer for gaming is used, the analysis result is "原神 装备". In this case, the document can be retrieved when you search for "原神装备", "原神", or "装备".

General analyzer for E-commerce for English

Description: This analyzer is applicable to English text analysis in the E-commerce industry.

Usage notes: This analyzer applies only to fields of the TEXT type.

This analyzer is specific to exclusive applications of Industry-specific Enhanced Edition for E-commerce.

Character analyzer for Chinese

Description: This analyzer segments text based on Chinese characters, numbers, English letters, and punctuation marks. This analyzer is applicable to searches that are not based on Chinese semantics.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types.

This analyzer is specific to exclusive applications.

Examples:

If the value of a field is "开放搜索OpenSearch123." in a document and the character analyzer for Chinese is used, the document can be retrieved when you search for "开", "放", "搜", "索", "O", "p", "e", "n", "S", "e", "a", "r", "c", "h", or ".".

General analyzer for Korean

Description: This analyzer is applicable to Korean text analysis in general industries.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types.

This analyzer is specific to exclusive applications.

Examples:

If the value of a field is "인제군의교육" in a document and the general analyzer for Korean is used, the analysis result is "인제군 의 교육". In this case, the document can be retrieved when you search for "인제군의교육", "의", or "교육".

E-commerce analyzer for Korean

Description: This analyzer is applicable to Korean text analysis in the E-commerce industry.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types.

This analyzer is specific to exclusive applications.

Examples:

If the value of a field is "스포츠캐주얼신발" in a document and the E-commerce analyzer for Korean is used, the analysis result is "스포츠 캐주얼 신발". In this case, the document can be retrieved when you search for "스포츠", "캐주얼", or "신발".

General analyzer for Japanese

Description: This analyzer is applicable to Japanese text analysis in general industries.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types.

This analyzer is specific to exclusive applications.

Examples:

If the value of a field is "メキシコアグーチ" in a document and the general analyzer for Japanese is used, the analysis result is "メキシコ アグーチ". In this case, the document can be retrieved when you search for "メキシコ", or "アグーチ".

E-commerce analyzer for Japanese

Description: This analyzer is applicable to Japanese text analysis in the E-commerce industry.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types.

This analyzer is specific to exclusive applications.

Examples:

If the value of a field is "ラウンドネックスーツ" in a document and the E-commerce analyzer for Japanese is used, the analysis result is "ラウンド ネック スーツ". In this case, the document can be retrieved when you search for "ラウンド", "ネック", or "スーツ".

Custom text analyzer

Description: This analyzer combines an industry-specific analyzer, such as a general analyzer, an E-commerce analyzer, or the person name analyzer, with custom intervention entries. For more information, see Custom text analyzers.

Usage notes: This analyzer applies to fields of the TEXT and SHORT_TEXT types.

Test an analyzer

You can test an industry-specific analyzer or a custom analyzer to check its analysis result. Log on to the OpenSearch Industry Algorithm Edition console. In the left-side navigation pane, choose Search Algorithm Center > Retrieval Configuration. On the Retrieval Configuration page, click Analyzer Management in the left-side pane. On the Analyzer Management page, find the analyzer that you want to test and click Analysis Test in the Actions column. Test the analyzer in the Analyzer Effect Test panel. The following figure provides an example.

4

Scenarios

  • In scenarios in which searches are based on Chinese semantics, we recommend that you use a semantics-based analyzer for Chinese, such as the general analyzer for Chinese or the E-commerce analyzer for Chinese.

  • In short Chinese text search scenarios or scenarios in which searches are not based on Chinese semantics, stringent sorting is not required. We recommend that you use the single character analyzer for Chinese in such scenarios to increase the number of documents that can be retrieved.

  • In search scenarios based on pinyin, use the fuzzy analyzer.

  • In English search scenarios, use the word stemming analyzer for English.

  • In some scenarios, you can use a semantics-based analyzer for Chinese and the single character analyzer for Chinese together to obtain better search results. For example, the following query string is used: query=title_index:'菊花茶' OR sws_title_index:'菊花茶'. The following fine sort expression is used: text_relevance(title)*5+field_proximity(sws_title). In this example, you can retrieve all documents that contain "xx菊xx花xx茶xx". In addition, documents that contain "菊花茶" are ranked first.

Usage notes

  • Fields of the following types can be configured as index fields:

    INT, INT_ARRAY, TEXT, SHORT_TEXT, LITERAL, LITERAL_ARRAY, TIMESTAMP, and GEO_POINT

    Fields of the following types cannot be configured as index fields:

    FLOAT, FLOAT_ARRAY, DOUBLE, and DOUBLE_ARRAY

  • If the search result summary is configured for a field of the TEXT type, some terms in the extended search units, such as "菊花茶" in the preceding example, are not added to the HTML tags for highlighting.

  • The single character analyzer for Chinese considers a number or an English word as a single character. For example, if a document contains a field whose value is "hello word" and the single character analyzer for Chinese is used, the document can be retrieved when you search for "hello". However, the document cannot be retrieved when you search for "he". To allow the system to return documents when you search for a part of an English word, use the fuzzy analyzer.

  • By default, the primary key of the primary table in the application schema is configured as an index field, and the name of the index field is id. You cannot modify this index field.