All Products
Search
Document Center

Application schema and index schema

Last Updated: Sep 09, 2021

Application schema

After you upload data to OpenSearch, the data is stored in offline data tables. To facilitate data upload, OpenSearch allows you to create multiple data tables based on your business requirements and provides data processing plug-ins. If you create multiple data tables, you must associate the tables by associating their fields. After the data in multiple data tables is processed, the tables are joined to form an index table. The index table defines search attributes and can be used by the search engine for index building and data searches.

The following part describes the field mappings between a data table and an index table.

Fields in a data table

You can use data tables when you upload data. Field types that can be processed by different data processing plug-ins are different. For more information about the limits on field values, see the Limits on fields section in Limits. If the field value is not in the value range or the length of a field exceeds the upper limit, the field will overflow or will be truncated. You must select correct field types.

Type

Description

INT

64-bit integer

INT_ARRAY

64-bit integer array

FLOAT

Floating point

FLOAT_ARRAY

Floating-point array

DOUBLE

Floating point

DOUBLE_ARRAY

Floating-point array

LITERAL

String constant. A string constant supports only exact match.

LITERAL_ARRAY

String constant array. A single element in an array supports only exact matching.

SHORT_TEXT

Short text. This type of field cannot exceed 100 bytes in length and supports multiple analysis methods.

TEXT

Long text. This type of field supports multiple analysis methods.

TIMESTAMP

64-bit unsigned integer. This type of field indicates a timestamp.

GEO_POINT

String constant. This type of field indicates a pair of latitude and longitude in the "Latitude value Longitude value" format.

Notes on reserved fields:

  • The following fields are reserved fields and you cannot use them: service_id, ops_app_name, inter_timestamp, index_name, pk, ops_version, and ha_reserved_timestamp.

Notes on fields of array types

  • If you create a field of an array type in an application, you can associate the field with fields of the VARCHAR or STRING type when you configure field mappings for a data source. In addition, you can use a data processing plug-in to process the source fields. For more information, see Use data processing plug-ins.

  • If you use an API operation or OpenSearch SDKs to upload a field of an array type, upload the field as an array instead of a string. Example: String[] literal_array = {"Alibaba Cloud","OpenSearch"};

Notes on fields of timestamp types

  • Fields of the INT or TIMESTAMP type can be mapped to the fields of the DATETIME or TIMESTAMP type in a data source, and the field values are automatically converted to the number of milliseconds. You can use the range function to retrieve search results by time range. For more information, see Range queries.

Index schema

An index schema consists of index fields and attribute fields. Index fields are used in data searches after the analysis is performed on fields of text types. Attribute fields are used in data statistics, sorting, filtering, and aggregation.

Fields of the following types can be set as index fields:

INT, INT_ARRAY, TEXT, SHORT_TEXT, LITERAL, LITERAL_ARRAY, TIMESTAMP, GEO_POINT

Fields of the following types cannot be set as index fields:

FLOAT, FLOAT_ARRAY, DOUBLE, DOUBLE_ARRAY

Fields of the following types can be set as attribute fields:

INT, INT_ARRAY, LITERAL, LITERAL_ARRAY, FLOAT, FLOAT_ARRAY, DOUBLE, DOUBLE_ARRAY, TIMESTAMP, GEO_POINT

Fields of the following types cannot be set as attribute fields:

TEXT and SHORT_TEXT

Composite index

A composite index is created on multiple fields of the text types. A search using a composite index differs from that using multiple indexes and the OR logical operator.

In the following example, two indexes and a composite index are created for an application:

title_index: the index on the title field.

body_index: the index on the body field.

union_index: the composite index on the title and body fields.

Content of a document in the application:

id:123456,title:Open,body:Search

The following code provides two search examples:

# When two indexes and the OR logical operator are used, the document cannot be retrieved.
query=title_index:'OpenSearch' OR body_index:'OpenSearch'

# When the composite index is used, the document can be retrieved.
query=union_index:'OpenSearch' 

Note: The fields of composite indexes must be of the same type and cannot be mixed. You cannot create a composite index that includes SHORT_TEXT type and TEXT type.

Fields in an index table

For more information about fields of the INT or FLOAT type, see Limits. The following part describes different types of analyzers.

Built-in analyzers

Search effects depend on how text is segmented. OpenSearch supports multiple analysis methods. You can select field types for analysis based on your business scenarios.

The following part describes search effects and scenarios of different analyzers for different types of fields.

Analyzer for keywords

This analyzer does not segment field values and applies to scenarios that require exact match. For example, you can use this analyzer to search for tags, keywords, non-analyzed strings, and numerical values. This analyzer applies to fields of the LITERAL or INT type.

If a field value in a document is "chrysanthemum tea", the document can be retrieved only when you search for "chrysanthemum tea".

Common analyzer for Chinese text

This analyzer segments Chinese text based on search units and Chinese semantic features, and applies to all industries. This analyzer can be used for all industries. The fields to be segmented must be of the TEXT or SHORT_TEXT type.

If a field value in a document is "菊花茶", the document can be retrieved when you search for "菊花茶", "菊花", "茶", or "花茶".

E-commerce industry analyzer for Chinese text

This analyzer applies to the E-commerce industry. The fields to be segmented must be of the TEXT or SHORT_TEXT type.

If a field value in a document is "大宝SOD蜜", the document can be retrieved when you search for "大宝", "sod", "sod蜜", "SOD蜜", or "蜜".

Analyzer for single Chinese characters

This analyzer segments Chinese text into single characters or phrases and applies to the searches of non-semantic Chinese text, such as novelist names and store names. The fields to be segmented must be of the TEXT or SHORT_TEXT type.

If a field value in a document is "菊花茶", the document can be retrieved when you search for "菊花茶", "菊花", "茶", "花茶", "菊", "花", or "菊茶".

Analyzer for fuzzy searches

This analyzer applies only to fields of the SHORT_TEXT type. You can use this analyzer for pinyin searches, prefix and suffix searches of numbers, and searches of single characters or words.

Except Chinese characters, letters, numbers, and pinyin support prefix and suffix matching.

The fields to be segmented cannot exceed 100 bytes in length. For more information about the description and usage notes of fuzzy searches, see Fuzzy searches.

If a field value in a document is "菊花茶", the document can be retrieved when you search for "菊花茶", "菊花", "茶", "花茶", "菊", "花", "菊茶", "ju", "juhua", "juhuacha", "j", "jh", or "jhc". 
If a field value in a document is a phone number "13812345678", the document can be retrieved when you search for "^138" or "5678$". 
If a field value in a document is "OpenSearch", the document can be retrieved when you search for a single letter or a combination of consecutive letters from the field value.

Analyzer for stemming English words

This analyzer applies to the searches of semantic English text. By default, the analyzer stems each English word to its root form. For example, the analyzer converts a plural noun to a singular noun or converts a verb to its original form. The fields to be segmented must be of the TEXT or SHORT_TEXT type.

If a field value in a document is "english analyzer", the document can be retrieved when you search for "english", "analyz", "analyzer", "analyzers", "analyze", "analyzed", or "analyzing". 
Note: Consecutive Chinese characters in an analyzer for English text is regarded as a word.

Analyzer that retains roots of English words

This analyzer applies to the searches of English book titles and English names, and segments text by space and punctuation. The fields to be segmented must be of the TEXT or SHORT_TEXT type.

If a field value in a document is "english analyzer", the document can be retrieved when you search for "english" or "analyzer". 
Note: Consecutive Chinese characters in an analyzer for English text is regarded as a word.

Analyzer for full pinyin spelling

This analyzer applies only to fields of the SHORT_TEXT type. You can use this analyzer to search for Chinese characters in short text by first letters and full pinyin spelling. This analyzer applies to the searches by simple pinyin spelling and full pinyin spelling, such as the searches of personal names and film names. If you search for a Chinese character by full pinyin spelling, you cannot use partial pinyin spelling.

If a field value in a document is "大内密探007", the document can be retrieved when you search for "d", "dn", "dnm", "dnmt", "dnmt007", "da", "danei", "daneimi", or "daneimitan". If you search for "an" or "anei", the document cannot be retrieved.

Analyzer for simple pinyin spelling

This analyzer applies only to fields of the SHORT_TEXT type. You can use this analyzer to search for Chinese characters in short text by first letters. This analyzer applies to the searches by simple pinyin spelling, such as the searches of personal names and film names.

If a field value in a document is "大内密探007", the document can be retrieved when you search for "d", "dn", "dnm", "dnmt", "dnmt0", "dnmt007", "m", "mt", "mt007", or "007".

Analyzer for specific scenarios

This analyzer applies to specific search scenarios that cannot be implemented by OpenSearch. You can completely control search effects by using this analyzer. When you push documents and search for documents, you can use tab characters (\t) to divide field values or search words. Make sure that you segment text in the same way in the preceding two scenarios. Otherwise, documents cannot be retrieved. The fields to be segmented must be of the TEXT or SHORT_TEXT type.

If a field value in a document is "chrysanthemum\tscented tea\thao", the document can be retrieved when you search for "chrysanthemum", "scented tea", "chrysanthemum\tscented tea", "scented tea\thao", "chrysanthemum\thao", or "chrysanthemum\tscented tea\thao".

Analyzer for searches by range

This analyzer applies to searches by time range and numerical range The fields to be segmented must be of the INT or TIMESTAMP type.

Sample QUERY clause: query=default:'OpenSearch' AND index:[number1,number2]
// index indicates the name of the index for which the analyzer for searches by range is configured.

For more information about the syntax of searches by range, see Range queries.

Analyzer for searches by the range of geographical locations

This analyzer applies to the searches by the range of geographical locations. The fields to be segmented must be of the GEO_POINT type.

Sample QUERY clause: query=spatial_index:'circle(116.5806 39.99624, 1000)'
// Queries geographical locations within a circle whose radius can be several kilometers.

For more information about the syntax of searches by the range of geographical locations, see Range queries.

Analyzer for IT-related text

This analyzer is industry-specific and applies to technical text from the IT industry. Compared with the common analyzer, this analyzer generates different results when it segments IT-related text. The fields to be segmented must be of the TEXT or SHORT_TEXT type.

Original text: usage notes on c++ array
Common analyzer: usage notes on c ++ array
Analyzer for IT-related text: usage notes on c++ array

Custom analyzer

You can create a custom analyzer by using an industry-specific analyzer, such as a common analyzer, an E-commerce analyzer, and a personal name analyzer, and custom intervention entries. For more information, see Custom analyzer. The fields to be segmented must be of the TEXT or SHORT_TEXT type.

Scenarios

  • If you want to search for semantic Chinese text, we recommend that you use an analyzer for Chinese text.

  • If you want to search for short text or non-semantic Chinese text, we recommend that you use the analyzer for single Chinese characters to increase the number of retrieved documents. Non-semantic Chinese text means that the order of the characters in the text does not matter.

  • If you want to search for data by pinyin, we recommend that you use the analyzer for fuzzy searches.

  • If you want to search for English text, we recommend that you use the analyzer for stemming English words.

  • In specific scenarios, great search effects can be achieved by using the common analyzer for Chinese text together with the analyzer for single Chinese characters. If you use the query=title_index:’菊花茶’ OR sws_title_index:’菊花茶’ clause, the fine sort expression is text_relevance(title)*5+field_proximity(sws_title). Documents that contain "xx菊xx花xx茶xx" can be retrieved, and the documents that contain "菊花茶" come first.

Usage notes

  • If you configure search result summaries for fields of the TEXT type, specific phrases, such as "花茶" in the preceding example, in extended search units are not highlighted in red.

  • The analyzer for single Chinese characters considers a number or an English word as a character. If a field value in a document is "hello world", the document can be retrieved when you search for "hello" but cannot be retrieved when you search for "he". If you want to retrieve documents by letters in English words, use the analyzer for fuzzy searches.

  • By default, the primary key of the primary table that is defined in the application schema is set as an index field, and the field name is "id". You cannot change the default setting.