Types of inverted indexes - - Alibaba Cloud Documentation Center

PACK indexes

Introduction to PACK indexes

A PACK index is a multi-field index that is created on fields of the TEXT type. Compared with a TEXT index, a PACK index is created by merging multiple fields of the TEXT type for retrieval. A PACK index can also store section information so that the section where each search term is located and the related information can be queried. You can use truncation and high-frequency words bitmap and tfbitmap to improve retrieval performance.

Item	df	ttf	tf	fieldmap	section information	position	positionpayload	docpayload	termpayload
Supported or not	Supported	Optional	Optional	Not supported	Optional	Optional	Optional	Optional	Optional

Sample code for configuring a PACK index

{
        "index_name": "pack_index",                                      
        "index_type" : "PACK",                                      
        "term_payload_flag" : 1,                                           
        "doc_payload_flag" : 1,                                            
        "position_list_flag" : 1,                                          
        "position_payload_flag" : 1,                                       
        "term_frequency_flag" : 1,                                         
        "term_frequency_bitmap" : 1,                                       
        "has_section_attribute" : false,                                   
        "section_attribute_config":                                        
        {
                "has_section_weight":true,
                "has_field_id":true,
                "compress_type":"uniq|equal"
        },
        "high_frequency_dictionary" : "bitmap1",                           
        "high_frequency_adaptive_dictionary" : "df",                       
        "high_frequency_term_posting_type" : "both",                       
        "index_fields":                                                    
        [
                {"field_name":"subject", "boost":200000},                      
                {"field_name":"company_name", "boost":600000},
                {"field_name":"feature_value", "boost":600000},
                {"field_name":"summary", "boost":600000}
        ],
        "index_analyzer" : "taobao_analyzer",                                      
    "file_compress":"simple_compress1",                                
    "format_version_id":1                                              
},

index_name: the name of the inverted index. You must specify an index-based query in the query statement. The index_name parameter cannot be set to summary.
index_type: the type of the index. Set the value to PACK.
term_payload_flag: specifies whether to store term_payload information (the payload of each term). The value 1 indicates that term_payload is stored. The value 0 indicates that term_payload is not stored. The values 1 and 0 of the following parameters have the same meaning. By default, term_payload is not stored.
doc_payload_flag: specifies whether to store doc_payload information (the payload of each term in each document). By default, doc_payload is stored.
position_list_flag: specifies whether to store position information. By default, the position information is stored. The configuration of the position_payload_flag parameter depends on the configuration of the term_frequency_flag parameter. Only if the term_frequency_flag parameter is set to 1, you can set the position_list_flag parameter to 1.
position_payload_flag: specifies whether to store position_payload information (the payload of the term at each position in each document). By default, position_payload is not stored. The configuration of the position_payload_flag parameter depends on the configuration of the position_list_flag parameter. Only if the position_list_flag parameter is set to 1, you can set the position_payload_flag parameter to 1.
term_frequency_flag: specifies whether to store the term frequency. By default, the term frequency is stored.
term_frequency_bitmap: specifies whether to store the term frequency as a bitmap. The default value is 0. The configuration of the term_frequency_bitmap parameter depends on the configuration of the term_frequency_flag parameter. Only if the term_frequency_flag parameter is set to 1, you can set the term_frequency_bitmap parameter to 1.
has_section_attribute: specifies whether to store section_attribute information. The default value is true. Text correlation can be calculated only after you set the has_section_attribute parameter.
section_attribute_config: the index configuration about section_attribute. The configuration takes effect when the has_section_attribute parameter is set to true. The has_field_id parameter specifies whether to store field_id information. The default value of the has_field_id parameter is true. If field_id information is not stored, the query process considers that all sections belong to the first field among the index fields. The has_section_weight parameter specifies whether to store weight information. The default value of the has_section_weight parameter is true. The compress_type parameter specifies the compression configuration method for section_attribute. By default, compression is disabled. The configuration method is the same as the method of configuring the compress_type parameter for multivalued attributes. If no section_attribute_config is displayed, the default configuration is used for all internal parameters by default.
high_frequency_dictionary: the vocabulary that is used when you create a bitmap index. To create a bitmap index, specify this parameter. If you do not need to create a bitmap index, leave this parameter empty.
high_frequency_adaptive_dictionary: the name of the rule that is used to create an adaptive bitmap index. To create an adaptive bitmap index, specify this parameter. If you do not need to create an adaptive bitmap index, leave this parameter empty.
high_frequency_term_posting_type: the type of the bitmap index. If you set the parameter for creating a bitmap index or an adaptive bitmap index, you can set this parameter to both or bitmap to configure the type of the bitmap index. If you set this parameter to both, a bitmap index and an inverted index are created. If you set this parameter to bitmap, only a bitmap index is created. The default value is bitmap.
index_fields: the fields on which you want to create an index. These fields must be of the TEXT type and use the same analyzer.
boost: the weight of the field in an index. You can specify the name of the field on which you want to create the index and the boost value.
index_analyzer: the analyzer that is used during the query. If you specify an analyzer, the analyzer is used to convert text to terms during the query. In this case, the analyzer can be inconsistent with the analyzers that are used in the fields. If you do not specify this parameter, the analyzers used in the fields are used. In this case, the analyzers that are used in the fields must be consistent. Take note that the analyzer can be only added to an index whose field type is TEXT.
file_compress: the file compression method. OpenSearch Vector Search Edition V3.9.1 or later allows you to configure the file compression method. Set the file_compress parameter to an alias of a compression method to enable file compression for an inverted index. If you do not specify this parameter, files are not compressed.
format_version_id: the version ID of an inverted index. The default value is 0 and indicates the format of the inverted index used when IndexLib is migrated to the AIOS of the benchmark version. For OpenSearch Vector Search Edition V3.9.1 or later, you can set this parameter to 1. OpenSearch Vector Search Edition V3.9.1 or later supports a series of techniques to optimize the storage formats of the inverted index, such as variable byte compression, the optimized PForDelta algorithm (NewPForDelta), and the dictInline storage for continuous document IDs.

Important

The index_name parameter cannot be set to summary.
The fields in the PACK index must be of the TEXT type.
If no analyzer is specified for the index, the analyzers specified for all fields in the same index must be the same.
The order in which fields are configured must be the same as the order in which fields are listed in index_fields.
If you set both the high_frequency_dictionary and high_frequency_adaptive_dictionary parameters, a bitmap index is created by using the high-frequency words specified in the high_frequency_dictionary parameter, regardless of the rule specified in the high_frequency_adaptive_dictionary parameter.
You can specify the format_version_id parameter for all inverted indexes that are created on non-primary key columns. Before the offline phase starts, upgrade OpenSearch Vector Search Edition to the version that supports the corresponding format version online. Otherwise, the loading fails. The new version is supported online and formats of the old version can also be read.
You can create a PACK index on a maximum of 32 fields.

TEXT indexes

Introduction to TEXT indexes

A TEXT index is a single-field index that is created on fields of the TEXT type. An analyzer is used to convert text into multiple terms. A separate posting list is created for each term. You can use truncation and high-frequency words bitmap and tfbitmap to improve retrieval performance.

Item	df	ttf	tf	fieldmap	section information	position	positionpayload	docpayload	termpayload
Supported or not	Supported	Optional	Optional	Not supported	Not supported	Optional	Optional	Optional	Optional

Sample code for configuring a TEXT index

{
        "index_name": "text_index",
        "index_type": "TEXT",
        "term_payload_flag" :  1 ,
        "doc_payload_flag" :  1 ,
        "position_payload_flag" : 1,
        "position_list_flag" : 1,
        "term_frequency_flag" : 1,
        "index_fields": "title",
  "file_compress":"simple_compress1"  
}

The following parameters have the same meaning in the configurations of TEXT and PACK indexes: index_name, index_type, term_payload_flag, doc_payload_flag, position_payload_flag, position_list_flag, term_frequency_flag, and file_compress. The exception is that the index_type parameter must be set to TEXT and the index_fields parameter supports only one field in the configuration of a TEXT index.

Important

The index_name parameter cannot be set to summary.

NUMBER indexes

Introduction to NUMBER indexes

A NUMBER index is a single-field index. It is an inverted index that is created for integer data of the INT8, UINT8, INT16, UINT16, INT32, UINT32, INT64, and UINT64 types. The field on which a NUMBER index is created can be a multivalued field. A separate posting list is created for each value in the multivalued field. A NUMBER index can store the following items.

Item	df	ttf	tf	fieldmap	section information	position	positionpayload	docpayload	termpayload
Supported or not	Supported	Optional	Optional	Not supported	Not supported	Not supported	Not supported	Optional	Optional

The following table describes the methods that can be used to improve the performance of NUMBER indexes.

Method	Description
Truncation	A separate inverted index is created for some high-quality documents based on your configuration. This index is retrieved first during retrieval. This prevents the system from retrieving unnecessary documents. The retrieval performance is doubled in the major retrieval. For more information, see Cluster configuration.
High-frequency word bitmap	You can use bitmaps to store common high-frequency words. This helps reduce the space that is consumed by the index and improve retrieval performance. If you use bitmaps to store high-frequency words, only items ttf, df, termpayload, and docid can be retrieved. You can use bitmaps to store high-frequency words by configuring a high-frequency dictionary and an adaptive high-frequency dictionary.
tf bitmap	You can use bitmaps to store the term frequency information of each term in each document. You can use bitmaps to store the term frequency information of terms that have high document frequency. This way, no inverted index information is lost.

Sample code for configuring a NUMBER index

The following sample code provides an example on how to configure a NUMBER index in the schema.json file:

{
        "index_name": "number_index",
        "index_type": "NUMBER",
        "term_payload_flag" :  0,
        "doc_payload_flag" :  0,
        "term_frequency_flag" : 0,
        "index_fields": "number_field",
  "file_compress":"simple_compress1"   
}

The following parameters have the same meaning in the configurations of NUMBER and PACK indexes: index_name, index_type, term_payload_flag, doc_payload_flag, term_frequency_flag, and file_compress. The exception is that index_type must be set to NUMBER, and the index_fields parameter supports only one field of the INTEGER type in the configuration of the NUMBER index.

Best practice: To reduce the index size, we recommend that you set term_payload_flag, doc_payload_flag, and term_frequency_flag to 0.

Important

The index_name parameter cannot be set to summary.
A NUMBER index can be created for multiple integer values. When a NUMBER index is created, a separate inverted index is created for each value.

STRING indexes

Introduction to STRING indexes

A STRING index is a single-field index. It is an inverted index that is created for data of the STRING type. Text is not converted into terms for fields of the STRING type. Each string value is used as a separate term for which a posting list is created. The fields on which a STRING index is created can be multivalued fields. Multiple field values can be separated by using delimiters. A posting list is created for each string value. You can use truncation and high-frequency words bitmap and tfbitmap to improve retrieval performance.

Item	df	ttf	tf	fieldmap	section information	position	positionpayload	docpayload	termpayload
Supported or not	Supported	Optional	Optional	Not supported	Not supported	Not supported	Not supported	Optional	Optional

Sample code for configuring a STRING index

{
        "index_name": "string_index",
        "index_type": "STRING",
        "term_payload_flag" :  1,
        "doc_payload_flag" :  1,
        "term_frequency_flag" : 1,
        "index_fields": "user_name",
  "file_compress":"simple_compress1"   
}

The following parameters have the same meaning in the configurations of STRING and PACK indexes: index_name, index_type, term_payload_flag, doc_payload_flag, term_frequency_flag, and file_compress. The exception is that index_type must be set to STRING, and the index_fields parameter supports only one field of the STRING type in the configuration of the STRING index. This field supports multiple integer values.

Best practice: To reduce the index size, we recommend that you set term_payload_flag, doc_payload_flag, and term_frequency_flag to 0.

Important

The index_name parameter cannot be set to summary.
A STRING index can be created for multiple integer values. When a STRING index is created, a separate inverted index is created for each value.

PRIMARYKEY64 indexes and PRIMARYKEY128 indexes

Introduction to PRIMARYKEY64 indexes and PRIMARYKEY128 indexes

A PRIMARYKEY index is the primary key index of a document. You can configure only one PRIMARYKEY index. A PRIMARYKEY index supports all types of fields. It can store mappings between the hash values of index fields and the document IDs for removing duplicates. You can obtain the hash value for each document.

Item	df	ttf	tf	fieldmap	section information	position	positionpayload	docpayload	termpayload
Supported or not	Supported	Not Supported	Not supported	Not supported	Not supported	Not supported	Not supported	Not supported	Not supported

Sample code for configuring a PRIMARYKEY64 index or a PRIMARYKEY128 index

{
        "index_name": "primary_key_index",
        "index_type" : "PRIMARYKEY64",
        "index_fields": "product_id",
        "has_primary_key_attribute": true,
        "is_primary_key_sorted": true,
  "pk_storage_type" : "sort_array",
  "pk_hash_type" : "default_hash"
}

index_name: the name of an index.
index_type: the type of an index. You can set the index_type parameter to PRIMARYKEY64 or PRIMARYKEY128. The numbers 64 and 128 indicate bits of hash values. In most cases, 64 bits are sufficient.
index_fields: the field on which you want to create an index. Only one field is supported. All field types are supported. We recommend that you set this parameter to the field that corresponds to the primary key.
has_primary_key_attribute: The attribute of the primary key refers to the mapping between the document ID and the hash value of the primary key. If duplicates need to be removed in the query or the hash value of the primary key needs to be returned in the phase-1 query, the has_primary_key_attribute parameter must be specified. The default value of this parameter is false.
is_primary_key_sorted: specifies whether the PRIMARYKEY index is optimized. If this parameter is set to true, the indexes that are dumped out are sorted by primary key. This accelerates the query. The default value is false.
pk_storage_type: specifies how the primary key is stored. Valid values are sort_array, hash_table, and block_array. The default value is sort_array.
sort_array: saves space.
hash_table: provides better performance.
block_array: allows you to configure the block cache and the mmap() function.
pk_hash_type: the method to calculate the hash value for the primary key field. Valid values are default_hash, murmur_hash, and number_hash. The default value is default_hash.
default_hash: indicates the default hash method of strings.
murmur_hash: uses the MurmurHash function that can provide better performance.
number_hash: can be used when the primary key field is of the NUMBER type. This way, numbers are used to replace hash values. The resolution is faster than the hashing. However, numbers are more likely to be clustered than hash values.

Important

A PRIMARYKEY64 index or PRIMARYKEY128 index supports all types of fields.
A PRIMARYKEY index is the primary key of a document. Therefore, you can configure at most one PRIMARYKEY index.
PRIMARYKEY indexes do not support fields that contain null values.
PRIMARYKEY indexes do not allow you to set the file_compress parameter to enable file compression.
The index_name parameter cannot be set to summary.

DATE indexes

Introduction to DATE indexes

A DATE index is created on date and time values and is used to query a time range.

Sample code for configuring a DATE index

"fileds":
[
    {"field_name":"inputtime",     "field_type":"UINT64", "binary_field": false},
    ...
]
...
"indexs":
[
    {
        "index_name": "inputtime",                                        1
        "index_type" : "DATE",                                            2
        "index_fields": "inputtime",                                      3
        "build_granularity": "minute",                                    4
        "file_compress":"simple_compress1"                                5
    },
    ...
]

index_name: the name of an inverted index. You must specify an index-based query in the query statement.
index_type: the type of an index. Set the value to DATE.
index_fields: the field on which you want to create an index. You can set the field_type parameter to UINT64, DATE, TIME, or TIMESTAMP to create a DATE index.
build_granularity: the granularity at which the term dictionary is built. If you set this parameter to minute, values that are expressed in seconds or microseconds in the data are ignored and converted to 0. Only values that are expressed in minutes can be queried.
file_compress: specifies how a postings file is compressed. For more information, see the PACK index section in this topic.

Query syntax

For information about the query syntax of DATE indexes, see query syntax of DATE indexes.

Important

DATE indexes do not support bitmaps.
The term dictionary can be built at one of the following seven granularities: year, month, day, hour, minute, second, and millisecond. If the granularity is closer to the microsecond level, more storage space is required.
DATE, TIME, and TIMESTAMP values are resolved to timestamps that are expressed in milliseconds that have elapsed since 00:00:00 00.000, January 1, 1970 based on their formats. Timestamps in Greenwich Mean Time (GMT) and Coordinated Universal Time (UTC) are supported regardless of time zones. An inverted index is created based on the timestamps. You also need to query the timestamp range based on the corresponding criteria.
The query interface term supports null terms. The records about fields for which enable_null is set to true can be queried.
The index_name parameter cannot be set to summary.

RANGE indexes

Introduction to RANGE indexes

A RANGE index is created on integer values and is used to query documents in a specific range. When a RANGE index is used to replace the range filtering specified in a filter clause, the query performance is greatly improved. The more documents are filtered by using the filter clause, the more obvious the query performance is improved.

Sample code for configuring a RANGE index

"fileds":
[
    {"field_name":"price",     "field_type":"INT64", "binary_field": false},
    ...
]
"indexs":
[
    {
        "index_name": "inputtime",
        "index_type" : "RANGE",
        "index_fields": "price",
        "file_compress":"simple_compress1"  
    },
    ...
]

Query syntax

For information about the query syntax of RANGE indexes, see query syntax of RANGE indexes.

Important

You can set the field_type parameter to INT64, INT32, UINT32, INT16, UINT16, INT8, or UINT8 to create a RANGE index.
RANGE indexes do not support fields that contain null values.
RANGE indexes do not support multivalued fields.
The index_name parameter cannot be set to summary.

SPATIAL indexes

Introduction to SPATIAL indexes

A SPATIAL index is created for the longitudes and latitudes of given points and is used for geospatial queries, including point range queries, line queries, and polygon queries.

Sample code for configuring a SPATIAL index

"fileds":
[
    {"field_name":"location",         "field_type":"LOCATION"},
    {"field_name":"line",     "field_type":"LINE"},
    {"field_name":"polygon",     "field_type":"POLYGON"},
    ...
]
....
"indexs":
[
    {
        "index_name": "inputtime",                                        1
        "index_type" : "SPATIAL",                                         2
        "index_fields": "location",                                       3
                    "max_search_dist": 10000,                                         4
                    "max_dist_err": 20                                                5
                    "distance_loss_accuracy":0.025,                                   6
        "file_compress":"simple_compress1"                                7
    },
    ...
]

index_name: the name of an inverted index. You must specify an index-based query in the query statement.
index_type: the type of an index. Set the value to SPATIAL.
index_fields: the fields on which you want to create an index. The field types must be LOCATION, LINE, and POLYGON.
LOCATION: You can specify values for fields of the LOCATION data type in the location={Longitude} {Latitude} format, such as location=116 40.
LINE: You can specify values for fields of the LINE data type in the line=location,location,location...^]location,location... format, such as line=116 40,117 41,118 42^].... If you use OpenSearch Vector Search Edition SDK to push data, specify the LINE field in the following format: line: ["location,location,location...", "location,location,location..."].
If you push a field of the POLYGON data type to OpenSearch Vector Search Edition, refer to the following format: polygon=location1,location2,...location1^].... If you use OpenSearch Vector Search Edition SDK to push a field of the POLYGON data type, refer to the following format: line : ["location,location,location",...]. A polygon can be a convex polygon or a concave polygon. The start point and end point of the polygon must be consistent. The two adjacent edges cannot be collinear. The edges of the polygon cannot be self-intersecting.
max_search_dist: the maximum distance (diameter) that is covered during the query. Unit: meters. The value of the max_search_dist parameter must be greater than the value of the max_dist_err parameter.
max_dist_err: the maximum distance (diameter) error value when the term dictionary is built. Unit: meters. The minimum value of this parameter is 0.05 meters.
distance_loss_accuracy: the loss of precision. OpenSearch Vector Search Edition improves the performance of polyline and polygon queries at the cost of loss of precision. The following method is used: Range of the distance to the outermost layer = Diagonal length of the circumscribed rectangle of the polyline or polygon × Value of the distance_loss_accuracy parameter. The default value of the distance_loss_accuracy parameter is 0.025.
For information about the file compression method, see PACK indexes.

Query syntax

For information about the query syntax of SPATIAL indexes, see query syntax of SPATIAL indexes.

Important

The index_name parameter cannot be set to summary.
The point coordinates of lines and polygons are mapped to a flat world map to determine the scope of line queries and polygon queries, regardless of the case of crossing 180 degrees longitude. The query result of the inverted index on the location field is accurate. The query results of the inverted index on the line and polygon fields need to be filtered.