Create a search index - Tablestore - Alibaba Cloud Documentation Center

You can create one or more search indexes for a table in the Tablestore console or by using the Tablestore CLI and Tablestore SDKs.

Optimal method to create search indexes

We recommend that you determine the number of search indexes that you want to create for a data table based on your query requirements.

If you have a data table that contains the id, name, age, city, and sex fields, you can create search indexes to query data by name, age, or city by using one of the following methods:

Method 1: Create a search index for each field
If you use this method, you must create the following search indexes: name_index, age_index, and city_index.
- To query students by city, use the city_index search index. To query students by age, use the age_index search index.
- However, if you want to query students who are younger than 12 years old and live in City A, this method does not work.
The implementation of this method is similar to that of the global secondary index feature. However, this method is not cost-effective. We recommend that you use Method 2 to create search indexes.
Method 2: Create one search index for multiple fields
In this method, a search index named student_index is created. The search index contains the following fields: name, age, and city.
- To query students by city, query the city field in the student_index search index. To query students by age, query the age field in the student_index search index.
- To query students who are younger than 12 years old and live in City A, query the age and city fields in the student_index search index.
This method makes full use of the advantages of search indexes and is more cost-effective. We recommend that you use this method to create a search index.

Limits

Timeliness of search index creation
After you create a search index, you must wait for a few seconds before you can use the search index. During this period, you can write data to the data table, but you cannot query the metadata of the index or query data by using the index.
When you create a search index, make sure that the types of fields in the search index match the types of fields in the data table for which the search index is created. For more information, see Data type mappings.
Number of search indexes
For more information, see Search index limits.

Parameters

When you create a search index, you must configure the tableName, indexName, and indexSchema parameters. You must also configure the fieldSchemas, indexSetting, and indexSort parameters in the indexSchema parameter. The following table describes the parameters.

Parameter	Description
tableName	The name of the data table.
indexName	The name of the search index.
fieldSchemas	The list of field schemas. In each field schema, configure the following parameters: fieldName: This parameter is required and specifies the name of the field in the search index. The value is used as a column name. Type: String. A column in a search index can be a primary key column or an attribute column of the data table. fieldType: This parameter is required and specifies the type of the field. Specify the type in the FieldType.XXX format. For more information, see Data type mappings. array: This parameter is optional and specifies whether the value is an array. Type: Boolean. If you set this parameter to true, the column stores data as an array. Data written to the column must be a JSON array. Example: ["a","b","c"]. The values of nested fields are arrays. If you set the fieldType parameter to Nested, skip this parameter. analyzer: This parameter is optional and specifies the type of analyzer that you want to use. If you set the fieldType parameter to Text, you can configure this parameter. If you do not configure this parameter, the default analyzer type single-word tokenization is used. For more information, see Tokenization. isVirtualField: This parameter is optional and specifies whether the field is a virtual column. Type: Boolean. Default value: false. You must set this parameter to true for virtual columns. For more information, see Virtual columns. sourceFieldName: This parameter is optional and specifies the name of the source field to which the virtual column is mapped in the data table. Type: String. If you set the isVirtualField parameter to true, you must configure this parameter. dateFormats: This parameter is optional and specifies the format of dates. Type: String. If you set the fieldType parameter to Date, you must configure this parameter. For more information, see Types of date data. vectorOptions: This parameter is optional and specifies the properties of vector fields. If you set the fieldType parameter to Vector, you must configure this parameter. You can use the following parameters to specify the properties of vector fields: dataType: the type of vector data. Only float32 is supported. If you want to use other types of vector data, submit a ticket. dimension: the dimension of the vector. For more information about the limits on the dimension of a vector, see Search index limits. vectorOptions: This parameter is optional and specifies the properties of vector fields. If you set the fieldType parameter to Vector, you must configure this parameter. You can use the following parameters to specify the properties of vector fields: dataType: the type of vector data. Only float32 is supported. If you want to use other types of vector data, submit a ticket. dimension: the dimension of the vector. For information about the limits on the dimension of a vector, see Search index limits. metricType: the algorithm that you want to use to measure the distance between vectors. Valid values: euclidean, cosine, and dot_product. euclidean: the Euclidean distance algorithm that measures the shortest path between two vectors in a multi-dimensional space. The Euclidean distance algorithm in Tablestore does not perform the final square root calculation to achieve better performance. A greater value that is obtained by using the Euclidean distance algorithm indicates a higher similarity between two vectors. cosine: the cosine similarity algorithm that calculates the cosine of the angle between two vectors in a vector space. A greater value that is obtained by using the cosine similarity algorithm indicates a higher similarity between two vectors. In most cases, the algorithm is used to calculate the similarity between text data. dot_product: the dot product algorithm that multiplies the corresponding coordinates of two vectors of the same dimension and adds the products. A greater value that is obtained by using the dot product algorithm indicates a higher similarity between two vectors. For more information, see Appendix: distance measurement algorithms for vectors.
indexSetting	The settings of the search index, including routingFields. routingFields: This parameter is optional and specifies custom routing fields. You can specify specific primary key columns as routing fields. Tablestore distributes data that is written to a search index across different partitions based on the specified routing fields. The data whose routing field values are the same is distributed to the same partition.
indexSort	The presorting settings of the search index, including sorters. If the indexSort parameter is left empty, field values are sorted by primary key. Note You can skip the presorting settings for search indexes that contain fields of the Nested type. sorters: This parameter is optional and specifies the presorting method for the search index. PrimaryKeySort and FieldSort are supported. For more information, see Perform sorting and paging.
timeToLive	This parameter is optional and specifies the retention period of data in the search index. Unit: seconds. Default value: -1. If the retention period of data exceeds the value of the timeToLive parameter, the data expires. Tablestore automatically deletes expired data. The value of this parameter must be greater than or equal to 86400. A value of 86400 specifies one day. You can also set this parameter to -1, which specifies that data never expires. For information about how to use the time to live (TTL) feature for search indexes, see Specify the TTL of a search index.

Usage notes

To perform full-text search, you can set the type of the field in a search index to Text and specify the type of the analyzer for the Text field. Then, you can use match query or match phrase query to query data. For information about tokenization, see Tokenization.
You can map columns in data tables to virtual columns in search indexes to query new fields and the data of new field types without the need to modify the storage schema and the data in the tables. For more information, see Virtual columns.
You can map fields of the String or Integer type in data tables to fields of the Date type in search indexes. This way, you can query data of the Date type. For more information, see Types of date data.
To perform approximate nearest neighbor searches by using vectors, you can set the type of the fields to Vector in search indexes and use the k-nearest neighbor (KNN) vector query feature to query data.
If you want to store data as arrays in a search index, you can store the data in fields of the String type in the data table, and set the type of the mapped fields in the search index to Array.
When you write data to the data table, the values of fields of the Array type must be JSON arrays. Example: ["a","b","c"].
Note
The fields of search indexes support the Array type. However, the fields of data tables do not support the Array type. For more information, see ARRAY and Nested field types.
If you want to use a field to store data that has a hierarchical relationship or one-to-many relationship, you can set the type of the field in the search index to Nested, and use nested query to query data.
When you write data to the data table, the values of the fields to which the fields of the Nested type in the search index is mapped must be JSON arrays. Example: [{"tagName":"tag1", "score":0.8}, {"tagName":"tag2", "score":0.2}].
If the data that you want to store in the data table is geographical location information, you can set the type of the fields in the search index to Geo-point, and use geo query, geo-bounding box query, or geo-polygon query to query data.
By default, the system uses the partition key as the routing key. You can specify custom routing fields to use other primary key columns as the routing key. This reduces the query latency because the partitions scanned by Tablestore are narrowed down. For more information, see How do I use routing fields?
By default, if you use a search index to query data, the query results are returned and sorted in ascending order of the primary key. You can specify a presorting method to sort the query results based on the values of specific columns or in descending order of the primary key. For more information, see Index presorting.
You can use the TTL feature to automatically delete historical data in a search index. This reduces storage usage and storage costs. For more information, see Specify the TTL of a search index.

Methods

You can create a search index in the Tablestore console or by using the Tablestore CLI or Tablestore SDKs.

Before you create a search index, make sure that the following preparations are complete:

You have an Alibaba Cloud account or a RAM user that has the permissions to perform operations on Tablestore. For information about how to grant Tablestore operation permissions to a RAM user, see Use a RAM policy to grant permissions to a RAM user.
If you want to use Tablestore SDKs or the Tablestore CLI to perform operations on Tablestore, an AccessKey pair is created for your Alibaba Cloud account or RAM user. For more information, see Create an AccessKey pair.
A data table is created. For more information, see Operations on tables.
If you want to use Tablestore SDKs to perform operations on Tablestore, an OTSClient instance is initialized. For more information, see Initialize an OTSClient instance.
If you want to use the Tablestore CLI to perform operations on Tablestore, the Tablestore CLI is downloaded and started, and information about the instance that you want to access is configured. For more information, see Download the Tablestore CLI and Start the Tablestore CLI and configure access information.

Use the Tablestore console

You can create a search index in the Tablestore console.

Go to the Indexes tab.
1. Log on to the Tablestore console.
2. In the top navigation bar, select a resource group and a region.
3. On the Overview page, click the name of the instance that you want to manage or click Manage Instance in the Actions column of the instance.
4. On the Tables tab of the Instance Details tab, click the name of the data table for which you want to create a search index or click Indexes in the Actions column of the data table.
On the Indexes tab, click Create Search Index.

In the Create Index dialog box, configure the parameters of the search index.

Retain the default index name that is generated by the system or enter a custom index name based on your business requirements.
Configure the schema of the search index.
Important
The values of the Field Name and Field Type parameters must match those of the data table. For more information about the mappings between the field types in data tables and those in search indexes, see Data type mappings.
- If you set the Schema Generation Type parameter to Manual, specify the names and types for the fields. Specify whether to turn on Array for each field.
- If you set the Schema Generation Type parameter to Auto Generate, the system automatically uses the primary key columns and attribute columns of the data table as index fields. You can select field types and specify whether to turn on Array based on your business requirements.
Note
To optimize performance in some cases, you can use virtual columns. For more information, see Virtual columns.

If you want to configure advanced settings, such as the Routing Key, Time to Live, and Pre-sorting parameters, turn on Advanced Settings. The following table describes the parameters.

Parameter	Description
Routing Key	The custom routing fields. You can select one or more primary key columns as the routing fields. Tablestore calculates the distribution of the index data based on the values of the routing fields. The records with the same values of the routing fields are distributed to the same data partition.
Time to Live	The retention period of data in the search index. The default value is -1, which indicates that the data never expires. The value of this parameter must be greater than or equal to 86400. A value of 86400 indicates one day. You can also set this parameter to -1. Specify a value that is greater than or equal to 86400 if you want the system to automatically clear the historical data in the search index. If the retention period exceeds the TTL value, the data expires and Tablestore automatically deletes expired data.
Pre-sorting	The default order in which the data is returned. Valid values: Default and Custom. A value of Default indicates that the data is sorted based on the primary key. A value of Custom indicates that the data is sorted based on the field that you specify. Configure the Pre-sorting parameter based on your business requirements. Important Search indexes that contain fields of the Nested type do not support index presorting.

Click OK.
After the search index is created, click Index Details in the Actions column of the search index. You can view the information about the search index, such as the metering information and index fields.

Use the Tablestore CLI

You can execute the create_search_index command by using the Tablestore CLI to create a search index. For more information, see Search index.

Run the create_search_index command to create a search index named search_index.
```
create_search_index -n search_index
```

Enter the index schema as prompted:

The index schema includes the settings of the search index (IndexSetting), the list of field schemas (FieldSchemas), and the presorting settings for the search index (IndexSort). For more information about the schema of a search index, see Create a search index.

 {

    "IndexSetting": {
        "RoutingFields": null
    },
    "FieldSchemas": [
        {
            "FieldName": "gid",
            "FieldType": "LONG",
            "Index": true,
            "EnableSortAndAgg": true,
            "Store": true,
            "IsArray": false,
            "IsVirtualField": false
        },
        {
            "FieldName": "uid",
            "FieldType": "LONG",
            "Index": true,
            "EnableSortAndAgg": true,
            "Store": true,
            "IsArray": false,
            "IsVirtualField": false
        },
        {
            "FieldName": "col2",
            "FieldType": "LONG",
            "Index": true,
            "EnableSortAndAgg": true,
            "Store": true,
            "IsArray": false,
            "IsVirtualField": false
        },
        {
            "FieldName": "col3",
            "FieldType": "TEXT",
            "Index": true,
            "Analyzer": "single_word",
            "AnalyzerParameter": {
                "CaseSensitive": true,
                "DelimitWord": null
            },
            "EnableSortAndAgg": false,
            "Store": true,
            "IsArray": false,
            "IsVirtualField": false
        },
        {
            "FieldName": "col1",
            "FieldType": "KEYWORD",
            "Index": true,
            "EnableSortAndAgg": true,
            "Store": true,
            "IsArray": false,
            "IsVirtualField": false
        },
        {
            "FieldName": "col3V",
            "FieldType": "LONG",
            "Index": true,
            "EnableSortAndAgg": true,
            "Store": true,
            "IsArray": false,
            "IsVirtualField": true,
            "SourceFieldNames": [
                "col3"
            ]
        }
    ]
}

Use Tablestore SDKs

You can create a search index by using the following Tablestore SDKs: Tablestore SDK for Java, Tablestore SDK for Go, Tablestore SDK for Python, Tablestore SDK for Node.js, Tablestore SDK for .NET, and Tablestore SDK for PHP. In this example, Tablestore SDK for Java is used to create a search index.

Create a search index by using the default configurations

The following sample code provides an example on how to create a search index by using the default configurations. In this example, the search index consists of the following columns: the Col_Keyword column of the Keyword type, the Col_Long column of the Long type, and the Col_Vector column of the Vector type. The data in the search index is presorted based on the primary key of the data table and data in the search index never expires.

private static void createSearchIndex(SyncClient client) {
    CreateSearchIndexRequest request = new CreateSearchIndexRequest();
    // Specify the name of the data table. 
    request.setTableName("<TABLE_NAME>"); 
    // Specify the name of the search index. 
    request.setIndexName("<SEARCH_INDEX_NAME>"); 
    IndexSchema indexSchema = new IndexSchema();
    indexSchema.setFieldSchemas(Arrays.asList(
            // Specify the names and types of the fields. 
            new FieldSchema("Col_Keyword", FieldType.KEYWORD), 
            new FieldSchema("Col_Long", FieldType.LONG)));
            // Set the type of the field to Vector. 
            new FieldSchema("Col_Vector", FieldType.VECTOR).setIndex(true)
                    // Set the number of dimensions for the vector to 4 and the distance measurement algorithm for vectors to the dot product algorithm. 
                    .setVectorOptions(new VectorOptions(VectorDataType.FLOAT_32, 4, VectorMetricType.DOT_PRODUCT))
    ));
    request.setIndexSchema(indexSchema);
    // Call the client to create the search index. 
    client.createSearchIndex(request); 
}

Create a search index with the indexSort parameter specified

The following sample code provides an example on how to create a search index with the indexSort parameter specified. In this example, the search index consists of the following columns: the Col_Keyword column of the String type, the Col_Long column of the Long type, the Col_Text column of the Text type, and the Timestamp column of the Long type. The data in the search index is presorted based on the Timestamp column.

private static void createSearchIndexWithIndexSort(SyncClient client) {
    CreateSearchIndexRequest request = new CreateSearchIndexRequest();
    // Specify the name of the data table. 
    request.setTableName("<TABLE_NAME>"); 
    // Specify the name of the search index. 
    request.setIndexName("<SEARCH_INDEX_NAME>"); 
    IndexSchema indexSchema = new IndexSchema();
    indexSchema.setFieldSchemas(Arrays.asList(
            new FieldSchema("Col_Keyword", FieldType.KEYWORD),
            new FieldSchema("Col_Long", FieldType.LONG),
            new FieldSchema("Col_Text", FieldType.TEXT),
            new FieldSchema("Timestamp", FieldType.LONG)
                    .setEnableSortAndAgg(true)));
    // Presort data based on the Timestamp column. 
    indexSchema.setIndexSort(new Sort(
            Arrays.<Sort.Sorter>asList(new FieldSort("Timestamp", SortOrder.ASC))));
    request.setIndexSchema(indexSchema);
    // Call the client to create the search index. 
    client.createSearchIndex(request);
}

Create a search index with the TTL specified

Important

Make sure that update operations on the data table are disabled.

The following sample code provides an example on how to create a search index with the TTL specified. In this example, the search index consists of the following columns: the Col_Keyword column of the String type and the Col_Long column of the Long type. The TTL of the search index is seven days.

// Use Tablestore SDK for Java V5.12.0 or later to create a search index. 
public static void createIndexWithTTL(SyncClient client) {
    int days = 7;
    CreateSearchIndexRequest createRequest = new CreateSearchIndexRequest();
    // Specify the name of the data table. 
    createRequest.setTableName("<TABLE_NAME>");
    // Specify the name of the search index. 
    createRequest.setIndexName("<SEARCH_INDEX_NAME>");
    IndexSchema indexSchema = new IndexSchema();
    indexSchema.setFieldSchemas(Arrays.asList(
            // Specify the names and types of the fields. 
            new FieldSchema("Col_Keyword", FieldType.KEYWORD), 
            new FieldSchema("Col_Long", FieldType.LONG)));
    createRequest.setIndexSchema(indexSchema);
    // Specify the TTL for the search index. 
    createRequest.setTimeToLiveInDays(days);
    // Call the client to create the search index. 
    client.createSearchIndex(createRequest);
}

Create a search index with virtual columns specified

The following sample code provides an example on how to create a search index that consists of the Col_Keyword and Col_Long columns. Each of the columns has a virtual column. The Col_Keyword_Virtual_Long column is mapped to the Col_Keyword column in the data table, and the Col_Long_Virtual_Keyword column is mapped to the Col_Long column in the data table.

private static void createSearchIndex(SyncClient client) {
    CreateSearchIndexRequest request = new CreateSearchIndexRequest();
    // Specify the name of the data table. 
    request.setTableName("<TABLE_NAME>"); 
    // Specify the name of the search index. 
    request.setIndexName("<SEARCH_INDEX_NAME>"); 
    IndexSchema indexSchema = new IndexSchema();
    indexSchema.setFieldSchemas(Arrays.asList(
        // Specify the name and type of the field. 
        new FieldSchema("Col_Keyword", FieldType.KEYWORD), 
        // Specify the name and type of the field. 
        new FieldSchema("Col_Keyword_Virtual_Long", FieldType.LONG) 
             // Specify whether the field is a virtual column. 
            .setVirtualField(true) 
             // Specify the name of the source field to which the virtual column is mapped in the data table. 
            .setSourceFieldName("Col_Keyword"), 
        new FieldSchema("Col_Long", FieldType.LONG),
        new FieldSchema("Col_Long_Virtual_Keyword", FieldType.KEYWORD)
            .setVirtualField(true)
            .setSourceFieldName("Col_Long")));
    request.setIndexSchema(indexSchema);
    // Call the client to create the search index. 
    client.createSearchIndex(request); 
}

What to do next

After you create a search index, you can use the search index to query, analyze, and export data.

Operation	Available features
Query data	Select a query method based on your business requirement. Basic queries: match all query, term query, terms query, prefix query, wildcard query, fuzzy query, exists query, range query, collapse (distinct), geo query (suitable for fields of the Geo-point type, nested query (suitable for fields of the Nested type). Queries based on multiple conditions: Boolean query Full-text search (suitable for fields of the Text type): tokenization, match query, and match phrase query KNN vector query (suitable for fields of the Vector type): KNN vector query When you call the Search operation to query data, you can sort or paginate the rows that meet the query conditions. For more information, see Perform sorting and paging.
Data analysis	Aggregation
Data export	Parallel scan

References

After you create a search index, you can perform operations on the search index based on your business requirements.

You can specify the TTL for a search index to delete historical data in the search index or extend the retention period of data in the search index. For more information, see TTL of search indexes.
You can dynamically modify the schema of a search index to add, update, or remove index columns in the search index. For more information, see Dynamically modify the schema of a search index.
You can call the ListSearchIndex operation to query all search indexes that are created for a data table. For more information, see List search indexes.
You can call the DescribeSearchIndex operation to query the description of a search index, such as the field information and search index configurations. For more information, see Query the description of a search index.
You can delete a search index that you no longer require. For more information, see Delete search indexes.
You can also use the SQL query feature or compute engines, such as MaxCompute, Spark, Hive, HadoopMR, Function Compute, Flink, and PrestoDB, to compute and analyze data in tables. For more information, see SQL query and Overview.

Appendix: distance measurement algorithms for vectors

The following table describes the algorithms that you can use to measure the distance for vectors. A greater value that is obtained by using the algorithm indicates a higher similarity.

MetricType	Formula	Performance	Description
Euclidean distance	$\frac{1}{1 + \sum _{i = 1}^{n} ( x _{i} - y _{i} ) ^{2}}$	Relatively high	Measures the shortest path between two vectors in a multi-dimensional space. The Euclidean distance algorithm in Tablestore does not perform the final square root calculation to achieve better performance. A greater value that is obtained by using the Euclidean distance algorithm indicates a higher similarity between two vectors.
Dot product	$\frac{1 + \sum _{i = 1}^{n} x _{i} y _{i}}{2}$	Highest	Multiplies the corresponding coordinates of two vectors of the same dimension and adds the products. A greater value that is obtained by using the dot product algorithm indicates a higher similarity between two vectors. Float32 vectors must be normalized before they are written to tables. For example, you can use the L2 norm to normalize Float32 vectors. If you do not normalize Float32 vectors before they are written to tables, you may encounter issues such as inaccurate query results, slow construction of vector indexes, and poor query performance.
Cosine similarity	$\frac{1 + \frac{\sum _{i = 1}^{n} x _{i} \times y _{i}}{\sum _{i = 1}^{n} x _{i}^{2} \times \sum _{i = 1}^{n} y _{i}^{2}}}{2}$	Relatively low	Calculates the cosine of the angle between two vectors in a vector space. A greater value that is obtained by using the cosine similarity algorithm indicates a higher similarity between two vectors. In most cases, the algorithm is used to calculate the similarity between text data. If 0 is used as a divisor, the cosine similarity cannot be calculated because 0 cannot be used as a divisor. Therefore, the sum of squares of Float32 vectors cannot be 0. It is complex to calculate the cosine similarity. We recommend that you normalize vectors before you write data to tables and use the dot product algorithm to measure the distance between vectors.

The following sample code provides an example on how to normalize vectors:

  public static float[] l2normalize(float[] v, boolean throwOnZero) {
    double squareSum = 0.0f;
    int dim = v.length;
    for (float x : v) {
      squareSum += x * x;
    }
    if (squareSum == 0) {
      if (throwOnZero) {
        throw new IllegalArgumentException("can't normalize a zero-length vector");
      } else {
        return v;
      }
    }
    double length = Math.sqrt(squareSum);
    for (int i = 0; i < dim; i++) {
      v[i] /= length;
    }
    return v;
  }