If you use *word* for a wildcard query (WildcardQuery), you can use fuzzy tokenization together with a match phrase query to obtain better query performance.

Background information

Fuzzy query is a common requirement in databases. For example, you can perform a fuzzy query to query file names and mobile numbers. To perform fuzzy queries in Tablestore, you can use the wildcard query feature of search indexes. The wildcard query feature is similar to the LIKE operator in MySQL. However, the wildcard query feature supports only up to 32 characters in the string that is used for a wildcard query, and the query performance decreases as the volume of data increases.

To resolve these issues, search indexes support fuzzy tokenization to ensure high performance in fuzzy queries. When you use fuzzy tokenization, Tablestore does not limit the length of the string that is used for a query. However, if the field value exceeds 1,024 characters in length, the system truncates the field value and performs tokenization only for the first 1,024 characters.

Scenarios

You can select a method that fits your scenario to perform a fuzzy query.

  • If you use *word* for a wildcard query, you can use fuzzy tokenization to perform a fuzzy query. For example, if you use "123" to query mobile numbers that contain 123 at any position, you can use fuzzy tokenization to perform a fuzzy query.

    In this case, the fuzzy tokenization improves the query performance by more than 10 times than the wildcard query.

    For example, a data table contains a column named file_name, and the field type is Text and the tokenization method is fuzzy tokenization (Fuzzy_Analyzer) for the column in the search index. If you use the search index to query the rows whose file_name column value is 2021 woRK@Hangzhou, you must perform a match phrase query (MatchPhraseQuery) and set the tokens to consecutive substrings for the query.

    • If the token for the query is 2021, 20, 21, work, WORK, @, Hang, zhou, Hangzhou, or @Hangzhou, the rows whose file_name column value is 2021 woRK@Hangzhou can match the token.
    • If the token for the query is 21work, 2021Hangzhou, 2120, or #Hangzhou, the rows whose file_name column value is 2021 woRK@Hangzhou cannot match the token.
  • For other complex queries, you can use wildcard queries for fuzzy queries. For more information about the wildcard query, see Wildcard query.

Use fuzzy tokenization for a fuzzy query

To use fuzzy tokenization for a fuzzy query, perform the following steps:

  1. Create a search index. When you create a search index, set the field type to Text and the tokenization method to fuzzy tokenization (Fuzzy Analyzer) for the specified column, and retain the default settings for other parameters.
    Note If a search index exists, you can add a virtual column for the specified column by dynamically modifying the schema of the search index. Then, set the field type to Text and the tokenization method to fuzzy tokenization for the virtual column. For more information about the specific operations, see Dynamically modify schemas and Virtual columns.
  2. Use the search index to query data. When you use the search index to query data, perform a match phrase query. For more information about the match phrase query, see Match phrase query.

Examples

The following test case shows how to use fuzzy tokenization to perform a fuzzy query:

package com.aliyun.tablestore.search.test;

import com.alicloud.openservices.tablestore.SyncClient;
import com.alicloud.openservices.tablestore.model.*;
import com.alicloud.openservices.tablestore.model.search.*;
import com.alicloud.openservices.tablestore.model.search.query.QueryBuilders;
import org.junit.Test;

import java.util.Arrays;
import java.util.Collections;

import static org.junit.Assert.assertEquals;


public class Test {
    private static final Conf conf = Conf.newInstance("src/test/resources/conf.json");
    private static final SyncClient ots = new SyncClient(conf.getEndpoint(), conf.getAccessId(), conf.getAccessKey(), conf.getInstanceName());
    private static final String tableName = "analysis_test";
    private static final String indexName = "analysis_test_index";


    @Test
    public void testFuzzyMatchPhrase() {
        // Delete the existing data table and index. 
        TableStoreHelper.deleteTableAndIndex(ots, tableName);
        // Create a data table. 
        TableStoreHelper.createTable(ots, tableName);
        // Define the schema of the data table. 
        IndexSchema indexSchema = new IndexSchema();
        indexSchema.setFieldSchemas(Collections.singletonList(
                // Note: If you change the type of the name field for the query from Keyword to Text and set the tokenization method for the field, exceptions may occur in the query. 
                // If you want to retain the fields of both the Keyword and Text types, see the example provided in the "Virtual columns" topic. If you use *abc* to match the name field, only the name field of the Text type is required. The name field of the Keyword type is not required. 
                new FieldSchema("name", FieldType.TEXT).setAnalyzer(FieldSchema.Analyzer.Fuzzy)
        ));
        // Create a search index. 
        TableStoreHelper.createIndex(ots, tableName, indexName, indexSchema);

        // Write a row of data to the data table. 
        PrimaryKey primaryKey = PrimaryKeyBuilder.createPrimaryKeyBuilder()
                .addPrimaryKeyColumn("pk1", PrimaryKeyValue.fromString("1"))
                .addPrimaryKeyColumn("pk2", PrimaryKeyValue.fromLong(1))
                .addPrimaryKeyColumn("pk3", PrimaryKeyValue.fromBinary(new byte[]{1, 2, 3}))
                .build();
        RowPutChange rowPutChange = new RowPutChange(tableName, primaryKey);
        // Add an attribute column to the data table. 
        rowPutChange.addColumn("name", ColumnValue.fromString("TheLionKing1024x768P.mp4"));
        PutRowRequest request = new PutRowRequest(rowPutChange);
        ots.putRow(request);

        // Wait until the row of data is synchronized to the search index. 
        TableStoreHelper.waitDataSync(ots, tableName, indexName, 1);

        // Use *abc* for the query. 
        assertMatchPhraseQuery(ots, tableName, indexName, "name", "The", 1);
        assertMatchPhraseQuery(ots, tableName, indexName, "name", "TheLion", 1);
        assertMatchPhraseQuery(ots, tableName, indexName, "name", "The Lion", 0);
        assertMatchPhraseQuery(ots, tableName, indexName, "name", "TheLionKing102", 1);
        assertMatchPhraseQuery(ots, tableName, indexName, "name", "TheLionKing1024", 1);
        assertMatchPhraseQuery(ots, tableName, indexName, "name", "TheLionKing1024x", 1);
        assertMatchPhraseQuery(ots, tableName, indexName, "name", "TheLionKing1024x7", 1);
        assertMatchPhraseQuery(ots, tableName, indexName, "name", "TheLionKing1024x768P.mp4", 1);
        assertMatchPhraseQuery(ots, tableName, indexName, "name", "24x768P.mp4", 1);
        assertMatchPhraseQuery(ots, tableName, indexName, "name", "24x76 8P.mp4", 0);
        assertMatchPhraseQuery(ots, tableName, indexName, "name", "24x7 P.mp4", 0);
    }

    @Test
    // Use a virtual column. 
    public void testFuzzyMatchPhraseWithVirtualField() {
        // Delete the existing data table and index. 
        TableStoreHelper.deleteTableAndIndex(ots, tableName);
        // Create a data table. 
        TableStoreHelper.createTable(ots, tableName);
        // Define the schema of the data table. 
        IndexSchema indexSchema = new IndexSchema();
        indexSchema.setFieldSchemas(Arrays.asList(
                // Set the type of the name field to Keyword, which facilitates equivalent queries. 
                new FieldSchema("name", FieldType.KEYWORD).setIndex(true).setStore(true),
                // Create a virtual column named name_virtual_text and set the field type to Text and the tokenization method to Fuzzy for the virtual column. The data source of the virtual column is the name field. 
                new FieldSchema("name_virtual_text", FieldType.TEXT).setIndex(true).setAnalyzer(FieldSchema.Analyzer.Fuzzy).setVirtualField(true).setSourceFieldName("name")
        ));
        // Create a search index. 
        TableStoreHelper.createIndex(ots, tableName, indexName, indexSchema);

        // Write a row of data to the data table. 
        PrimaryKey primaryKey = PrimaryKeyBuilder.createPrimaryKeyBuilder()
                .addPrimaryKeyColumn("pk1", PrimaryKeyValue.fromString("1"))
                .addPrimaryKeyColumn("pk2", PrimaryKeyValue.fromLong(1))
                .addPrimaryKeyColumn("pk3", PrimaryKeyValue.fromBinary(new byte[]{1, 2, 3}))
                .build();
        RowPutChange rowPutChange = new RowPutChange(tableName, primaryKey);
        // Add an attribute column to the data table. 
        rowPutChange.addColumn("name", ColumnValue.fromString("TheLionKing1024x768P.mp4"));
        PutRowRequest request = new PutRowRequest(rowPutChange);
        ots.putRow(request);

        // Wait until the row of data is synchronized to the search index. 
        TableStoreHelper.waitDataSync(ots, tableName, indexName, 1);

        // Use *abc* for the query. 
        // Note: The field for the query is name_virtual_text instead of name. 
        assertMatchPhraseQuery(ots, tableName, indexName, "name_virtual_text", "The", 1);
        assertMatchPhraseQuery(ots, tableName, indexName, "name_virtual_text", "TheLion", 1);
        assertMatchPhraseQuery(ots, tableName, indexName, "name_virtual_text", "The Lion", 0);
        assertMatchPhraseQuery(ots, tableName, indexName, "name_virtual_text", "TheLionKing102", 1);
        assertMatchPhraseQuery(ots, tableName, indexName, "name_virtual_text", "TheLionKing1024", 1);
        assertMatchPhraseQuery(ots, tableName, indexName, "name_virtual_text", "TheLionKing1024x", 1);
        assertMatchPhraseQuery(ots, tableName, indexName, "name_virtual_text", "TheLionKing1024x7", 1);
        assertMatchPhraseQuery(ots, tableName, indexName, "name_virtual_text", "TheLionKing1024x768P.mp4", 1);
        assertMatchPhraseQuery(ots, tableName, indexName, "name_virtual_text", "24x768P.mp4", 1);
        assertMatchPhraseQuery(ots, tableName, indexName, "name_virtual_text", "24x76 8P.mp4", 0);
        assertMatchPhraseQuery(ots, tableName, indexName, "name_virtual_text", "24x7 P.mp4", 0);
    }

    // Perform a match phrase query. 
    public static void assertMatchPhraseQuery(SyncClient ots, String tableName, String indexName, String fieldName, String searchContent, long exceptCount) {
        SearchRequest searchRequest = new SearchRequest();
        searchRequest.setTableName(tableName);
        searchRequest.setIndexName(indexName);
        SearchQuery searchQuery = new SearchQuery();
        // Perform a match phrase query to query data that matches the tokens. 
        searchQuery.setQuery(QueryBuilders.matchPhrase(fieldName, searchContent).build());
        searchQuery.setLimit(0);
        // Specify that the total number of matched rows is returned. If you are not concerned about the total number of matched rows, set this parameter to false for better performance. 
        searchQuery.setGetTotalCount(true);
        searchRequest.setSearchQuery(searchQuery);
        SearchResponse response = ots.search(searchRequest);
        assertEquals(String.format("field:[%s], searchContent:[%s]", fieldName, searchContent), exceptCount, response.getTotalCount());
    }
}