The search index feature provides multiple efficient index schemas to help you process complex queries in big data scenarios.
Data tables in Tablestore use distributed NoSQL data structures. Data such as monitoring data and log data can be stored, read, and written at a large scale.
In addition to queries based on primary keys including single-row read and range read, Tablestore provides the search index feature to meet your requirements for complex queries. These queries include Boolean query and queries based on non-primary key columns.
The search index feature is implemented by using inverted indexes and column stores. This feature provides query methods to solve problems in complex big data scenarios. The query methods include queries based on non-primary key columns, full-text search, prefix query, fuzzy query, Boolean query, nested query, and geo query. Aggregation can be implemented by using max, min, count, sum, avg, distinct_count, group_by, percentile statistics, and query by histogram.
Differences among indexes
Aside from queries based on primary keys in data tables, Tablestore provides two index schemas for accelerated queries: secondary index and search index. The following table describes the differences among the three types of indexes.
|Primary key of a data table||A data table is similar to a large map. Data tables support queries based only on primary keys.||You can specify a complete primary key or the prefix of a primary key.|
|Secondary index||You can create one or more index tables and perform queries by using the primary key columns of the index tables.||You can define the required columns in advance. Therefore, only a small number of columns are queried. You can also specify a complete primary key or the prefix of a primary key.|
|Search index||The search index feature uses inverted indexes, Bkd-trees, and column stores for various query scenarios.||The following scenarios do not support the use of a primary key of a data table and
the secondary index feature:
Compared with indexes of conventional database services such as MySQL, the search index feature is not subject to the leftmost matching principle. Therefore, the search index feature can be used in more scenarios. In most cases, only one search index is required for a data table. For example, a data table about student information may contain the student name, ID, gender, grade, class, and home address columns. When you create a search index, you can add these columns to the search index. When you use the search index, you can specify a combination of conditions. Examples: students named Tom in Grade Three, male students who live one kilometer away from their school, and students in Class Two, Grade Three who live in the specified residential community.
The search index feature provides the Search and ParallelScan API operations. The Search API operation is used for general queries. The ParallelScan API operation is used for data export.
|Search||Supports all features of search index.
|ComputeSplits+ParallelScan||Exports data in parallel. The query feature of search indexes is supported. However,
analysis features such as sorting and aggregation are not supported.
The query speed of a ParallelScan request that includes a parallel scan task is five times faster than the query speed of a Search request.
- Index synchronization
After a search index is created for a data table, data is written to the data table first. After the data is written to the data table, a success message is returned. At the same time, another asynchronous thread reads the newly written data from the data table and writes the data to the search index. The write performance of Tablestore is not affected when data is being asynchronously synchronized from a data table to a search index.
In most cases, the latency generated when data is synchronized to a search index is within 3 seconds. You can view the latency in real time in the Tablestore console.
- Time to live (TTL)
- If the UpdateRow operation is disabled for a data table, you can use the TTL feature of the search index that is created for the data table. For more information, see TTL of search indexes.
- If you want to retain the data only for a period of time and the time field does not
need to be updated, you can implement the TTL feature by splitting a data table into
several time-specific data tables.
This solution is implemented based on the following principles and rules, and has the following benefits:
- Principle: Split a data table based on fixed periods of time, such as day, week, month,
or year. Then, you can create a search index for each table. This way, tables for
the specified periods of time are retained.
For example, if you want to retain data for six months, you can store the data for each month in a data table (such as table_1, table_2, table_3, table_4, table_5, and table_6) and create a search index for each data table. Each data table and search index store the data only of a single month. Then, you need to only delete data tables that are retained for more than six months to implement the same feature as TTL.
When you query data by using a search index, you need to only query one table if data that meets the time range requirement is in that table. If data that meets the time range requirement is included in multiple tables, you need to query all these tables and then combine the query results.
- Rule: The size of a single table or search index cannot exceed 50 billion rows. The search index feature provides the optimal query performance if the size of a single table or search index does not exceed 20 billion rows.
- You can adjust the data storage duration based on the number of data tables retained.
- Query performance is directly proportional to data volumes. After a data table is split into multiple data tables, the data volume of each data table has an upper limit. This helps ensure better query performance and avoid high query latencies or timeouts.
- Principle: Split a data table based on fixed periods of time, such as day, week, month, or year. Then, you can create a search index for each table. This way, tables for the specified periods of time are retained.
- Max versions
You cannot create a search index for a data table for which you have specified the max versions parameter.
You can specify the timestamp when you write data to a column that allows only a single version. If you first write data with a greater version number and then write data with a smaller version number, the data with the greater version number may be overwritten by data with the smaller version number.
The results of the Search and ParallelScan requests may not include the timestamp attribute.
Search indexes can solve complex query problems in big data scenarios. Other systems such as databases and search engines can also solve data query problems. The following figure shows the differences between Tablestore, databases, and search engines.
Tablestore can provide all features of databases and search engines except for JOIN
operations, transactions, and relevance of search results. Tablestore also has high
data reliability of databases and supports advanced queries of search engines. Therefore,
Tablestore can be used to replace the common architecture that consists of
databases and search engines. If you do not need JOIN operations, transactions, and relevance of search results,
we recommend that you use the search index feature of Tablestore.
For more information, see Billable items of search indexes.