Search index - multi-dimensional query and analysis | Tablestore - Tablestore

Background information

Search indexes apply only to wide table models.

Search indexes, databases, and search engines all address complex query problems in big data, but differ in the following ways:

Except for joins, transactions, and relevance analysis, Tablestore provides the features of both databases and search systems. It combines the high data reliability of a database with the advanced query capabilities of a search system, replacing the common database + search engine architecture.

If your scenario does not involve joins, transactions, or complex relevance analysis, use a Tablestore search index.

表格存储与数据库及搜索系统的主要的区别

Index overview

A search index uses inverted indexes and column stores to solve multi-dimensional query and statistical analysis problems for big data. It supports non-primary key column queries, prefix queries, fuzzy queries, boolean queries, nested queries, geo queries, full-text search, vector search, and statistical aggregation (max, min, count, sum, avg, distinct_count, group_by, percentiles, and histogram).

The following figure shows the inverted index, column store, and multi-dimensional spatial index structures used by a search index.

Unlike traditional database indexes (such as MySQL), a search index is not limited by the leftmost prefix matching rule. In most cases, you only need one search index per table. For example, a student table with columns such as name, student ID, gender, grade, class, and home address requires only a single search index to support combined queries like students named John Doe in the third grade, male students whose home address is within 1 km, or students in Class 2 of Grade 3 who live in a specific residential area.

Index comparison

Tablestore supports primary key queries on data tables, plus two index types for accelerated queries: secondary index and search index. The following table compares these three query methods.

Query method	Principle	Scenario
Primary key	A data table functions like a large map. You can only query data by primary key.	Suitable for scenarios where you know the full primary key or a key prefix.
Secondary index	Creates index tables whose primary key columns extend the query capability of the data table to different columns.	Suitable for scenarios where the query columns are predetermined, the column count is small, and you know the full primary key or a key prefix.
Search index	Uses structures such as inverted indexes, BKD trees, and column stores to provide rich query capabilities.	Suitable for all query and analysis scenarios beyond primary key and secondary index coverage: non-primary key column queries, boolean queries on any columns, relationship queries, full-text search, geo queries, fuzzy queries, nested queries, NULL value queries, and statistical aggregation.

Scenarios

Search indexes are widely used for data query and analysis across application systems. The following table lists common scenarios.

Application system	Example scenario
E-commerce platform	Implement product categorization and attribute filtering to help users quickly search and filter products.
Social application	Query user follow and friend relationships, or recommend and match users based on interest tags.
Log analysis	Perform keyword searches and time-range queries to quickly locate problems and analyze log data.
Internet of Things data analytics	Query and analyze device data. For example, filter and count data by device type or geographic location.
Application performance monitoring	Aggregate and query metric data. For example, filter and summarize data by time range or application name.
Location-based service	Perform geo queries and nearby searches to provide information about nearby shops, attractions, and services.
Text search engine	Perform full-text search and relevance sorting to quickly find documents, articles, and other content.

Features

Feature list

The following table lists search index features.

Feature	Description	Document
Query on any column (including primary key and non-primary key columns)	Query data by any column. Suitable for most query scenarios. If primary key or prefix queries cannot meet your needs, create a search index with the target fields and query by column values.	Any search index query, such as a basic query
Boolean query	Combine multiple fields for efficient filtering. Suitable for order systems, log analysis, and user personas. In a relational database, a table with dozens of fields may require hundreds of indexes to cover all field combinations. Missing combinations result in inefficient queries. With Tablestore, one search index covers all field combinations. Add the fields you might query to the index, then freely combine them using And, Or, and Not logic.	Boolean query
Geo query	Mobile devices have made geographic location data increasingly valuable. Applications for social networking, food delivery, sports, and the Internet of Vehicles (IoV) all require location-aware queries. Search indexes support the following geo query features: Near: Query points within a specified distance from an origin. Example: the "People Nearby" feature in social media. Within: Query points within a rectangular or polygonal area. If your application requires location-based queries, a Tablestore search index provides a one-stop solution without additional databases or search systems.	Geo-distance query Geo-bounding box query Geo-polygon query
Full-text index	Find data containing a specified phrase. Suitable for big data analytics, content search, knowledge management, social media analysis, log analysis, AI chat systems, compliance reviews, and personalized recommendations. Search indexes use tokenization for full-text search. They provide basic BM25 relevance but not custom relevance. For complex relevance search needs, use a dedicated search system; otherwise, a search index is sufficient. Five tokenization types are available: single-word, delimiter, minimum semantic, maximum semantic, and fuzzy. To highlight keywords in results, use the summary and highlighting feature.	Match query Match phrase query Tokenization Summary and highlighting
Vector search	Search indexes support vector search for efficient approximate nearest neighbor queries on large-scale datasets. Suitable for retrieval-augmented generation (RAG), recommendation systems, similarity detection (images, videos, and speech), and natural language processing.	AISearch
Fuzzy query	Search indexes provide wildcard, prefix, and suffix queries for fuzzy matching in different scenarios. Wildcard query: Similar to the `like` syntax in relational databases. Supports two wildcards: asterisk () and question mark (?). For `word*` patterns, use a tokenization-based wildcard query (fuzzy tokenization combined with match phrase query) for better performance. Prefix query: Matches content by prefix. For example, querying `apple` matches `apple6s` and `applexr`. Supports Chinese, English, and other languages. Suffix query: Matches content by suffix. For example, you can query for all mobile phone numbers that end with `1234`.	Wildcard query Tokenization-based wildcard query Prefix query Suffix query
Column existence query (NULL query)	Check whether a column has a null value. Suitable for data integrity checks and data cleaning.	Column existence query
Nested query	Beyond flat structures, application data often has multi-level nested structures. For example, an image tagging system stores images with multiple entities (houses, cars, people), each with a different position, size, and weight (score). Each image maps to multiple tags, and each tag has a name and a weight score. To filter images by tag conditions, use the nested type query. Image tags are stored in JSON format: `{ "tags": [ { "name": "car", "score": 0.78 }, { "name": "tree", "score": 0.24 } ] }` Nested type queries handle data with multi-level logical relationships, providing flexibility for complex data modeling. For complex nested data structures (such as JSON), use the summary and highlighting feature to precisely locate required information.	Array and nested types Nested query
Deduplication	Search indexes deduplicate query results to improve diversity. Deduplication limits how many times a specific attribute value appears in a single result set. For example, when searching for `laptop` on an e-commerce platform, deduplication prevents the first page from being dominated by a single brand.	Collapse (deduplication)
Sorting	Tablestore sorts data by primary key in alphabetical order by default. To sort by other fields, use the sorting feature of a search index. Search indexes support ascending or descending order, single-condition sorting, and multi-condition sorting. All sorting is global. By default, search index results are sorted by the primary key in alphabetical order.	Sorting and pagination Any search index query, such as a basic query
Total number of rows	When querying data with a search index, you can return the number of matching rows. This is useful for data validation and operations. An empty query condition matches all indexed data. The returned total equals the number of indexed rows in the data table. If data writing has stopped and all data is indexed, the total equals the number of rows in the data table.	Match all query Any search index query, such as a basic query
Statistical aggregation	Search indexes provide common aggregation functions: Max, Min, Avg, Sum, Count, DistinctCount, GroupBy, Percentile, and Histogram. These meet basic statistical needs for lightweight analysis.	Statistical aggregation

Supported regions

Currently, the search index feature is available in the following regions: China (Hangzhou), China (Shanghai), China (Qingdao), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Guangzhou), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Philippines (Manila), Thailand (Bangkok), Germany (Frankfurt), UK (London), US (Silicon Valley), US (Virginia), SAU (Riyadh - Partner Region), and . The vector search feature is not yet supported in the US (Silicon Valley) region.

Disaster recovery

In regions with zone-disaster recovery capabilities, search indexes provide zone-redundant storage by default. Data is stored across multiple zones within the region. If a single zone fails, read and write services continue without disruption.

Currently, search index supports zone-redundant storage in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Hong Kong), Japan (Tokyo), Singapore, Indonesia (Jakarta), Germany (Frankfurt), and .

Data Lifecycle

If your data table has no UpdateRow operations, you can use search index TTL. Lifecycle management.

If you only need to retain data for a specific period and the time field does not need updates, implement TTL by sharding tables by time.

Dimension	Table sharding by time
Principle	Shard tables by a fixed interval (day, week, month, or year). Create a search index for each table and retain data tables for the required duration. For example, to retain data for six months, store each month's data in a separate table (table_1 through table_6) with its own search index. Each month, delete the table from six months ago. When querying, if the time range falls within a single table, query that table only. If it spans multiple tables, query each and merge the results.
Rule	A single table (single index) must not exceed 50 billion rows. Query performance is optimal when the row count stays below 20 billion.
Advantages	Control data retention duration by managing the number of retained tables. Query performance scales with data volume. Sharding caps each table's size, resulting in better performance and avoiding query timeouts.

Data versions

Search indexes do not support multiple data versions. You cannot create a search index for a data table with multiple versions enabled.

In a single-version table, if you customize the timestamp for each write, writing data with a smaller version number after a larger one may overwrite the larger version.

The data returned by Search and ParallelScan requests does not necessarily include the timestamp property.

Limits

Search indexes synchronize data from the data table asynchronously, so real-time queries are not possible. The typical latency is within 3 seconds. Search index limits.

Billing

Search indexes are billed for the storage space occupied by index data and the computing resources consumed for queries and analysis. Billing overview.

Development and integration

API reference

Search indexes provide API operations for index management and data query. Data query includes the general-purpose Search API and the data-exporting ParallelScan API. ParallelScan sacrifices some features (sorting, aggregation) for higher performance and throughput.

Category	API	Description
Index management	CreateSearchIndex	Creates a search index.
	UpdateSearchIndex	Updates the configuration of a search index, including its time to live (TTL) and index schema.
	DescribeSearchIndex	Gets the detailed description of a search index.
	ListSearchIndex	Lists the search indexes.
	DeleteSearchIndex	Deletes a search index.
Data query	Search	Full-featured query API. Supports all search index features including query functions, sorting, and statistical aggregation. Results are returned in the specified order. Query functions: non-primary key column query, column existence query, fuzzy query, boolean query, nested query, geo query, full-text search, vector search Collapse (deduplication) Sorting Statistical aggregation Total number of rows
Data query	ParallelScan	Data-exporting API with parallel scan support. Includes all query functions but omits sorting and statistical aggregation. Returns all matched data at higher speed. With a single concurrency, ParallelScan throughput is 5x that of the Search API. Query functions: non-primary key column query, column existence query, fuzzy query, boolean query, nested query, geo query, full-text search Supports multiple concurrent queries in a single request When exporting data with multiple concurrent requests, use the ComputeSplits API to get the maximum concurrency for a single ParallelScan request.

Integration methods

You can use the following SDKs or CLI tools to work with search indexes.

FAQ

References

To query and analyze data with SQL, use the Tablestore SQL query feature.

Note
You can also analyze data in Tablestore using compute engines such as MaxCompute, Spark, Hive, HadoopMR, Function Compute, or Flink. Compute and analysis overview.

Appendix: SQL mapping

Some search index features map to SQL functions. The following table lists the mappings.

SQL	Search index	Search index documentation
Show	DescribeSearchIndex	Query search index description
Select	ColumnsToGet parameter in any query	Any search index query, such as a basic query
From	IndexName parameter in any query Important Single index is supported. Multiple indexes are not yet supported.	Any search index query, such as a basic query
Where	Conditions in any query	Any search index query, such as a basic query
Order by	sort parameter in any query	Sorting and pagination
Limit	limit parameter in any query	Sorting and pagination
Delete	Use any query to get the primary key of the row. Perform the DeleteRow operation.	Get the primary key of the row through any search index query, such as a basic query. Delete data by primary key.
Like	WildcardQuery	Wildcard query
And	operator = and in BoolQuery	Boolean query
Or	operator = or in BoolQuery
Not	BoolQuery(mustNotQueries)
Between	RangeQuery	Range query
Null	ExistsQuery	Column existence query
In	TermsQuery	Terms query
Min	Aggregation: min	Statistical aggregation
Max	Aggregation: max
Avg	Aggregation: avg
Count	Aggregation: count
Count(distinct)	Aggregation: distinctCount
Sum	Aggregation: sum
Group By	GroupBy