In specific scenarios, primary key indexes cannot be efficiently used in queries. In this case, ApsaraDB for ClickHouse may perform a full table scan, especially for queries with the WHERE clause conditions. You can use data skipping indexes to accelerate these queries.
Introduction
Data skipping indexes use specific data structures with the following core mechanisms:
Purpose: Skips data parts that do not meet the
WHEREclause conditions by using granules within the data part, thereby reducing I/O operations and computation. Granules are pre-generated marks within data parts.Method: Creates lightweight indexes on the granule within a
data partto quickly verify whether thedata partcontains the required data.Scenarios: Applies to scenarios where queries filter non-primary key columns or involve complex expressions.
For more information, see Use data skipping indices where appropriate.
Impact of overuse
In some cases, data skipping indexes can accelerate specific queries. However, overusing them can have negative impacts.
Decreased performance
Increased write overhead: Building indexes increases write latency and lowers throughput.
Reduced query efficiency: If index columns lack statistical correlation with the primary key, many data parts are still loaded for evaluation. This forces the system to perform both index checks and full table scans, leading to significant performance degradation.
Complicated table design
Bloated table structure: Multiple indexes increase metadata management complexity, affecting maintenance and readability.
Maintenance cost: Indexes need to be dynamically adjusted as data distribution changes. Otherwise, they might become ineffective.
Usage notes
When using data skipping indexes, you must carefully design them to maximize their effectiveness. Poorly designed indexes can complicate table design, degrade insert performance, and lead to limited improvements in query performance in most cases. The following section describes the principles for using data skipping indexes.
Golden rules for usage priority
Prioritize primary key optimization: Make sure that the primary key covers high-frequency queries. We recommend that you use no more than four to five primary key columns. For more information, see Primary key design best practices.
Use pre-computation as a secondary option: Use projections or materialized views to pre-aggregate data for high-frequency queries.
Consider data skipping indexes as a last resort.
Design principles
Strong correlation requirement: Index columns must have strong statistical correlation with the primary key, such as timestamps and device IDs in time-series scenarios.
High filtering rate verification: Indexes that skip more than 90% of data blocks are considered effective. Otherwise, they are not cost-effective.
Suggestions
Analysis and verification: After introducing indexes, analyze the number of skipped data blocks by using query logs and
EXPLAIN.Dynamic adjustment: Assess index performance and refine or remove ineffective indexes as data distribution changes.
For more information, see Data skipping index best practices.