In some scenarios, queries might not efficiently use the primary key index. In these cases, ClickHouse might need to perform a full table scan on each column, especially when applying WHERE clause conditions. You can use data skipping indexes to accelerate these queries.
Understanding data skipping indexes
Data skipping indexes use specific data structures with the following core mechanisms:
Functional goal: Skip data parts that do not meet the
WHEREclause conditions through granules (pre-generated marks) of the data part, reducing IO and computation.Implementation method: Create lightweight indexes on
data partgranuleto quickly determine if thedata partcontains the target data.Applicable scenarios: Column filtering not covered by the primary key and complex expression query scenarios.
For more information, see Use data skipping indices where appropriate.
Impact of overuse
In some cases, data skipping indexes can accelerate specific queries. However, in many cases, overusing them can have negative impacts.
Performance decreases instead of improving.
Increased write overhead: Index building increases write latency and reduces data throughput.
Reduced query efficiency: If index columns have no statistical correlation with the primary key, many data parts are still loaded for evaluation, causing the system to perform both index checks and full table scans, resulting in double overhead.
Table design becomes more complex.
Bloated table structure: Multiple indexes increase metadata management complexity, affecting maintenance and readability.
Maintenance cost: Indexes need to be dynamically adjusted as data distribution changes, or they might become ineffective.
Usage principles
When using data skipping indexes, you need careful design rather than simply creating indexes to maximize their effectiveness. Therefore, we often see them complicating table design and slowing down insert performance, while rarely (if at all) improving query performance. The following are some principles for using data skipping indexes.
Golden rules for usage priority
Prioritize primary key optimization: Ensure the primary key covers high-frequency queries (recommended primary key columns ≤ 4-5). For primary key best practices, see Primary key design best practices.
Consider pre-computation solutions: Use Projections or materialized views to pre-aggregate data for high-frequency queries.
Consider data skipping indexes as a last resort.
Design principles
Strong correlation requirement: Index columns must have clear statistical association with the primary key (such as timestamp and device ID in time-series scenarios).
High filtering rate verification: Indexes need to skip more than 90% of data blocks to be meaningful. Otherwise, the cost-performance ratio is negative.
Implementation recommendations
Analysis and verification: After introducing indexes, analyze the actual skip amount through query logs and
EXPLAIN.Dynamic adjustment: Regularly evaluate index effectiveness and optimize or delete them as data distribution changes.
For more information, see Data skipping index best practices.