How to use hop index - ApsaraDB for ClickHouse - Alibaba Cloud Documentation Center

In some scenarios, queries might not efficiently use the primary key index. In these cases, ClickHouse might need to perform a full table scan on each column, especially when applying WHERE clause conditions. You can use data skipping indexes to accelerate these queries.

Understanding data skipping indexes

Data skipping indexes use specific data structures with the following core mechanisms:

Functional goal: Skip data parts that do not meet the WHERE clause conditions through granules (pre-generated marks) of the data part, reducing IO and computation.
Implementation method: Create lightweight indexes on data part granule to quickly determine if the data part contains the target data.
Applicable scenarios: Column filtering not covered by the primary key and complex expression query scenarios.

For more information, see Use data skipping indices where appropriate.

Impact of overuse

In some cases, data skipping indexes can accelerate specific queries. However, in many cases, overusing them can have negative impacts.

Performance decreases instead of improving.
- Increased write overhead: Index building increases write latency and reduces data throughput.
- Reduced query efficiency: If index columns have no statistical correlation with the primary key, many data parts are still loaded for evaluation, causing the system to perform both index checks and full table scans, resulting in double overhead.
Table design becomes more complex.
- Bloated table structure: Multiple indexes increase metadata management complexity, affecting maintenance and readability.
- Maintenance cost: Indexes need to be dynamically adjusted as data distribution changes, or they might become ineffective.

Usage principles

When using data skipping indexes, you need careful design rather than simply creating indexes to maximize their effectiveness. Therefore, we often see them complicating table design and slowing down insert performance, while rarely (if at all) improving query performance. The following are some principles for using data skipping indexes.

Golden rules for usage priority

Prioritize primary key optimization: Ensure the primary key covers high-frequency queries (recommended primary key columns ≤ 4-5). For primary key best practices, see Primary key design best practices.
Consider pre-computation solutions: Use Projections or materialized views to pre-aggregate data for high-frequency queries.
Consider data skipping indexes as a last resort.

Design principles

Strong correlation requirement: Index columns must have clear statistical association with the primary key (such as timestamp and device ID in time-series scenarios).
High filtering rate verification: Indexes need to skip more than 90% of data blocks to be meaningful. Otherwise, the cost-performance ratio is negative.

Implementation recommendations

Analysis and verification: After introducing indexes, analyze the actual skip amount through query logs and EXPLAIN.
Dynamic adjustment: Regularly evaluate index effectiveness and optimize or delete them as data distribution changes.

For more information, see Data skipping index best practices.