The primary key design in ClickHouse fundamentally differs from traditional OLTP database B-tree indexes. Its core strength lies in optimizing analytical queries through sparse indexes and ordered storage. This topic introduces ClickHouse primary key concepts and best practices.
Primary key misconceptions
First-time ClickHouse users often struggle to fully understand its unique primary key concept. The following misconceptions are common.
Misconception 1: Primary key uniqueness ClickHouse primary keys do not enforce uniqueness. Rows with duplicate primary keys can coexist. The
ORDER BYclause determines the data storage order.Misconception 2: Primary key equals index The primary key only ensures data ordering. When filtering on non-first columns, scanning multiple data blocks might still be necessary. It should be used in conjunction with skip indexes.
Understanding primary keys
Basic concepts
ClickHouse differs from OLTP databases based on B+Tree. It uses a sparse index designed for millions of rows inserted per second and PB-level datasets. Unlike OLTP databases, ClickHouse indexes have the following characteristics.
Ordered storage
Data is stored on disk in the order specified by the
ORDER BYprimary key. Each data part is strictly sorted internally to quickly identify data that might match queries.Sparse index
ClickHouse uses a sparse index that records primary key values at fixed granularity (such as every 8192 rows), forming a lightweight index. During queries, the index quickly locates potentially matching data parts, then sequentially scans within blocks to filter target rows.
Design characteristics ClickHouse is optimized for massive data insertion (millions of rows per second) and PB-level analytical queries, sacrificing point query efficiency for high throughput range queries and excellent compression ratio.
Understanding through examples
Building on the basic concepts above, let's further understand ClickHouse primary keys through an example. As shown in the following figure:
Design the primary key.
Table T primary key design background:
Columns A, B, and C should be sorted in ascending order of cardinality (low cardinality columns first). Cardinality refers to the number of distinct values in a column.
Queries frequently filter data using columns A, B, and C, indicating these columns are the most frequently used in queries.
Create table T.
Use DDL statement to create table T, specifying
ORDER BY (A, B, C). Column A is designated as the sparse primary key, and table T data will be sorted by columns A, B, and C.Table T data storage:
Data is stored in the form of data parts, and these parts are sorted by the specified columns (A, B, C).
Each data part has a sparse primary key index created based on the data sorting order.
Analyze queries on table T to understand primary key functionality.
Example statement:
SELECT ... FROM T WHERE A = ... AND B = ... AND C BETWEEN ... AND ... GROUP BY ... ORDER BY ... LIMIT ...When ClickHouse receives the query statement above, its query processing engine uses the primary key to quickly locate relevant data based on query conditions. By streaming ordered data blocks, it reduces disk seek time, thereby improving query efficiency.

Impact of improper primary key design
Significant decrease in query performance.
If the primary key does not include frequently filtered columns, queries cannot use the sparse index to quickly skip irrelevant data blocks. This leads to scanning large amounts of redundant data, or even full table scans, seriously affecting query performance.
Reduced data compression ratio.
When primary key order is not arranged by ascending cardinality (low cardinality columns first), identical values will be scattered across different data blocks, preventing compression algorithms from effectively removing duplicates.
Inefficient merging.
Improper primary key design leads to disordered sorting within data blocks, requiring processing of more overlapping or fragmented data blocks during merges. Frequent merge operations consume I/O and CPU resources, also affecting write throughput.
Key principles for primary key selection
Select frequently filtered columns
Column quantity limitation: Usually no more than three columns to avoid index bloat and sorting overhead.
Filter frequency priority: Primary key columns should be columns frequently used in
WHERE,GROUP BY, orJOINconditions.
Column order trade-offs
First column is critical:
It has the greatest impact on data pruning and should be a high-frequency column with high selectivity (such as a time field).
For example, if queries typically filter by
date, usingdateas the first column in the primary key allows quickly skipping irrelevant data blocks.Subsequent columns for compression optimization:
Arrange columns in ascending order of cardinality (low cardinality columns first) to improve compression ratio.
For example, with primary key
(country, city, user_id),countryhas the lowest cardinality, and storing identical values consecutively provides the best compression effect.Balance between query performance and compression:
If high cardinality columns need frequent querying, they can be placed first, but this might sacrifice compression ratio.
Compromise solution: Use skip indexes (such as
bloom_filter) to supplement queries on high-frequency but non-primary key columns. However, skip indexes also need to be designed properly. For specific considerations, see Overuse of skip indexes.
References
For more information about primary keys and indexes, see Primary keys and indexes.