All Products
Search
Document Center

PolarDB:Filter data by using IMCI

Last Updated:Mar 28, 2026

When you run analytical queries against large PolarDB tables, full table scans are the default: IMCI loads data packs from disk, decompresses them, and evaluates every row. For tables with hundreds of millions of rows, this is slow and puts pressure on the least recently used (LRU) cache, reducing overall queries per second (QPS).

The In-Memory Column Index (IMCI) feature of PolarDB solves this with pruning — a technique that filters out entire data packs before the query touches them. Pruning reduces disk I/O, compute overhead, and LRU cache churn. In the TPC-H benchmark at 100 GB, pruning cuts query time by up to 50% for numeric range queries and bitmap indexes reduce query time from seconds to milliseconds for low-cardinality equality queries.

This topic describes how each pruning method works, when to use it, its limitations, and the setup it requires.

How IMCI organizes data

IMCI organizes column data into row groups of approximately 64,000 rows each, stored in append order without a sort key. Because data has no inherent ordering, IMCI cannot filter records as precisely as InnoDB's ordered indexes. Without pruning, every query loads all relevant data packs.

After an IMCI pruner evaluates a data pack, the pack enters one of three states:

StateMeaningBenefit
Accept (AC)All rows matchSkip per-row evaluation; reduce compute overhead
Reject (RE)No rows matchSkip loading the pack from disk; reduce I/O
Partial accept (PA)Some rows may matchScan each row in the pack to find matches
image.png

IMCI supports four pruning methods: partition pruning, statistics pruning (minmax indexes and Bloom filters), Runtime filters, and bitmap indexes.

Choose a pruning method

Use the following table to select a method based on your data and query type.

MethodBest forData characteristicQuery typeSetup required
Partition pruningPartitioned tablesData splits into non-overlapping ranges by time or categoryQueries with partition key in WHERE clauseDDL to create partitioned table
Minmax indexContinuous numeric or timestamp valuesOrdered or clustered data (e.g., timestamps inserted sequentially)Range queries (>, <, BETWEEN)Automatic
Bloom filterHigh-cardinality discrete valuesRandomly distributed IDs or codesPoint queries (=, IN)Automatic
Runtime filterMulti-table joinsSmall dimension table joined to large fact tableJOIN, GROUP BY, ORDER BYNo setup; generated at query time
Bitmap indexLow-cardinality columns where statistics pruning is ineffectiveEvenly distributed values across data packsAND, OR predicates across multiple columns; COUNT(*) aggregationsExplicit creation per column

Minmax indexes and Bloom filters are created automatically by IMCI for supported column types. Partition pruning and bitmap indexes require explicit setup. Runtime filters require no setup and are generated dynamically at query time.

Partition pruning

Partition pruning filters entire partitions before reading any data. When a WHERE clause includes the partition key, IMCI uses partition metadata to identify matching partitions and skips all others.

IMCI supports three partitioning methods:

  • RANGE: Splits a table into ranges with defined boundary points. IMCI uses binary search to locate the matching partition.

  • LIST: Maps discrete values to partitions using <value, part_id> tuples, also searched with binary search.

  • HASH: Enumerates possible input values, hashes them, and determines which partition each falls into. Effective only for integer fields with a small number of distinct values.

image

Example: range partitioning on order date

If the orders table is partitioned by month, the following query touches only the January 2022 partition:

SELECT * FROM orders WHERE order_date = '2022-01-01';

IMCI also applies partition pruning through equivalence inference across joins. For example, if both R and S are partitioned on column a, the condition R.a = S.a AND R.a > 10 lets IMCI infer S.a > 10 and prune partitions of S as well.

When to use:

  • Most queries filter on the partition key.

  • You manage data lifecycle by partition — for example, drop old partitions instead of deleting rows.

  • You need level-1 or level-2 partitions based on business requirements.

When not to use:

  • Queries rarely include the partition key in their WHERE clause. Pruning has no effect in this case.

  • The table is large and you cannot perform DDL during off-peak hours. DDL operations on existing large tables are expensive.

Statistics pruning

Statistics pruning uses lightweight per-column metadata stored in memory to decide whether a data pack can be skipped. Because the statistics reside in memory, evaluating them requires no disk I/O. This approach is similar to data skipping indexes in ClickHouse and the Knowledge Grid of Infobright.

IMCI provides two statistics-based pruners: minmax indexes and Bloom filters. Both are created automatically.

Minmax index

A minmax index stores the minimum and maximum value of each data pack for a column. During a query, IMCI compares the query's range predicate against these bounds. If the query range does not overlap with a pack's [min, max] interval, the pack is rejected without loading it from disk.

Example: With the condition A > 15 AND B < 10 and three data packs per column, a minmax index rejects two of the three packs, reducing the scan workload by two thirds.

image.png

When to use:

  • Columns contain continuous values such as timestamps, dates, or numeric ranges.

  • Data is inserted in roughly sorted order so that min/max ranges per pack are tight and non-overlapping.

When not to use:

  • Column values are evenly distributed across all packs. Overlapping min/max ranges make the index ineffective.

Bloom filter

A Bloom filter is a probabilistic data structure that checks set membership using a bit array and multiple hash functions. When IMCI builds a Bloom filter for a data pack, it maps each value to several bits. At query time, if any mapped bit is 0, the value is definitively absent and the pack is rejected.

Bloom filters may produce false positives — reporting a value as present when it is not.

image.png

When to use:

  • Columns contain high-cardinality discrete values such as randomly generated IDs, order numbers, or product codes.

  • Queries use equality (=) or IN predicates.

When not to use:

  • Column values are evenly spread across packs. False positives increase and filtering effectiveness drops.

Combining minmax indexes and Bloom filters

IMCI evaluates both pruners together. A data pack is skipped if at least one pruner rejects it. Using them together improves pruning coverage compared to either alone.

Nullable column support

IMCI lets you create minmax indexes and Bloom filters on nullable columns, supporting predicates including IS NULL, IS NOT NULL, >, <, and =.

When building a pruner, IMCI excludes null values from min/max calculations. A pack containing {1, 2, 3, NULL} has min=1 and max=3. At query time, results are computed ignoring nulls, then adjusted based on whether each pack contains nulls:

  • A pack containing only nulls cannot be pruned by value predicates.

  • A pack partially containing nulls can still be pruned based on its non-null min/max range.

Example: With condition A > 15 and three packs where Pack A1 contains only nulls, Pack A2 and Pack A3 partially contain nulls:

  1. Evaluate without nulls: [PA, AC, RE]

  2. Adjust for null presence: Pack A1 (only nulls) cannot be filtered — convert to RE.

  3. Final result: [RE, PA, RE]. Packs A1 and A3 are pruned.

image.png

Improving pruning effectiveness on evenly distributed data

Statistics pruning is less effective when values are evenly distributed across data packs, because min/max ranges overlap and Bloom filters produce more false positives.

To improve effectiveness:

  • Reduce data pack size: Smaller packs produce finer-grained indexes and tighter min/max ranges. Note that smaller packs consume more memory.

  • Use a sort key: IMCI's sort key feature reorders data before building statistics, clustering similar values into the same packs and improving pruning accuracy.

  • Disable pruners: If statistics pruning adds overhead without improving query performance — for example, on uniformly distributed columns — disable the relevant pruners for those queries.

Runtime filters

A Runtime filter is generated dynamically as a query executes. It captures the result set from one side of a join (typically the smaller dimension table) and passes it as a filter to the other side (typically the larger fact table) before the fact table scan begins. As the query progresses and the filter tightens — for example, as a TopK query finds better candidates and raises the minimum threshold — more packs are pruned dynamically.

IMCI supports two types of Runtime filters:

  • Minmax Runtime filter: Generates a range expression (e.g., id > 1 AND id < 100) from the min and max values of the build-side result set.

  • Bloom filter Runtime filter: Builds a Bloom filter over the build-side result set for equality-based filtering of STRING-type columns.

Bloom filter Runtime filters increase memory overhead and are not suitable for large result sets.

Example: accelerating a hash join

SELECT * FROM sales JOIN items ON sales.item_id = items.id WHERE items.price > 1000

Here items is the dimension table (small) and sales is the fact table (large). IMCI builds a Runtime filter from the item_id values that satisfy price > 1000. If the resulting IDs range from 1 to 100, IMCI generates id > 1 AND id < 100 or IN(id1, id2, id3 ...) and passes it to the sales scan, pruning rows before any join probes occur.

If a minmax pruner exists on the filtered column, IMCI uses the Runtime filter to further prune data packs, reducing both rows scanned and packs loaded.

Runtime filters also apply to GROUP BY and ORDER BY queries. In massively parallel processing (MPP) scenarios, they reduce the volume of data shuffled across nodes.

image

Bitmap indexes

A bitmap index assigns a bit vector to each distinct value in a column. Each bit represents whether that value appears in a given row. To evaluate a filter predicate, IMCI performs bitwise AND/OR operations on the relevant bit vectors and retrieves matching row numbers directly — without loading full data packs.

Bitmap indexes are most effective when:

  • Columns have low cardinality (e.g., status, category, market segment) and values are evenly spread across data packs, making statistics pruning ineffective.

  • Queries combine multiple filter predicates with AND or OR.

  • Queries aggregate without materializing individual rows, such as SELECT COUNT(*) FROM t WHERE ....

Compared to B-tree indexes, bitmap indexes use less storage space for low-cardinality columns. However, global bitmap indexes require locking the entire table on row updates, making them best suited for read-heavy workloads.

image

Performance benchmark

The following results measure pruner and bitmap index acceleration on 100 GB of TPC-H data at concurrency 1, using in-memory storage. I/O-bound workloads show greater improvement.

Pruner acceleration (Q1–Q5)

Q1–Q3 test minmax index acceleration on partsupp.ps_suppkey (numeric range). Q4–Q5 test Bloom filter acceleration on orders.o_clerk (string equality and IN).

QueryWithout pruners (s)With pruners (s)
Q1: ps_suppkey = 411640.110.05
Q2: ps_suppkey IN (41164, 58321)0.140.07
Q3: ps_suppkey BETWEEN 40000 AND 500000.130.06
Q4: o_clerk = 'Clerk#000068170'0.890.43
Q5: o_clerk IN ('Clerk#000068170', 'Clerk#000087784')1.851.35
image.png

The acceleration ratio is proportional to the number of data packs filtered out.

Bitmap index acceleration (Q6–Q9)

The customer.c_mktsegment column has evenly distributed values, so pruners alone (Q6–Q8) provide no benefit. Adding bitmap indexes reduces query time dramatically.

QueryWithout pruners (s)With pruners only (s)With pruners + bitmap indexes (s)
Q6: c_mktsegment = 'AUTOMOBILE'1.431.430.03
Q7: c_mktsegment IN ('AUTOMOBILE', 'FURNITURE', 'BUILDING')3.493.490.07
Q8: c_mktsegment = 'AUTOMOBILE' OR c_phone = '18-338-906-3675'2.482.481.09
Q9: c_mktsegment = 'AUTOMOBILE' AND c_phone = '18-338-906-3675'1.250.050.05
image.png

Q9 benefits from pruners alone because the AND predicate allows minmax or Bloom filter pruning on at least one column. Q8 (OR) requires both columns to independently pass the filter, which only bitmap indexes handle effectively.

What's next

  • TopK operator implementation in IMCI — How IMCI uses statistics to accelerate TopK queries with the ORDER BY ... LIMIT pattern, including how Runtime filters tighten dynamically as the query executes.