Accelerate queries on column-oriented tables using sort keys and rough indexes - AnalyticDB

How it works

AnalyticDB for PostgreSQL stores column-oriented data in disk blocks. For each block, the database records the minimum and maximum values of every column — a structure called a rough set index. When a query includes a range predicate in the WHERE clause, the query processor compares the predicate against each block's min/max values and skips any block that falls outside the range.

The more tightly correlated your data is with the sort key, the more blocks can be eliminated. For example, if a table holds seven years of data sorted by date and a query filters on a single month, only 1/(7 × 12) of the data needs to be scanned — eliminating roughly 98.8% of disk blocks. Without sorting, all blocks may be scanned.

AnalyticDB for PostgreSQL supports two sorting methods:

Method	Behavior	Best for
Compound sorting	Sorts data as an ordered tuple of all sort key columns, prioritizing the leading column	Queries that filter on the first (leading) column of the sort key
Interleaved sorting	Assigns equal weight to each column in the sort key	Queries that filter on any subset of sort key columns, including non-leading columns

For a detailed performance comparison, see Performance comparison: compound vs. interleaved sorting.

When to use sort keys

Sort keys benefit tables that meet all of the following criteria:

Selective queries: queries filter on a small subset of rows using a range or equality predicate in the WHERE clause.
Consistent filter columns: a high percentage of queries filter on the same column or columns.
Large table size: the performance gain grows with table size. Sort keys are most impactful on tables with hundreds of millions of rows or more.

Sort keys add maintenance overhead: after data is loaded, you must explicitly sort the table, and you must re-sort it periodically as new data accumulates. If your workload has a high write rate with infrequent, ad-hoc queries, the maintenance cost may outweigh the query speedup.

Choose a sorting method

Use these guidelines to select the right method for your query patterns:

If most queries filter on the leading column of the sort key, use compound sorting. It produces the fastest response times for leading-column predicates. Note that re-sorting with compound sorting takes longer than with interleaved sorting because it performs additional data analysis.
If queries filter on non-leading columns or on arbitrary subsets of the sort key, use interleaved sorting. An interleaved sort key supports up to eight columns. The more sort key columns a query references, the greater the performance benefit.
If you are unsure, start with compound sorting. It is the simpler choice and performs best when queries have a clear leading filter column.

Define a sort key when creating a table

Use the ORDER BY clause in CREATE TABLE to define one or more columns as the sort key. The table must use column-oriented, append-only storage (APPENDONLY=true, ORIENTATION=column).

create table test(date text, time text, open float, high float, low float, volume int)
with(APPENDONLY=true,ORIENTATION=column) ORDER BY (volume);

Full syntax:

CREATE [[GLOBAL | LOCAL] {TEMPORARY | TEMP}] TABLE table_name (
    column_name data_type [, ...]
)
[ DISTRIBUTED BY (column [, ...]) | DISTRIBUTED RANDOMLY ]
[ ORDER BY (column [, ...]) ]

If your kernel version is earlier than 20210326, use SORTKEY (column [, ...]) instead of ORDER BY (column [, ...]) to define the sort key.

Sort a table

Defining a sort key does not automatically sort the data. After writing data to the table, run a sort command to apply the sort order and build the rough set index.

Compound sorting:

SORT table_name;

Interleaved sorting:

MULTISORT table_name;

If your kernel version is earlier than 20210326, use VACUUM SORT ONLY table_name for compound sorting and VACUUM REINDEX table_name for interleaved sorting.

As new rows are added to a sorted table, unsorted data accumulates and rough set filtering becomes less effective. Run SORT or MULTISORT periodically to maintain query performance.

Modify a sort key

To change the sort key of an existing column-oriented table:

ALTER TABLE table_name SET ORDER BY (column [, ...]);

This statement updates the catalog only — it does not sort the data. Run SORT table_name afterward to apply the new sort order.

Example:

ALTER TABLE test SET ORDER BY (high, low);
SORT test;

If your kernel version is earlier than 20210326, use ALTER TABLE test SET SORTKEY (high, low).

Limits

Item	Limit
Maximum sort key columns (interleaved sorting)	8
Table storage type	Column-oriented, append-only (`APPENDONLY=true, ORIENTATION=column`) only
Kernel version for `ORDER BY` / `SORT` / `MULTISORT` syntax	Later than 20210326
Kernel version for legacy `SORTKEY` / `VACUUM SORT ONLY` / `VACUUM REINDEX` syntax	20210326 or earlier

Performance comparison: compound vs. interleaved sorting

TPC-H benchmark: sort key impact on range queries

This section demonstrates how compound sorting improves query performance for rough set indexes compared to a full table scan, using a TPC-H Lineitem table that stores seven years of data.

This implementation of TPC is derived from the TPC Benchmark and is not comparable to published TPC Benchmark results, as this implementation does not comply with all the requirements of the TPC Benchmark.

Test setup:

Create a 32-node instance.
Write 13 billion rows to the Lineitem table.
Query data in the time range from 1997-09-01 to 1997-09-30, comparing results when data is sorted by l_shipdate versus unsorted.

Compound vs. interleaved: performance across query shapes

The following example uses two tables with identical data and sort keys to show how the two methods perform across different query shapes.

Test setup:

Two tables (test and test_multi), each with four columns: id, num1, num2, value
Sort key: (id, num1, num2) on both tables
10 million rows per table
test sorted with compound sorting (SORT test)
test_multi sorted with interleaved sorting (MULTISORT test_multi)

Create tables and insert data:

CREATE TABLE test (id int, num1 int, num2 int, value varchar)
WITH (APPENDONLY=TRUE, ORIENTATION=column)
DISTRIBUTED BY (id)
ORDER BY (id, num1, num2);

CREATE TABLE test_multi (id int, num1 int, num2 int, value varchar)
WITH (APPENDONLY=TRUE, ORIENTATION=column)
DISTRIBUTED BY (id)
ORDER BY (id, num1, num2);

INSERT INTO test (id, num1, num2, value)
SELECT g,
    (random() * 10000000)::int,
    (random() * 10000000)::int,
    (ARRAY['foo', 'bar', 'baz', 'quux', 'boy', 'girl', 'mouse', 'child', 'phone'])[floor(random() * 10 + 1)]
FROM generate_series(1, 10000000) AS g;

INSERT INTO test_multi SELECT * FROM test;

SORT test;
MULTISORT test_multi;

Point query performance

All three queries filter on the sort key columns, but at different positions.

-- Q1: filter on the leading column (id)
SELECT * FROM test WHERE id = 100000;
SELECT * FROM test_multi WHERE id = 100000;

-- Q2: filter on the second column (num1)
SELECT * FROM test WHERE num1 = 8766963;
SELECT * FROM test_multi WHERE num1 = 8766963;

-- Q3: filter on the second and third columns (num1, num2)
SELECT * FROM test WHERE num1 = 100000 AND num2 = 2904114;
SELECT * FROM test_multi WHERE num1 = 100000 AND num2 = 2904114;

Query	Filter columns	Compound sorting	Interleaved sorting
Q1	Leading column (id)	0.026s	0.55s
Q2	Second column (num1)	3.95s	0.42s
Q3	Second + third columns (num1, num2)	4.21s	0.071s

Range query performance

-- Q1: range filter on the leading column (id)
SELECT count(*) FROM test WHERE id > 5000 AND id < 100000;
SELECT count(*) FROM test_multi WHERE id > 5000 AND id < 100000;

-- Q2: range filter on the second column (num1)
SELECT count(*) FROM test WHERE num1 > 5000 AND num1 < 100000;
SELECT count(*) FROM test_multi WHERE num1 > 5000 AND num1 < 100000;

-- Q3: range filter on the second and third columns (num1, num2)
SELECT count(*) FROM test WHERE num1 > 5000 AND num1 < 100000 AND num2 < 100000;
SELECT count(*) FROM test_multi WHERE num1 > 5000 AND num1 < 100000 AND num2 < 100000;

Query	Filter columns	Compound sorting	Interleaved sorting
Q1	Leading column (id)	0.07s	0.44s
Q2	Second column (num1)	3.35s	0.28s
Q3	Second + third columns (num1, num2)	3.64s	0.047s

Key takeaways

Compound sorting wins on the leading column. Q1 results show compound sorting has a shorter query response time than interleaved sorting when the filter targets the first sort key column.
Interleaved sorting wins on non-leading columns. Q2 and Q3 results show interleaved sorting significantly outperforms compound sorting when queries bypass the leading column.
Interleaved sorting scales with column count. The more sort key columns a query references, the larger the performance advantage of interleaved sorting (Q3 vs. Q2).

This test uses 10 million rows — a modest size for AnalyticDB for PostgreSQL. The performance difference between the two methods is more pronounced with larger tables.