Sort keys let AnalyticDB for PostgreSQL skip large portions of disk blocks during table scans, dramatically reducing query response time for range-restricted queries. This feature applies to column-oriented, append-only tables and is most effective when queries consistently filter on a predictable set of columns.
This feature applies to:
-
Reserved mode instances with kernel version later than 20200826
-
Elastic mode instances with kernel version later than 20200906
How it works
AnalyticDB for PostgreSQL stores column-oriented data in disk blocks. For each block, the database records the minimum and maximum values of every column — a structure called a rough set index. When a query includes a range predicate in the WHERE clause, the query processor compares the predicate against each block's min/max values and skips any block that falls outside the range.
The more tightly correlated your data is with the sort key, the more blocks can be eliminated. For example, if a table holds seven years of data sorted by date and a query filters on a single month, only 1/(7 × 12) of the data needs to be scanned — eliminating roughly 98.8% of disk blocks. Without sorting, all blocks may be scanned.
AnalyticDB for PostgreSQL supports two sorting methods:
| Method | Behavior | Best for |
|---|---|---|
| Compound sorting | Sorts data as an ordered tuple of all sort key columns, prioritizing the leading column | Queries that filter on the first (leading) column of the sort key |
| Interleaved sorting | Assigns equal weight to each column in the sort key | Queries that filter on any subset of sort key columns, including non-leading columns |
For a detailed performance comparison, see Performance comparison: compound vs. interleaved sorting.
When to use sort keys
Sort keys benefit tables that meet all of the following criteria:
-
Selective queries: queries filter on a small subset of rows using a range or equality predicate in the WHERE clause.
-
Consistent filter columns: a high percentage of queries filter on the same column or columns.
-
Large table size: the performance gain grows with table size. Sort keys are most impactful on tables with hundreds of millions of rows or more.
Sort keys add maintenance overhead: after data is loaded, you must explicitly sort the table, and you must re-sort it periodically as new data accumulates. If your workload has a high write rate with infrequent, ad-hoc queries, the maintenance cost may outweigh the query speedup.
Choose a sorting method
Use these guidelines to select the right method for your query patterns:
-
If most queries filter on the leading column of the sort key, use compound sorting. It produces the fastest response times for leading-column predicates. Note that re-sorting with compound sorting takes longer than with interleaved sorting because it performs additional data analysis.
-
If queries filter on non-leading columns or on arbitrary subsets of the sort key, use interleaved sorting. An interleaved sort key supports up to eight columns. The more sort key columns a query references, the greater the performance benefit.
-
If you are unsure, start with compound sorting. It is the simpler choice and performs best when queries have a clear leading filter column.
Define a sort key when creating a table
Use the ORDER BY clause in CREATE TABLE to define one or more columns as the sort key. The table must use column-oriented, append-only storage (APPENDONLY=true, ORIENTATION=column).
create table test(date text, time text, open float, high float, low float, volume int)
with(APPENDONLY=true,ORIENTATION=column) ORDER BY (volume);
Full syntax:
CREATE [[GLOBAL | LOCAL] {TEMPORARY | TEMP}] TABLE table_name (
column_name data_type [, ...]
)
[ DISTRIBUTED BY (column [, ...]) | DISTRIBUTED RANDOMLY ]
[ ORDER BY (column [, ...]) ]
If your kernel version is earlier than 20210326, useSORTKEY (column [, ...])instead ofORDER BY (column [, ...])to define the sort key.
Sort a table
Defining a sort key does not automatically sort the data. After writing data to the table, run a sort command to apply the sort order and build the rough set index.
Compound sorting:
SORT table_name;
Interleaved sorting:
MULTISORT table_name;
If your kernel version is earlier than 20210326, useVACUUM SORT ONLY table_namefor compound sorting andVACUUM REINDEX table_namefor interleaved sorting.
As new rows are added to a sorted table, unsorted data accumulates and rough set filtering becomes less effective. Run SORT or MULTISORT periodically to maintain query performance.
Modify a sort key
To change the sort key of an existing column-oriented table:
ALTER TABLE table_name SET ORDER BY (column [, ...]);
This statement updates the catalog only — it does not sort the data. Run SORT table_name afterward to apply the new sort order.
Example:
ALTER TABLE test SET ORDER BY (high, low);
SORT test;
If your kernel version is earlier than 20210326, use ALTER TABLE test SET SORTKEY (high, low).
Limits
| Item | Limit |
|---|---|
| Maximum sort key columns (interleaved sorting) | 8 |
| Table storage type | Column-oriented, append-only (APPENDONLY=true, ORIENTATION=column) only |
Kernel version for ORDER BY / SORT / MULTISORT syntax |
Later than 20210326 |
Kernel version for legacy SORTKEY / VACUUM SORT ONLY / VACUUM REINDEX syntax |
20210326 or earlier |
Performance comparison: compound vs. interleaved sorting
TPC-H benchmark: sort key impact on range queries
This section demonstrates how compound sorting improves query performance for rough set indexes compared to a full table scan, using a TPC-H Lineitem table that stores seven years of data.
This implementation of TPC is derived from the TPC Benchmark and is not comparable to published TPC Benchmark results, as this implementation does not comply with all the requirements of the TPC Benchmark.
Test setup:
-
Create a 32-node instance.
-
Write 13 billion rows to the Lineitem table.
-
Query data in the time range from 1997-09-01 to 1997-09-30, comparing results when data is sorted by
l_shipdateversus unsorted.
Compound vs. interleaved: performance across query shapes
The following example uses two tables with identical data and sort keys to show how the two methods perform across different query shapes.
Test setup:
-
Two tables (
testandtest_multi), each with four columns:id,num1,num2,value -
Sort key:
(id, num1, num2)on both tables -
10 million rows per table
-
testsorted with compound sorting (SORT test) -
test_multisorted with interleaved sorting (MULTISORT test_multi)
Create tables and insert data:
CREATE TABLE test (id int, num1 int, num2 int, value varchar)
WITH (APPENDONLY=TRUE, ORIENTATION=column)
DISTRIBUTED BY (id)
ORDER BY (id, num1, num2);
CREATE TABLE test_multi (id int, num1 int, num2 int, value varchar)
WITH (APPENDONLY=TRUE, ORIENTATION=column)
DISTRIBUTED BY (id)
ORDER BY (id, num1, num2);
INSERT INTO test (id, num1, num2, value)
SELECT g,
(random() * 10000000)::int,
(random() * 10000000)::int,
(ARRAY['foo', 'bar', 'baz', 'quux', 'boy', 'girl', 'mouse', 'child', 'phone'])[floor(random() * 10 + 1)]
FROM generate_series(1, 10000000) AS g;
INSERT INTO test_multi SELECT * FROM test;
SORT test;
MULTISORT test_multi;
Point query performance
All three queries filter on the sort key columns, but at different positions.
-- Q1: filter on the leading column (id)
SELECT * FROM test WHERE id = 100000;
SELECT * FROM test_multi WHERE id = 100000;
-- Q2: filter on the second column (num1)
SELECT * FROM test WHERE num1 = 8766963;
SELECT * FROM test_multi WHERE num1 = 8766963;
-- Q3: filter on the second and third columns (num1, num2)
SELECT * FROM test WHERE num1 = 100000 AND num2 = 2904114;
SELECT * FROM test_multi WHERE num1 = 100000 AND num2 = 2904114;
| Query | Filter columns | Compound sorting | Interleaved sorting |
|---|---|---|---|
| Q1 | Leading column (id) | 0.026s | 0.55s |
| Q2 | Second column (num1) | 3.95s | 0.42s |
| Q3 | Second + third columns (num1, num2) | 4.21s | 0.071s |
Range query performance
-- Q1: range filter on the leading column (id)
SELECT count(*) FROM test WHERE id > 5000 AND id < 100000;
SELECT count(*) FROM test_multi WHERE id > 5000 AND id < 100000;
-- Q2: range filter on the second column (num1)
SELECT count(*) FROM test WHERE num1 > 5000 AND num1 < 100000;
SELECT count(*) FROM test_multi WHERE num1 > 5000 AND num1 < 100000;
-- Q3: range filter on the second and third columns (num1, num2)
SELECT count(*) FROM test WHERE num1 > 5000 AND num1 < 100000 AND num2 < 100000;
SELECT count(*) FROM test_multi WHERE num1 > 5000 AND num1 < 100000 AND num2 < 100000;
| Query | Filter columns | Compound sorting | Interleaved sorting |
|---|---|---|---|
| Q1 | Leading column (id) | 0.07s | 0.44s |
| Q2 | Second column (num1) | 3.35s | 0.28s |
| Q3 | Second + third columns (num1, num2) | 3.64s | 0.047s |
Key takeaways
-
Compound sorting wins on the leading column. Q1 results show compound sorting has a shorter query response time than interleaved sorting when the filter targets the first sort key column.
-
Interleaved sorting wins on non-leading columns. Q2 and Q3 results show interleaved sorting significantly outperforms compound sorting when queries bypass the leading column.
-
Interleaved sorting scales with column count. The more sort key columns a query references, the larger the performance advantage of interleaved sorting (Q3 vs. Q2).
This test uses 10 million rows — a modest size for AnalyticDB for PostgreSQL. The performance difference between the two methods is more pronounced with larger tables.