All Products
Search
Document Center

AnalyticDB:Beam sorting optimization (V7.0)

Last Updated:Jan 30, 2024

If you frequently perform range queries or equivalence filtering on one or more columns of a Beam table, you can specify a compound sort key or an interleaved sort key to improve query performance.

When you create a table, Beam allows you to specify one or more columns as the sort key. After data is written to the table, the system sorts the data by sort key. When Beam scans the sorted data in the table, Beam can use the maximum and minimum values of columns to skip the data blocks that do not meet the filter conditions. This significantly reduces I/O overheads. In most cases, sort keys can help you achieve better data compression ratios. Beam supports compound sort keys and interleaved sort keys.

Usage notes

Only AnalyticDB for PostgreSQL V7.0 instances of V7.0.1.x or later support the Beam sorting optimization feature. For information about how to view the minor version of an instance, see View the minor engine version.

Compound sorting

Define a compound sort key

A compound sort key consists of all columns that are listed in the sort key definition. Data is sorted based on the order in which the columns are listed. A compound sort key is efficient when the filter conditions of a query contain a prefix of the sort key columns.

Execute the following statement to create a Beam table that contains a compound sort key:

CREATE TABLE beam_example (
    id integer,
    name text,
    ftime timestamp
) 
USING beam 
DISTRIBUTED BY (id) 
ORDER BY(id);

Maintain a compound sort key

After you create a table that contains a compound sort key, the written data is not immediately sorted based on the sort key columns. The background process automatically generates an optimal sort order based on the amount of data and the number of files. If you want to immediately sort the written data, you can execute the OPTIMIZE statement.

Execute the following statement to immediately sort the data in the beam_example table:

OPTIMIZE beam_example;
Important

The OPTIMIZE statement blocks DDL changes. After you execute the statement, DDL statements automatically continue.

Interleaved sorting

Usage notes

  • Only AnalyticDB for PostgreSQL V7.0 instances of V7.0.4.0 or later support Beam interleaved sorting. For information about how to view the minor version of an instance, see View the minor engine version.

  • You can create a Beam table that contains only two to eight interleaved sort key columns.

  • We recommend that you do not specify incremented columns, such as the date and timestamp columns, as an interleaved sort key.

Define an interleaved sort key

An interleaved sort key assigns an equal weight to each column in the sort key. This way, all sort key columns are equally used. An interleaved sort key is efficient for queries whose filter conditions contain a sort key column.

Execute the following statement to create a Beam table that contains an interleaved sort key:

CREATE TABLE beam_example_interleaved (
    id integer,
    name text,
    ftime timestamp,
    region varchar,
    age integer
) 
USING beam 
DISTRIBUTED BY (id) 
ZORDER BY(name, region, age);

Maintain an interleaved sort key

After you create a table that contains an interleaved sort key, the written data is not immediately sorted in an interleaved manner. The background process automatically generates an optimal sort order based on the amount of data and the number of files. As the amount of data that is written increases, interleaved sorting may cause data skew to occur due to data range changes. In this case, you must manually re-sort the table data.

For example, the beam_example_interleaved table is sorted based on the name and region columns in an interleaved manner. Execute the following statement to query data skew on the interleaved sort key columns of the table:

SELECT * FROM adbpg_toolkit.pg_get_interleaved_skew('beam_example_interleaved'::regclass)
 relid | colname  | skew |  suggestion   
-------+----------+------+---------------
 17139 | name     | 0.46 | 
 17139 | region   | 0.54 | NEED OPTIMIZE
 17139 | OVER ALL | 0.54 | NEED OPTIMIZE
(3 rows)

Execute the following statement to query data skew on each column of all interleaved sorting tables of the current database:

SELECT * FROM adbpg_toolkit.pg_stat_interleaved_skew;
 relid |         relname          | colname  | skew |  suggestion   
-------+--------------------------+----------+------+---------------
 17139 | beam_example_interleaved | name     | 0.46 | 
 17139 | beam_example_interleaved | region   | 0.54 | NEED OPTIMIZE
 17139 | beam_example_interleaved | OVER ALL | 0.54 | NEED OPTIMIZE
(3 rows)

Columns in the sample result:

  • skew: the skew ratio of the interleaved sort key column.

  • suggestion: the suggestion on the interleaved sort key column. NEED OPTIMIZE indicates that you must re-sort the column.

  • OVER ALL in the colname column indicates the overall data skew of all interleaved sort key columns. If you learn about the data skew of only specific interleaved sort key columns, you can sort the entire table based on the data skew of the columns.

If you want to immediately sort the table data, you can execute the OPTIMIZE statement.

Execute the following statement to immediately sort the data in the beam_example_interleaved table:

OPTIMIZE beam_example_interleaved;
Important

The OPTIMIZE statement blocks DDL changes. After you execute the statement, DDL statements automatically continue.

Add or modify a sort key

After you create a table, you can add, modify, or remove a sort key.

Usage notes

  • When you add a sort key to a table or modify a sort key in a table, the table is locked and data cannot be read or written in the table.

  • If you modify a sort key in a table, all data of the table is rewritten. If a table contains a large amount of data, an extended period of time is required to rewrite the data. Proceed with caution.

  • If you modify a compound sort key in a table, the table data is immediately re-sorted. If you modify an interleaved sort key in a table, the table data is not immediately re-sorted. After you modify an interleaved sort key in a table, we recommend that you execute the OPTIMIZE statement.

Examples

  • Add or modify a compound sort key.

    ALTER TABLE beam_example SET ORDER BY(id, name);
  • Add or modify an interleaved sort key.

    ALTER TABLE beam_example_interleaved SET ZORDER BY(name, region);
  • Remove a sort key.

    ALTER TABLE beam_example SET ORDER NONE;

Select a sort key

Suggestions on selecting a sort key for Beam tables:

  • If you frequently perform range queries or equivalence filtering on one or more columns of a Beam table, you can specify the columns as the sort key.

  • If you frequently perform range queries or equivalence filtering on multiple columns of a Beam table, the query frequency and filtering rate of the columns are approximately the same, and auto-increment columns are not included in the columns, we recommend that you select an interleaved sort key. If the query frequency and filtering rate of a column are high, we recommend that you select a compound sort key.

  • If you want to specify a compound sort key in which all columns are used at the same frequency, we recommend that you arrange low-cardinality columns in the front of the compound sort key.

  • If you want to specify an interleaved sort key, we recommend that you select low-cardinality columns that involve large amounts of data to improve query performance.

Query performance differences between compound sort keys and interleaved sort keys

The following table describes the query performance differences between a compound sort key and an interleaved sort key under different filtering conditions. In this example, the sort keys are specified for the lineorder_flat Beam table that contains 1 TB of data and is used for Star Schema Benchmark (SSB) performance testing. The sort key columns are LO_ORDERDATE and P_BRAND.

Filter condition

Query performance of compound sort keys (s)

Query performance of interleaved sort keys (s)

Point queries based on the first column

0.297

18.329

1% filtering rate based on the first column

1.268

19.224

10% filtering rate based on the first column

16.83

38.30

50% filtering rate based on the first column

65.62

76.99

Point queries based on the first column + point queries based on the second column

0.288

5.29

1% filtering rate based on the first column + 1% filtering rate based on the second column

7.36

6.46

10% filtering rate based on the first column + 10% filtering rate based on the second column

91.73

26.70

50% filtering rate based on the first column + 50% filtering rate based on the second column

376.22

87.82

50% filtering rate based on the first column + point queries based on the second column

71.83

19.16

10% filtering rate based on the first column + 1% filtering rate based on the second column

82.50

18.95

1% filtering rate based on the first column + 10% filtering rate based on the second column

7.98

6.43

Point queries based on the first column + 50% filtering rate based on the second column

0.50

31.48

Point queries based on the second column

87.04

19.67

1% filtering rate based on the second column

515.08

78.90

10% filtering rate based on the second column

567.85

131.39

50% filtering rate based on the second column

588.86

134.36

Important

The preceding query result displays only the relative performance differences between the two types of sort keys. The result does not represent the optimal performance of AnalyticDB for PostgreSQL for the SSB dataset.

Test conclusion: If queries are performed only based on the first sort key column or the first sort key column at a high filtering rate, the compound sort key is more efficient. If queries are performed based on the second sort key column or multiple sort key columns, the interleaved sort key is more efficient.

References

Beam sorting optimization is suitable only for AnalyticDB for PostgreSQL V7.0 instances. For information about how to use sorting optimization for AnalyticDB for PostgreSQL V6.0 instances, see Sorting optimization.