Guide on scenario-specific table creation and tuning - Hologres

In Hologres, you can configure a variety of table properties. Different table properties have different characteristics. This topic describes how to configure table properties based on query scenarios. Appropriate table properties help decrease the amount of data to be scanned, the number of files to be accessed, and the number of I/O operations in queries. This way, queries are faster, and the queries per second (QPS) is higher.

Determine the storage format of your table

Hologres supports row-oriented storage, column-oriented storage, and row-column hybrid storage. For more information about these storage formats, see Storage models of tables: row-oriented storage, column-oriented storage, and row-column hybrid storage.

The following figure shows how to determine the storage format of a table. If your business scenario is not fully defined, we recommend that you use row-column hybrid storage to take into account more possible scenarios. image..png

Configure the query properties of your table

After you determine the storage format of your table, you need to configure the table properties based on query scenarios.

Note

This section provides examples on how to configure table properties in different scenarios.
In the examples, a Hologres instance with 64 compute units (CUs) is used to verify the performance of queries on the Lineitem and Orders tables in the TPC-H 100 GB dataset.
The TPC-H dataset simulates data in retail scenarios. The dataset contains the following tables:
- Orders table: It is an order table, in which the o_orderkey field uniquely identifies an order.
- Lineitem table: It is an order details table, in which the o_orderkey and l_linenumber fields uniquely identify a product in an order.
The TPC-H performance tests described in this topic are implemented based on the TPC-H benchmarking test but cannot meet all requirements of the TPC-H benchmarking test. Therefore, the test results described in this topic cannot be compared with the published results of the TPC-H benchmark test.
If your table is used in multiple query scenarios with different filter conditions or JOIN fields, you must configure the optimal table properties based on the query frequency and performance requirements.

Scenario 1: Point queries with ultra-high QPS

Scenario description
You want to perform tens of thousands of point queries per second on the Orders table in the TPC-H dataset. You can specify the o_orderkey field to locate a row of data. Sample SQL statement:
```
SELECT * FROM orders WHERE o_orderkey = ?;
```
Configuration suggestion
We recommend that you configure the filter field as the primary key. Hologres provides fixed plans for you to accelerate queries based on the primary key. This significantly improves the execution efficiency. For more information, see Accelerate the execution of SQL statements by using fixed plans.
Performance verification
Define the Orders table as a row-oriented table or a row-column hybrid table. Verify the query performance when the o_orderkey field is configured as the primary key and when no primary key is configured. For more information about the table creation statement, see DDL statements in scenario 1.
Verification result:
- Primary key configured: About 500 queries are triggered at a time, the average QPS is about 104,000, and the average latency is about 4 ms.
- No primary key configured: About 500 queries are triggered at a time, the average QPS is about 16,000, and the average latency is about 30 ms.

Scenario 2: Prefix scans on a small amount of data with high QPS

Scenario description
You want to perform queries in scenarios with the following requirements:
- Multiple fields of the table that you want to query constitute the primary key.
- Tens of thousands of QPS need to be processed. Point queries are performed based on a field in the primary key, and the returned dataset contains several or dozens of data entries.
For example, you want to query all products in an order specified by the l_orderkey field with high QPS from the Lineitem table. The Lineitem table is an order details table in the TPC-H dataset. In this table, you can specify the l_orderkey and l_linenumber fields to identify a product in an order. Sample statement:
```
SELECT * FROM lineitem WHERE l_orderkey = ?;
```
Configuration suggestion
1. Configure the field that is used for equivalence filtering as the leftmost field of the primary key. In this example, set the primary key of the Lineitem table to (l_orderkey, l_linenumber) but not (l_linenumber, l_orderkey).
2. Configure the field that is used for equivalence filtering as the distribution key to ensure that all the data that needs to be scanned is stored in the same shard. This helps reduce the number of shards to be accessed and improve QPS. In this example, set the distribution key of the Lineitem table to l_orderkey.
3. Configure the field that is used for equivalence filtering as the clustering key if your table uses the column-oriented storage or row-column hybrid storage format. This ensures that data entries to be scanned are sorted in the file based on the field and helps reduce the number of I/O operations. In this example, set the clustering key of the Lineitem table to l_orderkey.
With the preceding configurations, only a prefix scan on a single shard is required to query data. Hologres provides fixed plans for you to accelerate prefix scans. If you want to enable this feature, set the Grand Unified Configuration (GUC) parameter hg_experimental_enable_fixed_dispatcher_for_scan to on. For more information, see Accelerate the execution of SQL statements by using fixed plans.
Performance verification
Define the Lineitem table as a row-oriented table or a row-column hybrid table. Verify the query performance when the preceding configurations are used and when the preceding configurations are not used. For more information about the table creation statement, see DDL statements in scenario 2.
Verification result:
- Preceding configurations used: About 500 queries are triggered at a time, the average QPS is about 37,000, and the average latency is about 13 ms.
- Preceding configurations not used: Only 1 query is triggered at a time, the average QPS is about 60, and the average latency is about 16 ms.

Scenario 3: Queries with time-related filter conditions

Scenario description

You want to perform queries with time-related filter conditions. In the example, you want to query data in the Lineitem table of the TPC-H dataset based on the filter condition specified by the l_shipdate field. For more information about the query statements, see Q1. Sample SQL statements:

-- Original query statement
SELECT
    l_returnflag,
    l_linestatus,
    sum(l_quantity) AS sum_qty,
    sum(l_extendedprice) AS sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
    avg(l_quantity) AS avg_qty,
    avg(l_extendedprice) AS avg_price,
    avg(l_discount) AS avg_disc,
    count(*) AS count_order
FROM
    lineitem
WHERE
    l_shipdate <= date '1998-12-01' - interval '120' day
GROUP BY
    l_returnflag,
    l_linestatus
ORDER BY
    l_returnflag,
    l_linestatus;

-- Modified query statement
SELECT ... FROM lineitem
WHERE
    l_year='1992' AND -- The time-related filter condition needs to be added only to the desired partitioned table.
    l_shipdate <= date '1992-12-01' -- The time range is narrowed down to better show the effect.
...;

Configuration suggestion
1. Partition the Lineitem table based on time. In this example, add the l_year column to the Lineitem table and configure the l_year field as the partition key to partition the Lineitem table by year. Determine whether to partition your table or only configure the event_time_column property based on your data volume and business requirements. For more information about limits and usage notes on partitioned tables, see CREATE PARTITION TABLE.
2. Configure the time filter field as the event_time_column property to ensure that data entries in files in the shard are sorted in order based on the event_time_column property. This reduces the number of files to be scanned. In this example, configure the l_shipdate field as the event_time_column property for the Lineitem table. For more information about the principle and usage of the event_time_column property, see Event Time Column (Segment Key).
Performance verification
Define the Lineitem table as a column-oriented table. Verify the query performance when partitions and the event_time_column property are configured as suggested and when no partitions are configured and another field is configured as the event_time_column property. For more information about table creation statements, see DDL statements in scenario 3.
Verification result:
- Partitions and event_time_column configured as suggested: 80 files in 1 partition are scanned.
- No partition configured and event_time_column set to another field: Data is not filtered by partition and 320 files are scanned.
Note
You can execute the EXPLAIN ANALYZE statement to view the number of scanned partitions specified by the Partitions selected parameter and the number of scanned files specified by the dop parameter.

Scenario 4: Queries with non-time-related single-value filter conditions

Scenario description
You want to perform queries with non-time-related single-value filter conditions. In the example, you want to query data in the Lineitem table in the TPC-H dataset based on the non-time-related single-value filter condition specified by the l_shipmode field for data aggregation. For more information about the query statements, see Q1. Sample SQL statement:
```
SELECT
    ...
FROM
    lineitem
WHERE
    l_shipmode IN ('FOB', 'AIR');
```
Configuration suggestion
1. Configure the single-value field as the clustering key to ensure that data entries with the same values are consecutively sorted in the file. This helps reduce I/O operations. In this example, configure the l_shipmode field of the Lineitem table as the clustering key.
2. Build a bitmap index for the single-value field to accelerate the process to locate desired data. In this example, configure the l_shipmode field of the Lineitem table as the bitmap column property.
Performance verification
Define the Lineitem table as a column-oriented table. Verify the query performance when table properties are configured as suggested and when the l_shipmode field is not configured as the clustering key or bitmap column property. For more information about table creation statements, see DDL statements in scenario 4.
Verification result:
- Table properties configured as suggested: 170 million rows of data are read, and the query duration is 0.71s.
- l_shipmode not configured as the clustering key or bitmap column property: The full table that contains 600 million rows of data is read, and the query duration is 2.41s.
  Note
  You can check the number of rows of read data based on the read_rows parameter in slow query logs. For more information, see Query and analyze slow query logs.
  You can use the execution plan to check whether data is filtered based on the bitmap index. If the execution plan contains the Bitmap Filter keywords, data is filtered based on the bitmap index.

Scenario 5: Aggregate queries based on a field

Scenario description
You want to perform aggregate queries based on a field. In the example, you want to perform an aggregate query on the Lineitem table in the TPC-H dataset based on the l_suppkey field. Sample SQL statement:
```
SELECT
    l_suppkey,
    sum(l_extendedprice * (1 - l_discount))
FROM
    lineitem
GROUP BY
    l_suppkey;
```
Configuration suggestion
Configure the aggregate field as the distribution key to prevent a large amount of data from being shuffled across shards.
Performance verification
Define the Lineitem table as a column-oriented table. Verify the query performance when the aggregate field l_suppkey is configured as the distribution key and when another field is configured as the distribution key. For more information about table creation statements, see DDL statements in scenario 5.
Verification result:
- Aggregate field configured as the distribution key: About 0.21 GB of data is shuffled, and the execution duration is 2.30s.
- Another field configured as the distribution key: About 8.16 GB of data is shuffled, and the execution duration is 3.68s.
Note
You can check the amount of shuffled data based on the shuffle_bytes parameter in slow query logs. For more information, see Query and analyze slow query logs.

Scenario 6: Join queries on multiple tables

Scenario description

You want to perform join queries on multiple tables. In the example, you want to perform join queries on the Lineitem and Orders tables in the TPC-H dataset. For more information about the query statement, see Q4. Sample SQL statement:

SELECT
    o_orderpriority,
    count(*) AS order_count
FROM
    orders
WHERE
    o_orderdate >= date '1996-07-01'
    AND o_orderdate < date '1996-07-01' + interval '3' month
    AND EXISTS (
        SELECT
            *
        FROM
            lineitem
        WHERE
            l_orderkey=o_orderkey -- Join query
            AND l_commitdate < l_receiptdate)
GROUP BY
    o_orderpriority
ORDER BY
    o_orderpriority;

Configuration suggestion
Configure the fields based on which join operations are performed as distribution keys to prevent a large amount of data from being shuffled across shards.
Performance verification
Define the Lineitem and Orders tables as column-oriented tables. Verify the query performance when the l_orderkey and o_orderkey fields that are used for join operations are configured as distribution keys and when other fields are configured as the distribution keys. For example, the l_linenumber field is configured as the distribution key of the Lineitem table, and no distribution key is configured for the Orders table. For more information about table creation statements, see DDL statements in scenario 6.
Verification result:
- Distribution keys configured as suggested for the tables: About 0.45 GB of data is shuffled, and the execution duration is 2.19s.
- Distribution keys not configured as suggested for the tables: About 6.31 GB of data is shuffled, and the execution duration is 5.55s.
Note
You can check the amount of shuffled data based on the shuffle_bytes parameter in slow query logs. For more information, see Query and analyze slow query logs.

(Optional) Specify the table group to which your table belongs

If your Hologres instance contains more than 256 cores, and the instance is used in a variety of business scenarios, we recommend that you configure multiple table groups and specify the table group to which a table belongs when you create the table. For more information, see Best practices for specifying table groups.

Appendix: Table creation statements

DDL statements in scenario 1:

-- Data definition language (DDL) statement for creating a table with a primary key. If you want to create a table without configuring the primary key, delete PRIMARY KEY that is specified for the O_ORDERKEY field. 
DROP TABLE IF EXISTS orders;
BEGIN;
CREATE TABLE orders(
    O_ORDERKEY       BIGINT         NOT NULL PRIMARY KEY
    ,O_CUSTKEY       INT            NOT NULL
    ,O_ORDERSTATUS   TEXT           NOT NULL
    ,O_TOTALPRICE    DECIMAL(15,2)  NOT NULL
    ,O_ORDERDATE     TIMESTAMPTZ    NOT NULL
    ,O_ORDERPRIORITY TEXT           NOT NULL
    ,O_CLERK         TEXT           NOT NULL
    ,O_SHIPPRIORITY  INT            NOT NULL
    ,O_COMMENT       TEXT           NOT NULL
);
CALL SET_TABLE_PROPERTY('orders', 'orientation', 'row');
CALL SET_TABLE_PROPERTY('orders', 'clustering_key', 'o_orderkey');
CALL SET_TABLE_PROPERTY('orders', 'distribution_key', 'o_orderkey');
COMMIT;

DDL statements in scenario 2:

-- Create a table named Lineitem and configure required properties for the table. 
DROP TABLE IF EXISTS lineitem;
BEGIN;
CREATE TABLE lineitem
(
    L_ORDERKEY      BIGINT      NOT NULL,
    L_PARTKEY       INT         NOT NULL,
    L_SUPPKEY       INT         NOT NULL,
    L_LINENUMBER    INT         NOT NULL,
    L_QUANTITY      DECIMAL(15,2) NOT NULL,
    L_EXTENDEDPRICE DECIMAL(15,2) NOT NULL,
    L_DISCOUNT      DECIMAL(15,2) NOT NULL,
    L_TAX           DECIMAL(15,2) NOT NULL,
    L_RETURNFLAG    TEXT        NOT NULL,
    L_LINESTATUS    TEXT        NOT NULL,
    L_SHIPDATE      TIMESTAMPTZ NOT NULL,
    L_COMMITDATE    TIMESTAMPTZ NOT NULL,
    L_RECEIPTDATE   TIMESTAMPTZ NOT NULL,
    L_SHIPINSTRUCT  TEXT        NOT NULL,
    L_SHIPMODE      TEXT        NOT NULL,
    L_COMMENT       TEXT        NOT NULL,
    PRIMARY KEY (L_ORDERKEY,L_LINENUMBER)
);
CALL set_table_property('lineitem', 'orientation', 'row');
-- CALL set_table_property('lineitem', 'clustering_key', 'L_ORDERKEY,L_SHIPDATE');
CALL set_table_property('lineitem', 'distribution_key', 'L_ORDERKEY');
COMMIT;

DDL statements in scenario 3:

-- Create a partitioned table named Lineitem. The statement that is used to create a non-partitioned table is the same as that used in scenario 2. 
DROP TABLE IF EXISTS lineitem;
BEGIN;
CREATE TABLE lineitem
(
    L_ORDERKEY      BIGINT      NOT NULL,
    L_PARTKEY       INT         NOT NULL,
    L_SUPPKEY       INT         NOT NULL,
    L_LINENUMBER    INT         NOT NULL,
    L_QUANTITY      DECIMAL(15,2) NOT NULL,
    L_EXTENDEDPRICE DECIMAL(15,2) NOT NULL,
    L_DISCOUNT      DECIMAL(15,2) NOT NULL,
    L_TAX           DECIMAL(15,2) NOT NULL,
    L_RETURNFLAG    TEXT        NOT NULL,
    L_LINESTATUS    TEXT        NOT NULL,
    L_SHIPDATE      TIMESTAMPTZ NOT NULL,
    L_COMMITDATE    TIMESTAMPTZ NOT NULL,
    L_RECEIPTDATE   TIMESTAMPTZ NOT NULL,
    L_SHIPINSTRUCT  TEXT        NOT NULL,
    L_SHIPMODE      TEXT        NOT NULL,
    L_COMMENT       TEXT        NOT NULL,
    L_YEAR          TEXT        NOT NULL,
    PRIMARY KEY (L_ORDERKEY,L_LINENUMBER,L_YEAR)
)
PARTITION BY LIST (L_YEAR);
CALL set_table_property('lineitem', 'clustering_key', 'L_ORDERKEY,L_SHIPDATE');
CALL set_table_property('lineitem', 'segment_key', 'L_SHIPDATE');
CALL set_table_property('lineitem', 'distribution_key', 'L_ORDERKEY');
COMMIT;

DDL statements in scenario 4:

-- Create a table named Lineitem and configure table properties not as suggested. In comparison scenarios, configure the clustering key and bitmap column property as suggested. 
DROP TABLE IF EXISTS lineitem;
BEGIN;
CREATE TABLE lineitem
(
    L_ORDERKEY      BIGINT      NOT NULL,
    L_PARTKEY       INT         NOT NULL,
    L_SUPPKEY       INT         NOT NULL,
    L_LINENUMBER    INT         NOT NULL,
    L_QUANTITY      DECIMAL(15,2) NOT NULL,
    L_EXTENDEDPRICE DECIMAL(15,2) NOT NULL,
    L_DISCOUNT      DECIMAL(15,2) NOT NULL,
    L_TAX           DECIMAL(15,2) NOT NULL,
    L_RETURNFLAG    TEXT        NOT NULL,
    L_LINESTATUS    TEXT        NOT NULL,
    L_SHIPDATE      TIMESTAMPTZ NOT NULL,
    L_COMMITDATE    TIMESTAMPTZ NOT NULL,
    L_RECEIPTDATE   TIMESTAMPTZ NOT NULL,
    L_SHIPINSTRUCT  TEXT        NOT NULL,
    L_SHIPMODE      TEXT        NOT NULL,
    L_COMMENT       TEXT        NOT NULL,
    PRIMARY KEY (L_ORDERKEY,L_LINENUMBER)
);
CALL set_table_property('lineitem', 'segment_key', 'L_SHIPDATE');
CALL set_table_property('lineitem', 'distribution_key', 'L_ORDERKEY');
CALL set_table_property('lineitem', 'bitmap_columns', 'l_orderkey,l_partkey,l_suppkey,l_linenumber,l_returnflag,l_linestatus,l_shipinstruct,l_comment');
COMMIT;

DDL statements in scenario 5:

-- Create a table named Lineitem and configure the field that you want to specify in the Group By clause as the distribution key. 
DROP TABLE IF EXISTS lineitem;
BEGIN;
CREATE TABLE lineitem
(
    L_ORDERKEY      BIGINT      NOT NULL,
    L_PARTKEY       INT         NOT NULL,
    L_SUPPKEY       INT         NOT NULL,
    L_LINENUMBER    INT         NOT NULL,
    L_QUANTITY      DECIMAL(15,2) NOT NULL,
    L_EXTENDEDPRICE DECIMAL(15,2) NOT NULL,
    L_DISCOUNT      DECIMAL(15,2) NOT NULL,
    L_TAX           DECIMAL(15,2) NOT NULL,
    L_RETURNFLAG    TEXT        NOT NULL,
    L_LINESTATUS    TEXT        NOT NULL,
    L_SHIPDATE      TIMESTAMPTZ NOT NULL,
    L_COMMITDATE    TIMESTAMPTZ NOT NULL,
    L_RECEIPTDATE   TIMESTAMPTZ NOT NULL,
    L_SHIPINSTRUCT  TEXT        NOT NULL,
    L_SHIPMODE      TEXT        NOT NULL,
    L_COMMENT       TEXT        NOT NULL,
    PRIMARY KEY (L_ORDERKEY,L_LINENUMBER,L_SUPPKEY)
);
CALL set_table_property('lineitem', 'segment_key', 'L_COMMITDATE');
CALL set_table_property('lineitem', 'clustering_key', 'L_ORDERKEY,L_SHIPDATE');
CALL set_table_property('lineitem', 'distribution_key', 'L_SUPPKEY');
COMMIT;

DDL statements in scenario 6:

DROP TABLE IF EXISTS LINEITEM;

BEGIN;
CREATE TABLE LINEITEM
(
    L_ORDERKEY      BIGINT      NOT NULL,
    L_PARTKEY       INT         NOT NULL,
    L_SUPPKEY       INT         NOT NULL,
    L_LINENUMBER    INT         NOT NULL,
    L_QUANTITY      DECIMAL(15,2) NOT NULL,
    L_EXTENDEDPRICE DECIMAL(15,2) NOT NULL,
    L_DISCOUNT      DECIMAL(15,2) NOT NULL,
    L_TAX           DECIMAL(15,2) NOT NULL,
    L_RETURNFLAG    TEXT        NOT NULL,
    L_LINESTATUS    TEXT        NOT NULL,
    L_SHIPDATE      TIMESTAMPTZ NOT NULL,
    L_COMMITDATE    TIMESTAMPTZ NOT NULL,
    L_RECEIPTDATE   TIMESTAMPTZ NOT NULL,
    L_SHIPINSTRUCT  TEXT        NOT NULL,
    L_SHIPMODE      TEXT        NOT NULL,
    L_COMMENT       TEXT        NOT NULL,
    PRIMARY KEY (L_ORDERKEY,L_LINENUMBER)
);
CALL set_table_property('LINEITEM', 'clustering_key', 'L_SHIPDATE,L_ORDERKEY');
CALL set_table_property('LINEITEM', 'segment_key', 'L_SHIPDATE');
CALL set_table_property('LINEITEM', 'distribution_key', 'L_ORDERKEY');
CALL set_table_property('LINEITEM', 'bitmap_columns', 'L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_LINENUMBER,L_RETURNFLAG,L_LINESTATUS,L_SHIPINSTRUCT,L_SHIPMODE,L_COMMENT');
CALL set_table_property('LINEITEM', 'dictionary_encoding_columns', 'L_RETURNFLAG,L_LINESTATUS,L_SHIPINSTRUCT,L_SHIPMODE,L_COMMENT');
COMMIT;

DROP TABLE IF EXISTS ORDERS;

BEGIN;
CREATE TABLE ORDERS
(
    O_ORDERKEY      BIGINT      NOT NULL PRIMARY KEY,
    O_CUSTKEY       INT         NOT NULL,
    O_ORDERSTATUS   TEXT        NOT NULL,
    O_TOTALPRICE    DECIMAL(15,2) NOT NULL,
    O_ORDERDATE     timestamptz NOT NULL,
    O_ORDERPRIORITY TEXT        NOT NULL,
    O_CLERK         TEXT        NOT NULL,
    O_SHIPPRIORITY  INT         NOT NULL,
    O_COMMENT       TEXT        NOT NULL
);
CALL set_table_property('ORDERS', 'segment_key', 'O_ORDERDATE');
CALL set_table_property('ORDERS', 'distribution_key', 'O_ORDERKEY');
CALL set_table_property('ORDERS', 'bitmap_columns', 'O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT');
CALL set_table_property('ORDERS', 'dictionary_encoding_columns', 'O_ORDERSTATUS,O_ORDERPRIORITY,O_CLERK,O_COMMENT');
COMMIT;