All Products
Search
Document Center

Hologres:Optimize the performance of queries on Hologres internal tables

Last Updated:Nov 17, 2023

This topic describes best practices for optimizing the performance of queries on Hologres internal tables.

Update statistics

Statistics determine whether a suitable execution plan can be generated. Hologres collects the following statistics for analysis on sample data that you want to query: data distribution and characteristics, table statistics, column statistics, number of rows, number of columns, column length, cardinality, frequency, maximum value, minimum value, long key value, and bucket distribution and characteristics. These statistics can help update the execution cost information that is estimated by the query optimizer and narrow down the query range. These statistics also help estimate the optimal join order, the memory overhead, and the degree of parallelism (DOP). This way, the query optimizer can generate an optimal execution plan. For more information about the statistics, see Using EXPLAIN.

Statistics collection has the following drawback: If statistics are collected manually or periodically instead of in real time, you cannot obtain the most accurate data characteristics from the statistics. You must execute the EXPLAIN statement and check whether the statistics in the execution result of the EXPLAIN statement are correct. In the statistics returned by the EXPLAIN statement, the rows parameter indicates the number of rows returned by an operator, and the width parameter indicates the average length of columns returned by an operator.

Check whether statistics are correct

  • By viewing execution plans

    If the statistics are outdated, a poor execution plan may be generated.

    In this example, the tmp1 table contains 10 million rows of data. The tmp table contains 1,000 rows of data. In Hologres, the default value of the rows parameter is 1000. The following execution plan returned by the EXPLAIN statement indicates that the number of rows in the tmp1 table is 1,000, which is incorrect. This indicates that the statistics are not updated.

    Seq Scan on tmp1 (cost=0.00..5.01 rows=1000 width=1)
    示例
  • By viewing system tables

    You can check whether the statistics such as the number of rows and the column length are correct by viewing system tables.

    • Query the settings of the analyze_tuple column in the system table named hologres.hg_table_properties to check whether the number of rows is correct. You can also view the value of the rows parameter in the Scan node.

    • Query the system table named hologres.hg_stats and obtain its information, such as the histogram of each column, average column length, and number of distinct values, as shown in the following figure.查看系统表示例

Update statistics

If the tmp1 and tmp tables are joined, the following result of the EXPLAIN statement is expected: The tmp1 table is processed before the tmp table, and the tmp table is used in a HASH JOIN operation. This is because the data volume of the tmp1 table is larger than that of the tmp table. However, the statistics for the tmp1 table are not updated. As a result, the tmp1 table is used in a HASH JOIN operation. This leads to low efficiency or an out of memory (OOM) exception. Therefore, you must execute the ANALYZE statement to collect statistics for both tables that are involved in the JOIN operation.

analyze tmp;
analyze tmp1;

After you execute the ANALYZE statement, the tables that are involved in the JOIN operation are processed in the correct sequence. The following figure shows the execution result of the EXPLAIN statement after the ANALYZE statement is executed. The tmp1 table is processed before the tmp table. The tmp table is used to create a hash table. The number of rows in the tmp1 table is 10 million. This indicates that the statistics are updated.顺序

If rows=1000 is displayed for the tmp1 table in the execution result of the EXPLAIN statement, the statistics are not synchronized to the query optimizer. Low query performance is usually caused by a lack of statistics. You can execute the ANALYZE statement to collect statistics and synchronize the statistics to the query optimizer in time.

Scenarios for updating statistics

We recommend that you execute the ANALYZE statement in the following situations:

  • Data is imported.

  • A large number of INSERT, UPDATE, and DELETE operations are performed.

  • Statistics need to be collected for both internal and foreign tables.

  • Statistics need to be collected for a parent table.

  • If you encounter one of the following issues, you must execute the ANALYZE statement before you run an import task. This can improve efficiency.

    • An OOM exception occurs when multiple tables are joined.

      In most cases, the "Query executor exceeded total memory limitation xxxxx: yyyy bytes used" error message is reported.

    • Import efficiency is low.

      When you query or import data in Hologres, the efficiency of the system decreases and tasks take longer than expected to complete.

Set an appropriate number of shards

The number of shards affects the DOP of queries. This setting is critical to query performance. If only a few shards are available, the DOP is low. Excessive shards may lead to increased query startup overheads and decreased query efficiency. In addition, a large number of small files occupy much metadata management space in the memory. You can improve query efficiency and reduce memory overheads by setting an appropriate number of shards based on the instance types.

The number of shards is pre-allocated based on the type of a Hologres instance. The default number of shards for an instance is approximately the number of CPU cores that are used for core queries in the instance. The number of CPU cores that are used for core queries is smaller than the number of purchased CPU cores. The purchased CPU cores are allocated to different nodes, such as query nodes, access nodes, control nodes, and scheduling nodes. For more information about the default number of shards for different instance types, see Instance types. After an instance is scaled up, the number of shards is not automatically changed for a database created before the instance is scaled up. You must change the number of shards based on your business requirements. For a database created after the instance is scaled up, the number of shards is determined based on the new instance type. The default number of shards is determined by considering the requirements after an instance is scaled up. We recommend that you change the number of shards only if the resources are increased by more than five times. If the resources are increased by less than five times, you can still enjoy high query efficiency without the need to change the number of shards. For more information, see Best practices for specifying table groups.

You must change the number of shards in the following scenarios:

  • After a Hologres instance is scaled up, the business continues to grow and the number of shards cannot keep up with the demand. You need to improve query efficiency. In this case, you must create a table group and specify a larger number of shards for the table group. The table and data are still in the original table group. You need to import them to the new table group for resharding.

  • After a Hologres instance is scaled up, new business needs to be launched and the original business is not affected. In this case, you can create a table group and specify an appropriate number of shards for the table group without modifying the schema of the original table.

Note

You can create multiple table groups in a database. Make sure that the total number of shards for all table groups does not exceed the default number of shards recommended for the instance. This way, you can make the most out of CPU resources.

Optimization for a JOIN scenario

If two or more tables need to be joined, the following optimization methods can be used to improve JOIN performance:

Update statistics

If you do not update the statistics about tables that are involved in the JOIN operation in a timely manner, a hash operation may be performed on a table with a large amount of data. As a result, the JOIN efficiency becomes low. In this case, you can update the statistics about the tables to improve SQL performance.

analyze <tablename>;

Select an appropriate distribution key

A distribution key is used to divide data into multiple shards evenly. This can help prevent data skew. If you specify the join keys of multiple tables to constitute a distribution key and perform a local JOIN operation on the tables based on the distribution key, queries can be accelerated. When you create a table, you can select an appropriate distribution key based on the following rules:

  • Suggestions on distribution key settings

    • Specify join keys to constitute a distribution key.

    • Select columns that are frequently used in the GROUP BY clause to constitute a distribution key.

    • Select columns in which data is evenly distributed and discrete to constitute a distribution key.

    • For more information about the principles and usage of a distribution key, see Distribution key.

  • Sample scenario of setting an appropriate distribution key

    In this example, the execution result of the EXPLAIN statement indicates that the execution plan for querying the tmp and tmp1 tables contains a redistribution operation but no local JOIN operation. As a result, query efficiency is low. You must re-create the two tables and specify the join keys to constitute the distribution key. This prevents additional overheads caused by data redistribution when multiple tables are joined. motionYou can execute the following DDL statements to re-create the two tables and specify the join keys to constitute the distribution key:

    begin;
    create table tmp(a int, b int, c int);
    call set_table_property('tmp', 'distribution_key', 'a');
    commit;
    
    begin;
    create table tmp1(a int, b int, c int);
    call set_table_property('tmp1', 'distribution_key', 'b');
    commit;
    
    -- Specify the join keys to constitute the distribution key. 
    select count(1) from tmp join tmp1 on tmp.a = tmp1.b ;

    After you reset the distribution key for the two tables, execute the EXPLAIN statement again. The execution result of the statement indicates that data is evenly distributed to shards by hash key. In addition, the Redistribute Motion operator, as framed in red in the preceding figure, is optimized. This indicates that data is not redistributed and additional network overheads are prevented.设置DK

Use the Runtime Filter feature

Hologres V2.0 and later support the Runtime Filter feature. This feature is suitable for scenarios where two or more tables are joined for queries, especially for scenarios where large tables are joined with small tables. After the Runtime Filter feature is enabled, the optimizer and execution engine automatically optimize the filter operation during queries, without the need for manual configurations. After the optimization, the I/O overhead is reduced and the performance of join queries is improved because a smaller amount of data is scanned. For more information, see Runtime Filter.

Select a join order algorithm for the query optimizer

  • If complex SQL join operations are involved or many tables are joined, the query optimizer takes an extended period of time to select the optimal connection. If the join order policy is adjusted, the amount of time required to optimize queries may be reduced. The following syntax is used to select a join order algorithm for the query optimizer:

    set optimizer_join_order = '<value>'; 
  • Parameter description

    Parameter

    Description

    value

    The join order algorithm to be used by the query optimizer. Valid values:

    • exhaustive (default value): uses a dynamic planning algorithm to obtain the optimal join order. This generates the optimal execution plan. However, this algorithm has the highest query optimizer overheads.

    • query: uses the join order that is specified in the SQL statement. This algorithm has the lowest query optimizer overheads.

    • greedy: uses a greedy algorithm to obtain a better join order. This algorithm has moderate query optimizer overheads.

  • Usage notes

    You can use the exhaustive algorithm to obtain the globally optimal join order. The amount of time required for optimization depends on the number of tables that you want to join. For example, if more than 10 tables are to be joined, more time may be required to obtain the optimal join order. If you use the query or greedy algorithm to obtain a join order, the required time can be reduced, but the generated execution plan is not optimal.

Optimize Motion operators

The following table describes the four types of Motion operators in Hologres.

Item

Description

Redistribute Motion

Data is hashed or randomly distributed to one or more shards.

Broadcast Motion

Data is copied to all shards.

The Broadcast Motion operator is superior only in scenarios in which both the number of shards and the number of broadcast tables are small.

Gather Motion

Data is summarized into one shard.

Forward Motion

Data is transmitted between an external data source or execution engine and a Hologres execution engine. This operator is used for federated queries.

When you view the execution result of the EXPLAIN statement, take note of the following items:

  • If the Motion operators are time-consuming, you can redesign the distribution key.

  • If a Gather Motion or Broadcast Motion operator is generated due to invalid statistics, you can execute the ANALYZE <tablename> statement to help generate the more efficient Redistribute Motion operator.

  • You can execute the following statements to prohibit the generation of Motion operators and then compare the query time of different Motion operators:

    -- Prohibit the generation of the Broadcast Motion operator. 
    set optimizer_enable_motion_broadcast = off; 
    
    -- Prohibit the generation of the Redistribute Motion operator. 
    set optimizer_enable_motion_redistribute = off; 
  • The Broadcast Motion operator is superior only in scenarios in which both the number of shards and the number of broadcast tables are small. If the Broadcast Motion operator is to be performed on a small table, we recommend that you reduce the number of shards in the table to improve query efficiency. Try to keep the shard count proportional to the number of workers. For more information about shard counts, see User guide of table groups and shard counts.

Disable dictionary encoding

For STRING-based queries, including TEXT-, CHAR-, and VARCHAR-based queries, dictionary encoding or decoding reduces the amount of time consumed to compare strings, but causes a large amount of decoding or encoding overheads.

By default, Hologres enables dictionary encoding for all columns of the STRING type. You can leave the dictionary_encoding_columns property empty or disable dictionary encoding for specific columns. If you modify dictionary encoding settings, data files are re-encoded for storage, which consumes CPU and memory resources. We recommend that you modify dictionary encoding settings during off-peak hours.

If the Decode operator is time-consuming, disable the Decode operator. Disabling dictionary encoding can improve performance.

If a table contains a large number of columns of the STRING type, you can determine whether to enable dictionary encoding for the columns based on your business requirements. For example, you can execute the following statements:

begin;
create table tbl (a int not null, b text not null, c int not null, d int);
call set_table_property('tbl', 'dictionary_encoding_columns', '');
commit;

Common performance optimization methods

You can optimize SQL statements to improve query efficiency by using the following methods:

Optimize data writes and updates by using fixed plans

Fixed plans of Hologres are specific for high-throughput data write, update, and delete scenarios. Simplified execution paths help improve performance and throughput by multiple times. For more information about fixed plans, see Accelerate the execution of SQL statements by using fixed plans.

Change PostgreSQL Query Engine (PQE) operators

Hologres supports multiple execution engines at the underlying layer, such as Hologres Query Engine (HQE) and PQE. If an SQL statement contains an operator that is not supported by HQE, Hologres sends the operator to PQE for execution. In this case, query performance is not optimal. You must optimize the SQL statement.

You can execute the EXPLAIN statement to check the execution plan of an SQL statement. If the execution plan contains External SQL(Postgres), the SQL statement is executed in PQE.

Example: HQE does not support the NOT IN operation. In this case, Hologres uses PQE to perform the NOT IN operation. We recommend that you change NOT IN to NOT EXISTS. Before optimization, execute the following statement to check the execution plan of the SQL statement:

explain select * from tmp where a not in (select a from tmp1);

The External operator indicates that the SQL statement is executed in PQE.post

After optimization, execute the following statement to check the execution plan of the SQL statement. The execution plan indicates that the SQL statement is not executed in PQE.

explain select * from tmp where not exists (select a from tmp1 where a = tmp.a);
优化后的SQL

You can change functions to send operators to HQE. The following table describes the recommended functions that you can use. Each version of Hologres continuously iterates PQE functions to send more functions to HQE. If a function is already supported by HQE, you can upgrade the version of the function. For more information, see Function release notes.

Function not supported by HQE

Recommended function

Example

Remarks

not in

not exists

select * from tmp where not exists (select a from tmp1 where a = tmp.a);

N/A.

regexp_split_to_table(string text, pattern text)

unnest(string_to_array)

select name,unnest(string_to_array(age,',')) from demo;

The regexp_split_to_table function supports regular expressions.

In Hologres V2.0.4 and later, HQE supports regexp_split_to_table. You must execute the following statement to set the Grand Unified Configuration (GUC) parameter hg_experimental_enable_hqe_table_function to on: set hg_experimental_enable_hqe_table_function = on;

substring

extract(hour from to_timestamp(c1, 'YYYYMMDD HH24:MI:SS'))

select cast(substring(c1, 13, 2) as int) AS hour from t2;

Change it to:

select extract(hour from to_timestamp(c1, 'YYYYMMDD HH24:MI:SS')) from t2;

Some of the Hologres V0.10.x versions and all the earlier versions do not support the substring function. In Hologres V1.3 and later, HQE supports non-regular expressions as input parameters of the substring function.

regexp_replace

replace

select regexp_replace(c1::text,'-','0') from t2;

Change it to:

select replace(c1::text,'-','') from t2;

The replace function does not support regular expressions.

at time zone 'utc'

Delete the at time zone 'utc' function.

select date_trunc('day',to_timestamp(c1, 'YYYYMMDD HH24:MI:SS')  at time zone 'utc') from t2

Change it to:

select date_trunc('day',to_timestamp(c1, 'YYYYMMDD HH24:MI:SS') ) from t2;

N/A.

cast(text as timestamp)

to_timestamp

select cast(c1 as timestamp) from t2;

Change it to:

select to_timestamp(c1, 'yyyyMMdd hh24:mi:ss') from t2;

Hologres V2.0 and later support HQE.

timestamp::text

to_char

select c1::text from t2;

Change it to:

select to_char(c1, 'yyyyMMdd hh24:mi:ss') from t2;

Hologres V2.0 and later support HQE.

Do not perform fuzzy searches

No index is created in a fuzzy search operation such as LIKE.

Disable caching of query results

By default, Hologres caches query or subquery results. Cached results are returned for repeated queries. You can execute the following statement to disable the caching of query results to prevent the impact on query performance:

set hg_experimental_enable_result_cache = off;

Reduce OOM exceptions

If the memory used for computing on an instance is insufficient, an OOM exception may occur. The following example shows a common OOM exception. OOM exceptions can be caused by multiple reasons, such as complex computing and high concurrency. You can perform optimization accordingly to reduce OOM exceptions. For more information, see FAQ about OOM.

Total memory used by all existing queries exceeded memory limitation. 
memory usage for existing queries=(2031xxxx,184yy)(2021yyyy,85yy)(1021121xxxx,6yy)(2021xxx,18yy)(202xxxx,14yy); Used/Limit: xy1/xy2 quota/sum_quota: zz/100

Optimize the ORDER BY LIMIT operation

In versions earlier than Hologres V1.3, the Merge Sort operator is not supported in the ORDER BY LIMIT operation. In the execution plan of a statement that contains the ORDER BY LIMIT operation, the final results are sorted before the results are exported. This results in relatively low query performance. In Hologres V1.3 and later, the ORDER BY LIMIT operation is optimized to support the Merge Sort operator. The optimization merges and sorts multiple data streams without additional sorting. This improves query performance.

Examples:

  • DDL statements

  • begin;
    create table test_use_sort_1
    (
              uuid           text not null,
              gpackagename   text not null,
              recv_timestamp text not null
    );
    call set_table_property('test_use_sort_1', 'orientation', 'column');
    call set_table_property('test_use_sort_1', 'distribution_key', 'uuid');
    call set_table_property('test_use_sort_1', 'clustering_key', 'uuid:asc,gpackagename:asc,recv_timestamp:desc');
    commit;
    
    --Insert data into the table.
    insert into test_use_sort_1 select i::text, i::text, '20210814' from generate_series(1, 10000) as s(i);
    
    -- Update statistics.
    analyze test_use_sort_1;
  • Query statement

  • set hg_experimental_enable_reserve_gather_exchange_order =on 
    set hg_experimental_enable_reserve_gather_motion_order =on
    select uuid from test_use_sort_1 order by uuid limit 5;
  • Comparison of execution plans

    • The following figure shows the execution plan in Hologres V1.1. The execution plan in other Hologres versions earlier than V1.3 is similar to that in Hologres V1.1.执行计划1.1

    • The following figure shows the execution plan in Hologres V1.3.执行计划1.3

  • The comparison of the execution plans indicates that the execution plan in Hologres V1.3 contains one less sort operation before the results are exported. Multiple data streams are merged and sorted. This improves query performance.

Optimize the COUNT DISTINCT function

  • Change COUNT DISTINCT to APPROX_COUNT_DISTINCT.

    The COUNT DISTINCT function removes duplicate data precisely. The records with the same key must be distributed to the same node for deduplication by using the function. This is resource-consuming. Hologres supports the extended function APPROX_COUNT_DISTINCT. The HyperLogLog algorithm is used to approximate the number of distinct values to perform imprecise COUNT DISTINCT computing. This improves query performance. The average error rate can be controlled between 0.1% and 1%. You can change COUNT DISTINCT to APPROX_COUNT_DISTINCT based on your business requirements. For more information, see APPROX_COUNT_DISTINCT.

  • Use the UNIQ function

    In Hologres V1.3 and later, you can use the UNIQ function to precisely remove duplicate data. If the cardinality of the GROUP BY key is high, the UNIQ function performs better and saves more memory than the COUNT DISTINCT function. If an OOM exception occurs when you use the COUNT DISTINCT function, you can use the UNIQ function instead. For more information, see UNIQ.

  • Select an appropriate distribution key

    If multiple COUNT DISTINCT functions with the same key exist, and data is evenly distributed and discrete, we recommend that you specify the key of the COUNT DISTINCT functions as the distribution key. This way, the same data can be distributed to the same shard. This prevents data shuffling.

  • Optimize the COUNT DISTINCT function

    In Hologres V2.1 and later, Hologres optimizes the performance for COUNT DISTINCT scenarios, including a single COUNT DISTINCT function, multiple COUNT DISTINCT functions, data skew, and SQL statements that do not contain the GROUP BY clause. To improve query performance, you do not need to manually change the COUNT DISTINCT function to the UNIQ function. If you want to improve the performance of the COUNT DISTINCT function, we recommend that you upgrade the version of your Hologres instance to V2.1 or later.

Optimize the GROUP BY operation

The GROUP BY Key operation allows data to be redistributed based on the key of the column in the GROUP BY clause. If the GROUP BY operation is time-consuming, you can use the column in the GROUP BY clause as the distribution key.

-- If data is distributed based on the values in Column a, data redistribution during runtime is reduced. The parallel computing capability of shards is fully utilized. 
select a, count(1) from t1 group by a; 

Resolve the issue of data skew

Queries slow down if data is unevenly distributed among multiple shards. You can execute the following statement to check whether data is evenly distributed: For more information, see Query the shard allocation among workers.

-- The hg_shard_id column is the built-in hidden column of each table. This column provides the ID of the shard where each row of data is located.
select hg_shard_id, count(1) from t1 group by hg_shard_id;
  • If data is unevenly distributed, you must select a column in which data is evenly distributed and discrete as the distribution key.

    Note

    To change the distribution key, you must create another table and import data.

  • If data skew is irrelevant to a distribution key, we recommend that you optimize the data from a business perspective to prevent data skew.

Optimize WITH queries (beta)

Hologres is compatible with PostgreSQL and supports common table expressions (CTEs). CTEs are commonly used in recursive queries in a WITH clause. The implementation is the same as that in PostgreSQL, in which CTEs are inlined. Therefore, repeated computing occurs when a CTE is used multiple times. In Hologres V1.3, you can set the GUC parameter optimizer_cte_inlining to off to enable the CTE reuse feature. This way, a CTE can be computed only once and referenced multiple times to save computing resources and improve query performance. If the version of your Hologres instance is earlier than V1.3, upgrade your instance.

set optimizer_cte_inlining=off;
Note
  • The CTE reuse feature is still in public preview and is not enabled by default. By default, all CTEs are inlined and repeatedly computed. You can manually set the GUC parameter optimizer_cte_inlining to off to enable this feature.

  • The CTE reuse feature depends on shuffle spilling. Downstream users consume CTEs at different paces. Therefore, query performance is affected if a large amount of data is processed.

  • Example

    create table cte_reuse_test_t
    (
        a integer not null,
        b text,
        primary key (a)
    );
    
    insert into cte_reuse_test_t values(1, 'a'),(2, 'b'), (3, 'c'), (4, 'b'), (5, 'c'), (6, ''), (7, null);
    
    
    set optimizer_cte_inlining=off;
    
    explain with c as (select b, max(a) as a from cte_reuse_test_t group by b)
    select a1.a,a2.a,a1.b, a2.b
    from c a1, c a2
    where a1.b = a2.b
    order by a1.b
    limit 100;
                                        
  • Comparison of execution plans

    • The following figure shows the execution plan in Hologres V1.1. The execution plan in other Hologres versions earlier than V1.3 is similar to that in Hologres V1.1.执行计划_11with

    • The following figure shows the execution plan in Hologres V1.3.执行计划_13with

    The comparison of the execution plans indicates that the execution plan in a Hologres version earlier than V1.3 contains multiple HashAggregate operations. In the execution plan in Hologres V1.3, the results are computed only once and then reused. This improves query performance.

Optimize single-stage aggregation to multi-stage aggregation.

If aggregate operators are time-consuming, check whether local shard-level pre-aggregation is performed.

You can perform aggregation operations on individual shards to decrease the volume of data that is involved in the final aggregation operation and improve query performance. The following aggregation operations are supported:

  • Three-stage aggregation: Aggregate data by file, aggregate data in each shard, and then summarize the aggregation results of all shards.三阶段聚合

  • Two-stage aggregation: Aggregate data in each shard and summarize the aggregation results of all shards.两阶段聚合

You can execute the following statement to forcibly perform multi-stage aggregation in Hologres:

set optimizer_force_multistage_agg = on;

Optimize table attributes when you create a table

Select a storage mode

Hologres supports row-oriented, column-oriented, and row-column hybrid storage modes. You can select a storage mode that best suits your business scenario. The following table describes the storage modes.

Storage mode

Scenario

Disadvantage

Row-oriented storage

  • Point query scenarios in which data is queried with high QPS performance based on primary keys.

  • Scenarios in which all columns of data are read at the same time and high performance of the UPDATE, DELETE, and INSERT operations is required.

The performance of operations such as large-scale data queries, full table scanning, and aggregation is low.

Column-oriented storage

Data analysis scenarios, such as multi-column data query by range, single-table aggregation, and multi-table join.

The performance of the UPDATE and DELETE operations in this mode is lower than that in the row-oriented storage mode. The performance of point queries in this mode is lower than that in the row-oriented storage mode in scenarios without specified indexes.

Row-column hybrid storage

Scenarios in which the row-oriented and column-oriented storage modes are applicable.

The memory overheads are high.

Select a data type

Hologres supports a variety of data types. You can select a data type that best meets your business requirements. The following rules apply:

  • Select a type that requires minimal storage space.

    • Preferentially use the INT type rather than the BIGINT type.

    • Preferentially use the DECIMAL or NUMERIC type with minimal PRECISION or SCALE. To prevent statistical errors, do not use the FLOAT or DOUBLE PRECISION type.

    • Do not use the FLOAT or DOUBLE type for columns in the GROUP BY clause.

    • Preferentially use the TEXT type rather than the VARCHAR(N) and CHAR(N) types. Set the value of N as small as possible.

    • Use the TIMESTAMPTZ and DATE types rather than the TEXT type to represent a date.

  • Use consistent data types for JOIN operations.

    When you join tables, make sure that the columns that match each other use the same data type. This prevents additional overheads caused if Hologres implicitly converts the data type of columns to different data types.

  • Do not use the FLOAT or DOUBLE type in operations such as UNION and GROUP BY.

    Operations such as UNION and GROUP BY do not support the DOUBLE PRECISION or FLOAT type. You must use the DECIMAL type.

Select a primary key

Primary keys are used to uniquely identify data. They are applicable to data import scenarios with duplicate primary key values. You can specify the option parameter to select a deduplication mode when you import data. Valid values:

  • ignore: ignores new data.

  • update: uses new data to overwrite existing data.

Appropriate settings of primary keys can help the query optimizer generate appropriate execution plans in specific scenarios. For example, GROUP BY pk,a,b,c is executed based on an appropriate primary key to accelerate data queries.

In column-oriented storage mode, primary keys may degrade the performance of data writing. In most cases, the performance of data writing if no primary keys are configured is two times higher than that if primary keys are configured.

Select partition table

Hologres tables support only one level of partitioning. Appropriate partition settings accelerate queries. Inappropriate partition settings such as a large number of partitions cause excessively small files and degrade query performance.

Note

For data imported by day, we recommend that you create a partitioned table for the data for each day. The data is separately stored. Only the data of the current day is accessed.

You can configure partitions in the following scenarios:

  • When you delete all partitions of a child table, data in the partitions of other child tables is not affected. The DROP or TRUNCATE statement has higher performance than the DELETE statement.

  • For the queries of partition key columns in predicate conditions, you can find corresponding partitions based on the indexes of the partition key columns. You can also directly query child partitions.

  • A partitioned table is required for real-time data that is imported on a periodic basis. If new data is imported every day, you can use the date column as a partition key column and import data to a child partition every day. Sample statements:

  • begin;
    create table insert_partition(c1 bigint not null, c2 boolean, c3 float not null, c4 text, c5 timestamptz not null) partition by list(c4);
    call set_table_property('insert_partition', 'orientation', 'column');
    commit;
    create table insert_partition_child1 partition of insert_partition for values in('20190707');
    create table insert_partition_child2 partition of insert_partition for values in('20190708');
    create table insert_partition_child3 partition of insert_partition for values in('20190709');
    
    select * from insert_partition where c4 >= '20190708';
    select * from insert_partition_child3;

Select an index

Hologres uses various types of indexes to implement different features. You can select an index that best suits your business scenario to improve query performance. You must design the table schema before data is written based on your business scenario. The following table describes the supported indexes.

Index type

Index name

Description

Suggestion

Sample statement

clustering_key

Clustering key

The clustering key is used in files. Data in a file is sorted by this index.

For range queries, Hologres can use the data orderliness attributes of the clustering key to filter data.

Use a range or filter query column as a clustering key. Index-based data filtering complies with the leftmost prefix matching principle. We recommend that you specify no more than two columns to constitute a clustering key.

select sum(a) from tb1 where a > 100 and a < 200;

bitmap_columns

Bitmap index

The bitmap index is used in files. Bitmaps are created for data in a file based on this index.

For equivalent queries, Hologres can encode data for each row by value, and perform bit operations to find the rows where data resides. The time complexity is O(1).

Use an equivalent query column as a bitmap index.

select * from tb1 where a =100;

segment_key (also known as event_time_column)

Segment key

A file-level index. Data is written to files in Append Only mode. Then, small files are merged based on the index.

The segment key identifies the boundary of a file. You can use the segment key to find the required file.

The segment key is designed for ordered, range-specific data such as timestamps and dates. Therefore, the segment key is strongly correlated with the data write time.

Use this index to filter files. Then, use the bitmap index or clustered index for range or equivalent queries in files. Index-based data filtering complies with the leftmost prefix matching principle. We recommend that you specify only one column.

We recommend that you specify the first non-empty timestamp column as a segment key.

select sum(a) from tb1 where ts > '2020-01-01' and a < '2020-03-02';

Both the clustering key and segment key must comply with the leftmost prefix matching principle of traditional databases such as MySQL. This way, data is sorted based on the values in the leftmost column of an index. If the values in the leftmost column are ordered, use the second leftmost column to sort data. Sample statements:

call set_table_property('tmp', 'clustering_key', 'a,b,c');
select * from tmp where a > 1 ;  -- The clustering key can be used. 
select * from tmp where a > 1 and c > 2 ;   -- The clustering key can be used only for Column a. 
select * from tmp where a > 1 and b > 2 ;  -- The clustering key can be used for both Column a and Column b. 
select * from tmp where a > 1 and b > 2 and c > 3 ; -- The clustering key can be used for Column a, Column b, and Column c. 
select * from tmp where b > 1 and c > 2 ;   -- The clustering key cannot be used for Column b or Column c.

A bitmap index supports AND or OR queries among multiple columns. Sample statements:

call set_table_property('tmp', 'bitmap_columns', 'a,b,c');
select * from tmp where a = 1 and b = 2 ;  --The bitmap index can be used. 
select * from tmp where a = 1 or b = 2 ;  --The bitmap index can be used.
Note

You can specify a bitmap index after you create a table. A clustering key and a segment key are specified only when you create a table and cannot be specified after a table is created.

Check whether an index is used

Create a table named tmp and specify index columns by executing the following statements:

begin;
create table tmp(a int not null, b int not null, c int not null);
call set_table_property('tmp', 'clustering_key', 'a');
call set_table_property('tmp', 'segment_key', 'b');
call set_table_property('tmp', 'bitmap_columns', 'a,b,c');
commit;
  • To check whether the clustering key is used, execute the following statement:

    explain select * from tmp where a > 1;
    cluster
  • To check whether the bitmap index is used, execute the following statement:

    explain select * from tmp where c = 1;
    bitmap
  • To check whether the segment key is used, execute the following statement:

    explain select * from tmp where b > 1;
    segment