This topic describes best practices to optimize query performance on Hologres internal tables.

Update statistics

Statistics determine whether a suitable execution plan can be generated. Hologres collects the following statistics for analysis on sample data that you want to query: data distribution and characteristics, table statistics, column statistics, number of rows, number of columns, column width, cardinality, frequency, maximum value, minimum value, long key value, and bucket distribution and characteristics. These statistics can help update the estimated cost for operator execution, narrow down the query range, and estimate the optimal join order, the memory overhead, and the degree of parallelism (DOP). This way, the optimizer can generate an optimal execution plan. For more information about the statistics, see Using EXPLAIN.

Statistics collection has the following drawback: If statistics are collected manually or periodically instead of in real time, you cannot obtain the most accurate data characteristics from the statistics. You must execute the EXPLAIN statement and check whether the statistics in the execution result of the EXPLAIN statement are correct. In the statistics, the rows parameter indicates the number of rows of an operator, and the width parameter indicates the width of an operator.

  • Check whether statistics are correct
    • By viewing execution plans

      If the statistics are outdated, a poor execution plan may be generated, as demonstrated in the following example:

      The tmp1 table contains 10 million rows of data. The tmp table contains 1,000 rows of data. In Hologres, the default value of the rows parameter is 1000 in the statistics. In the following execution result of the EXPLAIN statement, the number of rows in the tmp1 table is not the actual value. This indicates that the statistics are not up to date.
      Seq Scan on tmp1 (cost=0.00..5.01 rows=1000 width=1)
      Example
    • By viewing system tables
      You can check whether the statistics such as the number of rows and the column width are correct by viewing system tables.
      • Query the settings of the analyze_tuple column in the system table named hologres.hg_table_properties to check whether the number of rows is correct. Alternatively, you can view the value of the rows parameter in the Scan node.
      • Query the system table named hologres.hg_stats and obtain information such as the histogram of each column, the average width, and the number of distinct values, as shown in the following figure. Example
  • Update statistics
    When the tmp1 and tmp tables are joined, the following result of the EXPLAIN statement is expected: the tmp1 table is processed before the tmp table and the tmp table is used in a HASH JOIN operation. This is because the data volume of the tmp1 table is larger than that of the tmp table. However, the statistics for the tmp1 table are not up to date. As a result, the tmp1 table is used in a HASH JOIN operation. This leads to low efficiency or even an out of memory (OOM) exception. Therefore, you must execute the ANALYZE statement to collect statistics for both tables that are involved in the JOIN operation.
    analyze tmp;
    analyze tmp1;
    After you execute the ANALYZE statement, the tables that are involved in the JOIN operation are processed in the correct sequence. The following figure shows the execution result of the EXPLAIN statement after the ANALYZE statement is executed. The tmp1 table is processed before the tmp table. The tmp table is used to create a hash table. The number of rows in the tmp1 table is 10 million. This indicates that the statistics are up to date. Sequence

    If rows=1000 is displayed for the tmp1 table in the execution result of the EXPLAIN statement, the statistics are not synchronized to the optimizer. Low query performance is usually caused by a lack of statistics. You can execute the ANALYZE <tablename> statement to collect statistics and synchronize the statistics to the optimizer in time.

  • Scenarios for updating statistics
    We recommend that you execute the ANALYZE <tablename> statement in the following scenarios:
    • Data is imported.
    • A large number of INSERT, UPDATE, and DELETE operations are performed.
    • Statistics need to be collected for both internal and foreign tables.
    • Statistics need to be collected for a parent table.
    • If you encounter one of the following issues, you must execute the ANALYZE <tablename> statement before you run an import task. This can improve efficiency.
      • An OOM exception occurs when multiple tables are joined.

        The Query executor exceeded total memory limitation xxxxx: yyyy bytes used error is reported.

      • Import efficiency is low.

        When you query or import data in Hologres, tasks cannot end in time. This results in low efficiency.

  • Select a join order algorithm for the optimizer
    • If complex JOIN operations are performed on a large number of tables, the optimizer takes an extended period of time to select the optimal connection. As a result, if the join order policy is adjusted, the amount of time required to optimize queries may be reduced. The following syntax is used to select a join order algorithm for the optimizer:
      set optimizer_join_order = '<value>'; 
    • Parameter
      ParameterDescription
      valueThe join order algorithm to be used by the optimizer. Valid values:
      • query: uses the join order that is specified in the SQL statement. This algorithm has the lowest optimizer overheads.
      • greedy: uses a greedy algorithm to obtain a better join order. This algorithm has moderate optimizer overheads.
      • exhaustive: uses a dynamic planning algorithm to obtain the optimal join order. This generates the optimal execution plan. However, this algorithm has the highest optimizer overheads.
      Default value: exhaustive.
    • Usage notes

      You can use the exhaustive algorithm to obtain the globally optimal join order. The amount of time required for optimization depends on the number of tables that you want to join. For example, if more than 10 tables are joined, much time may be required to obtain the optimal join order. Hologres takes less time to obtain the optimal join order if the query or greedy algorithm is used, but neither of the algorithms can ensure that the optimal join order can be obtained.

Set an appropriate number of shards

The number of shards affects the DOP of queries. This setting is critical to query performance. If only a few shards are available, the DOP is low. Excessive shards may lead to increased query startup overheads and decreased query efficiency. In addition, a large number of small files occupy much metadata management space in the memory. You can improve query efficiency and reduce memory overheads by setting an appropriate number of shards based on the instance types.

The number of shards is pre-allocated based on the instance type of a Hologres instance. The default number of shards for an instance is approximately the number of CPU cores that are used for core queries in the instance. The number of CPU cores that are used for core queries is smaller than the number of purchased CPU cores. The purchased CPU cores are allocated to different nodes such as query nodes, access nodes, control nodes, and scheduling nodes. For more information about the default numbers of shards for instances of different types, see Instance types. After an instance is scaled up, the number of shards is not automatically changed for a database created before the instance is scaled up. You must change the number of shards based on your business requirements. For a database created after the instance is scaled up, the number of shards is determined based on the new instance type. The default number of shards is determined by considering the requirements after an instance is scaled up. We recommend that you change the number of shards only if the resources are increased by more than five times. If the resources are increased by less than five times, you can still enjoy high query efficiency without the need to change the number of shards. For more information, see Best practices for specifying table groups.

You must change the number of shards in the following scenarios:
  • After a Hologres instance is scaled up, the business continues to grow, and the number of shards cannot keep up with the demand. You need to improve query efficiency. In this case, you must create a table group and specify a larger number of shards for the table group. The table and data are still in the original table group. You need to import them to the new table group for resharding.
  • After a Hologres instance is scaled up, new business needs to be launched without affecting the original business. In this case, you can create a table group and specify an appropriate number of shards for the table group without modifying the schema of the original table.
Note You can create multiple table groups in a database. Make sure that the total number of shards for all table groups does not exceed the default number of shards recommended for the instance. This way, you can make the most out of CPU resources.

Select an appropriate distribution key

A distribution key is used to evenly divide data into multiple shards. This can help prevent data skew. If you specify the join keys of multiple tables to constitute a distribution key and perform a local JOIN operation on the tables based on the distribution key, queries can be accelerated. When you create a table, you can select an appropriate distribution key based on the following rules:
  • Suggestions on distribution key settings
    • Specify join keys to constitute a distribution key.
    • Select columns that are frequently used in the GROUP BY clause to constitute a distribution key.
    • Select columns in which data is evenly distributed and discrete to constitute a distribution key.
  • Sample scenario of setting an appropriate distribution key
    For example, the execution result of the EXPLAIN statement indicates that the execution plan for querying the tmp and tmp1 tables contains a redistribution operation but no local JOIN operation. In this case, query efficiency is low. You must re-create the two tables and specify the join keys to constitute the distribution key. This prevents additional overheads caused by data redistribution when multiple tables are joined. motionYou can execute the following DDL statements to re-create the two tables and specify the join keys to constitute the distribution key:
    begin;
    create table tmp(a int, b int, c int);
    call set_table_property('tmp', 'distribution_key', 'a');
    commit;
    
    begin;
    create table tmp1(a int, b int, c int);
    call set_table_property('tmp1', 'distribution_key', 'b');
    commit;
    
    -- Specify the join keys to constitute the distribution key. 
    select count(1) from tmp join tmp1 on tmp.a = tmp1.b ;
    After you reset the distribution key for the two tables, execute the EXPLAIN statement again. The execution result of the statement indicates that data is evenly distributed to shards by hash key. In addition, the Redistribute Motion operator as framed in red in the preceding figure is optimized. This indicates that data is not redistributed and that additional network overheads are prevented. Set a distribution keyHologres has four types of Motion operators that correspond to four data redistribution scenarios, as described in the following table.
    TypeDescription
    Redistribute MotionData is hashed or randomly distributed to one or more shards.
    Broadcast MotionData is copied to all shards.

    The Broadcast Motion operator is superior only in scenarios where both the number of shards and the number of broadcast tables are small.

    Gather MotionData is summarized to one shard.
    Forward MotionData is transmitted between an external data source or execution engine and a Hologres execution engine. This operator is used for federated queries.
    When you view the execution result of the EXPLAIN statement, take note of the following items:
    • If the Motion operators are time-consuming, you can redesign the distribution key.
    • The Broadcast Motion operator is superior only in scenarios where both the number of shards and number of broadcast tables are small.
    • If a Gather Motion or Broadcast Motion operator is generated due to invalid statistics, you can execute the ANALYZE <tablename> statement to use the more efficient Redistribute Motion operator.
    • You can execute the following statements to prohibit the generation of Motion operators and then compare the query time of different Motion operators:
      -- Prohibit the generation of the Broadcast Motion operator. 
      set optimizer_enable_motion_broadcast = off; 
      
      -- Prohibit the generation of the Redistribute Motion operator. 
      set optimizer_enable_motion_redistribute = off; 
  • Optimize the GROUP BY operation
    The GROUP BY Key operation allows data to be redistributed based on the key of the column in the GROUP BY clause. If the GROUP BY operation is time-consuming, you can use the column in the GROUP BY clause as the distribution key.
    -- Distribute data based on the values of column a to reduce data redistribution and make full use of concurrent computing of shards. 
    select a, count(1) from t1 group by a; 
  • Process data skew
    Queries slow down if data is unevenly distributed among multiple shards. You can execute the following statement to check whether data is evenly distributed:
    -- The hg_shard_id column is a built-in hidden column of each table. This column indicates the shard where the data of the corresponding row resides.
    select hg_shard_id, count(1) from t1 group by hg_shard_id;
    If data is unevenly distributed, you must select a column in which data is evenly distributed and discrete as the distribution key.
    Note To change the distribution key, you must create another table and import data.

Disable dictionary encoding

For STRING-based queries including TEXT-, CHAR-, and VARCHAR-based queries, dictionary encoding or decoding reduces the amount of time consumed to compare strings, but causes a large amount of decoding or encoding overheads.

By default, Hologres enables dictionary encoding for all columns of the STRING type. You can leave the dictionary_encoding_columns property empty or disable dictionary encoding for specific columns. If you modify the dictionary encoding setting, data files are re-encoded for storage, which consumes CPU and memory resources. We recommend that you modify dictionary encoding settings during off-peak hours.

You can query the statistics to check whether the Decode operator is time-consuming. If the Decode operator is time-consuming, disable the Decode operator. If a table contains a large number of columns of the STRING type, we recommend that you disable dictionary encoding to improve performance.

If a table contains a large number of columns of the STRING type, you can determine whether to enable dictionary encoding for the columns based on your business requirements. For example, you can execute the following statements:
begin;
create table tbl (a int not null, b text not null, c int not null, d int);
call set_table_property('tbl', 'dictionary_encoding_columns', '');
commit;

Optimize SQL statements

You can optimize SQL statements to improve query efficiency by using the following methods:
  • Optimize data writes and updates by using fixed plans

    Fixed plans are a specific optimization provided by Hologres for high-throughput data write, update, and delete scenarios. Simplified execution paths help improve performance and throughput by several times. For more information about fixed plans, see Accelerate the execution of SQL statements by using fixed plans.

  • Change PostgreSQL Query Engine (PQE) operators

    Hologres supports both the Alibaba Cloud-developed Hologres Query Engine (HQE) and PQE. If an SQL statement contains an operator that is not supported by HQE, Hologres sends the operator to PQE for execution. In this case, query performance is low. You must optimize the SQL statement.

    You can execute the EXPLAIN statement to check the execution plan of an SQL statement. If the execution plan contains External SQL(Postgres), the SQL statement is executed in PQE.

    Example: Hologres does not support the NOT IN operation. In this case, the system uses PQE to perform the NOT IN operation. We recommend that you change NOT IN to NOT EXISTS. Before optimization, execute the following statement to check the execution plan of the SQL statement:
    explain select * from tmp where a not in (select a from tmp1);
    The External operator indicates that the SQL statement is executed in PQE. post
    After optimization, execute the following statement to check the execution plan of the SQL statement. The SQL statement is not executed in PQE.
    explain select * from tmp where not exists (select a from tmp1 where a = tmp.a);
    SQL statement after optimization
    You can change functions to send operators to HQE. The following table describes recommended change rules.
    Function not supported by HologresRecommended functionExampleRemarks
    not innot existsselect * from tmp where not exist (select a from tmp1 where a = tmp.a);N/A
    regexp_split_to_table(string text, pattern text )unnest(string_to_array)
    select name,unnest(string_to_array(age,',')) from demo;
    The regexp_split_to_table function supports regular expressions.
    substringextract(hour from to_timestamp(c1, 'YYYYMMDD HH24:MI:SS'))
    select cast(substring(c1, 13, 2) as int) AS hour from t2;
    Change it to:
    select extract(hour from to_timestamp(c1, 'YYYYMMDD HH24:MI:SS')) from t2;
    Some of the Hologres V0.10.x versions and all earlier versions do not support the substring function. In Hologres V1.3 and later, HQE supports non-regular expressions as input parameters of the substring function.
    regexp_replacereplace
    select regexp_replace(c1::text,'-','') from t2;
    Change it to:
    select replace(c1::text,'-','') from t2;
    The replace function does not support regular expressions.
    at time zone 'utc'Delete the at time zone 'utc' function.
    select date_trunc('day',to_timestamp(c1, 'YYYYMMDD HH24:MI:SS')  at time zone 'utc') from t2
    Change it to:
    select date_trunc('day',to_timestamp(c1, 'YYYYMMDD HH24:MI:SS') ) from t2;
    N/A
    cast(text to timestamp)to_timestamp
    select cast(c1 as timestamp) from t2;
    Change it to:
    select to_timestamp(c1, 'yyyyMMdd hh24:mi:ss') from t2;
    N/A
    timestamp::textto_char
    select c1::text from t2;
    Change it to:
    select to_char(c1, 'yyyyMMdd hh24:mi:ss') from t2;
    N/A
  • Do not perform fuzzy searches

    No index is created in a fuzzy search operation such as LIKE. Fuzzy searches are full table scans.

  • Disable caching of query results
    By default, Hologres caches query or subquery results. Cached results are returned for repeated queries. You can disable caching of query results to prevent the impact on query performance. To disable caching of query results, run the following command:
    set hg_experimental_enable_result_cache = off;
  • Prevent OOM exceptions
    CommandDescriptionValid valueRecommendation
    set hg_experimental_scan_node_dop=<number>;Sets the number of tables that are concurrently scanned by the scan operator in a shard. The maximum value is the number of tables in a shard. By default, the number is set to the optimal value for query performance. This command is applicable to scenarios where an OOM exception occurs when you write data to tables. 0-512
    Note If a large value is set, an OOM exception may occur. As a result, the import or query may fail or the instance may even restart. This causes services to become unavailable. If a small value is set, the query performance may be low.
    1. Execute the SHOW hg_experimental_scan_node_dop statement to query the current number of tables that are concurrently scanned by the scan operator in a shard.
    2. Divide the queried number by two. The result is the recommended value of the hg_experimental_scan_node_dop parameter.
    Note You can also use other values based on your business requirements.
    set hg_experimental_dml_bulkload_dop=<number>;Sets the number of tables that are concurrently written or updated. The maximum value is the number of tables in a shard. The minimum value is 1. By default, the number is set to the optimal value for query performance. This command is applicable to scenarios where an OOM exception occurs when you write data to tables. 0-512
    Note If a large value is set, an OOM exception may occur. As a result, the import or query may fail or the instance may even restart. This causes services to become unavailable. If a small value is set, the query performance may be low.
    1. Execute the SHOW hg_experimental_dml_bulkload_dop statement to query the current number of tables that are concurrently written or updated in a shard.
    2. Divide the queried number by two. The result is the recommended value of the hg_experimental_dml_bulkload_dop parameter.
    Note You can also use other values based on your business requirements.
  • Optimize the ORDER BY LIMIT operation

    In versions earlier than Hologres V1.3, the Merge Sort operator is not supported in the ORDER BY LIMIT operation. In the execution plan of an SQL statement that contains the ORDER BY LIMIT operation, the final results are sorted before the results are exported. This results in relatively low query performance. In Hologres V1.3 and later, the ORDER BY LIMIT operation is optimized to support the Merge Sort operator. The optimization merges and sorts multiple data streams without additional sorting. This improves query performance. If the version of your Hologres instance is earlier than V1.3, submit a ticket to apply for an instance upgrade.

    Example:
    • DDL statements
      begin;
      create table test_use_sort_1
      (
                uuid           text not null,
                gpackagename   text not null,
                recv_timestamp text not null
      );
      call set_table_property('test_use_sort_1', 'orientation', 'column');
      call set_table_property('test_use_sort_1', 'distribution_key', 'uuid');
      call set_table_property('test_use_sort_1', 'clustering_key', 'uuid:asc,gpackagename:asc,recv_timestamp:desc');
      commit;
      
      --Insert data into the table.
      insert into test_use_sort_1 select i::text, i::text, '20210814' from generate_series(1, 10000) as s(i);
      
      -- Update statistics.
      analyze test_use_sort_1;
    • Query statement
      select uuid from test_use_sort_1 order by uuid limit 5;
    • Execution plan comparison
      • The following figure shows the execution plan in Hologres V1.1. Execution plan in Hologres V1.1
      • The following figure shows the execution plan in Hologres V1.3. Execution plan in Hologres V1.3
      From the comparison of execution plans, the execution plan in Hologres V1.3 contains one less sort operation before the results are exported. Multiple data streams are merged and sorted. This improves query performance.
  • Optimize the COUNT DISTINCT function
    • Change COUNT DISTINCT to APPROX_COUNT_DISTINCT.

      The COUNT DISTINCT function precisely removes duplicate data. The records with the same key must be distributed to the same node for deduplication by using the function. This is resource-consuming. Hologres supports the extended function APPROX_COUNT_DISTINCT. The HyperLogLog algorithm is used to approximate the number of distinct values to perform imprecise COUNT DISTINCT computing. This improves query performance. The average error rate can be controlled between 0.1% and 1%. You can change COUNT DISTINCT to APPROX_COUNT_DISTINCT based on your business requirements. For more information, see APPROX_COUNT_DISTINCT.

    • Use the UNIQ function

      In Hologres V1.3 and later, you can use the UNIQ function to precisely remove duplicate data. If the cardinality of the GROUP BY key is high, the UNIQ function performs better and saves more memory than the COUNT DISTINCT function. If an OOM exception occurs when you use the COUNT DISTINCT function, you can use the UNIQ function instead. For more information, see UNIQ.

    • Select an appropriate distribution key

      If multiple COUNT DISTINCT functions with the same key exist, and data is evenly distributed and discrete, we recommend that you set the key of the COUNT DISTINCT functions as the distribution key. This way, the same data can be distributed to the same shard. This prevents data shuffling.

    • Optimize the operation that uses multiple COUNT DISTINCT functions
      If an SQL statement contains multiple COUNT DISTINCT functions, each function separately computes values, and then the values are joined. If the FROM clause is a subquery, the subquery is repeatedly computed. This consumes large resources and deteriorates performance. In Hologres V1.3 and later, the operation that uses multiple COUNT DISTINCT functions is optimized. If the number of COUNT DISTINCT functions is greater than one, the functions are automatically rewritten as UNIQ. The results and semantics are the same as those of the current implementation, and the change is transparent to the upper layer. This reduces repeated subquery computing in the current implementation and improves query performance. If the version of your Hologres instance is earlier than V1.3, submit a ticket to apply for an instance upgrade. Example:
      • DDL statements
        create table test_count_distinct_implementation
        (
            id int
            ,dim1 int
            ,dim2 text
            ,dim3 int8
            ,userid text
            ,deviceid text
            ,price float8
            ,ds text
        );
        
        insert into test_count_distinct_implementation select i, i % 17, (i % 13)::text, i % 37, 'user_' || round(i % 97723)::text, 'device_' || floor(i % 179357)::text, (sqrt(i % 24658)), '1' from generate_series(1, 5000)i;
        insert into test_count_distinct_implementation select i, i % 19, (i % 13)::text, i % 37, 'user_' || round(i % 87723)::text, 'device_' || floor(i % 139557)::text, (sqrt(i % 38658)), '2' from generate_series(1, 4000)i;
        
        analyze test_count_distinct_implementation;
      • Query statement
        select count(1), count(distinct userid), count(distinct deviceid), sum(distinct price::numeric) from test_count_distinct_implementation;
      • Execution plan comparison
        • The following figure shows the execution plan in Hologres V1.1. Execution plan in Hologres V1.1
        • The following figure shows the execution plan in Hologres V1.3. Execution plan in Hologres V1.3
        From the comparison of execution plans, the execution plan in Hologres V1.3 reduces repeated computing. This improves query performance.
  • Optimize WITH queries (Beta)
    Hologres is compatible with PostgreSQL and supports common table expressions (CTEs). CTEs are commonly used in recursive queries in a WITH clause. The implementation is the same as that in PostgreSQL, in which CTEs are inlined. Therefore, repeated computing occurs when a CTE is used multiple times. In HologresV1.3, you can set the optimizer_cte_inlining GUC parameter to enable CTE reuse. This way, a CTE can be computed only once and referenced multiple times to save computing resources and improve query performance. If the version of your Hologres instance is earlier than V1.3, upgrade your instance.
    set optimizer_cte_inlining=off;
    Note
    • This feature is still in the Beta phase and is not enabled by default. By default, all CTEs are inlined and repeatedly computed. You can manually set the GUC parameter to enable this feature.
    • This feature depends on shuffle spilling. Downstream users consume CTEs at different paces. Therefore, the query performance is affected if a large amount of data is processed.
    • Example
      create table cte_reuse_test_t
      (
          a integer not null,
          b text,
          primary key (a)
      );
      
      insert into cte_reuse_test_t values(1, 'a'),(2, 'b'), (3, 'c'), (4, 'b'), (5, 'c'), (6, ''), (7, null);
      
      
      set optimizer_cte_inlining=off;
      
      explain with c as (select b, max(a) as a from cte_reuse_test_t group by b)
      select a1.a,a2.a,a1.b, a2.b
      from c a1, c a2
      where a1.b = a2.b
      order by a1.b
      limit 100;
                                          
    • Execution plan comparison
      • The following figure shows the execution plan in Hologres V1.1. Execution plan of the WITH clause in Hologres V1.1
      • The following figure shows the execution plan in Hologres V1.3. Execution plan of the WITH clause in Hologres V1.3
      From the comparison of execution plans, the execution plan in Hologres V1.1 contains multiple HashAggregate operations. In the execution plan in Hologres V1.3, the results are computed only once and then reused. This improves query performance.
  • Optimize single-stage aggregation to multi-stage aggregation

    If aggregate operators are time-consuming, check whether local shard-level pre-aggregation is performed.

    You can perform aggregation operations on individual shards to decrease the volume of data that is involved in the final aggregation operation and improve performance. The following aggregation operations are supported:
    • Three-stage aggregation: aggregates data by file, aggregates data in each shard, and then summarizes the aggregation results of all shards. Three-stage aggregation
    • Two-stage aggregation: aggregates data in each shard and summarizes the aggregation results of all shards. Two-stage aggregation
    You can execute the following statement to forcefully perform multi-stage aggregation in Hologres:
    set optimizer_force_multistage_agg = on;

Optimize table attributes when you create a table

  • Select an appropriate storage model
    Hologres supports row-oriented, column-oriented, and row-column hybrid storage models. You can select a storage model that best suits your business scenario. The following table describes the storage models.
    Storage modelScenarioDisadvantage
    Row-oriented storage
    • Point query scenarios in which data is queried with high QPS performance based on primary keys.
    • Scenarios in which all columns of data are read at the same time and high performance of the UPDATE, DELETE, and INSERT operations is required.
    The performance of operations such as large-scale data queries, full table scanning, and aggregation is low.
    Column-oriented storageData analysis scenarios such as multi-column data query by range, single-table aggregation, and multi-table join. The performance of the UPDATE and DELETE operations in this model is lower than that in the row-oriented storage model. The performance of point queries in this model is lower than that in the row-oriented storage model in scenarios without specified indexes.
    Row-column hybrid storageScenarios in which the row-oriented and column-oriented storage models are applicable. The memory overheads are high.
  • Select an appropriate data type
    Hologres supports a variety of data types. You can select a data type that best suits your business needs. The following rules apply:
    • Select a type that requires minimal storage space.
      • Preferentially use the INT type rather than the BIGINT type.
      • Preferentially use the DECIMAL or NUMERIC type with minimal PRECISION or SCALE. To prevent statistical errors, do not use the FLOAT or DOUBLE PRECISION type.
      • Do not use the FLOAT or DOUBLE type for columns in the GROUP BY clause.
      • Preferentially use the TEXT type rather than the VARCHAR(N) and CHAR(N) types. Set the value of N as small as possible.
      • Use the TIMESTAMPTZ and DATE types rather than the TEXT type to represent a date.
    • Use consistent data types for JOIN operations.

      When you join tables, make sure that the columns that match each other use the same data type. This prevents additional overheads caused if Hologres implicitly converts the data type of columns to different data types.

    • Do not use the FLOAT or DOUBLE type in operations such as UNION and GROUP BY.

      Operations such as UNION and GROUP BY do not support the DOUBLE PRECISION or FLOAT type. You must use the DECIMAL type.

  • Select an appropriate primary key
    Primary keys are used to uniquely identify data. They are applicable to data import scenarios with duplicate primary key values. You can specify the option parameter to select a deduplication mode when you import data. Valid values:
    • ignore: ignores new data.
    • update: uses new data to overwrite existing data.

    Appropriate settings of primary keys can help the optimizer generate appropriate execution plans in specific scenarios. For example, GROUP BY pk,a,b,c is executed based on an appropriate primary key to accelerate data queries.

    In column-oriented storage model, primary keys may degrade the performance of data writing. In most cases, the performance of data writing without configured primary keys is two times higher than that with configured primary keys.

  • Select an appropriate partitioned table
    Hologres tables support only one level of partitioning. Appropriate partition settings accelerate queries. Inappropriate partition settings such as a large number of partitions cause excessive small files and degrade query performance.
    Note For data imported by day, we recommend that you create a partitioned table for the data of each day. The data is separately stored. Only the data of the current day is accessed.
    You can configure partitions in the following scenarios:
    • When you delete all partitions of a child table, data in partitions of other child tables is not affected. The drop or truncate operation has higher performance than the delete operation.
    • For queries of partition key columns in predicate conditions, you can find corresponding partitions based on indexes of the partition key columns. You can also directly query child partitions.
    • A partitioned table is required for real-time data that is imported on a periodic basis. If new data is imported every day, you can use the date column as a partition key column and import data to a child partition every day. For example, you can execute the following statements:
      begin;
      create table insert_partition(c1 bigint not null, c2 boolean, c3 float not null, c4 text, c5 timestamptz not null) partition by list(c4);
      call set_table_property('insert_partition', 'orientation', 'column');
      commit;
      create table insert_partition_child1 partition of insert_partition for values in('20190707');
      create table insert_partition_child2 partition of insert_partition for values in('20190708');
      create table insert_partition_child3 partition of insert_partition for values in('20190709');
      
      select * from insert_partition where c4 >= '20190708';
      select * from insert_partition_child3;
  • Select an appropriate index
    Hologres uses various types of indexes to implement different features. You can select an index that best suits your business scenario to improve query performance. The following table describes the supported indexes.
    Index typeIndex nameDescriptionSuggestionSample statement
    clustering_keyClustered indexThe clustered index is used in files. Data in a file is sorted by this index.

    For range queries, Hologres can use data orderliness attributes of the clustered index to filter data.

    Use a range or filter query column as the clustered index. Index-based data filtering complies with the leftmost prefix matching principle. We recommend that you specify no more than two columns to constitute a clustered index. select sum(a) from tb1 where a > 100 and a < 200;
    bitmap_columnsBitmap indexThe bitmap index is used in files. Bitmaps are created for data in a file based on this index.

    For equivalent queries, Hologres can encode data of each row by value, and perform bit operations to find the rows where data resides. The time complexity is O(1).

    Use an equivalent query column as the bitmap index. select * from tb1 where a =100;
    segment_key (also known as event_time_column)Segment keyA file-level index. Data is written to files in Append Only mode. Then, small files are merged based on the index.

    The segment key identifies the boundary of a file. You can use the segment key to find the required file.

    The segment key is designed for ordered, range-specific data such as timestamps and dates. Therefore, the segment key is strongly correlated with the data write time.

    Use this index to filter files. Then, use the bitmap index or clustered index for range or equivalent queries in files. Index-based data filtering complies with the leftmost prefix matching principle. We recommend that you specify only a single column.

    We recommend that you specify the first non-empty timestamp column as the segment key.

    select sum(a) from tb1 where ts > '2020-01-01' and a < '2020-03-02';
    Both the clustered index and segment key must comply with the leftmost prefix matching principle of traditional databases such as MySQL. This way, data is sorted based on the values in the leftmost column of an index. If the values in the leftmost column are ordered, use the second leftmost column to sort data. For example, you can execute the following statements:
    call set_table_property('tmp', 'clustering_key', 'a,b,c');
    select * from tmp where a > 1 ;  -- The clustered index can be used. 
    select * from tmp where a > 1 and c > 2 ;   -- The clustered index can be used only for column a. 
    select * from tmp where a > 1 and b > 2 ;  -- The clustered index can be used for both columns a and b. 
    select * from tmp where a > 1 and b > 2 and c > 3 ; -- The clustered index can be used for columns a, b, and c. 
    select * from tmp where b > 1 and c > 2 ;   -- The clustered index cannot be used for column b or c. 
    A bitmap index supports AND queries among multiple columns. Sample statements:
    call set_table_property('tmp', 'bitmap_columns', 'a,b,c');
    select * from tmp where a = 1 and b = 2 ;  --The bitmap index can be used. 
    Note You can add a bitmap index after you create a table. The clustered index and segment key are specified only when you create a table and cannot be added after the table is created.
  • Check whether an index is used
    Create a table named tmp and specify index columns by executing the following statements:
    begin;
    create table tmp(a int not null, b int not null, c int not null);
    call set_table_property('tmp', 'clustering_key', 'a');
    call set_table_property('tmp', 'segment_key', 'b');
    call set_table_property('tmp', 'bitmap_columns', 'a,b,c');
    commit;
    • To check whether the clustered index is used, execute the following statement:
      explain select * from tmp where a > 1;
      cluster
    • To check whether the bitmap index is used, execute the following statement:
      explain select * from tmp where c = 1;
      bitmap
    • To check whether the segment key is used, execute the following statement:
      explain select * from tmp where b > 1;
      segment