This topic describes how to use ANALYZE statements and the auto-analyze feature.

ANALYZE statement

Statistics determine whether a suitable execution plan can be generated. Hologres collects statistics on sample data to analyze the data to be queried with respect to the following aspects: data distribution and characteristics, table statistics, column statistics, number of rows, number of columns, column width, cardinality, frequency, maximum value, minimum value, high-frequency value, and bucket distribution and characteristics. These statistics can help update the estimated cost for operator execution, narrow down the query range, and estimate the optimal join order, the memory overheads, and the degree of parallelism (DOP). This way, the optimizer can generate an optimal execution plan.

You can use ANALYZE statements to collect statistics of table data in your database. Then, the query optimizer of Hologres generates an optimal query plan based on the statistics to improve query efficiency.
  • Syntax
    -- Update the statistics on a table. By default, statistics on all columns in the table are collected.
    analyze <tablename>;
    
    -- Update the statistics on a column. This requires more data to be sampled than updating the statistics on a table. Therefore, statistics obtained by executing this statement are more accurate. This statement is usually used to update the statistics on a column that stores conditions.
    analyze <tablename>(<colname>, <colname>);
  • Parameters

    The tablename parameter specifies the name of the table for which you want to update statistics, and the colname parameter specifies the name of the column for which you want to update statistics.

  • Syntax description
    The two ANALYZE statements have the following similarities and differences:
    • Similarities
      • Both statements can be used to collect statistics on one or more columns, such as the row count, the column width, the most common values in the columns, the histograms of the columns, and the number of distinct values (NDV) in the columns.
      • After you execute one of the two statements for one or more specific columns, existing statistics on the columns that are returned by using another statement are overwritten. However, existing statistics on other columns are retained. For example, if you execute the analyze <tablename>(<colname1>); statement, existing statistics on the colname1 column are overwritten, but existing statistics on the colname2 column are retained.
    • Differences
      • analyze <tablename>;: Statistics returned by the statement are collected based on sampled data.
      • analyze <tablename>(<colname>, <colname>);: If you execute the statement, Hologres runs the APPROX_COUNT_DISTINCT function to calculate the NDV in columns. In most cases, statistics obtained based on the NDV are more accurate than those based on sampled data, but cost more overheads. Therefore, we recommend that you execute the analyze <tablename>(<colname>, <colname>); statement only for specific columns. Statistics other than the NDV, such as the column width and the histograms of the columns, are collected based on sampled data.
      Therefore, for a table with two columns (colname1 and colname2), the analyze table; statement does not equal the analyze table(colname1, colname2); statement.

      We recommend that you execute the analyze <tablename>(<colname>, <colname>); statement for JOIN columns or GROUP BY columns to obtain finer-grained statistics on these columns.

  • Scenarios
    In the following scenarios, we recommend that you execute the analyze <tablename>; statement on a table:
    • A large number of INSERT, UPDATE, DELETE, or data import operations are performed on the table.
    • The performance of a table JOIN operation is deteriorated. In this case, you can execute the ANALYZE statement for the JOIN and GROUP BY columns before you perform the table JOIN operation.
    • After you execute the CREATE FOREIGN TABLE statement, you want to collect statistics on the foreign table.
    • After you execute the IMPORT FOREIGN SCHEMA statement, you want to collect statistics on the table to be queried.
  • Usage notes
    • In Hologres V0.10 and V1.1, if you want to query a parent partitioned table, execute an ANALYZE statement for the parent partitioned table. If you want to query a child partitioned table, execute an ANALYZE statement for the child partitioned table. If you want to query both the parent and child partitioned tables, execute an ANALYZE statement for both tables to prevent some statistics from being omitted.
    • If you encounter one of the following problems, you must execute an ANALYZE statement before you run an import task. This can improve efficiency.
      • An out-of-memory (OOM) exception occurs when multiple tables are joined. The Query executor exceeded total memory limitation xxxxx: yyyy bytes used error is reported.
      • Import efficiency is low. When you query or import data in Hologres, tasks time out without completing. This results in low efficiency.
    • If the table you want to query has ultra-wide columns, such as a column with bitmap indexes of the BYTEA type and a column of the TEXT type that is more than 1 KB in size, the collection of statistics on these columns has no use and consumes memories in data sampling. For such a table, we recommend that you do not execute the analyze <tablename>; statement. Instead, you can execute the analyze <tablename>(<colname>, <colname>); statement for columns except for ultra-wide columns. For example, you can execute the statement for the JOIN, GROUP BY, and FILTER columns.
      Note You can set the benchmark for ultra-wide columns based on your business requirements. Normally, columns whose data size is greater than 1 KB are considered ultra-wide.

Auto-analyze feature

Hologres V0.10 and later support the auto-analyze feature. After you enable this feature, you do not need to repeatedly and manually execute ANALYZE statements. After you enable this feature, Hologres determines whether to automatically perform ANALYZE operations based on the situation of table creation, data writes, and data modification. This reduces the complexity of operations and prevents omissions in statistics collection caused by omitted ANALYZE statements.

  • Syntax
    • The following statements are used to check whether the auto-analyze feature is enabled:
      SHOW hg_enable_start_auto_analyze_worker;  -- Syntax for Hologres V1.1.
      
      SHOW hg_experimental_enable_start_auto_analyze_worker;  -- Syntax for Hologres V0.10.
    • The following statements are used to enable or disable the auto-analyze feature for databases. You can execute the statements only as a superuser.
      -- Syntax for Hologres V1.1.
      ALTER DATABASE dbname SET hg_enable_start_auto_analyze_worker = ON;  -- Enable the auto-analyze feature for the specified database. ON is the default value.
      ALTER DATABASE dbname SET hg_enable_start_auto_analyze_worker = OFF; -- Disable the auto-analyze feature for the specified database.
      
      -- Syntax for Hologres V0.10.
      ALTER DATABASE dbname SET hg_experimental_enable_start_auto_analyze_worker = ON;  -- Enable the auto-analyze feature for the specified database. ON is the default value.
      ALTER DATABASE dbname SET hg_experimental_enable_start_auto_analyze_worker = OFF; -- Disable the auto-analyze feature for the specified database.
  • Limits
    When you use the auto-analyze feature of Hologres, take note of the following limits:
    • The auto-analyze feature is supported only in Hologres V0.10 and later. You can view the version of your Hologres instance on the instance details page in the Hologres console. If the version of your instance is earlier than V0.10, submit a ticket to update your instance.
    • Only a superuser can enable or disable the auto-analyze feature.
    • The auto-analyze feature has the following limits on partitioned tables:
      • Before the auto-analyze feature is implemented on child partitioned tables that have been modified, an ANALYZE operation is performed on the relevant parent partitioned table.
      • By default, during data sampling, the system can scan a maximum of 224 (16,777,216) data entries at a time. If the total number of data entries in child partitioned tables exceeds 16,777,216, the data entries are distributed to different partitions, and only data in specific partitions is sampled.
        Note Partitioning does not affect statistics collection on the partition key column, but affects statistics collection on columns whose data is distributed in the same way as the partition key column, such as a column that has the same data with the partition key column. If such columns exist, some data in these columns may not be sampled and the number of entries may be inaccurate. For technical supports, submit a ticket. The designated technical engineer will adjust the maximum number of data entries that can be scanned at a time based on your business requirements.
    • By default, you can use the auto-analyze feature to collect statistics on up to 256 columns at a time. If a table has more than 256 columns, statistics on the first 256 columns are collected. You can change the value of the hg_experimental_auto_analyze_max_columns_count parameter to adjust the limit.
    • By default, each worker used by the auto-analyze feature can occupy 8 GB of memory. If a table has ultra-wide columns, an ANALYZE operation may fail due to the lack of memory resources for data sampling. You can change the value of the auto_analyze_work_memory_mb parameter to adjust the limit. However, you must pay attention to the size of remaining memory available for your Hologres instance. The more advanced instance type you select, the more workers your Hologres instance can have, and the more memory resources the auto-analyze feature can occupy.
  • How the auto-analyze feature works
    After you enable the auto-analyze feature, the system periodically searches for tables that require ANALYZE operations.
    • Regular tables, including internal tables such as non-partitioned tables and partitioned tables
      • The system checks whether operations, especially DML operations such as INSERT, UPDATE, and DELETE that may cause data changes, are performed on a table at intervals of 1 minute. If one of the following conditions is met, the system performs an ANALYZE operation to collect statistics on the table:
        • Non-partitioned table: A DML operation is performed on the table and more than 10% data entries are changed. Partitioned table: A DML operation is performed on the table and more than 10% data entries in the current partition are changed.
        • A TRUNCATE TABLE operation is performed to clear the table data.
        • A DDL operation is performed on the table. For example, a CREATE TABLE operation is performed to create a table or an ALTER TABLE operation is performed to modify the table schema. CALL SET_TABLE_PROPERTY operations that are used to modify table attributes are not included.
      • The system checks data changes of an internal table at intervals of 10 minutes. If more than 10% data entries are changed compared with the previous check result, an ANALYZE operation is performed on this table.
        Note This prevents the system from omitting data changes caused by implicit DML operations, such as real-time data writes by using Flink, Data Integration, or HoloClient.
    • Foreign tables
      • You can execute ANALYZE statements and implement the auto-analyze feature on a foreign table only if the foreign table maps to a MaxCompute table.
      • The system checks data changes and metadata changes of a foreign table at intervals of 4 hours. If the following condition is met, the system performs an ANALYZE operation to collect statistics on the table:

        The data of the MaxCompute table that the foreign table maps to is changed within 4 hours after the previous check. This condition is met if the value of the last_modify_time parameter of the MaxCompute table is changed within 4 hours after the previous check.

    Note A check and its corresponding ANALYZE operation are in the same scheduling task. A check can start only if the previous ANALYZE operation is complete and the time interval between two consecutive checks specified by the scheduling task elapses.
  • Set the parameters of the auto-analyze feature

    After you enable the auto-analyze feature, Hologres periodically checks tables to decide whether to perform ANALYZE operations on tables. If ANALYZE operations are performed, data is sampled and statistics are collected, which consumes resources of your Hologres instance.

    The default settings may not apply to specific business scenarios. For example, if data is not frequently written to your Hologres instance, you can decrease the frequency of automatic ANALYZE operations. You can modify the default settings of the auto-analyze feature based on your business requirements. This allows you to optimize the feature performance in a fine-grained manner.

    Note You can modify the default settings as a superuser. When you enable this feature, you need to set the required parameters at the database level. The settings take effect 1 minute after the feature is enabled.
    • Syntax
      -- Modify the default settings of the auto-analyze feature as a superuser.
      ALTER DATABASE <dbname> SET <GUC>=<values>;
      The dbname parameter specifies the database name, the GUC parameter specifies the parameter name, and the values parameter specifies the parameter value.
    • Parameters
      Parameter Description Supported version Default value Example
      autovacuum_naptime The interval between two consecutive checks of operations on tables. Unit: seconds (s). Hologres V1.1.0 and later
      Note You need to modify the settings of the parameter before use. To modify the settings, submit a ticket.
      60s
      • ALTER DATABASE <dbname> SET autovacuum_naptime = 60;
      • ALTER DATABASE <dbname> SET autovacuum_naptime = '60s';
      • ALTER DATABASE <dbname> SET autovacuum_naptime = '10min';
      hg_auto_check_table_changes_interval The interval between two consecutive checks of data changes in internal tables. Unit: seconds (s). Hologres V1.1.0 and later 600s
      -- Syntax in Hologres V1.1 and later
      ALTER DATABASE <dbname> SET hg_auto_check_table_changes_interval = '600s';
      -- Syntax in Hologres V0.10
      ALTER DATABASE <dbname> SET hg_experimental_auto_check_table_changes_interval = '600s';
      hg_auto_check_foreign_table_changes_interval The interval between two consecutive checks of data changes in foreign tables. Unit: seconds (s). Hologres V1.1.0 and later 14400s
      -- Syntax in Hologres V1.1 and later
      ALTER DATABASE <dbname> SET hg_auto_check_foreign_table_changes_interval = '14400s';
      -- Syntax in Hologres V0.10
      ALTER DATABASE <dbname> SET hg_experimental_auto_check_foreign_table_changes_interval = '14400s';
      hg_experimental_auto_analyze_max_columns_count The maximum number of columns whose statistics can be collected at a time. Hologres V1.1.0 and later 256
      ALTER DATABASE <dbname> SET hg_experimental_auto_analyze_max_columns_count =300;
      auto_analyze_work_memory_mb The maximum size of memory that can be occupied by the auto-analyze feature for a single table. Unit: MB.
      Note This parameter is available only in Hologres V1.1 and later.
      Hologres V1.1.54 and later By default, each worker used by the auto-analyze feature can occupy 8,192 MB (8 GB) of memory. The more advanced instance type you select, the more workers your Hologres instance can have, and the more memory resources the auto-analyze feature can occupy. Change the upper limit of the memory size for a table to 9 GB:
      ALTER DATABASE <dbname> SETauto_analyze_work_memory_mb =9216;
      hg_experimental_auto_analyze_start_time The beginning of the time range when ANALYZE operations are performed every day.
      Note The specified beginning and end of the time range must be in the same time zone. The beginning of the time range must be equal to or earlier than the end of the time range.
      Hologres V1.1.54 and later 00:00 +0800 Request the system to perform ANALYZE operations from 00:00 to 06:00. In the daytime, data in foreign tables are not changed and ANALYZE operations are not required:
      • ALTER DATABASE <dbname> SET hg_experimental_auto_analyze_start_time = '00:00 +0800';
      • ALTER DATABASE <dbname> SET hg_experimental_auto_analyze_end_time = '06:00 +0800';
      hg_experimental_auto_analyze_end_time The end of the time range when ANALYZE operations are performed every day. Hologres V1.1.54 and later 23:59 +0800
      autovacuum_enabled Specifies whether to enable the auto-analyze feature for the specified database. Hologres V1.1.54 and later true, which means to enable this feature for all tables in the database. Disable the auto-analyze feature for the specified table. Then, ANALYZE operations are not performed on this table.
      ALTER TABLE <tablename> SET (autovacuum_enabled = false);

Query table statistics

Table statistics are recorded in a table named hologres_statistic.hg_table_statistic. You can execute the following statement to check the execution status of an ANALYZE operation:
Note If you want to query the information about the previous ANALYZE operation, sort the information by using the analyze_timestamp parameter.
SELECT schema_name,                -- The table schema.
       table_name,                 -- The table name.
       schema_version,             -- The table version.
       statistic_version,          -- The version of statistics obtained by using the previous ANALYZE operation.
       total_rows,                 -- The number of rows involved in the previous ANALYZE operation.
       analyze_timestamp           -- The end of the time range when the previous ANALYZE operation is performed.
FROM   hologres_statistic.hg_table_statistic
WHERE  table_name = '<tablename>'
ORDER BY analyze_timestamp DESC;
  • The table hologres_statistic.hg_table_statistic stores 0 to n statistical records for each table. If the table hologres_statistic.hg_table_statistic stores 0 statistical records, no ANALYZE operations have been performed. If the table stores one or more statistical records, ANALYZE operations have been performed.
  • If the table stores two or more statistical records, the values of the schema_version parameter in the statistical records are different. This is because the schema of the table is changed. For example, after you execute a statement such as the ADD COLUMN statement, the system generates a new statistical record and deprecates the previously generated statistical records.
  • The following code provides an example of query results. In this example, the value of the schema_version parameter in the first statistical record is greater than that in the second statistical record. Therefore, the second statistical record is deprecated.
     schema_name |    table_name    | schema_version | statistic_version | total_rows |  analyze_timestamp
    -------------+------------------+----------------+-------------------+------------+---------------------
     public      | tbl_name_example |             13 |              8580 |      10002 | 2022-04-29 16:03:18
     public      | tbl_name_example |             10 |              8576 |      10002 | 2022-04-29 15:41:20
    (2 rows)
  • Hologres V0.10 and V1.1 does not delete deprecated statistical records from the table hologres_statistic.hg_table_statistic. You do not need to pay attention to these records.

FAQ

In the following scenarios, the auto-analyze feature does not work properly. You can resolve the problems by following the solutions provided in this section.
  • Why does the table hologres_statistic.hg_table_statistic have no statistical records?
    • Problem description: When you open the table hologres_statistic.hg_table_statistic to view statistical records, no statistical records are displayed.
    • Possible causes:
      • The auto-analyze feature is not enabled or the conditions that trigger automatic ANALYZE operations are not met.
      • An error occurs in the auto-analyze feature. To troubleshoot the error, you can submit a ticket.
    • Solution: Manually trigger an ANALYZE operation.
  • Why is the value of the analyze_timestamp parameter too small?
    • Problem description: In the query results, the value of the analyze_timestamp parameter is much smaller than the current timestamp, which means no ANALYZE operations have been performed for a long period.
    • Possible causes:
      • The auto-analyze feature does not work because of a specific reason.
      • The auto-analyze feature was once manually disabled.
    • Solution: Manually trigger an ANALYZE operation, and then submit a ticket for troubleshooting.