This topic describes how to optimize the performance of querying MaxCompute tables in Hologres.

Hologres is seamlessly integrated with MaxCompute. You can query MaxCompute data in Hologres in an accelerated manner by using the following methods:
  • Create a foreign table that is mapped to the external table to be queried

    After a foreign table is created, you can directly query the mapped external table. This method does not require data export and import, or redundant storage. This method is applicable to a table whose size is less than 200 GB. You can determine whether to use this method based on the volume of the data to be scanned at a time, but not the size of the queried fields.

  • Import data from an external table to Hologres

    To analyze and compute large amounts of data from a MaxCompute table, you can create an internal table in Hologres and import data from the MaxCompute table to the internal table. You can specify an appropriate distribution key for the internal table to accelerate queries based on your business requirements.

    Compared with the preceding method, this method ensures faster queries. We recommend that you use this method to query the data of a table whose size is 200 GB or larger. This method is also applicable to complex queries such as queries by index, data updates, and data insertions.

    For more information about how to import data from a MaxCompute table to Hologres, see Import data from MaxCompute to Hologres by executing SQL statements.

Some methods are available for you to optimize the performance of querying MaxCompute tables in Hologres based on your business requirements. For example, you can optimize the query statement, optimize MaxCompute tables, allocate appropriate resources, and properly set parameters to obtain the optimal query performance.

Optimize the query statement

To prevent a full table scan when you query an external table, you can use the following methods to optimize the query statement:
  • Use the SELECT a FROM xx statement instead of the SELECT * FROM xx statement. This way, the query range is narrowed down.
  • Add partition filter conditions or reduce the number of partitions to be scanned.

Optimize MaxCompute tables

You can use the following methods to optimize MaxCompute tables to improve the performance of querying MaxCompute tables in Hologres:
  • Convert MaxCompute tables to hash cluster tables

    Hash cluster tables can help optimize bucket pruning, aggregation, and storage.

    When you create a MaxCompute table, use the CLUSTERED BY clause to specify the hash key. Then, MaxCompute performs the hash operation on the specified columns and distributes data to each bucket based on the hash values. We recommend that you select columns with fewer duplicate key values as the hash key.

    If no hash key is specified for an existing table, execute the following statement to specify the hash key:
    ALTER TABLE table_name [CLUSTERED BY (col_name [, col_name, ...]) [SORTED BY (col_name [ASC | DESC] [, col_name [ASC | DESC] ...])] INTO number_of_buckets BUCKETS]

    The ALTER TABLE statement can be used to specify the hash key for existing tables.

    After the hash key is specified, data is stored in new partitions in compliance with hash clustering. Then, execute the INSERT OVERWRITE statement to convert the MaxCompute table to a hash cluster table.

    Note
    • The INSERT INTO statement is not supported by hash clustering. You can add data only by using the INSERT OVERWRITE statement.
    • MaxCompute Tunnel cannot be used to upload data to a range cluster table.
  • Merge small files

    Excessive small files may slow down the query.

    You can execute the following statement in Hologres to view the number of files that match the query:
    explain analyze <query>;
    The file_count parameter in the output indicates the number of MaxCompute files that match the query. If large numbers of small files are matched, you can merge the small files.

Properly set parameters

For queries of external tables, Hologres automatically sets some parameters to improve the concurrency of reading data. This way, the query efficiency can be improved. If you have special requirements, you can modify these parameters. The default values of these parameters are proven to be the most appropriate based on internal tuning and experiments. Therefore, we recommend that you do not change the default values unless necessary.
Note We recommend that you do not set the hg_foreign_table_executor_max_dop parameter to a small value such as 1, especially when write and query operations are performed on the Hologres instance at the same time. In this case, if you set this parameter to 1, the workload is unevenly distributed to compute workers, which may lead to an out of memory (OOM) error.
-- Set the number of partitions to be hit by each query. Default value: 512. Maximum value: 1024. We recommend that you do not use a large value, which affects the query performance.
set hg_foreign_table_max_partition_limit = 128;

-- Set the number of data entries to be read at a time in a MaxCompute table. Default value: 8192. 
set hg_experimental_query_batch_size = 4096;

-- Set the size of each shard of a MaxCompute table. Default value: 64. Unit: MB. The shard size affects the concurrency. If a table is large in size, you can increase the value of this parameter to prevent excessive shards from deteriorating query performance. This parameter takes effect in Hologres V1.1. 
set hg_foreign_table_split_size = 128; 

-- Set the maximum degree of parallelism (MAXDOP) to read a MaxCompute table. Default value: the number of CPU cores of the instance. Maximum value: 128. We recommend that you do not use a large value. This prevents your queries from affecting other queries and causing errors due to system overloading, especially in data import scenarios. This parameter takes effect in Hologres V1.1. 
set hg_foreign_table_executor_max_dop = 32;

-- Set the maximum number of DML statements that can be executed at the same time when you query a MaxCompute table. Default value: 32. This parameter is specifically provided for data import and export scenarios to prevent import operations from occupying excessive system resources. This parameter takes effect in Hologres V1.1. 
set hg_experimental_foreign_table_executor_dml_max_dop = 16;

Adopt a new engine for external table queries

Hologres V0.10 and later adopt a new engine to query MaxCompute data. Compared with the versions earlier than V0.10, Hologres V0.10 and later improve query performance by about 30% to 100%.

  • Limits
    • This feature is supported only by Hologres V0.10 and later. You can check the version of your Hologres instance on the instance details page in the Hologres console. If your instance version is earlier than V0.10, submit a ticket to update your instance.
    • In Hologres V1.1 and later, the new engine for external table queries is enabled by default.
    • This feature applies only to dedicated instances, not shared instances.
    • This feature applies only to the queries of MaxCompute ORC tables. Other files such as CFile files are not supported.
    • To achieve better acceleration performance, make sure that MaxCompute data types are properly mapped to Hologres data types. For more information, see Data type mappings between MaxCompute and Hologres when you create a foreign table.
  • Usage
    • After you update your Hologres instance to V0.10 or later, you can execute the following statements to enable the new engine for different objects:
      -- Enable the new engine for the current session.
      set hg_experimental_enable_access_odps_orc_via_holo = on;
      
      -- Enable the new engine for the specified database.
      alter database <databasename> set hg_experimental_enable_access_odps_orc_via_holo = on;
    • In Hologres V1.1 and later, the new engine for external table queries is enabled by default. You can also execute the following statements to manually enable the new engine for different objects:
      -- Enable the new engine for the current session.
      set hg_enable_access_odps_orc_via_holo = on;
      
      -- Enable the new engine for the specified database.
      alter database <databasename> set hg_enable_access_odps_orc_via_holo = on;