This topic describes how to optimize the performance of querying MaxCompute tables in Hologres.
Hologres is seamlessly integrated with MaxCompute. You can query MaxCompute data in Hologres in an accelerated manner by using the following methods:
Create a Hologres foreign table that is mapped to the MaxCompute table to be queried
After a Hologres foreign table is created, you can directly query the mapped MaxCompute table. This method does not require data export and import, or redundant storage. This method is applicable to a table whose size is less than 200 GB for a single query. You can determine whether to use this method based on the volume of the data to be scanned at a time, but not the size of the queried fields.
Import data from a MaxCompute table to Hologres
To analyze and compute a large amount of data from a MaxCompute table, you can create an internal table in Hologres and import data from the MaxCompute table to the internal table. You can specify an appropriate distribution key for the internal table to accelerate queries based on your business requirements.
Compared with the preceding method, this method ensures faster queries. We recommend that you use this method to query the data of a table whose size is 200 GB or larger at a time. This method is also applicable to complex queries such as queries by index, data updates, and data insertion.
For more information about how to import data from a MaxCompute table to Hologres, see Import data from MaxCompute to Hologres by executing SQL statements.
Some methods are available for you to optimize the performance of querying MaxCompute tables in Hologres based on your business requirements. For example, you can optimize the query statement, optimize MaxCompute tables, allocate appropriate resources, and properly set parameters to obtain the optimal query performance.
Optimize the query statement
To prevent a full table scan when you query a MaxCompute table, you can use the following methods to optimize the query statement:
SELECT a FROM xxstatement instead of the
SELECT * FROM xxstatement. This way, the query range is narrowed down.
Add partition filter conditions or reduce the number of partitions to be scanned.
Optimize MaxCompute tables
You can use the following methods to optimize MaxCompute tables to improve the performance of querying MaxCompute tables in Hologres:
Convert MaxCompute tables to hash-clustered tables
Hash cluster tables can help optimize bucket pruning, aggregation, and storage.
When you create a MaxCompute table, use the
CLUSTERED BYclause to specify the hash key. Then, MaxCompute performs the hash operation on the specified columns and distributes data to each bucket based on the hash values. We recommend that you select columns with fewer duplicate key values as the hash key.
If no hash key is specified for an existing table, execute the following statement to specify the hash key:
ALTER TABLE table_name [CLUSTERED BY (col_name [, col_name, ...]) [SORTED BY (col_name [ASC | DESC] [, col_name [ASC | DESC] ...])] INTO number_of_buckets BUCKETS]
ALTER TABLEstatement can be used to specify the hash key for existing tables.
After the hash key is specified, data is stored in new partitions in compliance with hash clustering. Then, execute the
INSERT OVERWRITEstatement to convert the MaxCompute table to a hash-clustered table.Note
INSERT INTOstatement is not supported by hash clustering. You can add data only by using the
MaxCompute Tunnel cannot be used to upload data to a range cluster table.
Merge small files
Excessively small files in MaxCompute may slow down the query of table data.
You can execute the following statement in Hologres to query the number of files that match the query:
explain analyze <query>;
The file_count parameter in the output indicates the number of MaxCompute files that match the query. If a large number of small files are matched, you can merge the small files to accelerate the query.
Properly set parameters
For queries of MaxCompute tables, Hologres automatically sets some default parameters to improve the concurrency of reading data. This way, the query efficiency can be improved. If you have special requirements, you can change the values of these parameters based on your business scenario. The default values of these parameters are proven to be the most appropriate based on internal tuning and experiments. Therefore, we recommend that you do not change the default values unless necessary.
We recommend that you do not set the
hg_foreign_table_executor_max_dop parameter to a small value such as 1, especially when write and query operations are performed on the Hologres instance at the same time. In this case, if you set this parameter to 1, the workload is unevenly distributed to compute workers, which may lead to an out of memory (OOM) error.
-- Set the number of partitions to be hit by each query. Default value: 512. Maximum value: 1024. We recommend that you do not use a large value. Using a large value affects the query performance.
set hg_foreign_table_max_partition_limit = 128;
-- Set the number of data entries to be read at a time in a MaxCompute table. Default value: 8192.
set hg_experimental_query_batch_size = 4096;
-- Set the size of each shard of a MaxCompute table. Default value: 64. Unit: MB. The shard size affects the concurrency. If a table is large in size, you can increase the value of the parameter hg_foreign_table_split_size to prevent excessive shards from deteriorating the query performance. The parameter hg_foreign_table_split_size takes effect in Hologres V1.1.
set hg_foreign_table_split_size = 128;
-- Set the maximum degree of parallelism (MAXDOP) to read a MaxCompute table. Default value: the number of CPU cores for an instance. Maximum value: 128. We recommend that you do not use a large value. This prevents your queries from affecting other queries and causing errors due to system overloading, especially in data import scenarios. The parameter hg_foreign_table_executor_max_dop takes effect in Hologres V1.1.
set hg_foreign_table_executor_max_dop = 32;
-- Set the maximum number of DML statements that can be executed at the same time when you query a MaxCompute table. Default value: 32. The parameter hg_foreign_table_executor_dml_max_dop is specifically provided for data import and export scenarios to prevent import operations from occupying excessive system resources. The parameter hg_foreign_table_executor_dml_max_dop takes effect in Hologres V1.1.
set hg_foreign_table_executor_dml_max_dop = 16;
Adopt a new engine for MaxCompute table queries
Hologres V0.10 and later adopt a new engine to query MaxCompute data. Compared with the versions earlier than V0.10, Hologres V0.10 and later improve query performance by about 30% to 100%.
This feature is supported only by Hologres V0.10 and later. You can view the version of your Hologres instance on the instance details page in the Hologres console. If the version of your Hologres instance is earlier than V0.10, manually upgrade your Hologres instance in the Hologres console or join a Hologres DingTalk group to apply for an instance upgrade. For more information about how to manually upgrade a Hologres instance, see Instance upgrades. For more information about how to join a Hologres DingTalk group, see Obtain online support for Hologres.
In Hologres V1.1 and later, the new engine for MaxCompute table queries is enabled by default.
This feature applies only to dedicated instances, not shared instances.
This feature applies only to the queries of MaxCompute tables in the Optimized Row Columnar (ORC) format. Other files such as CFile files are not supported.
To achieve better acceleration performance, make sure that MaxCompute data types are properly mapped to Hologres data types. For more information, see Data type mappings between MaxCompute and Hologres.
After you upgrade your Hologres instance to V0.10 or later, you can execute the following statements to enable the new engine for different objects:
-- Enable the new engine for the current session. set hg_experimental_enable_access_odps_orc_via_holo = on; -- Enable the new engine for the specified database. alter database <databasename> set hg_experimental_enable_access_odps_orc_via_holo = on;
In Hologres V1.1 and later, the new engine for MaxCompute table queries is enabled by default. You can also execute the following statements to manually enable the new engine for different objects:
-- Enable the new engine for the current session. set hg_enable_access_odps_orc_via_holo = on; -- Enable the new engine for the specified database. alter database <databasename> set hg_enable_access_odps_orc_via_holo = on;