Optimize the performance of querying MaxCompute tables in Hologres - Hologres

MaxCompute and Hologres are seamlessly integrated at the underlying layer. You can use the native vector engine of Hologres to directly read data from MaxCompute. You can also use the multi-level partition filtering and operator pushdown features of MaxCompute to accelerate queries. This topic describes how to accelerate queries on MaxCompute tables by using foreign tables or importing data from MaxCompute to Hologres.

Create a foreign table

You can create a foreign table in Hologres to accelerate queries on MaxCompute tables. This method does not require data export and import or redundant storage. Compared with the method of querying data in MaxCompute, this method improves query performance by two to five times. The following sections describe the methods that can further accelerate queries when you use foreign tables to query data from MaxCompute tables.

Use the Serverless Computing feature to query a foreign table

Hologres V2.1.17 and later support the Serverless Computing feature. The Serverless Computing feature is suitable for scenarios in which you want to import a large amount of data offline, run large-scale extract, transform, and load (ETL) jobs, or query a large amount of data from foreign tables. You can use the Serverless Computing feature to perform the preceding operations based on additional serverless computing resources. This can eliminate the need to reserve additional computing resources for the instances. This improves instance stability and reduces the occurrences of out of memory (OOM) errors. You are charged only for the additional serverless computing resources used by tasks. For more information about the Serverless Computing feature, see Overview of Serverless Computing. For more information about how to use the Serverless Computing feature, see User guide on Serverless Computing.

Use HQE

Hologres V0.10 and later support Hologres Query Engine (HQE), which improves the performance of querying MaxCompute tables by about 30% to 100%.

When you use HQE, take note of the following limits:

You can use HQE only to accelerate queries on MaxCompute tables in the Optimized Row Columnar (ORC) format. Queries on other files, such as CFile files, are not accelerated.
You cannot use HQE to accelerate queries on MaxCompute transactional tables or MaxCompute tables on which schema evolution is performed. For these queries, the system automatically uses the Seahawks Query Engine (SQE).
Make sure that the data types of the MaxCompute table map to the data types of the Hologres foreign table that is sourced from the MaxCompute table. Otherwise, the query performance is negatively affected. For more information about data type mappings, see the "Data type mappings between MaxCompute and Hologres" section in Data types.

Prevent full table scans

To prevent a full table scan when you query a MaxCompute table, you can use one of the following methods to optimize the query statement:

Use the SELECT a FROM xx statement instead of the SELECT * FROM xx statement. This way, the query range is narrowed down.
Add partition filter conditions or reduce the number of partitions to be scanned. This helps reduce the amount of data to be scanned.

Optimize SQE-based queries

The system automatically uses SQE if HQE is not supported. In this case, query performance may deteriorate. To optimize query performance, you can use one of the following methods:

Convert MaxCompute tables into hash-clustered tables
Hash-clustered tables support bucket pruning, aggregation, and storage optimization. When you create a MaxCompute table, use the CLUSTERED BY clause to specify the hash key. MaxCompute performs the hash operation on the specified columns and distributes data to each bucket based on the hash values. We recommend that you select columns that have fewer duplicate values as the hash key columns.
To convert a MaxCompute table into a hash-clustered table, execute the following statement:
```
ALTER TABLE <table_name> [CLUSTERED BY (<col_name> [, <col_name>, ...])
                       [SORTED BY (<col_name> [ASC | DESC] [, <col_name> [ASC | DESC] ...])]
                       INTO <number_of_buckets> BUCKETS];
```
After you execute the ALTER TABLE statement to specify a hash key, data in new partitions is stored based on the hash key. For more information about hash clustering, see Hash clustering.
Convert MaxCompute tables into range-clustered tables
Range clustering is a new data clustering method that distributes data in a globally sorted order. Range clustering can prevent data skew issues that may be caused by hash clustering. Range clustering also allows you to create two levels of indexes. Range clustering is suitable for scenarios such as range queries based on cluster keys and multi-key queries.
To convert a MaxCompute table into a range-clustered table, execute the following statement:
```
ALTER TABLE <table_name> [RANGE CLUSTERED BY (<col_name> [, <col_name>, ...])
                         [SORTED BY (<col_name> [ASC | DESC] [, <col_name> [ASC | DESC] ...])]
                         [INTO <number_of_buckets> BUCKETS];
```
After you execute the ALTER TABLE statement to specify a cluster key, data in new partitions is stored based on the cluster key. For more information about range clustering, see Range clustering.
Note
- You cannot execute the INSERT INTO statement to write data to hash-clustered tables or range-clustered tables. To write data to these tables, execute the INSERT OVERWRITE statement.
- The data that is imported by using Tunnel commands is not arranged in order. Therefore, you cannot import data to a range-clustered table or a hash-clustered table by using Tunnel commands.

Merge small files

The query speed slows down if a large number of small files exist in the MaxCompute table to be queried.

You can execute the following statement in MaxCompute to obtain the number of files in the table to be queried. For more information about how to merge small files, see FAQ about optimization of small files and job diagnostics.

desc extended <table_name>;

In the command output, the FileNum parameter indicates the number of files in the desired MaxCompute table, and the Size parameter indicates the total size of files in bytes. If a MaxCompute table contains more than 100 files and the average file size is less than 64 MB, you can merge the small files in MaxCompute.

Optimize parameter settings

For queries on MaxCompute tables, Hologres provides default values for specific parameters to improve the parallelism of reading data. This improves the query efficiency. You can adjust the parameter settings based on your business requirements.

Important

The default values of the parameters are proven to be optimal based on internal tuning and experiments. Therefore, we recommend that you do not change the default values unless necessary.
We recommend that you do not set the hg_foreign_table_executor_max_dop parameter to an excessively small value. For example, if you set this parameter to 1 and write and query operations are performed on a table at the same time, the workload is unevenly distributed to workers. As a result, an OOM error may occur.

To configure the maximum number of partitions that can be hit in each query, execute the following statement:

-- The default value is 512. The maximum value is 1024. We recommend that you do not set this parameter to an excessively large value, which may negatively affect the query performance. 
set hg_foreign_table_max_partition_limit = 128;

To configure the size of data that you want to read at the same time from a MaxCompute table, execute the following statement:
```
-- The default value is 8192. 
set hg_experimental_query_batch_size = 4096;
```

To configure the size of each shard of a MaxCompute table, execute the following statement. The shard size affects the parallelism of reading data entries.

-- The default value is 64, in MB. If a table is large in size, you can increase the value of this parameter to prevent excessive shards from degrading query performance. This parameter is supported in Hologres V1.1 and later. 
set hg_foreign_table_split_size = 128;

To configure the maximum parallelism of reading a MaxCompute table, execute the following statement. The default value is the same as the number of vCPU cores of the instance.

-- The maximum value is 128. We recommend that you do not use a large value. This prevents your queries from affecting other queries and causing errors due to system overloading, especially in data import scenarios. This parameter is supported in Hologres V1.1 and later. 
set hg_foreign_table_executor_max_dop = 32;

To configure the maximum number of DML statements that can be executed at the same time when you read data from a MaxCompute table, execute the following statement:

-- The default value is 32. This parameter is designed for data import and export scenarios to prevent import operations from occupying excessive system resources. This parameter is supported in Hologres V1.1 and later. 
set hg_foreign_table_executor_dml_max_dop = 16;

(Recommended) Import data from MaxCompute to Hologres

If you want to perform frequent analysis and computing operations on data in a MaxCompute table and you need to connect the MaxCompute table with a Hologres table, we recommend that you create a table in Hologres and import data from the MaxCompute table to the Hologres table. You can also configure a distribution key for the Hologres table based on your business requirements. This accelerates queries.

Compared with the method of creating foreign tables, this method improves the query performance by 10 to 100 times. For more information about how to import data from a MaxCompute table to Hologres, see Import data from MaxCompute to Hologres by executing SQL statements.