This topic describes how to use Hologres to optimize the performance of querying MaxCompute foreign tables.

Seamlessly integrated with MaxCompute at the underlying layer, Hologres allows you to use the following methods to query MaxCompute data in an accelerated manner:
  • Method 1: Create a foreign table in Hologres for mapping the source table in MaxCompute.

    By using the foreign table, you can directly query the mapped source table in MaxCompute without migrating data. This method is applicable when the volume of data to be queried at a time is less than 100 GB.

  • Method 2: Import data from a MaxCompute foreign table to an internal table in Hologres.

    To analyze and compute large amounts of data from a MaxCompute foreign table, you can create an internal table in Hologres and import data from the foreign table to the internal table. You can specify appropriate partition keys for the internal table to accelerate queries based on your business requirements.

    Compared with method 1, this method provides faster queries. We recommend that you use this method to query data of 100 GB or more at a time. This method is applicable to complex queries such as queries by index, data updates, and data insertions.

    For more information about how to import data from a MaxCompute foreign table to Hologres, see Use SQL statements to import MaxCompute to Hologres.

In addition, you can use the following methods to optimize the performance of querying MaxCompute foreign tables based on your business requirements. These methods include optimizing the query statement, optimizing MaxCompute source tables, allocating appropriate resources, and setting appropriate parameters.

Optimize the query statement

When you query a foreign table, you can use the following methods to optimize the query statement to avoid a full table scan:
  • Use the SELECT a FROM xx statement to specify the query range.
  • Add partition filter conditions or reduce the number of partitions to be queried.

Optimize MaxCompute source tables

You can use the following methods to optimize MaxCompute source tables to improve the performance of querying MaxCompute foreign tables:
  • Convert MaxCompute source tables to hash cluster tables.

    Using hash cluster tables helps optimize bucket pruning, aggregation, and storage.

    When you create a MaxCompute table, use the CLUSTERED BY clause to specify the hash key. Then, MaxCompute performs the hash operation on the specified columns and distributes data to each bucket based on the hash values. If no hash key is specified for an existing table, execute the following statement to specify the hash key:
    ALTER TABLE table_name [CLUSTERED BY (col_name [, col_name, ...]) [SORTED BY (col_name [ASC | DESC] [, col_name [ASC | DESC] ...])] INTO number_of_buckets BUCKETS]

    We recommend that you select columns with fewer duplicate key values as the hash key.

    The ALTER TABLE statement is applicable to specifying the hash key for existing tables.

    After the hash key is specified, data is stored to new partitions in compliance with hash clustering. Then, execute the INSERT OVERWRITE statement to convert the source table to a hash cluster table.

    Note
    • The INSERT INTO statement is not supported by hash clustering. You can add data only by using the INSERT OVERWRITE statement.
    • MaxCompute Tunnel cannot be used to upload data to a range cluster table.
  • Reduce the number of small files.
    Excessive small files in a MaxCompute source table may slow down the query. You can use the following methods to reduce the number of small files:
    • Reduce the number of small files generated during the computing process.

      Execute the INSERT OVERWRITE statement to update data in the source table. Alternatively, create a table, add data to the new table, and then delete the source table.

    • Reduce the number of small files generated when Tunnel collects data.
      • When you call the Tunnel SDK, commit the buffered data every time when its size reaches 64 MB.
      • When you use the Tunnel client, we recommend that you accumulate small files to a specific size and upload them at one time. Do not frequently upload small files.
      • When you import data to a partitioned table, we recommend that you set a lifecycle for the partitions so that expired data is automatically cleared.
      • Execute the INSERT OVERWRITE statement to update data in the source table or partitioned table.
      • Execute the ALTER statement on the Tunnel client to merge small files.
    • Set a lifecycle for all temporary tables. When a temporary table expires, the garbage collection mechanism is automatically triggered to drop the table.
    • Excessive DataHub shards may generate a large number of small files. Follow these rules when you apply for DataHub shards:
      • Apply for a specific number of shards based on the throughput of a single shard. The default throughput of a shard is 1 Mbit/s.
      • When you split shards by hour, each shard generates 12 small files per hour. If the number of shards exceeds the requirements of the data volume, excessive small files are generated.

        In MaxCompute, each DataHub shard is processed in an independent task. To accelerate queries of MaxCompute data, commit the task every 5 minutes or every time when the size of buffered data reaches 64 MB.

      • Allocate shards based on your business requirements.

Allocate appropriate resources

After you purchase a Hologres instance, the system allocates resources based on the instance type. You can contact technical support or submit a ticket to allocate resources based on your business requirements.

Set appropriate parameters

Set appropriate parameters to optimize the performance of querying MaxCompute foreign tables, as described in the following code:
SET hg_experimental_query_batch_size = xx  // Set the number of data entries to be read at a time in MaxCompute tables. Default value: 8192.
SET hg_experimental_foreign_table_split_size = xx  // Set the number of partitions to be read at a time in the MaxCompute foreign table.