This topic describes how to optimize tables and data distribution to optimize the performance of a data warehouse.

Hash clustering

Hash cluster tables can optimize bucket pruning, aggregation, and data storage. When creating a table, you can use the CLUSTERED BY clause to specify the hash key. Then, MaxCompute performs the hash operation on the specified columns and distributes data to each bucket based on the hash values. We recommend that you select columns with fewer duplicate key values as the hash key.

To add the Hash Clustering attribute to a table, use the following statement:
ALTER TABLE table_name [CLUSTERED BY (col_name [, col_name, ...]) [SORTED BY (col_name [ASC | DESC] [, col_name [ASC | DESC] ...])] INTO number_of_buckets BUCKETS]

The ALTER TABLE statement applies to existing tables. After the Hash Clustering attribute is added, data is stored to new partitions in compliance with hash clustering.

Then, you can use the INSERT OVERWRITE statement to convert the source table to a hash cluster table.

Note Hash cluster tables are subject to the following limits:
  • To add data to a table, you can use only the INSERT OVERWRITE statement, but not the INSERT INTO statement.
  • You cannot run the tunnel upload command to import data to a range cluster table. The data uploaded by using this command is not arranged in order.

Other table optimization methods

We recommend that you strictly follow table design specifications. In addition, you can use the following methods to optimize tables:
  • Use intermediate tables: This method is suitable for tables that contain a large amount of data and involve many descendant jobs.
  • Split tables: This method is applicable when data queries about specific fields are extremely slow. You can split such fields to form separate tables.
  • Merge tables: With the development of a data warehouse, you can merge jobs and data for tables with overlapping or duplicate business.
  • Use zipper tables: The proper use of zipper tables can save your storage space.
  • Use the special features of MaxCompute tables.