This topic describes how to optimize tables and data distribution to optimize the performance of a data warehouse.
Hash cluster tables can optimize bucket pruning, aggregation, and data storage. When creating a table, you can use the CLUSTERED BY clause to specify the hash key. Then, MaxCompute performs the hash operation on the specified columns and distributes data to each bucket based on the hash values. We recommend that you select columns with fewer duplicate key values as the hash key.
ALTER TABLE table_name [CLUSTERED BY (col_name [, col_name, ...]) [SORTED BY (col_name [ASC | DESC] [, col_name [ASC | DESC] ...])] INTO number_of_buckets BUCKETS]
ALTER TABLE statement applies to existing tables. After the Hash Clustering attribute is added,
data is stored to new partitions in compliance with hash clustering.
Then, you can use the
INSERT OVERWRITE statement to convert the source table to a hash cluster table.
- To add data to a table, you can use only the
INSERT OVERWRITEstatement, but not the
- You cannot run the tunnel upload command to import data to a range cluster table. The data uploaded by using this command is not arranged in order.
Other table optimization methods
- Use intermediate tables: This method is suitable for tables that contain a large amount of data and involve many descendant jobs.
- Split tables: This method is applicable when data queries about specific fields are extremely slow. You can split such fields to form separate tables.
- Merge tables: With the development of a data warehouse, you can merge jobs and data for tables with overlapping or duplicate business.
- Use zipper tables: The proper use of zipper tables can save your storage space.
- Use the special features of MaxCompute tables.