Performance optimization for a data warehouse primarily involves optimizing tables and data distribution.
Hash clustering
Hash clustering tables offer advantages such as bucket pruning optimization, aggregation optimization, and storage optimization. When you create a table, you can specify a Hash Key using the clustered by clause. MaxCompute then performs a hash operation on the specified column and distributes the data into different buckets based on the hash values. The best practice is to select a column with few repeated key values for the Hash Key. For more information about how to use hash clustering tables, see Table operations.
To convert a table to a hash clustering table:
ALTER TABLE table_name [CLUSTERED BY (col_name [, col_name, ...]) [SORTED BY (col_name [ASC | DESC] [, col_name [ASC | DESC] ...])] INTO number_of_buckets BUCKETS]You can use the ALTER TABLE statement on existing tables. After you add a new clustering property, data in new partitions is stored using hash clustering.
After you create a hash clustering table, use the INSERT OVERWRITE statement to populate it with data from a source table.
Hash clustering tables have the following limits:
The
INSERT INTOstatement is not supported. You must useINSERT OVERWRITEto write data.Do not use the tunnel upload command to directly import data into a range cluster table because data uploaded using Tunnel is not sorted.
Other table optimization techniques
Follow the Table design specifications. In addition, you can use the following techniques to optimize your tables:
Use intermediate tables for tables that have a large data volume and many downstream tasks.
Split tables. If generating specific fields is extremely slow, you can split these fields into separate tables.
Merge tables. As your data warehouse evolves, you can merge tables that have overlapping or duplicate business logic to consolidate tasks and data.
Zipper tables: You can use zipper tables to effectively reduce your storage consumption.
Use the special features of MaxCompute tables.