In large-scale data processing, operations like JOIN and GROUP BY often trigger massive data shuffling, leading to high CU consumption and slow job performance. To address this, optimize the physical layout of your data by using clustering, which can eliminate the need for shuffling. However, identifying the right tables and clustering keys can be complex. Clustering optimization recommendations automate this process by analyzing a table's historical read and write patterns to provide precise, actionable suggestions for implementing Hash Clustering. Applying these recommendations significantly reduces shuffle volume, which speeds up job execution and lowers CU costs.
Scope
Supported regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), and China (Chengdu).
Clustering optimization recommendations are not supported for three-tier model projects.
The statistics cover the execution history of most jobs but may exclude complex jobs with multiple underlying execution tasks. Consequently, the recommendations might not cover every job.
For details on Hash Clustering, see Hash Clustering.
View clustering optimization recommendations
You can view recommended tables, the estimated benefits of clustering them, and detailed suggestions for any or all projects in the current region. This information helps you decide whether to adopt the clustering recommendations.
Log on to the MaxCompute console and select a region in the top-left corner.
In the navigation pane on the left, choose .
On the Clustering Optimization tab, click Estimated Benefits and use the filters to check for tables with clustering recommendations.
Parameter
Description
Project Name
Select a MaxCompute project. Defaults to All Projects if left unselected.
Table Name
Enter a table name. Fuzzy search is supported. Separate multiple table names with a comma (
,).Recommendation Generation Date
The date when the recommendation was generated. The default is the previous day.
Estimated benefit metrics
Metric
Description
Estimated Number of Beneficial Jobs/Day
The estimated number of jobs per day that would benefit from clustering the table.
Estimated Shuffle Reduction/Day
The estimated daily reduction in shuffle volume if the table is converted to a clustered table.
Reducing shuffle volume directly lowers the CU-hour consumption of jobs. As a rule of thumb, every 1 TB of shuffle reduction saves 2 to 4 CU-hours daily.Optimization recommendation list
Review the recommended values for the parameters in the list and check the suggestions for more optimization details about the table.
Column
Description
Project
The project that contains the recommended table.
Table Name
The name of the table recommended for clustering.
Type
The recommended clustering type for the table. Currently, only Hash Clustering recommendations are supported.
Suggested Cluster Key
The recommended ClusterKey for the table. This is primarily related to Shuffle Removal and data filtering for point lookups.
Suggested Sort Key
The recommended SortKey for the table. This is primarily related to data filtering and storage compression ratios.
Bucket Count
The recommended number of buckets for the table. This is primarily related to the parallelism of table write operations and table read jobs after Shuffle Removal.
Recommendation Index
A 1- to 5-star rating that indicates the potential impact of applying the recommendation. A higher score means a stronger recommendation. The score is calculated as follows:
Evaluation dimension
Deduction rule
Weight Coefficient
Timeliness of optimizable pattern
The observation window is less than 14 days.
Deduct 1 star
Shuffle data savings on read
The saved amount is less than 1 TB.
Deduct 1 star
Write Job Frequency
No write job records were found for the day.
Deduct 1 star
Number of optimizable partitions
If the number of optimizable partitions is greater than 3, deduct 1 star.
If the number of optimizable partitions is greater than 31, deduct 2 stars.
Dynamic adjustment
NoteThe score is reduced if no write jobs are recorded for the day, as the potential increase in write cost cannot be estimated.
A deduction is applied if there is a large number of optimizable partitions, as the clustering optimization will only take effect after rewriting a significant number of existing partitions with new data or through active rewriting.
Estimated Shuffle Reduction/Day
The estimated daily reduction in shuffle volume after converting the table to a clustered table.
Observation Interval
The number of consecutive days this recommendation has been generated.
Actions
Click a suggestion to go to the Details of Table with Optimization Recommendations page, which includes the following:
Optimization Recommendations
Current Status of Table
Overview of Estimated Benefits
Estimated Shuffle Reduction/Day
Beneficial Table Read Jobs
Full Table Write Jobs
Full Table Read Jobs
Apply clustering optimization recommendations
Apply recommendations directly to the original table
Using the UI
For partitioned tables, you can apply a recommendation in a single click to convert the original table into a clustered table.
In the navigation pane on the left, choose .
On the Clustering Optimization tab, click Estimated Benefits.
Click View Details in the Actions column for the target table to open the Details of Table with Optimization Recommendations page.
In the upper-right corner, click Apply Recommendations to complete the conversion.
Using SQL commands
Run the following command:
-- Change a table to a Hash Clustering table. ALTER TABLE <table_name> [CLUSTERED BY (<col_name> [, <col_name>, ...]) [SORTED BY (<col_name> [ASC | DESC] [, <col_name> [ASC | DESC] ...])] INTO <number_of_buckets> BUCKETS];After the conversion, inspect the clustered table and run its associated read jobs to verify that they operate as expected. If any issues arise, perform a rollback immediately by running the following command.
-- Change a Hash Clustering table back to a non-clustered table. ALTER TABLE <table_name> NOT CLUSTERED;
After converting a table to a clustered table, you can no longer perform incremental write operations such as
INSERT INTOor Tunnel uploads.You cannot directly apply clustering recommendations to a non-partitioned table.
Clustering increases the latency and CU cost of write operations but reduces them for read operations, leading to a net reduction in overall CU consumption.
For certain scenarios, rewrite partitions and verify the benefits for downstream optimizable jobs. For more information, see (Recommended) Rewrite partition data.
(Recommended) Rewrite partition data
Changes to clustering properties only affect new data. To apply optimizations to existing data, you must rewrite the relevant partitions. Rewriting partitions is necessary in the following scenarios to verify the benefits for downstream jobs:
The table is a dependency for high-priority, latency-sensitive jobs.
The table has large partitions, with single write operations exceeding 10 TB.
Downstream jobs read from multiple partitions. To gain the full optimization benefit, all relevant partitions read by the job must be rewritten.
If the table is a large table in a daily full-and-incremental merge-write job, rewrite the last day's partition. This mitigates the risk of increased costs and slower run times when the full-and-incremental job runs for the first time on the next day.
-- Assume the partition key column is ds, and new partitions such as 20241015 and 20241016 are added daily. The new partition data is derived by merging incremental data from the previous day's partition. INSERT OVERWRITE TABLE <table_name> PARTITION(ds) SELECT * FROM <table_name> WHERE ds = max_pt('<table_name>');If an optimizable job reads from multiple existing partitions, rewrite the existing partitions within the read range to apply the clustering optimization sooner.
-- Assume the partition key column is ds, and new partitions such as 20241015, 20241016, and so on are added daily. The read range starts from 20241015. INSERT OVERWRITE TABLE <table_name> PARTITION(ds) SELECT * FROM <table_name> WHERE ds >='20241015';After rewriting, run a trial of the optimizable job to verify that the optimization is effective.
Apply recommendations to a new table
For non-partitioned tables, you cannot directly modify the clustering attributes of the original table. Therefore, you must manually apply the changes by creating a new clustered table.
Create a cluster table.
-- Check the CREATE TABLE statement of the existing table. SHOW CREATE TABLE <orignal_table>; -- Insert the CLUSTER information into the appropriate position in the CREATE TABLE syntax to create a new clustered table. CREATE TABLE <new_table> [CLUSTERED BY (<col_name> [, <col_name>, ...]) [SORTED BY (<col_name> [ASC | DESC] [, <col_name> [ASC | DESC] ...])] INTO <number_of_buckets> BUCKETS];Load data into the new table.
INSERT OVERWRITE TABLE <new_table> SELECT * FROM <orignal_table>;Back up the original table.
ALTER TABLE <orignal_table> RENAME TO <orignal_table_backup>;Rename the new table to the original table name.
ALTER TABLE <new_table> RENAME TO <orignal_table>;
The new table now has the same name as the original table and will cluster data according to its new attributes.
Rollback for new table application
Inspect the new clustered table to ensure all jobs run as expected. If any issues arise, perform a rollback immediately by running the following commands.
Delete the new table (which has the same name as the original table).
DROP TABLE IF EXISTS <orignal_table>;Rename the backup table to the original table name.
ALTER TABLE <orignal_table_backup> RENAME TO <orignal_table>;
View clustering optimization benefits
On the Clustering Optimization tab, click Actual Benefits to view the gains from clustering optimization.
Log on to the MaxCompute console and select a region in the top-left corner.
In the navigation pane on the left, choose .
On the Clustering Optimization tab, click Actual Benefits.
Filter by Project Name and Analysis Time to view a summary and detailed breakdown of the benefits from tables with modified clustering attributes.
Benefit indicator descriptions
Metric
Description
Number of Benefited Jobs
The number of times that recently modified clustered tables were read during the benefit analysis period.
Saved CU-hours
The CU-hour reduction for jobs that read the modified clustered tables, compared to their consumption before the change.
Reduced Shuffle Amount
The shuffle volume reduction for jobs that read the modified clustered tables, compared to their consumption before the change.
Benefit statistics are calculated by comparing the average consumption of jobs with the same signature before and after the recommended changes were applied. The statistics cover clustered tables modified based on recommendations within the last 365 days.
Optimized list
Column
Description
Project
The project containing the clustered table with modified attributes.
Table Name
The name of the table with modified clustering attributes.
Clustering Attribute Modification Time
The date when the table's clustering attributes were last modified.
Number of Benefited Jobs
The number of times the table was read during the benefit analysis period after its clustering attributes were modified.
Saved Computing Duration
The reduction in computing duration for jobs that read this table during the analysis period, compared to before the attributes were modified.
Saved CU-hours
The reduction in CU-hour consumption for jobs that read this table during the analysis period, compared to before the attributes were modified.
Reduced Shuffle Amount
The reduction in shuffle volume for jobs that read this table during the analysis period, compared to before the attributes were modified.
Actions
Click a suggestion to go to the Details of Optimized Table page, which includes the following information:
Current Status of Table
Clustering Type
ClusterKey
SortKey
Number of Buckets
Partition Key
Benefit Overview
Number of Benefited Jobs
Saved CU-hours
Reduced Shuffle Amount
A list of benefited read jobs
Signature
Saved Computing Duration
Saved CU-hours
Reduced Shuffle Amount
ImportantThe daily job benefit statistics are updated on a T+1 basis. You can view real-time optimization effects in Job Monitoring or Logview.
Benefit statistics are based on historical job signatures and can be affected by daily performance fluctuations. If the benefits do not meet expectations, compare the job execution details from different dates to identify influencing factors.
These statistics are for reference only. Final CU savings are reflected in the monthly bill.