This topic describes the architecture used to optimize data organization in Append Delta Tables.
Overview
The innovative table format of Append Delta Tables uses a Range Clustering structure for underlying data organization. By default, Row_ID is the cluster key, and the number of buckets is dynamically allocated as data grows. After a cluster key is specified, a background clustering job performs incremental reclustering on the data. This process ensures that all data remains ordered.
Append Delta Tables perform well in complex business scenarios. Their significant performance improvements highlight the core value of optimizing data storage formats for big data analytics. The technical benefits and performance optimizations are summarized as follows:
Data autonomy: Achieves a dynamic balance between storage efficiency and query performance through background tasks such as Merge, Compaction, and Reclustering.
Scalability: Supports seamless scaling from terabytes to exabytes of data with dynamic bucketing and Auto-Split/Merge policies.
Real-time clustering: Incremental reclustering provides millisecond-level data freshness and accelerates clustered queries in the Operational Data Store (ODS) layer.
Dynamic bucketing
Challenges
To create a Range/Hash Cluster table, you must first estimate the data scale for your business. Then, you set an appropriate number of buckets and a cluster key based on the estimate. After the table is created, MaxCompute uses a clustering algorithm to route data to the correct buckets based on the cluster key.
This can lead to two problems:
Data skew: If the amount of data is too large and the number of buckets is too small, individual buckets become oversized. This reduces the effectiveness of data pruning during queries.
Data fragmentation: If the number of buckets is much larger than required for the data volume, each bucket contains too little data. This creates many small, fragmented files and harms query performance.
Explicitly specifying the number of buckets when creating a table is challenging for users. To set a suitable number of buckets based on data volume, a user must understand their business usage patterns and the underlying MaxCompute table format. Only then can they use clustering features correctly and maximize query performance.
For large-scale data migrations, the potential data volume of each table must be evaluated. This evaluation is manageable for a small number of tables. However, it becomes very difficult to execute when dealing with thousands or tens of thousands of tables.
Even if a user accurately assesses the current data scale of a table, the actual data scale will change as the business evolves. A bucket count that is suitable today may not be suitable in the future.
A static bucket count configuration cannot effectively support large-scale data migrations or rapidly changing business environments. A better approach is for the platform to dynamically set the number of buckets based on the actual data volume. This frees users from managing the underlying bucket count, which lowers the learning curve and allows the system to better adapt to changing data scales.
Solution
The Append Delta Table format was designed to support the dynamic allocation of buckets. All data in the table is automatically divided into buckets. Each bucket is a logically contiguous storage unit that contains about 500 MB of data.
You do not need to specify the number of buckets for a table when you create it and write data. As data is continuously written, new buckets are automatically created as needed. This eliminates concerns about data skew or fragmentation caused by buckets becoming too large or too small as the data volume changes.
The workflow is shown in the following diagram:

Incremental reclustering
Challenges
Clustering is a common data optimization method. A cluster key is a user-specified table property. It works by sorting and contiguously storing specified data fields. When a query uses the cluster key, optimizations such as pushdown and pruning can narrow the data scan range. This improves query efficiency.
Previously, MaxCompute provided Range Clustering and Hash Clustering. These features support data bucketing by range or hash and sort the data within each bucket. This accelerates queries by pruning buckets and data within buckets during the query process. The process is shown in the following diagram:

Problem 1: High cost of appending data
A limitation of tables that use Range/Hash Clustering is that data must be bucketed and sorted during the write process to achieve a globally sorted state. This restricts how data can be written. Data must be written in a single operation using insert or overwrite data (INSERT INTO | INSERT OVERWRITE). After the initial write is complete, to append more data, you must read all existing data from the table, combine it with the new data using a union (UNION), and then rewrite the entire dataset. This process makes appending data very expensive and inefficient.
Businesses typically do not use clustering on tables in the Operational Data Store (ODS) layer. This is because data in the ODS layer is close to raw business data and is often imported continuously through external collection pipelines. This process requires high data import performance. The costly write model of traditional clustered tables cannot meet the low-latency, high-throughput write requirements.
Problem 2: Data freshness latency in the DW layer
Therefore, businesses tend to set cluster keys on tables in the data warehouse (DW) layer. Data from the previous data timestamp in ODS tables is cleaned and then imported into the more stable DW layer. This process accelerates the performance of subsequent queries.
However, this approach introduces a delay in data freshness in the DW layer. To avoid write amplification from repeated updates, the DW layer is usually updated only after the ODS layer data stabilizes. This causes data queried from the DW layer to have a data timestamp lag. Some scenarios have extremely high requirements for both query performance and data freshness. These scenarios require clustering on the ODS layer to accelerate queries and obtain real-time information.
Therefore, the original MaxCompute solution of performing synchronous clustering during data writes cannot meet user demands for real-time performance.
Solution
The incremental clustering feature of Append Delta Tables uses a background data service to perform incremental clustering asynchronously. This achieves an optimal balance among data import performance, data freshness, and query performance.
As shown in the following diagram, data is imported into MaxCompute using streaming writes. During the write phase, data is written directly to disk unsorted and allocated to buckets. This method maximizes write throughput and minimizes latency. Because the newly written data is not yet clustered, the data ranges of the new buckets overlap with those of existing clustered buckets. When a query runs, the SQL engine prunes the clustered buckets and scans the incremental buckets.

A background data service in MaxCompute continuously monitors the Bucket Overlap Depth. When the overlap reaches a specific threshold, it triggers incremental reclustering. This operation reclusters the newly written buckets. This process ensures that the bulk of the data remains ordered, which provides stable overall query performance.