Incremental queries - MaxCompute - Alibaba Cloud Documentation Center

Transaction Table 2.0 supports incremental write and storage to support incremental queries and incremental computing optimization. Therefore, a new SQL incremental query syntax is developed to support near-real-time incremental data processing.

Process of an incremental query

The following figure shows the process of an incremental query on a Transaction Table 2.0 table.

After you enter an SQL statement, the engine parses the specified version range to query all delta files within the specified time range. Then, the engine merges data in the delta files to generate an output.
New data files are generated during clustering and compaction operations. In the new data files, the original records are organized and optimized and no new logical records are added. Therefore, records in the new data files should not be considered as new data for output. Incremental queries are optimized to meet business requirements. After optimization, records that are generated during compaction and clustering are deleted from the new data files during incremental queries. Therefore, MaxCompute does not read base files during an incremental query. Instead, MaxCompute reads only all delta files within the specified time range, and merges data in the delta files based on the specified policy to generate an output.

The preceding figure shows how to query data in a transaction table named src.

The schema of the table consists of a pk column and a val column.
The left part of the preceding figure shows the data change process. The time points t1 to t5 represent the time versions of transactions. Five data write transactions are performed, and five delta files are generated.
The compaction operation is performed at the time points t2 and t4, and two base files b1 and b2 are generated.
In this example, if the value in the Begin column is t1-1 and the value in the End column is t1, MaxCompute needs to read only the delta file d1 at t1 to generate an output. If the value in the Begin column is t2, MaxCompute needs to read two delta files d1 and d2. If the value in the Begin column is t1 and the value in the End column is t2-1, the query time range is (t1, t2). In this case, empty rows are returned because no incremental data is inserted during the time range.