Overview of Delta tables - MaxCompute - Alibaba Cloud Documentation Center

As data processing scenarios become more complex, the display of updated data within seconds or row-level updates are not required in many business scenarios. Instead, minute-level or hour-level near-real-time data processing and batch processing of large amounts of data are required. MaxCompute provides Delta tables to meet your business requirements for the storage and processing of full and incremental data in near real time. This topic describes the business pain points in big data storage and computing, and the architecture and main features of Delta tables.

Current situation analysis

In low-timeliness business scenarios in which a large amount of data needs to be processed in batches, you can use MaxCompute to meet business requirements. In high-timeliness business scenarios in which second-level real-time data processing or streaming processing is required, you need to use a real-time data processing system or a streaming system to meet business requirements. In comprehensive business scenarios, such as the combination of minute-level or hour-level near-real-time data processing and batch processing of large amounts of data, specific issues may occur regardless of whether you use a single engine or multiple federated engines.

Specific issues may occur if you use only MaxCompute for batch processing in specific scenarios, as shown in the preceding figure. For example, if you use MaxCompute in scenarios in which minute-level incremental data and full data of users need to be continuously merged and stored, additional computing and storage costs are generated. If you use MaxCompute in scenarios in which complex data processing links and processing logic need to be converted into batch processing of data within T+1 days, the complexity of data processing links increases and the timeliness cannot meet business requirements. If you use only a real-time data processing system in the preceding scenarios, the resource costs are high, the cost efficiency is low, and the batch processing of large-scale data is unstable. In most cases, the Lambda architecture is used as a solution. In the Lambda architecture, MaxCompute is used for batch processing of full data, and a real-time data processing system is used for incremental data processing to meet high timeliness requirements. However, the Lambda architecture can cause known issues, such as data inconsistency between multiple sets of processing and storage engines, additional costs due to redundant storage and computing of multiple copies of data, a complex architecture, and a long-term development cycle.

To address the preceding issues, the big data open source ecosystem launched various solutions in recent years. The most popular solution is that the open source data processing engine Spark, Flink, or Presto is deeply integrated with the open source data lakes Hudi, Delta Lake, and Iceberg to implement a unified compute engine and data storage. This solution can help resolve a series of issues caused by the Lambda architecture. An incremental data storage and processing architecture is developed based on the architecture of MaxCompute. The architecture provides an integrated solution for batch data processing and near-real-time incremental data processing. The architecture maintains the cost-effectiveness of batch processing and meets the business requirements for minute-level incremental data reading, writing, and processing. The architecture can also provide practical features, such as the UPSERT operation and the time travel feature, to expand business scenarios. This helps reduce data computing, storage, and migration costs and improve user experience.

Integrated architecture for full and incremental data storage and processing

The preceding figure shows the new architecture in which MaxCompute efficiently supports the preceding comprehensive business scenarios. In the new architecture, MaxCompute supports various data sources to allow you to easily import incremental and full data to a unified storage system by using customized access tools. The background data management service automatically optimizes the data storage structure. A unified computing engine is used to support near-real-time incremental data processing and batch processing of large-scale data. A unified metadata service is used to support transaction management and file metadata management. The new architecture provides multiple benefits, including resolving issues that occur when only a batch processing system is used, such as redundant computing and storage and low timeliness, preventing the high resource consumption of real-time data processing systems or streaming systems, eliminating data inconsistency between multiple sets of systems in the Lambda architecture, and reducing the redundant storage cost of multiple copies of data and the cost of data migration between systems. The end-to-end integrated architecture can meet the business requirements for computing and storage optimization of incremental data processing and minute-level timeliness, ensure the overall efficiency of batch processing, and effectively reduce resource costs.

Core features

Before you use Delta tables, you can learn about the architecture of Delta tables to understand how to configure appropriate parameters in different business scenarios. This way, you can reduce storage and computing costs and improve overall link performance. For more information about how to use Delta tables, see Basic operations.

Delta tables provide the following core features:

Table data format: Delta table is a new type of table provided to efficiently support the storage, reading, and writing of incremental and full data. If you specify a primary key for a Delta table, data in the table can be updated in real time.
Near-real-time data update: You can perform the UPSERT or DELETE operation to import incremental data to a table within minutes.
Incremental queries: The SQL syntax can be used to query incremental data.
SQL syntax: A full set of SQL syntax can be used to support all features of the new architecture, such as the create, read, update, and delete (CRUD) syntax, time travel, incremental queries, and configuration of primary keys and properties of tables.
Features related to data organization optimization:
- Clustering: A large number of small incremental files can be automatically merged. This helps you prevent issues such as a heavy load on storage and low I/O efficiency.
- Compaction: The intermediate historical data status can be manually or automatically compacted. This helps you effectively reduce data storage and computing costs and improve data query efficiency.
- Data reclamation: MaxCompute can automatically reclaim expired data and operation logs. This helps you reduce storage costs.
Time travel: The SQL syntax can be used to query historical snapshots. This helps you trace the historical data status of your business and restore data that contains errors.
Transaction management: The Multi Version Concurrency Control (MVCC) model is used to support snapshot isolation. This way, read and write operations do not affect each other. The transaction management mechanism supports conflict detection of parallel write operations and optimization of automatic retries.

Benefits

To support the business scenarios and business migration of the open source data lakes Hudi and Iceberg, the new architecture provides specific common features. The self-developed new architecture also provides the following benefits in terms of features, performance, stability, and integration:

Provides a unified design for storage, metadata, and compute engines to achieve in-depth and efficient integration of the engines. The new architecture provides the following benefits: low storage costs, efficient data file management, and high query efficiency. In addition, a large number of optimization rules for MaxCompute batch queries can be reused by time travel and incremental queries.
Provides a full set of unified SQL syntax to support all features of the new architecture. This facilitates user operations.
Provides in-depth customized and optimized data import tools to support various complex business scenarios.
Seamlessly integrates with existing business scenarios of MaxCompute to reduce migration, storage, and computing costs.
Supports automatic management of data files to ensure better read and write stability and supports automatic optimization of storage efficiency and costs.
Is fully managed on MaxCompute. You can use the new architecture out-of-the-box without additional access costs. You need to only create a Delta table to use the features of the new architecture.
Is a self-developed architecture. You can manage data development for your business requirements based on the new architecture.