All Products
Search
Document Center

DataWorks:Overview of the full and incremental synchronization feature

Last Updated:Aug 28, 2023

DataWorks allows you to create different types of full and incremental synchronization tasks to synchronize data between different data sources in various data synchronization scenarios, such as real-time synchronization, batch full synchronization, and batch incremental synchronization. Full and incremental synchronization tasks help you migrate your business data to the cloud in an efficient and convenient manner.

Background information

In some business scenarios, data cannot be synchronized by using only one or more simple batch or real-time synchronization tasks. Instead, multiple batch synchronization tasks, real-time synchronization tasks, and data processing tasks are required to synchronize data. In this case, complex configurations are required.

To resolve this issue, DataWorks provides a scenario-specific synchronization solution that allows you to synchronize data between different types of data sources by using simple configurations. For example, you can create a one-click real-time synchronization task to easily synchronize data to Elasticsearch, Hologres, or MaxCompute. This simplifies data synchronization.

Note

For example, a large amount of data is stored in your database system, and you want to synchronize full and incremental data from your database to MaxCompute for analysis. The traditional data synchronization method allows you to perform full synchronization or perform incremental synchronization based on fields such as modify_time in database tables. However, these fields may not exist in database tables in an actual scenario. Therefore, you cannot use the Java Database Connectivity (JDBC) driver to extract data for incremental synchronization. To synchronize the full and incremental data to MaxCompute, you can create a one-click real-time synchronization task. After the synchronization is complete, the full and incremental data is automatically merged in MaxCompute. This simplifies data synchronization.

A full and incremental synchronization task has the following benefits:

  • Synchronizes full data at a time.

  • Synchronizes incremental data in real time.

  • Automatically merges incremental and full data on a regular basis and writes the merged data to the related partition in a table that is used to store full data.

Overview

The following figure shows the capabilities of the full and incremental synchronization feature.Architecture

Capability

Description

Data synchronization from or to data sources that are deployed in complex network environments

The full and incremental synchronization feature supports data synchronization from or to Alibaba Cloud databases, data centers, self-managed data sources that are hosted on Elastic Compute Service (ECS) instances, and data sources that do not belong to Alibaba Cloud. You can select appropriate network connectivity solutions to establish network connections between your resource group and data sources based on the network environments in which the data sources are deployed. Before you configure a full and incremental synchronization task, you must make sure that network connections are established between your resource group for Data Integration and data sources. For more information about how to establish a network connection between a resource group and a data source, see Establish a network connection between a resource group and a data source.

Data synchronization scenarios

The full and incremental synchronization feature supports the synchronization of data from a single table to another single table, from tables in sharded databases to a single table, and from multiple tables in a database to multiple tables.

DataWorks allows you to create the following types of full and incremental synchronization tasks: batch synchronization of all data in a database (one-time full synchronization, periodic full synchronization, one-time full synchronization and periodic incremental synchronization, one-time incremental synchronization, and periodic incremental synchronization), and one-click real-time synchronization (one-time full synchronization and real-time incremental synchronization). For more information, see Supported data source types and data synchronization solutions.

Configurations for full and incremental synchronization tasks

For information about how to configure a full and incremental synchronization task, see Configure a synchronization task in Data Integration. For information about the capabilities that you can use when you configure a full and incremental synchronization task, see the Capabilities that you can use when you configure a full and incremental synchronization task section in this topic.

O&M for full and incremental synchronization tasks

  • The full and incremental synchronization feature allows you to easily add tables to or remove tables from a full and incremental synchronization task that is run. For more information, see Manage a full and incremental synchronization task.

  • Data Integration allows you to view the status and resource usage of a full and incremental synchronization task within a specified period of time. Data Integration also allows you to view the status distribution and data synchronization speed of the batch and real-time synchronization subtasks generated by the synchronization task, the amount of data and number of data records synchronized by the subtasks, and the latency of the real-time synchronization subtask that is running. For more information, see View the status overview of synchronization tasks.

  • A full and incremental synchronization task can generate a real-time synchronization subtask or multiple batch synchronization subtasks based on the attributes that you configure for the synchronization task. You can configure monitoring and alerting rules for the subtasks. For information about how to configure alert rules for a real-time synchronization subtask, see O&M for real-time synchronization nodes. For information about how to configure alert rules for a batch synchronization subtask, see O&M for batch synchronization nodes.

Capabilities that you can use when you configure a full and incremental synchronization task

Capability

Description

Refresh mappings

Click Refresh Mappings Between Source Tables and Destination Tables. Then, the system displays the mappings between source tables and destination tables. In the preceding figure, hudi_b.tb_order_1 is an existing destination Hologres table, and hudi_b.tb_order_2 and hudi_b.tb_order_3 are destination Hologres tables that are automatically created.

View or modify the schema of a destination table

Find a mapping record and click the name of the destination table to open the dialog box that displays the schema of the table. In the dialog box, you can modify the schema of the table based on your business requirements.

In the preceding figure, the add_col field is added to the automatically created destination Hologres table hudi_b.tb_order_3, the data type of the field is set to TEXT, and the description of the field is set to Add a field to an automatically created destination table. After the modification is complete, click Apply and Refresh Mapping to save the modifications.

Important

When you modify the schema of an automatically created destination table, you must take note of the following items about the fields that have the same names as fields in the source table:

  • You cannot delete these fields or adjust their order.

  • You cannot change the names of these fields.

  • You can modify the configurations such as the data types and descriptions of these fields. You can also specify whether to use one of the fields as a primary key.

In the preceding figure, the add_col field is added to the existing destination Hologres table hudi_b.tb_order_1, the data type of the field is set to TEXT, and the description of the field is set to Add a field to an existing destination table. After the modification is complete, click Apply and Refresh Mapping to save the modifications.

Important

When you modify the schema of an existing destination table, you must take note of the following items:

  • For existing fields in the destination table, you can modify only the descriptions of the fields.

  • You can add fields to the destination table.

Modify the schemas of multiple destination tables at a time

Select multiple mapping records and click Batch Modify Table Schema. In the dialog box that appears, you can modify the schemas of the destination tables at a time. After the modification is complete, click Apply and Refresh Mapping to save the modifications.

Important
  • After you click Batch Modify Table Schema, the system displays the fields contained in all the selected destination tables in the dialog box that appears.

  • In the dialog box that appears, you can add fields. The preceding figure shows that the batch_add_col field is added to all the selected destination tables.

Then, click the name of a destination table on which the batch operation is performed to view the new schema of the table.

Specify the name of a destination schema or table

By default, data is written to the destination schema and table that are named the same as the source database and table. If no such destination schema or table exists, the system automatically creates the schema or table in the destination. You can specify the name of the destination schema or table to which you want to write data. For more information, see Configure rules for mapping databases or tables.

Note
  • You can use a regular expression to map the names of the destination schemas or tables to the names of source schemas or tables.

  • You can also use built-in variables to specify the names of the destination tables and add a prefix or a suffix to the names of the destination tables.

Assign values to the fields that are newly added to a destination table

By default, source fields specified in a full and incremental synchronization task are mapped to destination fields that are named the same as the source fields. The values of the source fields are written to the destination fields that are named the same as the source fields. You can add fields to a destination table and assign constants or variables to the fields as values. You can find the desired mapping record and click Edit in the Value Assignment for Destination Field column. In the dialog box that appears, the new schema of the destination table to which fields are added is displayed.

  • The id and name fields in the hudi_b.tb_order_3 destination Hologres table are also contained in the mapped source table. In this case, the fields are automatically assigned the same values as the values of the mapped source fields, and the values assigned to the destination fields cannot be changed.

  • add_col and batch_add_col are the fields that are newly added to the hudi_b.tb_order_3 destination Hologres table. You can assign values to the fields.

    • Manually assign a value to a field: You can enter a constant string in the Assignment column of the add_col field. This way, the constant string is written to the add_col field during data synchronization.

    • Use a variable as the value of a field: You can select a built-in variable that represents specific business meaning from the Assignment column of the batch_add_col field. This way, the value that is indicated by the variable is written to the batch_add_col field during data synchronization.

Note

Descriptions for the supported variables:

  • DATASOURCE_NAME_SRC: the name of a source

  • DB_NAME_SRC: the name of a source database

  • TABLE_NAME_SRC: the name of a source table

  • DATASOURCE_NAME_DEST: the name of a destination

  • DB_NAME_DEST: the name of a destination database

  • TABLE_NAME_DEST: the name of a destination table

  • DB_NAME_SRC_TRANSED: the name of a source database after conversion

  • EXECUTE_TIME: the time when the event for a change operation on a source occurs

Configure rules to process DDL or DML messages

When you configure a full and incremental synchronization task, you can configure rules to process different types of DDL messages for destination tables. For information about the DDL and DML operations supported by different types of destinations, see Supported DML and DDL operations.

Note
  • When you configure a batch synchronization task used to synchronize all data in a database, you cannot configure rules to process DDL or DML messages.

  • DataWorks allows you to configure a logical deletion rule for a destination table to process DML messages for deletion operations. After you configure such a rule for a destination table, the system automatically adds the _data_integration_deleted_ field to the table and sets the data type of the field to BOOLEAN. If a deletion operation is performed on a source table in a synchronization task used to synchronize data to Hologres, Hologres automatically converts the original SQL statement that is used to write data to the mapped destination Hologres table to update tb_xxx set _data_integration_deleted_=true where id=xxx and executes the new statement to write data to the mapped destination Hologres table. Logical deletion is implemented during the data write process.

  • After you configure a logical deletion rule for a destination Hologres table, you can view the schema of the table.