DataWorks Data Integration allows you to create a synchronization task to synchronize full and incremental data in a database to MaxCompute in quasi real time. When the synchronization task is run, full data is first synchronized at a time and incremental data is then synchronized in real time to MaxCompute. This topic describes how to create a synchronization task to synchronize full and incremental data in a database to MaxCompute in quasi real time.
Prerequisites
The data sources that you want to use are prepared. Before you configure a synchronization task, you must prepare the data sources from which you want to read data and to which you want to write data. This way, when you configure a synchronization task, you can select the data sources. For information about the supported data source types and the addition of data sources, see Supported data source types and synchronization operations.
NoteFor information about the items that you need to understand before you add a data source, see Overview.
The data source environments are prepared. Before you configure a synchronization task, you must create an account that can be used to access a database and grant the account the permissions required to perform specific operations on the database based on your configurations for data synchronization. For more information, see Overview.
An exclusive resource group for Data Integration or a serverless resource group that meets your business requirements is purchased, and the resource group is associated with your workspace. For more information, see Use serverless resource groups or Create and use an exclusive resource group for Data Integration.
Network connections are established between the resource group and the data sources. For more information, see Network connectivity solutions.
Background information
After you run the synchronization task created in this example, a merge task is automatically generated to merge full and incremental data. Full data is written to MaxCompute base tables at a time, and incremental data is written to a MaxCompute log table in real time. The system runs the merge task to merge the full and incremental data on a regular basis and writes all data to MaxCompute base tables. Full and incremental data are merged once a day.
Item | Description |
Number of tables from which you can read data |
|
Subtasks | The synchronization task generates multiple batch synchronization subtasks used to synchronize full data and a real-time synchronization subtask used to synchronize incremental data. The number of batch synchronization subtasks that are generated varies based on the number of tables from which data is read. |
Data write | You can synchronize both full and incremental data or only incremental data to MaxCompute. The process of synchronizing full and incremental data consists of the following stages:
Note
The following figure shows how to synchronize full and incremental data to a partitioned table. |
Precautions
If you run a synchronization task to synchronize full and incremental data in a database to MaxCompute in quasi real time by using a temporary AccessKey pair, the temporary AccessKey pair is valid for only seven days. After the period elapses, the temporary AccessKey pair automatically expires, and the synchronization task fails. If the system detects that the temporary AccessKey pair is expired, the system restarts the synchronization task. If a related alert rule is configured for the synchronization task, the system reports an error.
On the day you configure a synchronization task to synchronize full and incremental data in a database to MaxCompute in quasi real time, you can query only the historical full data. You can query the incremental data only after full and incremental data are merged on the next day. For more information, see the description of the data write item in the Background information section of this topic.
A synchronization task used to synchronize full and incremental data in a database to MaxCompute in quasi real time generates a partition for storing full data in a MaxCompute table every day. To prevent data from occupying excessive storage resources, the default lifecycle of a MaxCompute table that is automatically created is 30 days. If the lifecycle does not meet your business requirements, you can click the name of a MaxCompute table to modify the lifecycle of the table when you configure the related synchronization task. For more information, see Step 4: Configure settings related to destination tables.
Data Integration uses the channels provided by MaxCompute to upload and download data. You can select a channel based on your business requirements. For more information about the types of channels provided by MaxCompute, see Data upload scenarios and tools.
If you want to run the synchronization task created in this example to synchronize data in whole-instance mode on an exclusive resource group for Data Integration, the specifications of the resource group must be at least 8 vCPUs and 16 GiB of memory. If you want to run the synchronization task to synchronize data in whole-instance mode on a serverless resource group, the serverless resource group must contain at least two compute units (CUs).
Limits
You can use only a self-managed MaxCompute data source that resides in the same region as your workspace. If you use a self-managed MaxCompute data source that resides in a different region from your workspace, the data source can be connected to the resource group that you use. However, an error indicating that the compute engine does not exist is reported when the system creates a MaxCompute table during the running of the synchronization task.
NoteIf you use a self-managed MaxCompute data source, you must associate a MaxCompute compute engine with your workspace. Otherwise, a MaxCompute SQL node cannot be created. As a result, a node that is used to mark the end of full synchronization cannot be created.
You must configure a resource group for scheduling for batch synchronization subtasks generated by a synchronization task used to synchronize full and incremental data in a database to MaxCompute in quasi real time. The shared resource group for scheduling is not supported.
Billing
A synchronization task used to synchronize full and incremental data in a database to MaxCompute in quasi real time requires periodic merging of full and incremental data. Therefore, MaxCompute computing resources are consumed. The fees for MaxCompute computing resources are charged by MaxCompute and are positively correlated to the size of the full data and the merging cycle. For more information, see Billable items and billing methods.
Procedure
Step 1: Select a synchronization type
Log on to the DataWorks console and go to the Synchronization Task page in Data Integration.
Go to the Data Integration page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Integration.
On the Synchronization Task page, select a source type and a destination type and click Create Synchronization Task to go to the Create Data Synchronization Solution page. In the Basic Settings section of the Create Data Synchronization Solution page, configure the following parameters:
Source And Destination: Select the desired source type and select MaxCompute as the destination type.
NoteThe following source types are supported:
ApsaraDB for OceanBase,MySQL,Oracle, andPolarDB.New Node Name: Specify a name for the synchronization task based on your business requirements.
Synchronization Method: Select Full increment of the whole warehouse(quasi real time).
Step 2: Establish network connections
Select a source, a destination, and a resource group that you want to use to run the synchronization task. Test the network connectivity between the data sources and the resource group.

When you select a source, click the
icon. In the Configure dialog box, configure the Encoding and Time Zone parameters based on your business requirements.
If the network connectivity test is successful, click Next.
Step 3: Select source tables and configure mapping rules
In the Basic Configurations section, configure parameters such as Solution Name and Storage Location.
In the Source section, confirm the information about the source.
In the Select Source Table for Synchronization section, select the tables from which you want to synchronize data in the Source Table list and click the
icon to move the selected tables to the Selected Tables list. The Source Table list displays all tables in the source. You can select all or specific tables.
In the Set Mapping Rules for Table/Database Names section, click Add rule, select a rule type, and then configure a mapping rule of the selected type.
By default, data in source tables is written to destination tables that have the same names as the source tables. If no destination table that has the same name as a source table exists, the system automatically creates such a destination table. You can specify a destination table name in a mapping rule to synchronize data in multiple source tables to the same destination table. The following rule types are supported: Rule for Conversion Between Source Database Name and Destination Schema Name, Source table name and destination Table Name conversion rules, and Rule for Destination Table Name.
Rule for Conversion Between Source Database Name and Destination Schema Name: This type of mapping rule allows you to use a regular expression to map source database names with a specific prefix to destination schema names with another specific prefix.
This type of mapping rule establishes mappings based on source database names. The destination schema name obtained after mapping can be represented by the built-in variable
${db_name_src_transed}and used when you configure a mapping rule of the Rule for Destination Table Name type.Example: Synchronize data from source databases whose names start with the prefix
doc_to destination schemas whose names start with the prefixpre_.
Source table name and destination Table Name conversion rules: This type of mapping rule allows you to use a regular expression to map source table names with a specific prefix to destination table names with another specific prefix.
Example 1: Synchronize data from source tables whose names start with the prefix doc_ to destination tables whose names start with the prefix pre_.
Example 2: Synchronize data from multiple source tables to the same destination table.
To synchronize data from table_01, table_02, and table_03 to my_table, you can configure a mapping rule of the Source table name and destination Table Name conversion rules type, and set the Source parameter to table.* and the Destination parameter to my_table.
Rule for Destination Table Name: This type of mapping rule allows you to use a built-in variable to specify the names of the destination tables to which you want to write data and add a prefix and a suffix to the names of the destination tables. The following built-in variables are supported:
${db_table_name_src_transed}: the name of the destination table that is mapped based on a mapping rule of the Source table name and destination Table Name conversion rules type${db_name_src_transed}: the name of the destination schema that is mapped based on a mapping rule of the Rule for Conversion Between Source Database Name and Destination Schema Name type${ds_name_src}: the name of the source
For example, you can specify
pre_${db_table_name_src_transed}_postto convert the table namemy_tablethat is generated in the previous example topre_my_table_post.
Step 4: Configure settings related to destination tables
Configure the Write Mode parameter.
You can set this parameter only to Real-time Write to Log Table. This way, incremental data in the source tables is written to a MaxCompute log table. Then, the system writes the incremental data in the MaxCompute log table to the MaxCompute base tables on a regular basis to merge the full and incremental data.
Configure the Automatic Partitioning by Time parameter.
You can set the Automatic Partitioning by Time parameter to Partitioned Table or Non-partitioned Table. If you set this parameter to Partition Table, you can specify the partition key column.
NoteIf you set this parameter to Partition Table, you can click the
icon to specify the partition key column. Configure mappings between the source tables and destination tables.
Click Refresh Source table and MaxCompute table mapping to map the source tables to destination MaxCompute tables based on the mapping rules that you configured in Step 3. If no mapping rule is configured in Step 3, data in the source tables is written to the MaxCompute tables that are named the same as the source tables. If no such tables exist in the destination, the system automatically creates the tables in the destination. You can specify whether to create a destination table or use an existing table. You can also add additional fields to a destination table.
NoteThe system maps the source tables to destination tables based on the mapping rule that you configured in Step 3.
Operation
Description
Select a primary key for synchronization
The synchronization task created in this example cannot be used to synchronize data from source tables that do not have primary keys. If a source table does not have a primary key, you can click the
icon in the Synchronized Primary Key column of the source table to specify a primary key for the source table. You can use a field or a combination of multiple fields in the source table as the primary key. The system removes duplicate data based on the primary key during data synchronization. Specify whether to create a destination table or use an existing table
You can select Create Table or Use Existing Table from the drop-down list in the Table creation method column.
If you select Use Existing Table from the drop-down list, all existing MaxCompute tables are displayed in the drop-down list in the MaxComputeBase Table Name column. You must select the name of the table that you want to use from the drop-down list.
If you select Create Table from the drop-down list, the name of the destination table that is automatically created appears in the MaxComputeBase Table Name column. You can click the table name to view and modify the data types and comments of fields, and the lifecycle of the table.
Specify whether to perform full synchronization
You can determine whether to turn on the switch in the Full Synchronization column of a source table to synchronize full data in the source table to the destination before real-time incremental synchronization starts.
If you turn off the switch, the full data in the source table is not synchronized before real-time incremental synchronization starts. If full data in a source table is already synchronized to the destination at a time, you can turn off the switch for the source table.
Add additional fields to a destination table and assign values to the fields
You can click Edit additional fields in the Actions column of a destination table to add additional fields to the table and assign values to the fields. You can assign constants and variables to additional fields as values.
NoteYou can add additional fields to a destination table only if Create Table is selected from the drop-down list in the Table creation method column of the table.
Data Integration allows you to assign the following variables to additional fields as values:
EXECUTE_TIME: the execution time UPDATE_TIME: the update time DB_NAME_SRC: the name of the source database DB_NAME_SRC_TRANSED: the name of the mapped database DATASOURCE_NAME_SRC: the name of the source DATASOURCE_NAME_DEST: the name of the destination DB_NAME_DEST: the name of the destination database TABLE_NAME_DEST: the name of the destination table TABLE_NAME_SRC: the name of the source tableModify the schema of a destination table
By default, the lifecycle of MaxCompute tables that are automatically created is 30 days and field type conversion may occur. For example, if the data types of the fields in a destination table are different from the data types of the fields in a source table, the synchronization task automatically maps the source fields to the destination fields whose data types can be compatible with the data types of the source fields when the synchronization task creates a destination table. You can click the name of a destination table in the MaxComputeBaseTable Name column to modify the lifecycle or field types of the table.
NoteYou can add additional fields to a destination table only if Create Table is selected from the drop-down list in the Table creation method column of the table.
After you complete and confirm the configurations in the Set Destination Table step, click Next.
Step 5: Configure rules for processing DML messages
DataWorks allows you to configure table-level DML processing rules for some synchronization tasks. You can configure processing rules for the messages that are generated for INSERT, UPDATE, and DELETE operations performed on a source table.
Support for synchronizing data changes generated by DML operations varies based on the destination type. You can check whether a synchronization task supports the configuration of DML processing rules when you configure the synchronization task in the DataWorks console. For more information, see Supported DML and DDL operations.
Step 6: Configure rules for processing DDL messages
DDL operations may be performed on a source. Data Integration provides default rules to process DDL messages. You can also configure processing rules for different DDL messages based on your business requirements. For more information, see Configure DDL processing rules.
Step 7: Configure resource groups required to run the synchronization task
After you run the synchronization task created in this example, the synchronization task generates multiple batch synchronization subtasks used to synchronize full data and a real-time synchronization subtask used to synchronize incremental data. You must configure properties related to batch synchronization subtasks and the real-time synchronization subtask in the Configure Resource step.
The properties include the exclusive resource groups for Data Integration required by real-time incremental synchronization and batch full synchronization, and the resource group for scheduling required by batch full synchronization. The shared resource group for scheduling is not supported. You can also separately click Advanced configuration in the Real-time Incremental Synchronization and Batch Full Synchronization sections and configure parameters such as Allow Dirty Data Records, Task Expected Maximum Concurrency, Maximum Number of Parallel Threads Allowed for Destination, and Maximum Number of Connections Allowed for Source.
DataWorks uses resource groups for scheduling to issue batch synchronization tasks to resource groups for Data Integration and runs the tasks on the resource groups for Data Integration. Therefore, a batch synchronization task also occupies the resources of a resource group for scheduling. If you use an exclusive resource group for scheduling, you are charged for scheduling instances. You can read the Overview topic to understand the task issuing mechanism.
We recommend that you use different resource groups to run the generated batch and real-time synchronization subtasks. If you use the same resource group to run the subtasks, the subtasks compete for resources and affect each other. For example, CPU resources, memory resources, and networks used by the two types of subtasks may affect each other. In this case, the batch synchronization subtasks may slow down, or the real-time synchronization subtasks may be delayed. The batch or real-time synchronization subtasks may even be terminated by the out of memory (OOM) killer due to insufficient resources.
Step 8: Run the synchronization task
Go to the Synchronization Task page in Data Integration. In the Tasks section of the Synchronization Task page, find the created synchronization task.
Click Submit and Run in the Operation column to start the synchronization task.
Click Description in the Execution Overview column of the synchronization task to view the running details of the synchronization task.
What to do next
After a synchronization task is configured, you can manage the task. For example, you can add source tables to or remove source tables from the task, configure alerting and monitoring settings for the task, and view information about the running of the subtasks. For more information, see Perform O&M on a full and incremental synchronization task.
Appendix: What do I do if data fails to be written to the base tables?
Data merging process | Problem description | Cause | Solution |
| A synchronization task is configured on the T day, and incremental data fails to be written to the T-1 partition in a MaxCompute log table. | An exception occurs when the real-time synchronization subtask is run. |
|
A synchronization task is configured on the T day, and full and incremental data fail to be written to the T-2 partitions in the MaxCompute base tables. |
|
|

