Data Integration supports full and incremental synchronization of entire databases from sources such as ApsaraDB for OceanBase, MySQL, Oracle, and PolarDB to MaxCompute. This synchronization link performs an initial full migration, then continuously syncs incremental changes and merges them into the destination daily (T+1). This topic uses MySQL-to-MaxCompute as an example to describe how to create a full and incremental task.
How it works
A full and incremental task combines batch and real-time subtasks to migrate and merge data into the destination base table. After the task is started, the system automatically creates and coordinates batch and real-time subtasks to merge data into the destination table (base table).
The process has three phases:
Initial full load: A batch task migrates the table schemas and existing data from all tables in the source database to MaxCompute. After completion, this batch task is frozen.
Incremental synchronization: A real-time task continuously captures changes (Insert, Update, Delete) from the source database binlog (such as MySQL binlog) and writes them to a temporary log table in MaxCompute in near-real time.
Periodic merge: A daily (T+1) merge task combines the previous day's incremental data from the log table with the base table, producing a full snapshot in a new partition. This merge task runs once daily.
Key features:
Many-to-many or many-to-one table mapping: Synchronize multiple source tables to corresponding destination tables, or merge multiple source tables into a single destination table using mapping rules.
Composition: A full and incremental task consists of a batch subtask (full load), a real-time subtask (incremental sync), and a merge task (data consolidation).
Destination table support: Data can be written to both partitioned and non-partitioned tables in MaxCompute.
Usage notes
Resource requirements: Tasks must run on a serverless resource group. When you synchronize data by instance, minimum specifications: 2 CUs for a serverless resource group.
Network connectivity: Ensure that the resource group for Data Integration has network connectivity to both the source (such as MySQL) and destination (such as MaxCompute) data sources. For more information, see Network connectivity.
Region restrictions: Only self-managed MaxCompute data sources in the same region as the current DataWorks workspace are supported. When you use a self-managed MaxCompute data source, you must still associate a MaxCompute compute resource in Data Studio. Otherwise, you cannot create MaxCompute SQL nodes, which causes the full synchronization done node to fail.
Scheduling resource group restrictions: The serverless resource group configured for the task is used as the scheduling resource group.
Destination table type restrictions: Synchronizing source data to MaxCompute external tables is not supported.
Notes
Primary key requirements: Tables without primary keys are not supported. For tables without primary keys, you must manually specify one or more columns as business primary keys (Specify Primary Key) during configuration.
Data visibility latency: On the day you configure a full and incremental task to MaxCompute, you can only query historical full data. Incremental data can be queried in MaxCompute only after the next day's merge is completed. For more information, see the periodic merge section in How it works.
Storage and lifecycle: A full and incremental task generates a full partition every day. To avoid excessive storage usage, the MaxCompute tables automatically created by the task have a default lifecycle of 30 days. If this does not meet your business needs, you can click the corresponding MaxCompute table name to modify the lifecycle when configuring the task. For more information, see Edit the destination table schema (optional).
SLA: Data Integration uses the MaxCompute data channel to upload and download data. For more information about the SLA of the data channel, see MaxCompute data channel SLA. Evaluate your data synchronization solution based on the MaxCompute data channel SLA.
Binlog retention policy: Real-time synchronization depends on the binlog of the source MySQL instance. Ensure that the binlog retention period is sufficient to prevent synchronization interruptions caused by missing start offsets when a task is paused for an extended period or retried after a failure.
Billing
A full and incremental task includes a batch synchronization task for the full load phase, a real-time synchronization task for the incremental phase, and a scheduled task for the periodic merge phase. These three phases are billed separately. All three phases consume CUs from the resource group. For more information about CU billing, see Resource group billing. The scheduled task also incurs task scheduling fees. For more information, see Task scheduling billing.
In addition, the full and incremental synchronization link to MaxCompute requires periodic full-incremental data merges, which consume MaxCompute compute resources. These costs are charged directly by MaxCompute and are proportional to the size of the full data and the merge cycle. For more information, see MaxCompute billing.
Procedure
Step 1: Select a synchronization task type
Log on to the DataWorks console. In the target region, click in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Integration.
In the left-side navigation pane, click Synchronization Task, and then click Create Synchronization Task at the top of the page. In the dialog that appears, configure the following key settings:
Source Type:
MySQL.Destination Type:
MaxCompute.Specific Type:
Full and Incremental.Sync Procedure: Schema Migration, Incremental Sync, Full Synchronization, and Periodic Merge.
Step 2: Configure data sources and resource groups
In the Source Information section, select the
MySQLdata source that you have added. In the Destination section, select theMaxComputedata source that you have added.In the Running Resources section, select the Resource Group for the task and allocate Resource Group CU to it. You can configure CUs separately for full synchronization and incremental synchronization to precisely control resources and avoid waste.
NoteBatch synchronization tasks in DataWorks are dispatched by the scheduling resource group to the Data Integration resource group for execution. Therefore, batch synchronization tasks consume resources from both the Data Integration resource group and the scheduling resource group, which incurs scheduling instance fees.
Ensure that both the source and destination data sources pass the Connectivity Check.
Step 3: Select tables to synchronize
In the Source Tables area, select the tables to sync and click the
icon to move them to the Selected Tables list on the right.
Two table selection methods are available: Select Specific Tables and Select Tables by Regex. Switching between methods clears the results of the other method. You can select up to 5,000 tables. You can also use Batch Paste to Select Tables to quickly add tables.
If there are many databases and tables, you can use Database Filtering or Search for Tables, or configure regular expressions to select the tables to synchronize.
Step 4: Configure task settings
Log table time range: This parameter defines the time query range for merging data from the log table into the destination partition.
To avoid cross-day partition errors caused by data latency, extend this range appropriately to ensure that all data belonging to the partition is correctly merged.
Merge task schedule settings: Set the scheduling time for the daily merge task. For more information about how to configure the scheduling time, see Schedule settings.
Periodic scheduling parameters: Configure scheduling parameters. You can use the scheduling parameters to assign values to partitions in subsequent partition settings, which allows you to write data to partitions by date.
Table partition settings: Configure partitions for the destination table, including key parameters such as partition column names and value assignment methods. You can use scheduling parameters in the assignment column to automatically generate partitions by date.
Step 5: Configure destination table mapping
In this step, you define mapping rules between source tables and destination tables, and specify primary keys, dynamic partitions, DDL/DML configurations, and other rules to determine how data is written.
Action | Description |
Refresh | The system automatically lists the source tables you selected, but the destination table attributes take effect only after you refresh and confirm them.
|
Customize Mapping Rules for Destination Table Names (optional) | The system has a default table name generation rule:
The following scenarios are supported:
|
Edit column type mapping (optional) | The system provides a default mapping between source and destination column types. You can click Edit Mapping of Field Data Types in the upper-right corner of the table to customize the column type mapping between source and destination tables. After configuration, click Apply and Refresh Mapping. When you edit column type mapping, ensure that the type conversion rules are correct. Incorrect rules cause type conversion failures that generate dirty data and affect task execution. |
Edit the destination table schema (optional) | The system automatically creates destination tables that do not exist based on the custom table name mapping rules, or reuses existing tables with the same name. DataWorks automatically generates the destination table schema based on the source table schema. Manual intervention is not required in common scenarios. You can also modify the table schema in the following ways:
For existing tables, you can only add columns. For new tables, you can add columns, partition columns, and set the table type or table properties. See the editable areas on the page for details. |
Value assignment | Native columns are automatically mapped based on columns with the same name in the source and destination tables. The newly added columns in the preceding steps require manual value assignment. Perform the following operations:
You can assign constants or variables to columns. Switch the type in Value Type. The following methods are supported:
|
Source Split Key | In the source split key column, select a column from the source table in the drop-down list or select Not Split. During task execution, the data is split into multiple tasks based on this column for concurrent, batch data reading. We recommend that you use a column with evenly distributed data, such as the table primary key, as the source split key. String, floating-point, and date types are not supported. The source split key is currently supported only when the source is MySQL. |
Skip Full Synchronization | If full synchronization is configured in step 3, you can cancel the full data synchronization for individual tables. This is applicable when you have already synchronized full data to the destination through other methods. |
Full condition | Apply condition-based filtering to the source during the full load phase. Specify only the WHERE clause here without the WHERE keyword. |
Configure DML Rule | DML message processing filters and controls the change data ( |
Full Merge Cycle | Currently, only daily merges are supported. You can configure the specific scheduling time of the merge task in Custom Merge Time. |
Merge Primary Key | You can define a primary key by selecting one or more columns in the table.
|
Step 6: Advanced configuration
Advanced parameter configuration
If you need fine-grained task configuration to meet custom synchronization requirements, go to the Advanced Parameters tab to modify advanced parameters.
Click Advanced Configuration in the upper-right corner of the page to go to the advanced parameter configuration page.
Modify parameter values based on the parameter descriptions. The meaning of each parameter is explained next to the parameter name.
AI-assisted configuration is also supported. Enter natural language instructions, such as adjusting the concurrency of the task, and the large language model generates recommended parameter values. You can decide whether to accept the AI-generated parameters based on your actual needs.
Modify parameters only when you fully understand their meanings. Incorrect modifications may cause unexpected issues such as task latency, excessive resource consumption that blocks other tasks, or data loss.
DDL capability configuration
Some real-time synchronization links can detect metadata changes to source table schemas and notify the destination to synchronize the updates, or take other actions such as alerting, ignoring, or terminating the task.
Click Configure DDL Capability in the upper-right corner of the page to set a processing policy for each type of change. The supported policies vary depending on the channel.
Normal: The destination processes the DDL change information from the source.
Ignore: The change message is ignored and no modifications are made to the destination.
Error: The real-time synchronization task is terminated and its status is set to Error.
Alert: An alert is sent when this type of change occurs on the source. You must configure DDL notification rules in Configure Alert Rule.
After a new column is added on the source and synchronized to the destination through DDL, the system does not backfill existing data in the destination table for that column.
Step 7: Run the synchronization task
-
After you complete all configurations, click Save to save the task.
-
On the page, find the synchronization task that you created, and click Deploy in the Operation column. If you select Start immediately after deployment during deployment, the task runs immediately after you confirm. Otherwise, you must manually start the task.
NoteData Integration tasks must be deployed to the production environment before they can run. Therefore, you must perform the Publish operation after you create or edit a task for the changes to take effect.
-
Click the Name/ID of the corresponding task in Tasks to view the detailed execution process of the task.
Next step
After you complete the task configuration, you can manage the created task, add or remove tables, configure monitoring alerts for the task, and view key metrics of the task. For more information, see Manage synchronization tasks.
FAQ
Q: Why is the base table data not updated as expected?
A: The following provides root cause analysis and solutions:
Symptom | Cause | Solution |
The data production check for the T-1 partition of the incremental log table failed. | The real-time synchronization task failed, causing the T-1 partition data in the incremental log table to not be produced normally. |
|
The data production check for the T-2 partition of the destination base table failed. |
|
|
button to select Manual Input and Built-in Variable to concatenate the destination table name. The supported variables include the source data source name, source database name, and source table name.
button in the Destination Table Name column to add columns.