All Products
Search
Document Center

DataWorks:Synchronize MySQL to MaxCompute using a full and incremental task

Last Updated:Jun 24, 2026

Data Integration supports full and incremental synchronization of entire databases from sources such as ApsaraDB for OceanBase, MySQL, Oracle, and PolarDB to MaxCompute. This synchronization link performs an initial full migration, then continuously syncs incremental changes and merges them into the destination daily (T+1). This topic uses MySQL-to-MaxCompute as an example to describe how to create a full and incremental task.

How it works

A full and incremental task combines batch and real-time subtasks to migrate and merge data into the destination base table. After the task is started, the system automatically creates and coordinates batch and real-time subtasks to merge data into the destination table (base table).

The process has three phases:

  1. Initial full load: A batch task migrates the table schemas and existing data from all tables in the source database to MaxCompute. After completion, this batch task is frozen.

  2. Incremental synchronization: A real-time task continuously captures changes (Insert, Update, Delete) from the source database binlog (such as MySQL binlog) and writes them to a temporary log table in MaxCompute in near-real time.

  3. Periodic merge: A daily (T+1) merge task combines the previous day's incremental data from the log table with the base table, producing a full snapshot in a new partition. This merge task runs once daily.

image

Key features:

  • Many-to-many or many-to-one table mapping: Synchronize multiple source tables to corresponding destination tables, or merge multiple source tables into a single destination table using mapping rules.

  • Composition: A full and incremental task consists of a batch subtask (full load), a real-time subtask (incremental sync), and a merge task (data consolidation).

  • Destination table support: Data can be written to both partitioned and non-partitioned tables in MaxCompute.

Usage notes

  • Resource requirements: Tasks must run on a serverless resource group. When you synchronize data by instance, minimum specifications: 2 CUs for a serverless resource group.

  • Network connectivity: Ensure that the resource group for Data Integration has network connectivity to both the source (such as MySQL) and destination (such as MaxCompute) data sources. For more information, see Network connectivity.

  • Region restrictions: Only self-managed MaxCompute data sources in the same region as the current DataWorks workspace are supported. When you use a self-managed MaxCompute data source, you must still associate a MaxCompute compute resource in Data Studio. Otherwise, you cannot create MaxCompute SQL nodes, which causes the full synchronization done node to fail.

  • Scheduling resource group restrictions: The serverless resource group configured for the task is used as the scheduling resource group.

  • Destination table type restrictions: Synchronizing source data to MaxCompute external tables is not supported.

Notes

  • Primary key requirements: Tables without primary keys are not supported. For tables without primary keys, you must manually specify one or more columns as business primary keys (Specify Primary Key) during configuration.

  • Data visibility latency: On the day you configure a full and incremental task to MaxCompute, you can only query historical full data. Incremental data can be queried in MaxCompute only after the next day's merge is completed. For more information, see the periodic merge section in How it works.

  • Storage and lifecycle: A full and incremental task generates a full partition every day. To avoid excessive storage usage, the MaxCompute tables automatically created by the task have a default lifecycle of 30 days. If this does not meet your business needs, you can click the corresponding MaxCompute table name to modify the lifecycle when configuring the task. For more information, see Edit the destination table schema (optional).

  • SLA: Data Integration uses the MaxCompute data channel to upload and download data. For more information about the SLA of the data channel, see MaxCompute data channel SLA. Evaluate your data synchronization solution based on the MaxCompute data channel SLA.

  • Binlog retention policy: Real-time synchronization depends on the binlog of the source MySQL instance. Ensure that the binlog retention period is sufficient to prevent synchronization interruptions caused by missing start offsets when a task is paused for an extended period or retried after a failure.

Billing

  • A full and incremental task includes a batch synchronization task for the full load phase, a real-time synchronization task for the incremental phase, and a scheduled task for the periodic merge phase. These three phases are billed separately. All three phases consume CUs from the resource group. For more information about CU billing, see Resource group billing. The scheduled task also incurs task scheduling fees. For more information, see Task scheduling billing.

  • In addition, the full and incremental synchronization link to MaxCompute requires periodic full-incremental data merges, which consume MaxCompute compute resources. These costs are charged directly by MaxCompute and are proportional to the size of the full data and the merge cycle. For more information, see MaxCompute billing.

Procedure

Step 1: Select a synchronization task type

  1. Log on to the DataWorks console. In the target region, click Data Integration > Data Integration in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Integration.

  2. In the left-side navigation pane, click Synchronization Task, and then click Create Synchronization Task at the top of the page. In the dialog that appears, configure the following key settings:

    • Source Type: MySQL.

    • Destination Type: MaxCompute.

    • Specific Type: Full and Incremental.

    • Sync Procedure: Schema Migration, Incremental Sync, Full Synchronization, and Periodic Merge.

Step 2: Configure data sources and resource groups

  1. In the Source Information section, select the MySQL data source that you have added. In the Destination section, select the MaxCompute data source that you have added.

  2. In the Running Resources section, select the Resource Group for the task and allocate Resource Group CU to it. You can configure CUs separately for full synchronization and incremental synchronization to precisely control resources and avoid waste.

    Note

    Batch synchronization tasks in DataWorks are dispatched by the scheduling resource group to the Data Integration resource group for execution. Therefore, batch synchronization tasks consume resources from both the Data Integration resource group and the scheduling resource group, which incurs scheduling instance fees.

  3. Ensure that both the source and destination data sources pass the Connectivity Check.

Step 3: Select tables to synchronize

In the Source Tables area, select the tables to sync and click the image icon to move them to the Selected Tables list on the right.

Two table selection methods are available: Select Specific Tables and Select Tables by Regex. Switching between methods clears the results of the other method. You can select up to 5,000 tables. You can also use Batch Paste to Select Tables to quickly add tables.

If there are many databases and tables, you can use Database Filtering or Search for Tables, or configure regular expressions to select the tables to synchronize.

Step 4: Configure task settings

  • Log table time range: This parameter defines the time query range for merging data from the log table into the destination partition.

    To avoid cross-day partition errors caused by data latency, extend this range appropriately to ensure that all data belonging to the partition is correctly merged.

  • Merge task schedule settings: Set the scheduling time for the daily merge task. For more information about how to configure the scheduling time, see Schedule settings.

  • Periodic scheduling parameters: Configure scheduling parameters. You can use the scheduling parameters to assign values to partitions in subsequent partition settings, which allows you to write data to partitions by date.

  • Table partition settings: Configure partitions for the destination table, including key parameters such as partition column names and value assignment methods. You can use scheduling parameters in the assignment column to automatically generate partitions by date.

Step 5: Configure destination table mapping

In this step, you define mapping rules between source tables and destination tables, and specify primary keys, dynamic partitions, DDL/DML configurations, and other rules to determine how data is written.

Action

Description

Refresh

The system automatically lists the source tables you selected, but the destination table attributes take effect only after you refresh and confirm them.

  • Select the tables to synchronize in batch and click Batch Refresh Mapping.

  • Destination table name: The destination table name is automatically generated based on the Customize Mapping Rules for Destination Table Names rules. The default format is ${source_database_name}_${table_name}. If a table with the same name does not exist in the destination, the system automatically creates one.

Customize Mapping Rules for Destination Table Names (optional)

The system has a default table name generation rule: ${source_database_name}_${table_name}. You can also click Edit in the Customize Mapping Rules for Destination Table Names column to add custom destination table name rules.

  • Rule name: Define a rule name. We recommend that you specify a name with clear business meaning.

  • Destination table name: You can click the image button to select Manual Input and Built-in Variable to concatenate the destination table name. The supported variables include the source data source name, source database name, and source table name.

  • Edit built-in variables: You can perform string transformations on built-in variables.

The following scenarios are supported:

  1. Add prefixes or suffixes to table names: Add a prefix or suffix to the source table name by configuring constants.

  2. Uniform string replacement: Replace the string dev_ in source table names with prd_.

  3. Write multiple tables to a single table.

Edit column type mapping (optional)

The system provides a default mapping between source and destination column types. You can click Edit Mapping of Field Data Types in the upper-right corner of the table to customize the column type mapping between source and destination tables. After configuration, click Apply and Refresh Mapping.

When you edit column type mapping, ensure that the type conversion rules are correct. Incorrect rules cause type conversion failures that generate dirty data and affect task execution.

Edit the destination table schema (optional)

The system automatically creates destination tables that do not exist based on the custom table name mapping rules, or reuses existing tables with the same name.

DataWorks automatically generates the destination table schema based on the source table schema. Manual intervention is not required in common scenarios. You can also modify the table schema in the following ways:

  • Add columns to a single table: Click the image.png button in the Destination Table Name column to add columns.

  • Add columns in batch: Select all tables to synchronize and choose Batch Modify > Destination Table Schema - Batch Modify and Add Field at the bottom of the table.

  • Renaming column names is not supported.

For existing tables, you can only add columns. For new tables, you can add columns, partition columns, and set the table type or table properties. See the editable areas on the page for details.

Value assignment

Native columns are automatically mapped based on columns with the same name in the source and destination tables. The newly added columns in the preceding steps require manual value assignment. Perform the following operations:

  • Single-table assignment: Click the Configuration button in the Value assignment column to assign values to destination table columns.

  • Batch assignment: At the bottom of the list, choose Batch Modify > Value assignment to assign values in batch for columns with the same name in destination tables.

You can assign constants or variables to columns. Switch the type in Value Type. The following methods are supported:

  • Table columns

    • Manual input: Enter a constant value or scheduling parameter directly, such as abc or ${bizdate}.

    • Select variable: Select a system-provided variable as the column value. For variable types, see the descriptions on the page.

    • Source column: Use the value of a source table column as the column value. The value type can be a column value or a time value.

      • Column value: If the source column is not a time type, you can directly write the source column value to the destination column.

      • Time value: If the source column value is a time value, you can process it according to different formats and specify the Destination Format to format the partition value.

        • Time string: A string that represents a time, such as 2018-10-23 02:13:56 or 2021/05/18. Serialize it as a time by specifying the source and destination time formats. For the examples above, you can use the formats yyyy-MM-dd HH:mm:ss and yyyy/MM/dd respectively for serialization.

        • Time object: If the source value is a time type such as Date or Datetime, select the corresponding type directly.

        • Unix timestamp (seconds): A second-level timestamp. Numbers or strings in a 10-digit timestamp format are also supported, such as 1610529203 or "1610529203".

        • Unix timestamp (milliseconds): A millisecond-level timestamp. Numbers or strings in a 13-digit timestamp format are also supported, such as 1610529203002 or "1610529203002".

    • Function: Apply simple transformations to source columns using functions to generate partition values. For more information, see Function reference.

  • Partition columns: Dynamically create partitions based on the enumerated values or event time of source columns.

    • Manual input: Enter a constant value or a scheduling parameter that has been configured in Configure Scheduling Parameters for Periodical Scheduling, such as abc or ${bizdate}.

Source Split Key

In the source split key column, select a column from the source table in the drop-down list or select Not Split. During task execution, the data is split into multiple tasks based on this column for concurrent, batch data reading.

We recommend that you use a column with evenly distributed data, such as the table primary key, as the source split key. String, floating-point, and date types are not supported.

The source split key is currently supported only when the source is MySQL.

Skip Full Synchronization

If full synchronization is configured in step 3, you can cancel the full data synchronization for individual tables. This is applicable when you have already synchronized full data to the destination through other methods.

Full condition

Apply condition-based filtering to the source during the full load phase. Specify only the WHERE clause here without the WHERE keyword.

Configure DML Rule

DML message processing filters and controls the change data (Insert, Update, Delete) captured from the source before the data is written to the destination during real-time synchronization. This rule takes effect only during the incremental phase.

Full Merge Cycle

Currently, only daily merges are supported. You can configure the specific scheduling time of the merge task in Custom Merge Time.

Merge Primary Key

You can define a primary key by selecting one or more columns in the table.

  • Single-column primary key: Select a single column with unique values (such as id) as the primary key.

  • Composite primary key: If no single column is unique, select a combination of multiple columns. The combined values of these columns must uniquely identify each row to ensure data uniqueness during writes and prevent overwrites.

Step 6: Advanced configuration

Advanced parameter configuration

If you need fine-grained task configuration to meet custom synchronization requirements, go to the Advanced Parameters tab to modify advanced parameters.

  1. Click Advanced Configuration in the upper-right corner of the page to go to the advanced parameter configuration page.

  2. Modify parameter values based on the parameter descriptions. The meaning of each parameter is explained next to the parameter name.

  3. AI-assisted configuration is also supported. Enter natural language instructions, such as adjusting the concurrency of the task, and the large language model generates recommended parameter values. You can decide whether to accept the AI-generated parameters based on your actual needs.

Important

Modify parameters only when you fully understand their meanings. Incorrect modifications may cause unexpected issues such as task latency, excessive resource consumption that blocks other tasks, or data loss.

DDL capability configuration

Some real-time synchronization links can detect metadata changes to source table schemas and notify the destination to synchronize the updates, or take other actions such as alerting, ignoring, or terminating the task.

Click Configure DDL Capability in the upper-right corner of the page to set a processing policy for each type of change. The supported policies vary depending on the channel.

  • Normal: The destination processes the DDL change information from the source.

  • Ignore: The change message is ignored and no modifications are made to the destination.

  • Error: The real-time synchronization task is terminated and its status is set to Error.

  • Alert: An alert is sent when this type of change occurs on the source. You must configure DDL notification rules in Configure Alert Rule.

Note

After a new column is added on the source and synchronized to the destination through DDL, the system does not backfill existing data in the destination table for that column.

Step 7: Run the synchronization task

  1. After you complete all configurations, click Save to save the task.

  2. On the Data Integration > Sync Tasks page, find the synchronization task that you created, and click Deploy in the Operation column. If you select Start immediately after deployment during deployment, the task runs immediately after you confirm. Otherwise, you must manually start the task.

    Note

    Data Integration tasks must be deployed to the production environment before they can run. Therefore, you must perform the Publish operation after you create or edit a task for the changes to take effect.

  3. Click the Name/ID of the corresponding task in Tasks to view the detailed execution process of the task.

Next step

After you complete the task configuration, you can manage the created task, add or remove tables, configure monitoring alerts for the task, and view key metrics of the task. For more information, see Manage synchronization tasks.

FAQ

Q: Why is the base table data not updated as expected?

A: The following provides root cause analysis and solutions:

image

Symptom

Cause

Solution

The data production check for the T-1 partition of the incremental log table failed.

The real-time synchronization task failed, causing the T-1 partition data in the incremental log table to not be produced normally.

  • Check the failure cause of the real-time synchronization task and restart the task from the failed offset.

    Note

    RDS binlog retention is limited. If the binlog at the corresponding point in time is purged, the real-time synchronization task will report an error indicating that the offset cannot be found.

  • Configure monitoring alerts for the real-time synchronization subtask.

The data production check for the T-2 partition of the destination base table failed.

  1. The previous day's merge task failed.

  2. The previous day's partition of the base table was deleted.

  • Configure run status monitoring for the merge node. For more information, see Configure monitoring rules.

  • If you accidentally deleted the T-2 partition of the base table, use the backfill data feature to restore it. After the partition is restored, rerun the merge node. For more information, see Backfill data.