how to configure a full database offline sync task - DataWorks

The full database offline synchronization feature in DataWorks lets you synchronize all or part of the table schemas and data from a source database to a destination in batches. You can run full or incremental synchronization tasks on a recurring basis. This feature provides an efficient solution for data migration. This topic uses the migration of an entire MySQL database to MaxCompute as an example to describe the general process for configuring this type of task.

Preparations

Prepare data sources
- Create a source data source and a destination data source. For more information about how to configure data sources, see Data Source Management.
- Make sure that the data sources support full database offline synchronization. For more information, see Supported data sources.
Resource group: Purchase and configure a Serverless resource group.
Network connectivity: Establish network connectivity between the resource group and the data sources.

Access the feature

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Integration > Data Integration. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Integration.

Configure the task

1. Create a sync task

You can create a sync task in one of the following ways:

Method 1: On the sync task page, select a Source and a Destination, and then click Create Synchronization Task. In this example, select MySQL for the source and MaxCompute for the destination.
Method 2: On the sync task page, if the task list is empty, click Create.

2. Configure basic information

Configure basic information, such as the task name, task description, and owner.
Select a synchronization type. Based on the source and destination database types, Data Integration displays the supported synchronization types. For this example, select Offline synchronization of the entire database.
Synchronization steps: Schema migration, full synchronization, and incremental synchronization are supported. Full synchronization and incremental synchronization are optional. The synchronization steps setting is linked to the Full and incremental control setting. You can combine these settings to create different synchronization solutions. For more information, see Full and incremental control.
Network and resource configuration: Select the Resource Group to run the sync task, and select the Source and Destination. Then, test the network connectivity between the resource group and the data sources.

3. Select the databases and tables to sync

In the Source Table area, select the tables to sync from the source data source. Click the icon to move the tables to the Selected Tables list.

If you have many databases and tables, you can use Database Filtering or Search For Tables to select the tables you want to sync by configuring a regular expression.

4. Set destination table properties

Click the Configuration button next to Partition Initialization Configuration to configure the initial partition settings for all new destination tables. Changes made here will overwrite the partition settings for all new destination tables. Existing destination tables are not affected.

5. Configure full and incremental control

Configure the full and incremental synchronization type for the task.

If you select Full Synchronization or Incremental Synchronization, you can choose to run the task as a One-time task or a Recurring task.

If you select both Full Synchronization and Incremental Synchronization, the system uses the built-in mode: One-time full sync first, then recurring incremental sync. This option cannot be changed.

Synchronization steps	Full and incremental control	Data write description	Scenarios
Full synchronization	One-time	After the task starts, all data from the source table is synchronized to the destination table or a specified partition at one time.	Data initialization, system migration
Full synchronization	Recurring	All data from the source table is periodically synchronized to the destination table or a specified partition based on the configured scheduling cycle.	Data reconciliation, T+1 full snapshot
Incremental synchronization	One-time	After the task starts, incremental data is synchronized to a specified partition at one time based on the incremental condition you specify.	Manually fix a batch of data
Incremental synchronization	Recurring	After the task starts, incremental data is periodically synchronized to a specified partition based on the configured scheduling cycle and incremental condition.	Daily extract, transform, and load (ETL), building zipper tables
Full synchronization & Incremental synchronization	(Built-in mode, cannot be selected)	First run: The system automatically performs an initial schema synchronization and a full synchronization of historical data. Subsequent runs: Incremental data is periodically synchronized to a specified partition based on the configured scheduling cycle and incremental condition.	One-click data warehousing/data lake ingestion

Note

For full database offline synchronization, instances for recurring schedules are generated immediately after you publish the task. For more information, see Instance generation method: Generate immediately after publishing.
You can define how partitions are generated in the Value assignment step. You can use constants or dynamically generate partitions using system-predefined variables and recurring schedule parameters.
The configurations for the scheduling cycle, incremental condition, and partition generation method are interconnected. For more information, see Incremental condition.

Configure recurring schedule parameters.
If your task involves recurring synchronization, click Configure Scheduling Parameters for Periodical Scheduling to configure them. You can use these parameters later when configuring the incremental condition and field value assignment in the destination table mapping.

6. Configure destination table mapping

In this step, you need to define the mapping rules between the source and destination tables. You also need to define the Recurring Schedule and Incremental Condition to specify how data is written.

Destination table configuration

Operation

Description

Refresh mapping

The system automatically lists the source tables you selected. However, you must refresh the mapping to confirm the specific properties of the destination tables before they take effect.

Select the tables to be synchronized in batches and click Batch Refresh Mapping.
Destination table name: The destination table name is automatically generated based on the Customize Mapping Rules for Destination Table Names rule. The default format is ${SourceDatabaseName}_${TableName}. If a table with this name does not exist at the destination, the system automatically creates one for you.

Customize destination table name mapping (Optional)

The system has a default rule for generating table names: ${SourceDatabaseName}_${TableName}. You can also click the Edit button in the Customize Mapping Rules for Destination Table Names column to add a custom rule for destination table names.

Rule Name: Define a name for the rule. We recommend giving the rule a name with a clear business meaning.
Destination Table Name: You can generate the destination table name by clicking the button and combining values from manual input and built-in variables. The variables include the source data source name, source database name, and source table name.
Edit Built-in Variables: You can perform string transformations on the built-in variables.

This allows for the following scenarios:

Add prefixes and suffixes to names: Add a prefix or suffix to the source table name by setting a constant.
Rule configuration
Application effect
Uniform string replacement: Replace the string dev_ in source table names with prd_.
Rule configuration
Application effect
Write multiple tables to a single table.
Rule configuration
Application effect

Edit field type mapping (Optional)

The system has a default mapping between source and destination field types. You can click Edit Mapping of Field Data Types in the upper-right corner of the table to customize the field type mapping between source and destination tables. After configuration, click Apply and Refresh Mapping.

When editing field type mapping, ensure that the type conversion rules are correct. Otherwise, type conversion may fail, leading to dirty data and affecting task execution.

Edit destination table schema (Optional)

The system automatically generates the destination table schema based on the source table schema. In most scenarios, no manual intervention is needed. If special handling is required, you can customize it in the following ways:

Add a field to a single table: Click the button in the Destination Table Name column.
Add fields in batches: Select all tables to be synchronized, and at the bottom of the table, choose Batch Modify > Destination Table Schema - Batch Modify and Add Field.
Renaming columns is not supported.

Assign values to destination table fields

Standard fields are automatically mapped based on matching names between the source and destination tables. You need to manually assign values for partition fields and any new fields added in the previous step. Perform the following operations:

Assign values for a single table: Click the Configure button in the Value assignment column to assign values to the destination table fields.
Assign values in batches: At the bottom of the list, choose Batch Modify > Value assignment to assign values to the same fields in multiple destination tables in batches.

You can assign constants or variables. Switch between assignment modes using the icon. You can use constants or dynamically generate values using system-predefined variables and recurring schedule parameters. Both variables and recurring schedule parameters in the code will be automatically replaced when the task is scheduled.

Set source sharding column

In the source sharding column, you can select a field from the source table or choose Not Split from the drop-down list. When the sync task runs, it will be sharded into multiple tasks based on this field to enable concurrent, batched data reading.

We recommend using the table's primary key as the source sharding column. String, float, date, and other types are not supported.

Currently, the source sharding column is supported only when the source is MySQL.

Others

Table type: MaxCompute supports standard tables and Delta Tables. If the destination table status is "To be created", you can select the table type when editing the destination table schema. The type of an existing table cannot be changed.
If the table type is Delta Table, you can define the Number of Table Buckets and Queryable Time for Historical Data.

For more information about Delta Tables, see Delta Table.

Recurring schedule

If incremental synchronization is set to Recurring, you need to configure the Recurring Schedule for the destination table. This includes the Scheduling Cycle, Scheduling Time, and Resource Group. The scheduling configuration for this sync task is consistent with the node scheduling configuration in Data Development. For more information about the parameters, see Node scheduling configuration.

Note

If a one-time sync task involves many tables, we recommend staggering the execution times when you configure the schedule to prevent task backlogs and resource contention.

Incremental condition

If the task needs to synchronize incremental data, you must configure an incremental condition. This condition determines which data each instance of a recurring schedule will synchronize.

Function and syntax
- Function: The incremental condition is essentially a WHERE clause that filters the source data.
- Syntax: When configuring the condition, you only need to enter the conditional expression that follows WHERE. Do not include the WHERE keyword itself.
Use scheduling parameters to achieve incremental synchronization
To implement periodic incremental synchronization, you can use scheduling parameters in the incremental condition. For example, you can set the condition to <span data-tag="ph" id="codeph_rtz_ohk_wy5"><code code-type="xCode" data-tag="code" id="68c36d2fd9h4l">STR_TO_DATE('${bizdate}', '%Y%m%d') <= columnName AND columnName < DATE_ADD(STR_TO_DATE('${bizdate}', '%Y%m%d'), INTERVAL 1 DAY)' to synchronize the newly generated data from the previous day.
Write to a specific partition
By combining the incremental condition with the destination table's partition field, you can ensure that each batch of incremental data is written to the correct partition.
For example, with the incremental condition mentioned previously, you can set the partition field to ds=${bizdate} and set the destination table to be partitioned by day. This way, each day's instance will synchronize only the data from the corresponding date at the source and write it to the partition with the same name in the destination table.

Important

By properly combining the time range from the incremental condition, the time interval for partition generation, and the scheduling cycle from the recurring schedule, you can create an automated T+n incremental ETL pipeline. In this pipeline, business rules strictly align with physical partitions.

7. Other configurations

Alert configuration

After the task runs, a scheduling task is generated in the Operation Center. To avoid business data synchronization delays caused by task errors, you can set an alert policy for the sync task.

In the Tasks, find the running sync task. In the Actions column, click More > Edit to open the task configuration page.
Click Next, and then click Configure Alert Rule in the upper-right corner of the page to open the alert settings page.
In the Scheduling Information column, click the generated scheduling task to open the task details page in the Operation Center and retrieve the Task ID.
In the navigation pane on the left of the Operation Center, click Node Alarm > Alarm > Rule Management to open the rule management page.
Click Create Custom Rule. Set the Rule Object, Trigger Condition, and Alert Details. For more information, see Rule management.
You can search for the retrieved Task ID in the Rule Object section to find the target task and set an alert for it.

Resource group configuration

You can click Configure Resource Group in the upper-right corner of the interface to view and switch the resource group used by the current task.

Advanced parameter configuration

To perform fine-grained configuration for the task to meet custom synchronization requirements, click Configure Advanced Parameters to modify the advanced parameters.

Click Configure Advanced Parameters in the upper-right corner of the interface to open the advanced parameter configuration page.
Modify the parameter values according to the prompts. The meaning of each parameter is explained after its name.

Important

Modify these parameters only if you fully understand their meanings to avoid unexpected issues such as task latency, excessive resource usage that blocks other tasks, or data loss.

8. Run the sync task

After you finish the configuration, click Complete at the bottom of the page.
On the Data Integration > Synchronization Task page, find the created sync task and click Deploy in the Operation column. If you select Start immediately after deployment, the task will execute immediately after you click Confirm; otherwise, you will need to start it manually.
Note
Data Integration tasks must be deployed to the production environment to run. Therefore, both newly created and edited tasks must be deployed to take effect.
In the Tasks, click the Name/ID of the task to view the execution details.

Edit a task

On the Data Integration > Synchronization Task page, find the created sync task. In the Operation column, click More, and then click Edit to modify the task information. The steps are the same as for configuring a new task.
For a task that is not in the running state, you can directly modify the configuration and save it. The changes take effect after you publish the task.
For a task that is running, when you edit and publish it, if you do not select Start immediately after deployment, the original action button changes to Apply Updates. You must click this button for the changes to take effect in the production environment.
After you click Apply Updates, the system performs three steps on the changed content: Stop, Publish, and Restart.
- If you add a table:
  After you click Apply Updates, a sync subtask is added for the new table. The schema migration and one-time full synchronization for this subtask start immediately. Then, incremental synchronization proceeds according to the schedule.
- If you switch the destination table, which is equivalent to deleting the old table and adding a new one:
  After you click Apply Updates, the subtask for the old table is deleted and a new subtask for the new table is generated. The schema migration and one-time full synchronization for the new subtask start immediately. The new subtask then proceeds with incremental synchronization according to the schedule.
- If you modify other information:
  The schema migration and one-time full synchronization for the table are not affected. New instances that are generated for incremental synchronization will use the updated configuration. Instances that are already generated are not affected.
Unmodified tables are not affected and will not be re-run.

View a task

After creating a sync task, you can view the list of created sync tasks and their basic information on the sync task page.

In the Actions column, you can Start or Stop a sync task. Under More, you can perform operations such as Edit and View.
For a running task, you can view its basic status in the Execution Overview section. You can also click the summary area to view execution details.

What to do next

After the task starts, you can click the task name to view its running details and perform task O&M and tuning.

FAQ

For frequently asked questions about offline full database sync tasks, see FAQ about full and incremental sync tasks.

DataWorks:Configure a batch full-database synchronization task