All Products
Search
Document Center

DataWorks:Batch synchronization from a full MySQL database to an OSS data lake

Last Updated:Feb 26, 2026

Data Integration supports the offline synchronization of entire databases from sources such as AnalyticDB for MySQL 3.0, MySQL, Oracle, PolarDB, and PostgreSQL to OSS. This topic describes how to synchronize data from an entire MySQL database to an OSS data lake offline, using MySQL as the source and OSS as the destination.

Prerequisites

Procedure

1. Select a sync task type

  1. Go to the Data Integration page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Integration > Data Integration. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Integration.

  2. In the navigation pane on the left, click Sync Task. At the top of the page, click Create Sync Task to open the sync task creation page. Configure the basic information.

    • Source And Destination: MySQLOSS

    • New Task Name: Enter a custom name for the sync task.

    • Synchronization Type: Offline Full Database.

    • Synchronization Steps: Select Full Synchronization and Incremental Synchronization.

2. Configure network and resources

  1. In the Network And Resource Configuration section, select a Resource Group for the sync task. You can allocate CUs for Task Resource Usage.

  2. For Source Data Source, select your MySQL data source. For Destination Data Source, select your OSS data source. Then, click Test Connectivity.image

  3. After you confirm that the source and destination data sources are connected, click Next.

3. Configure basic destination settings

The offline full database synchronization to OSS supports multiple write formats, such as Paimon, Iceberg, CSV, text, Parquet, and ORC.

The configuration parameters vary based on the write format. Configure the parameters as required:

Paimon, Iceberg

Parameter

Description

Storage Path Selection

Select the OSS path where the data is stored after it is ingested into the data lake.

Select Metadatabase Auto-build Location

Select whether to automatically build a metadatabase in DLF.

Note

You can only build a metadatabase in a DLF instance that is in the same region.

CSV, text

Parameter

Description

Destination Root Path

Format: di_to_oss_path/${bizdate}/. The scheduling parameter ${bizdate} indicates the data timestamp. The scheduling value is $bizdate. This value is the day before the scheduling date. For more information about scheduling parameters, see Supported formats for scheduling parameters.

Note

When you use the scheduling parameter $bizdate, it must be included in the Recurring Schedule Parameter Configuration of the Full And Incremental Control settings.

Column Delimiter

The character that separates fields in your data, such as a comma (,). If the separator is not visible, enter its Unicode encoding, such as \\u001b or \\u007c.

Prefix Conflict

When writing data, if the destination object has the same prefix as the object to be written, you can perform one of the following operations:

  • Replace Original File: Deletes files with the same prefix and creates new files to replace them.

  • Keep Original File: Keeps the original files and creates new files with the original filenames plus a random suffix.

  • Exit With An Error: The sync task stops.

Output Table Header

Specify whether to output the table header as content.

Parquet, ORC

Parameters to configure

Description

Destination Root Path

Format: di_to_oss_path/${bizdate}/. The scheduling parameter ${bizdate} indicates the data timestamp. The scheduling value is $bizdate. This value is the day before the scheduling date. For more information about scheduling parameters, see Supported formats for scheduling parameters.

Note

When you use the scheduling parameter $bizdate, it must be included in the Recurring Schedule Parameter Configuration of the Full And Incremental Control settings.

Prefix Conflict

When writing data, if the destination object has the same prefix as the object to be written, you can perform one of the following operations:

  • Replace Original File: Deletes files with the same prefix and creates new files to replace them.

  • Keep Original File: Keeps the original files and creates new files with the original filenames plus a random suffix.

  • Exit With An Error: The sync task stops.

4. Select the databases and tables to synchronize

In the Source Table area, select the tables to sync from the source data source. Click the image icon to move the tables to the Selected Tables list.

image

5. Configure full and incremental control

  1. Configure the full and incremental sync type for the task.

    • If you select both Full initialization and Incremental synchronization in the Synchronization Mode, the task defaults to a one-time full sync and recurring incremental syncs. This setting cannot be changed.

    • If you selected Full initialization in the Synchronization Mode, you can configure the task for a one-time full sync or a recurring full sync.

    • If you select Incremental synchronization in the Synchronization Mode, you can configure the task as a one-time or recurring incremental sync.

      Note

      The following steps use a one-time full sync and recurring incremental sync task as an example.

  2. Configure recurring schedule parameters.

    If you want the task to run on a recurring schedule, click Configure Scheduling Parameters for Periodical Scheduling.

6. Configure destination table mapping

After you select the tables to synchronize, they are automatically displayed on the current page. By default, the properties of the object files have a status of 'mapping to be refreshed'. You must define and confirm the mapping between the source tables and object files, which defines the data read and write relationship. You can refresh the mapping directly or customize the object file rules before you click Refresh Mapping to proceed.

Note
  • You can select the tables to synchronize and click Batch Refresh Mapping. If no mapping rule is configured, the default naming convention for the destination OSS object is ${Source Table Name}/data_${Data Timestamp}.

  • Because recurring scheduling is required, you must define the properties for the recurring scheduling task. These properties include Scheduling Cycle, Rerun Property, and Scheduling Resource Group. The scheduling configuration for this synchronization is consistent with the node scheduling configuration in Data Studio. For more information about the parameters, see Node scheduling configuration.

  • Based on the selected Sync Step, set the Incremental Condition and Full Condition. These conditions apply a WHERE clause to filter the source data. Enter only the content of the clause, not the WHERE keyword. If you enable a recurring schedule, you can use system parameters.

  • In the Custom Destination Path Mapping and Custom Destination Filename Mapping columns, click Configure to customize the storage path and naming convention for the destination OSS objects. For more information, see Appendix: Description of destination OSS file paths and names.

1. Edit mapping of field data types

A sync task maps source field types to destination field types by default. To customize this mapping, click Edit Mapping of Field Data Types in the upper-right corner of the table. After you configure the mapping, click Apply and Refresh Mapping.

2. Add fields to the object file and assign values

You can add new fields to the object file that are not in the original table schema. To do this, perform the following steps:

  • Add a field and assign a value for a single table: Click Configure in the Add Field To Object File column. On the Add Field page, click Add Field to add a field to the object file and assign a value to it.

  • Assign values in batches: Select multiple tables. At the bottom of the list, choose Batch Modify > Add Field To Object File to add the same field to the destination tables and assign values in batches.

    Note

    You can assign constants and variables. Click the image icon to switch the assignment mode.

3. Custom advanced parameters

For fine-grained control over the task, click Configure in the Custom Advanced Parameters column.

Important

Modify these parameters only if you fully understand what they do. Incorrect settings can cause unexpected errors or data quality issues.

4. Set the source chunking column

In the source chunking column, you can select a field from the source table in the drop-down list or select Do Not Chunk.

7. Configure advanced parameters

The sync task provides several parameters that you can modify as needed. For example, you can limit the maximum number of connections to prevent the sync task from exerting too much pressure on your production database.

Note

Modify these parameters only if you fully understand what they do. Incorrect settings can cause unexpected errors or data quality issues.

  1. In the upper-right corner of the page, click Configure Advanced Parameters to go to the advanced parameter configuration page.

  2. On the Configure Advanced Parameters page, modify the parameter values.

8. Configure the resource group

In the upper-right corner of the page, click Resource Group Configuration to view or switch the resource group for the current task.

9. Run the sync task

  1. After you finish the configuration, click Complete at the bottom of the page.

  2. On the Data Integration > Synchronization Task page, find the created sync task and click Deploy in the Operation column.

  3. In the Tasks, click the Name/ID of the task to view the execution details.

10. Configure alerts

After the task runs, a scheduled job is generated in the Operation Center. To prevent task errors from causing data sync latency, you can set an alarm policy for the sync task.

  1. In the Tasks, find the running sync task. In the Actions column, choose More > Edit to open the task editing page.

  2. Click Next. Then, click Configure Alert Rule in the upper-right corner of the page to open the alarm settings page.

  3. In the Scheduling Information column, click the scheduled job to open the task details page in the Operation Center and retrieve the Task ID.

  4. In the navigation pane on the left of the Operation Center, choose Node Alarm > Alarm > Rule Management to go to the Rule Management page.

  5. Click Create Custom Rule and set Rule Object, Trigger Condition, and Alert Details. For more information, see Rule management.

    In the Rule Object field, search for the target task using the obtained Task ID and set an alert.

Sync task O&M

View task running status

After you create a sync task, you can view the list of created sync tasks and their basic information on the sync task page.

image

  • In the Operation column, you can Start or Stop the sync task. From the More menu, you can perform other operations, such as Edit and View.

  • For a running task, you can view its basic running status in the Execution Overview section. You can also click the corresponding overview area to view execution details.image

    In an offline full database synchronization task from MySQL to OSS:

    • If your task's synchronization step is Full Synchronization, schema migration and full synchronization are displayed.

    • If your task's synchronization step is Incremental Synchronization, schema migration and incremental synchronization are displayed.

    • If your task's synchronization steps are Full Synchronization + Incremental Synchronization, schema migration, full synchronization, and incremental synchronization are displayed.

Rerun a task

  • Click Rerun to rerun the task without changing the task configuration.

    Effect: This operation reruns a one-time task or updates the properties of a recurring task.

  • To rerun a task after modifying it by adding or removing tables, edit the task and click Complete. The task status then changes to Apply Update. Click Apply Update to immediately trigger a rerun of the modified task.

    Effect: Only the new tables are synced. Tables that were previously synced are not synced again.

  • After you edit a task (for example, by changing a destination table name or switching to a different destination table) and click Complete, the available operation for the task changes to Apply Update. Click Apply Update to immediately trigger a rerun of the modified task.

    Effect: The modified tables are synced. Unmodified tables are not synced again.

Use cases

If you have downstream data dependencies and need to perform data development operations, you can refer to Node scheduling configuration to set the upstream and downstream nodes. The corresponding auto-triggered task node information can be viewed in the Recurring Configuration column.

image

Appendix: Description of final destination OSS file paths and names

DataWorks Data Integration provides custom rules for mapping the destination OSS path and destination OSS filename during step 6. Configure destination table mapping.

  • Built-in custom rule for mapping the destination OSS path: default_path_convert_rule.

    This rule uses the source database name as the destination OSS path. For example, if the source database name is di_ide_yufa, this name is used as the destination path in OSS. The storage path in OSS is di_ide_yufa.

  • Two built-in custom rules are available for mapping the destination OSS filename:

    • default_file_convert_rule_with_schedule_params: This rule is defined as ${srcTableName}/data_${bizdate}. The source table name ${srcTableName} is used as part of the OSS path, and the object file is named data_ followed by the value of the scheduling parameter ${bizdate}.

      Note

      For example, if the source table name is base_c_app_config and the scheduling date value is 20230101, the generated destination object name in OSS is base_c_app_config/data_20230101.

    • default_file_convert_rule: This rule is defined as ${srcTableName}/data. The source table name ${srcTableName} is used as part of the OSS path, and the default object file name is data.

      Note

      For example, if the source table name is base_c_app_config, the converted destination object name is base_c_app_config/data.

  • The final OSS file write path and filename are formed by concatenating the following three parts.

    • The destination root path.image

    • The object file path that is obtained from the custom destination OSS path mapping.

    • The object file name that is obtained from the custom destination OSS filename mapping rule.

      image