All Products
Search
Document Center

DataWorks:Synchronize data

Last Updated:Mar 31, 2026

This tutorial walks you through synchronizing data from two heterogeneous sources — a MySQL database and an OSS flat file — into MaxCompute using DataWorks Data Integration. By the end, the ods_user_info_d and ods_raw_log_d tables in MaxCompute will be populated and ready for downstream processing.

In this tutorial, you will:

  1. Add MySQL and HttpFile data sources to your DataWorks workspace.

  2. Build a sync pipeline with three nodes: a Zero Load Node and two Batch Synchronization nodes.

  3. Configure each sync task, including the destination table DDL, partition parameters, and dirty data policy.

  4. Run the pipeline and verify that data is written to MaxCompute.

Prerequisites

Before you begin, make sure you have:

  • Completed Prepare the environment to set up your DataWorks workspace, Serverless resource group, and public network access

  • A Serverless resource group attached to your workspace (if not, follow the prompts in the connection configuration section: click Purchase and then Associated Purchased Resource Group)

Step 1: Create data sources

Add the following two data sources to your DataWorks workspace so that subsequent sync tasks can reach the raw data.

If you have already created MySQL and HttpFile data sources for other user profile analysis tutorials, skip this step.

Create a MySQL data source

The basic user information for this tutorial is stored in a MySQL database. Create a MySQL data source named user_behavior_analysis_mysql to pull the ods_user_info_d table into MaxCompute.

  1. Log on to the DataWorks console. In the top navigation bar, select your region. In the left-side navigation pane, choose More > Management Center. Select your workspace from the drop-down list and click Go to Management Center.

  2. In the left-side navigation pane, click Data Sources.

  3. Click Add Data Source, then search for and select MySQL.

  4. On the Add MySQL Data Source page, configure the parameters below. Use these values for both the development and production environments. Keep the default values for any parameters not listed.

    Parameter Value
    Data Source Name user_behavior_analysis_mysql
    Data Source Description This data source is for DataWorks tutorials. Read data from this data source when you configure an offline sync task to access the test data provided by the platform. This data source can only be read in Data Integration scenarios. Other modules cannot use it.
    Configuration Mode Connection String Mode
    Connection Address Host IP: rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com, Port: 3306
    Database Name workshop
    Username workshop
    Password workshop#2017
    Authentication Method No authentication
  5. In the Connection Configuration section, click Test Network Connectivity for both environments. Confirm that the status shows Connected.

    Important

    The resource group must be attached to the workspace with public network access enabled. Otherwise, data synchronization fails.

  6. Click Complete Creation.

Create an HttpFile data source

The user website access logs for this tutorial are stored in OSS (Object Storage Service). Create an HttpFile data source named user_behavior_analysis_httpfile to pull user_log.txt into MaxCompute.

  1. In the left-side navigation pane, click Data Sources.

  2. Click Add Data Source, then search for and select HttpFile.

  3. On the Add HttpFile Data Source page, configure the parameters below. Use these values for both the development and production environments. Keep the default values for any parameters not listed.

    Parameter Value
    Data Source Name user_behavior_analysis_httpfile
    Data Source Description This data source is for DataWorks tutorials. Read data from this data source when you configure an offline sync task to access the test data provided by the platform. This data source can only be read in Data Integration scenarios. Other modules cannot use it.
    URL https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com
  4. In the Connection Configuration section, click Test Network Connectivity for both environments. Confirm that the status shows Connected.

    Important

    The resource group must be attached to the workspace with public network access enabled. Otherwise, data synchronization fails.

  5. Click Complete Creation.

Step 2: Build a sync pipeline

Build a workflow that orchestrates the two sync tasks and runs daily on a schedule.

  1. Click the icon icon in the upper-left corner and select All Products > Data Development And Task Operation > DataStudio. Switch to the workspace created for this tutorial.

  2. In the left-side navigation pane, click image to open the Data Development page. In the Workspace Directories area, click image, select Create Workflow, and name it user_profile_analysis.

  3. On the workflow canvas, drag one Zero Load Node and two Batch Synchronization nodes from the left pane onto the canvas. Name them as follows:

    Node type Node name Function
    Zero Load Node workshop_start Dry-run ancestor node that manages the entire workflow. No code editing required.
    Batch Synchronization ods_user_info_d Syncs basic user information from MySQL to the ods_user_info_d table in MaxCompute.
    Batch Synchronization ods_raw_log_d Syncs user website access logs from OSS to the ods_raw_log_d table in MaxCompute.
  4. Drag connections to set workshop_start as the ancestor node of both Batch Synchronization nodes, as shown below.

    image

  5. Configure workflow scheduling. On the canvas, click Scheduling in the right-side pane and set the following parameters. Keep the default values for any parameters not listed.

    Scheduling parameter Value
    Scheduling Parameters bizdate=$[yyyymmdd-1] — passes the previous day's date to all inner nodes
    Scheduling Cycle Day
    Scheduling Time 00:30
    Scheduling Dependencies Click Use Workspace Root Node to attach the workflow to the root node (format: workspace_name_root).

Step 3: Configure sync tasks

Configure the initial node

  1. On the workflow canvas, hover over workshop_start and click Open Node.

  2. In the right-side pane, click Scheduling and set the following parameters. Keep the default values for any parameters not listed.

    Scheduling parameter Value
    Scheduling Type dry-run scheduling
    Resource Group Select the Serverless resource group created in Prepare the environment.
    Scheduling Dependencies Click Use Workspace Root Node to trigger execution from the workspace root node (workspace_name_root).

Configure the user data sync task (ods_user_info_d)

  1. On the workflow canvas, hover over ods_user_info_d and click Open Node.

  2. Configure the source and destination:

    Parameter Value
    Source — Data source type MySQL
    Source — Data source name user_behavior_analysis_mysql
    Resource Group Select the Serverless resource group from Prepare the environment.
    Destination — Data destination type MaxCompute(ODPS)
    Destination — Data source name Select the MaxCompute resource attached in Prepare the environment, for example MaxCompute_Source.
  3. Click Next to configure the sync task details.

    1. Configure source and destination. Set the following parameters. Keep the default values for any parameters not listed.

      Module Configuration Item Value
      Data source Table ods_user_info_d
      Data source Split key uid (the primary key of the ods_user_info_d table in MySQL)
      Destination Tunnel Resource Group Common transmission resources (default). If you have an exclusive Tunnel Quota, you can select it from the drop-down list. For more information, see Purchase and use an exclusive resource group for Data Transmission Service.
      Destination Table Click Generate Destination Table Schema. In the Statement for Creating Table box, paste the following DDL and click Create Table:
      Module Configuration Item Value
      Destination Partition information ${bizdate}
         CREATE TABLE IF NOT EXISTS ods_user_info_d (
             uid STRING COMMENT 'User ID',
             gender STRING COMMENT 'Gender',
             age_range STRING COMMENT 'Age range',
             zodiac STRING COMMENT 'Zodiac sign'
         )
         PARTITIONED BY (
             dt STRING
         )
         LIFECYCLE 7;

      Using ${bizdate} lets you pass a constant during testing and a dynamic value during scheduled runs. For supported variable formats, see Scheduling Parameters.

    2. Confirm field mapping and channel control. DataWorks maps source fields to destination fields based on the field mapping configuration. Set Policy for Dirty Data Records to Disallow Dirty Data Records. Keep the default values for all other settings. For details, see Configure a sync task using the codeless UI.

  4. Configure debug parameters. In the right-side pane, click Debugging Configurations and set the following:

    Configuration item Value
    Resource Group Select the Serverless resource group from Prepare the environment.
    Script Parameters Leave blank. When you run the workflow in Step 4, set Value Used in This Run to a constant such as 20250223. The job replaces ${bizdate} with this value.
  5. (Optional) Configure scheduling properties. In the right-side pane, click Scheduling. The workflow-level scheduling parameters already apply to this node; no separate configuration is required. For details, see Node scheduling.

  6. In the node toolbar, click Save.

Configure the user log sync task (ods_raw_log_d)

  1. On the workflow canvas, hover over ods_raw_log_d and click Open Node.

  2. Configure the source and destination:

    Parameter Value
    Source — Data source type HttpFile
    Source — Data source name user_behavior_analysis_httpfile
    Resource Group Select the Serverless resource group from Prepare the environment.
    Destination — Data destination type MaxCompute(ODPS)
    Destination — Data source name Select the MaxCompute resource from Prepare the environment, for example MaxCompute_Source.
  3. Click Next to configure the sync task details.

    1. Configure source and destination. Set the following parameters. Keep the default values for any parameters not listed.

      Important

      After you configure the data source settings, click Confirm Data Structure to verify that DataWorks can read the log file correctly before proceeding.

      Module Configuration Item Value
      Data source File path /user_log.txt
      Data source Text type text
      Data source Column Delimiter |
      Data source Advanced configuration > Skip Header No
      Destination Tunnel Resource Group Common transmission resources (default). If you have an exclusive Tunnel Quota, you can select it from the drop-down list. For more information, see Purchase and use an exclusive resource group for Data Transmission Service.
      Destination Table Click Generate Destination Table Schema. In the Statement for Creating Table box, paste the following DDL and click Create Table:
      Module Configuration Item Value
      Destination Partition information ${bizdate}
         CREATE TABLE IF NOT EXISTS ods_raw_log_d
         (
             col STRING
         )
         PARTITIONED BY
         (
             dt STRING
         )
         LIFECYCLE 7;
    2. Confirm field mapping and channel control. Set Policy for Dirty Data Records to Disallow Dirty Data Records. Keep the default values for all other settings. For details, see Configure a sync task using the codeless UI.

  4. Configure debug parameters. In the right-side pane, click Debugging Configurations and set the following:

    Configuration item Value
    Resource Group Select the Serverless resource group from Prepare the environment.
    Script Parameters Leave blank. When you run the workflow in Step 4, set Value Used in This Run to a constant such as 20250223.
  5. (Optional) Configure scheduling properties. In the right-side pane, click Scheduling. The workflow-level parameters apply to this node automatically. For details, see Node scheduling.

  6. In the node toolbar, click Save.

Step 4: Sync data

  1. In the workflow toolbar, click Run. Set Value Used in This Run for the bizdate variable — for example, 20250223 — and click OK. Wait for the run to complete.

  2. Verify the results.

    1. In the left-side navigation pane of DataStudio, click image to open the Data Development page. In the personal folder area, click image to create a file with the .sql extension.

    2. At the bottom of the page, confirm that the language mode is MaxCompute SQL. image

    3. Run the following queries to count the records loaded into each table. Replace your_data_timestamp with the bizdate value used in the run (for example, 20250223).

         -- Replace your_data_timestamp with the actual bizdate value (e.g., 20250223)
         SELECT count(*) FROM ods_user_info_d WHERE dt='your_data_timestamp';
         SELECT count(*) FROM ods_raw_log_d WHERE dt='your_data_timestamp';
      • If both queries return a non-zero count, data synchronization is complete.

      • If a query returns zero, confirm that the Value Used in This Run matches the dt value in the query. To check, click the workflow, open Running History in the right-side pane, and click View in the Actions column to inspect the partition value (partition=[pt=xxx]) in the run log.

What you learned

In this tutorial, you:

  • Added a MySQL data source and an HttpFile data source to your DataWorks workspace.

  • Built a user_profile_analysis workflow with a Zero Load Node and two Batch Synchronization nodes.

  • Configured source and destination tables, partition parameters, and dirty data policy for each sync task.

  • Ran the pipeline and verified that data was written to the ods_user_info_d and ods_raw_log_d tables in MaxCompute.

What's next

Proceed to Process data to transform and analyze the synchronized data in MaxCompute.