All Products
Search
Document Center

DataWorks:Synchronize data

Last Updated:Mar 27, 2026

Use Data Integration batch synchronization tasks to load data from MySQL and Object Storage Service (OSS) into MaxCompute. This tutorial walks through creating destination tables, registering data sources, configuring two synchronization nodes, and verifying results — covering the data collection phase of a user profile analysis workflow.

What you'll learn

By the end of this tutorial, you'll know how to:

  • Create partitioned MaxCompute tables to receive synchronized data

  • Register an ApsaraDB RDS for MySQL data source and an HttpFile data source in DataWorks

  • Configure a batch synchronization node for structured relational data (MySQL to MaxCompute)

  • Configure a batch synchronization node for raw log files (OSS to MaxCompute)

  • Set up daily scheduling with $bizdate parameterization

  • Verify synchronization results using an ad hoc query

Prerequisites

Before you begin, make sure you have:

Important

The test data and data sources used in this tutorial are pre-provisioned mock data for DataWorks experiments. The data can only be read through Data Integration and is not intended for production use.

Data synchronization overview

This tutorial synchronizes data from two sources into MaxCompute:

Source type Data Source table or object Destination table
MySQL Basic user information ods_user_info_d ods_user_info_d_odps
HttpFile (OSS) Website access logs user_log.txt ods_raw_log_d_odps

Why batch synchronization? Batch synchronization is appropriate here because the source data accumulates throughout the day and is processed as a daily snapshot. If you need to process data within minutes of it arriving, use a real-time synchronization approach instead.

Go to DataStudio

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select your workspace from the drop-down list, then click Go to Data Development.

Step 1: Design the workflow

Create a workflow

Development in DataWorks is organized around workflows. Before creating nodes, create a workflow named User profile analysis_MaxCompute. For instructions, see Create a workflow.

image

Add nodes to the workflow

After creating the workflow, the workflow canvas opens automatically. Click Create Node in the upper part of the canvas, drag nodes onto the canvas, and draw dependency lines between them. For the full workflow design, see the workflow design reference.

image

The workflow for this tutorial uses the following three nodes:

Node classification Node type Node name Purpose
General Zero load node workshop_start_odps Triggers the workflow at 00:30 daily and manages the overall execution order
Data Integration Batch synchronization ods_user_info_d_odps Syncs basic user information from MySQL to MaxCompute
Data Integration Batch synchronization ods_raw_log_d_odps Syncs website access logs from OSS to MaxCompute

Name each synchronization node after the destination table it populates.

Because no data lineage exists between the zero load node and the synchronization nodes, configure their dependencies by drawing lines in the workflow canvas. For details, see Scheduling dependency configuration guide.

All nodes in a DataWorks workflow require a parent node. Both synchronization nodes in this tutorial depend on workshop_start_odps, so they start running after that node completes.

Configure scheduling for the zero load node

The workshop_start_odps zero load node triggers the entire workflow. Configure its scheduling properties as follows. Leave the scheduling properties of the two synchronization nodes at their defaults for now — you'll configure them in Step 2.

Configuration item Setting Description
Scheduling time 00:30 The workflow runs daily at 00:30.
Scheduling dependencies Root node of the workspace The zero load node has no parent nodes, so it depends on the workspace root node.

For more information about scheduling options, see Configure scheduling time for nodes in a workflow in different scenarios and the scheduling overview.

Step 2: Configure data synchronization tasks

Create the destination MaxCompute tables

Create the MaxCompute tables that will receive the synchronized data before configuring the synchronization tasks.

  1. Go to the table creation entry point.

    image.png

  2. Create the ods_raw_log_d_odps table. In the Create Table dialog box, enter ods_raw_log_d_odps in the Name field. Click DDL, paste the following statement, and click Generate Table Schema. In the Confirm dialog box, click Confirmation to overwrite the existing configuration.

    CREATE TABLE IF NOT EXISTS ods_raw_log_d_odps
    (
     col STRING
    )
    PARTITIONED BY
    (
     dt STRING
    )
    LIFECYCLE 7;
  3. Create the ods_user_info_d_odps table. In the Create Table dialog box, enter ods_user_info_d_odps in the Name field. Click DDL, paste the following statement, and click Generate Table Schema. In the Confirm dialog box, click Confirmation to overwrite the existing configuration.

    CREATE TABLE IF NOT EXISTS ods_user_info_d_odps (
     uid STRING COMMENT 'The user ID',
     gender STRING COMMENT 'The gender',
     age_range STRING COMMENT 'The age range',
     zodiac STRING COMMENT 'The zodiac sign'
    )
    PARTITIONED BY (
     dt STRING
    )
    LIFECYCLE 7;
  4. Commit and deploy both tables. For each table, click Commit to Development Environment and then Commit to Production Environment on the configuration tab. This creates the physical tables in the MaxCompute projects associated with your workspace in each environment.

    After committing, you can view the tables in the MaxCompute project for the corresponding environment. Tables committed to the development environment appear in the development MaxCompute project; tables committed to the production environment appear in the production MaxCompute project.

Register data sources

Register the data sources before configuring synchronization tasks, so you can select them by name during task configuration.

Important

The test data for both data sources is hosted on the internet. Make sure an Internet NAT gateway is configured for your DataWorks resource group as described in Step 2 of environment preparation. Without this, connectivity tests will fail with timeout errors.

To navigate to the Data Sources page:

  1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. Select your workspace from the drop-down list, then click Go to Management Center.

  2. In the left-side navigation pane of the SettingCenter page, click Data Sources.

Register the MySQL data source

Add the ApsaraDB RDS for MySQL data source (user_behavior_analysis_mysql) to access the basic user information table.

  1. On the Data Sources page, click Add Data Source.

  2. In the Add Data Source dialog box, click MySQL.

  3. On the Add MySQL Data Source page, configure the following parameters:

    Set Environment to Development and Production. A data source must exist in both environments; otherwise, tasks fail when they run against the production environment.
    Parameter Value
    Data Source Name user_behavior_analysis_mysql
    Configuration Mode Connection String Mode
    Environment Development and Production
    Host IP Address rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com
    Port Number 3306
    Database Name workshop
    Username workshop
    Password workshop#2017
    Authentication Method No Authentication

    image

  4. In the Connection Configuration section, find your serverless resource group and click Test Network Connectivity in the Connection Status column. Test connectivity for both the development and production environments. When the test succeeds, the status changes to Connected.

Register the HttpFile data source

Add the HttpFile data source (user_behavior_analysis_httpfile) to access the website access log file stored in OSS.

  1. On the Data Sources page, click Add Data Source.

  2. In the Add Data Source dialog box, click HttpFile.

  3. On the Add HttpFile Data Source page, configure the following parameters:

    Set Environment to both Development Environment and Production Environment to avoid task failures when running in production.
    Parameter Value
    Data Source Name user_behavior_analysis_httpfile
    Environment Development Environment and Production Environment
    URL Domain https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com
  4. In the Connection Configuration section, click Test Network Connectivity for your serverless resource group. Test both environments. When the test succeeds, the status changes to Connected.

Configure the MySQL to MaxCompute synchronization task

This task reads the ods_user_info_d table from MySQL and writes the data to the ods_user_info_d_odps MaxCompute table.

  1. Double-click the ods_user_info_d_odps batch synchronization node to open its configuration tab.

  2. Configure the connection. Set the source, resource group, and destination, then click Next and complete the connectivity test.

    Parameter Value
    Source Type: MySQL; Data Source Name: user_behavior_analysis_mysql
    Resource Group The serverless resource group you purchased during environment preparation
    Destination Type: MaxCompute; Data Source Name: user_behavior_analysis_mysql

    image

  3. Configure the source and destination settings. Use the following settings. Leave all other parameters at their defaults. For field mapping and Channel Control settings, the defaults work for this tutorial. For other configuration options, see Configure a task in the codeless UI.

    Section Parameter Value
    Source Table ods_user_info_d
    Source Split key uid (primary key; used to parallelize data reads)
    Destination Tunnel Resource Group Common transmission resources (default)
    Destination Schema default
    Destination Table ods_user_info_d_odps
    Destination Partition information ${bizdate}
    Destination Write Mode Clean up existing data before writing (Insert Overwrite)
    Destination Write by Converting Empty Strings into Null No
  4. Configure scheduling properties. Click Properties in the right-side navigation pane. Use the following settings and leave all other parameters at their defaults.

    The node runs after workshop_start_odps completes each day. DataWorks uses the node output to resolve scheduling dependencies for downstream SQL nodes through automatic data lineage parsing.
    Section Parameter Value
    Scheduling Parameter Parameter Name / Value bizdate / $bizdate (date of the previous day in yyyymmdd format)
    Schedule Scheduling Cycle Day
    Schedule Scheduled time 00:30
    Schedule Rerun Allow Regardless of Running Status
    Resource Group The serverless resource group from environment preparation
    Dependencies Parent node workshop_start_odps
    Dependencies Node output Verify that an output named <MaxCompute project name in the production environment>.ods_user_info_d_odps exists. Add it manually if it is missing.

Configure the HttpFile to MaxCompute synchronization task

This task reads the user_log.txt log file from OSS via HttpFile and writes the raw log data to the ods_raw_log_d_odps MaxCompute table.

  1. Double-click the ods_raw_log_d_odps batch synchronization node to open its configuration tab.

  2. Configure the connection. Set the source, resource group, and destination, then click Next and complete the connectivity test.

    Parameter Value
    Source Type: HttpFile; Data Source Name: user_behavior_analysis_HttpFile
    Resource Group The serverless resource group you purchased during environment preparation
    Destination Type: MaxCompute; Data Source Name: user_behavior_analysis_mysql

    image

  3. Configure the source and destination settings. Use the following settings. Leave all other parameters at their defaults. After setting the source parameters, click Confirm Data Structure to verify that DataWorks can read the log file. For field mapping and Channel Control settings, the defaults work for this tutorial. For other configuration options, see Configure a task in the codeless UI.

    Section Parameter Value
    Source File Path /user_log.txt
    Source File Type text
    Source Column Delimiter |
    Source (Advanced) Coding UTF-8
    Source (Advanced) Compression format UTF-8
    Source (Advanced) Skip Header No
    Destination Tunnel Resource Group Common transmission resources (default)
    Destination Schema default
    Destination Table ods_raw_log_d_odps
    Destination Partition information ${bizdate}
    Destination Write Mode Clean up existing data before writing (Insert Overwrite)
    Destination Write by Converting Empty Strings into Null No
  4. Configure scheduling properties. Click Properties in the right-side navigation pane. Use the following settings and leave all other parameters at their defaults.

    Section Parameter Value
    Scheduling Parameter Parameter Name / Value bizdate / $bizdate (date of the previous day in yyyymmdd format)
    Schedule Scheduling Cycle Day
    Schedule Scheduled time 00:30
    Schedule Rerun Allow Regardless of Running Status
    Resource Group The serverless resource group from environment preparation
    Dependencies Parent node workshop_start_odps
    Dependencies Node output Verify that an output named <MaxCompute project name in the production environment>.ods_raw_log_d_odps exists. Add it manually if it is missing.

Step 3: Run the workflow and verify results

Run the workflow

  1. In the DataStudio left-side navigation pane, under Business Flow, double-click the User profile analysis_MaxCompute workflow.

  2. On the workflow canvas, click the image.png icon in the top toolbar to trigger all nodes based on their scheduling dependencies.

Check node status

  • Node state indicator: A node displaying the image.png icon is running normally.

  • Logs: Right-click the ods_user_info_d_odps or ods_raw_log_d_odps node and select View Logs. A successful run shows a completion message in the log output.

    image

Verify the synchronization results

When the workflow completes successfully:

  • All records from ods_user_info_d in ApsaraDB RDS for MySQL are written to the previous day's partition in workshop2024_01_dev.ods_user_info_d_odps.

  • All access log records from user_log.txt in OSS are written to the previous day's partition in workshop2024_01_dev.ods_raw_log_d_odps.

To confirm the row counts, run an ad hoc query:

  1. In the DataStudio left-side navigation pane, click the image.png icon. In the Ad Hoc Query pane, right-click Ad Hoc Query and choose Create Node > ODPS SQL.

  2. Run the following queries, replacing Data timestamp with the $bizdate value for your run (the date of the previous day in yyyymmdd format — for example, if the node ran on June 21, 2023, the data timestamp is 20230620):

    select count(*) from ods_user_info_d_odps where dt='Data timestamp';
    select count(*) from ods_raw_log_d_odps where dt='Data timestamp';

    image

Nodes run in DataStudio execute in the development environment by default. Query results reflect data in the workshop2024_01_dev MaxCompute project.

What's next

Data collection is complete. The next tutorial covers processing the user information and access logs you just synchronized into MaxCompute. See Process data.