All Products
Search
Document Center

DataWorks:Synchronize data

Last Updated:Mar 26, 2026

In this tutorial, you build a data pipeline that pulls user behavior data from two sources — website access logs (via HttpFile) and basic user information (via MySQL) — syncs the data to an Object Storage Service (OSS) bucket, and creates E-MapReduce (EMR) Hive external tables to query the results. By the end, you have a scheduled workflow in DataWorks that runs automatically every day at 00:30.

Prerequisites

Before you begin, ensure that you have:

Step 1: Add data sources

Add three data sources to your DataWorks workspace:

Data source Type Purpose
user_behavior_analysis_httpfile HttpFile Read website access logs
user_behavior_analysis_mysql MySQL Read basic user information
test_g OSS Store the synced data

Add the HttpFile data source

  1. Log on to the DataWorks console. In the top navigation bar, select your region. In the left-side navigation pane, choose More > Management Center. Select your workspace from the drop-down list and click Go to Management Center.

  2. In the left-side navigation pane of the SettingCenter page, click Data Sources.

  3. In the upper-left corner of the Data Sources page, click Add Data Source. In the dialog box, click HttpFile.

  4. On the Add HttpFile Data Source page, configure the following parameters:

    Parameter Value
    Data Source Name user_behavior_analysis_httpfile
    Data Source Description Used as the source of a batch synchronization task to access website access log test data. Read-only.
    URL https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com (same for development and production environments)
  5. Find the resource group you want to use. Click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns, then confirm that Connected appears in both columns.

    Important

    At least one resource group must show Connected. Otherwise, you cannot use the codeless UI to configure a data synchronization task for this data source.

  6. Click Complete Creation.

Add the MySQL data source

  1. In the left-side navigation pane of the SettingCenter page, click Data Sources. In the upper-left corner, click Add Data Source.

  2. In the dialog box, select MySQL.

  3. On the Add MySQL Data Source page, configure the following parameters:

    Parameter Value
    Data Source Name user_behavior_analysis_mysql
    Data Source Description Used as the source of a batch synchronization task to access basic user information test data. Read-only.
    Configuration Mode Connection String Mode
    Host IP Address rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com
    Port Number 3306
    Database Name workshop
    Username workshop
    Password workshop#2017
    Authentication Method No Authentication
  4. Find the resource group you want to use. Click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns, then confirm that Connected appears in both columns.

  5. Click Complete Creation.

Add the OSS data source

The MySQL data and the HttpFile access logs are both synced to a private OSS data source.

  1. In the left-side navigation pane of the SettingCenter page, click Data Sources. In the upper-left corner, click Add Data Source.

  2. In the dialog box, select OSS.

  3. On the Add OSS Data Source page, configure the following parameters:

    Important

    The AccessKey secret is displayed only when you create it. Store it securely. If the AccessKey pair is leaked or lost, delete it and create a new AccessKey pair.

    Parameter Value
    Data Source Name test_g
    Data Source Description The description of the data source.
    Access Mode AccessKey Mode
    AccessKey ID Your AccessKey ID. Go to the AccessKey page to copy it.
    AccessKey Secret Your AccessKey secret.
    Endpoint http://oss-cn-shanghai-internal.aliyuncs.com
    Bucket dw-emr-demo (the OSS bucket you created during environment preparation)
  4. Find the resource group you want to use. Click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns, then confirm that Connected appears in both columns.

    At least one resource group must show Connected. Otherwise, you cannot use the codeless UI to configure a data synchronization task for this data source.
  5. Click Complete Creation.

Step 2: Create the workflow

  1. On the Data Sources page, click the 图标 icon in the upper-left corner and choose All Products > Data Development And Task Operation > DataStudio.

  2. In the Scheduled Workflow pane, right-click Business Flow and select Create Workflow.

  3. In the Create Workflow dialog box, set Workflow Name to workshop_emr and click Create.

  4. Double-click the new workflow to open the configuration tab. Create three nodes on the canvas:

    1. Click Create Node and drag Zero-Load Node (in the General section) to the canvas. In the dialog box, set Name to workshop_start_emr and click Confirm.

    2. Click Create Node and drag Offline synchronization (in the Data Integration section) to the canvas. Create two nodes named ods_raw_log_d_2oss_emr and ods_user_info_d_2oss_emr. These nodes sync the HttpFile access logs and MySQL user information to OSS.

  5. Draw dependency lines from workshop_start_emr to both batch synchronization nodes, making it the ancestor node of both.

    Workflow dependency diagram

Step 3: Configure the nodes

Configure the initial node

The zero-load node workshop_start_emr controls when the workflow runs. It triggers the entire workflow daily at 00:30.

  1. In the Scheduled Workflow pane, double-click the workshop_start_emr node. In the right-side navigation pane, click Properties.

  2. Configure the scheduling properties:

    Section Setting
    Schedule Set the scheduling time to 00:30. Set Rerun to Allow Regardless of Running Status — this lets you re-run the workflow manually without waiting for it to finish or fail.
    Scheduling dependencies Configure workshop_start_emr as a descendant of the workspace root node (named in the Workspace name_root format). The root node triggers this workflow to run.
  3. Click the 保存 icon to save.

Configure the batch synchronization nodes

Sync MySQL user information to OSS

  1. On the DataStudio page, double-click the ods_user_info_d_2oss_emr node.

  2. Set the network connections between your resource group and the data sources: After completing the connection configuration, click Next and finish the connectivity test.

    Parameter Value
    Source MySQL / user_behavior_analysis_mysql
    Resource Group Your serverless resource group
    Destination OSS / test_g
  3. Configure the synchronization settings:

    Parameter Value
    Source > Table ods_user_info_d
    Source > Split key uid (primary key; only INTEGER fields are supported)
    Destination > Text type text
    Destination > Object Name (Path Included) ods_user_info_d/user_${bizdate}/user_${bizdate}.txtods_user_info_d is the folder you created in the OSS bucket; ${bizdate} resolves to the previous day's date.
    Destination > Column Delimiter |
  4. Click Properties in the right-side navigation pane and configure the scheduling settings:

    Section Setting
    Scheduling Parameter Click Add Parameter. Set Parameter Name to bizdate and Parameter Value to $[yyyymmdd-1]. This passes the previous day's date to the sync task at run time.
    Schedule Set Rerun to Allow Regardless of Running Status.
    Dependencies Make sure the output table is set to this node. The output table name follows the Workspace name.Node name format.
  5. Click the 保存 icon to save.

Sync HttpFile access logs to OSS

  1. On the DataStudio page, double-click the ods_raw_log_d_2oss_emr node.

  2. Set the network connections between your resource group and the data sources: After completing the connection configuration, click Next and finish the connectivity test.

    Parameter Value
    Source HttpFile / user_behavior_analysis_httpfile
    Resource Group Your serverless resource group
    Destination OSS / test_g
  3. Configure the synchronization settings:

    Parameter Value
    Source > File Path /user_log.txt
    Source > Text type text
    Source > Column Delimiter |
    Source > Compression format None
    Source > Skip Header No
    Destination > Text type text
    Destination > Object Name (Path Included) ods_raw_log_d/log_${bizdate}/log_${bizdate}.txtods_raw_log_d is the folder you created in the OSS bucket; ${bizdate} resolves to the previous day's date.
    Destination > Column Delimiter |
  4. Click Properties in the right-side navigation pane and configure the scheduling settings:

    Section Setting
    Scheduling Parameter Click Add Parameter. Set Parameter Name to bizdate and Parameter Value to $[yyyymmdd-1]. This passes the previous day's date to the sync task at run time.
    Schedule Set Rerun to Allow Regardless of Running Status.
    Dependencies Make sure the output table is set to this node. The output table name follows the Workspace name.Node name format.
  5. Click the 保存 icon to save.

Create EMR Hive tables

Create two EMR Hive external tables — ods_user_info_d_emr and ods_raw_log_d_emr — that point to the synced data in OSS. These tables let you query the synchronized data directly.

  1. In the Scheduled Workflow pane, click the workshop_emr workflow, right-click EMR, and then choose Create Node > EMR Hive.

  2. Create two nodes: ods_user_info_d_emr and ods_raw_log_d_emr. Then draw dependency lines on the workflow configuration tab so both EMR Hive nodes depend on the corresponding batch synchronization nodes.

    EMR Hive node dependencies

  3. Configure the ods_user_info_d_emr node:

    1. Double-click the node. On the configuration tab, enter the following SQL:

      CREATE EXTERNAL TABLE IF NOT EXISTS ods_user_info_d_emr
      (
          `uid` STRING COMMENT 'The user ID',
          `gender` STRING COMMENT 'The gender',
          `age_range` STRING COMMENT 'The age range',
          `zodiac` STRING COMMENT 'The zodiac sign'
      ) PARTITIONED BY (
        dt STRING
      )
      ROW FORMAT delimited fields terminated by '|'
      LOCATION 'oss://dw-emr-demo/ods_user_info_d/';
      
      ALTER TABLE ods_user_info_d_emr ADD IF NOT EXISTS PARTITION (dt='${bizdate}')
      LOCATION 'oss://dw-emr-demo/ods_user_info_d/user_${bizdate}/';
      Note

      The LOCATION path matches the Object Name (Path Included) value you set in the ods_user_info_d_2oss_emr synchronization node. dw-emr-demo is the OSS bucket you created during environment preparation.

    2. Click Properties and configure the scheduling settings:

      Section Setting
      Scheduling Parameter Set Parameter Name to bizdate, Parameter Value to $[yyyymmdd-1].
      Schedule Set Rerun to Allow Regardless of Running Status.
      Dependencies Set the output table to this node. The output table name follows the Workspace name.Node name format.
      Resource Group Select your serverless resource group.
    3. Click the image icon to save and run the node.

  4. Configure the ods_raw_log_d_emr node:

    1. Double-click the node. On the configuration tab, enter the following SQL:

      -- Create the table used to store website access logs.
      CREATE EXTERNAL TABLE IF NOT EXISTS ods_raw_log_d_emr
      (
        `col` STRING
      ) PARTITIONED BY (
        dt STRING
      );
      ALTER TABLE ods_raw_log_d_emr ADD IF NOT EXISTS PARTITION (dt='${bizdate}')
      LOCATION 'oss://dw-emr-demo/ods_raw_log_d/log_${bizdate}/';
      Note

      The LOCATION path matches the Object Name (Path Included) value you set in the ods_raw_log_d_2oss_emr synchronization node. dw-emr-demo is the OSS bucket you created during environment preparation.

    2. Click Properties and configure the scheduling settings:

      Section Setting
      Scheduling Parameter Set Parameter Name to bizdate, Parameter Value to $[yyyymmdd-1].
      Schedule Set Rerun to Allow Regardless of Running Status.
      Dependencies Set the output table to this node. The output table name follows the Workspace name.Node name format.
      Resource Group Select your serverless resource group.
    3. Click the image icon to save and run the node.

Step 4: Run the workflow and verify results

Run the workflow

On the DataStudio page, double-click the workshop_emr workflow under Business Flow. On the workflow configuration tab, click the image.png icon in the toolbar to run all nodes in dependency order.

Check node status

  • Nodes in the image.png state are running normally.

  • If a node shows image and the error "java.net.ConnectException: Connection timed out (Connection timed out)" appears, add a security group rule in the ECS console. Use port 10000 of your ECS instance to connect to DataWorks, and set Authorization Object to the CIDR block of the vSwitch associated with your resource group. To find the vSwitch CIDR block, go to the Resource Groups page, find your resource group, click Network Settings, and check the VPC Binding tab. For details, see Add a security group rule.

Verify the sync results

To confirm data was written successfully, right-click the ods_user_info_d_emr or ods_raw_log_d_emr node and select View Logs. If the sync completed, the logs show a result similar to:

Sync completion log output

Then create an ad hoc query to count the imported records.

  1. In the left-side navigation pane of the DataStudio page, click the image.png icon. In the Ad Hoc Query pane, right-click Ad Hoc Query and choose Create Node > EMR Hive.

  2. Run the following queries to confirm the record counts. Replace Data timestamp with the actual partition date — if the workflow ran on November 7, 2019, the data timestamp is 20191106 (one day before).

    SELECT * FROM ods_user_info_d_emr WHERE dt = Data timestamp;
    SELECT * FROM ods_raw_log_d_emr WHERE dt = Data timestamp;

What's next

With data collection complete, proceed to the next tutorial to compute and analyze the synced data. See Process data.