All Products
Search
Document Center

E-MapReduce:Synchronize data

Last Updated:Mar 26, 2026

This tutorial walks you through adding data sources to DataWorks, configuring batch synchronization tasks, and using E-MapReduce (EMR) Hive nodes to create tables and query the synchronized data.

In this tutorial, you will:

  1. Add three data sources (HttpFile, MySQL, OSS) to a DataWorks workspace

  2. Create a workflow with two batch synchronization nodes

  3. Configure the nodes to sync user data and access logs to OSS

  4. Create EMR Hive external tables that map to the OSS data

  5. Run the workflow and verify the results with ad hoc queries

Nodes created in this tutorial:

Node name Type Purpose
workshop_start_emr Zero-Load Node Triggers the workflow daily
ods_user_info_d_2oss_emr Offline synchronization Syncs MySQL user data to OSS
ods_raw_log_d_2oss_emr Offline synchronization Syncs HttpFile access logs to OSS
ods_user_info_d_emr EMR Hive Creates an external table for user data
ods_raw_log_d_emr EMR Hive Creates an external table for access logs

Prerequisites

Before you begin, make sure you have:

Step 1: Add data sources

Add three data sources to your DataWorks workspace: an HttpFile source for website access logs, a MySQL source for user information, and an OSS destination to store the synced data.

Add an HttpFile data source

  1. Go to the Data Sources page.

    1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select your workspace from the drop-down list and click Go to Management Center.

    2. In the left-side navigation pane of the SettingCenter page, click Data Sources.

  2. In the upper-left corner of the Data Sources page, click Add Data Source. In the Add Data Source dialog box, click HttpFile.

  3. On the Add HttpFile Data Source page, configure the following parameters. Use the sample values for both the development and production environments.

    Parameter Value
    Data Source Name user_behavior_analysis_httpfile
    Data Source Description This data source is read-only in data synchronization scenarios. It is used as the source of a batch synchronization task to access the provided test data.
    URL https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com
  4. Find a desired resource group and click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns.

    Important

    At least one resource group must show Connected. Otherwise, you cannot use the codeless UI to configure a data synchronization task for this data source.

  5. Click Complete Creation.

Add a MySQL data source

  1. In the left-side navigation pane of the SettingCenter page, click Data Sources. In the upper-left corner, click Add Data Source.

  2. In the Add Data Source dialog box, select MySQL.

  3. On the Add MySQL Data Source page, configure the following parameters. Use the sample values for both environments.

    Parameter Value
    Data Source Name user_behavior_analysis_mysql
    Data Source Description This data source is read-only in data synchronization scenarios. It is used as the source of a batch synchronization task to access the provided test data.
    Configuration Mode Connection String Mode
    Connection Address Host IP address: rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com; Port: 3306
    Database Name workshop
    Username workshop
    Password workshop#2017
    Authentication Method No Authentication
  4. Find a desired resource group and click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns.

  5. Click Complete Creation.

Add an OSS data source

This OSS data source serves as the destination for both synchronization tasks: user information from MySQL and access logs from HttpFile are written here.

  1. In the left-side navigation pane of the SettingCenter page, click Data Sources. In the upper-left corner, click Add Data Source.

  2. In the Add Data Source dialog box, select OSS.

  3. On the Add OSS Data Source page, configure the following parameters.

    Important

    The AccessKey secret is shown only at creation time. If it is lost or compromised, delete the key pair and create a new one.

    Parameter Value
    Data Source Name test_g
    Access Mode AccessKey Mode
    AccessKey ID Your AccessKey ID. Go to the AccessKey page to copy it.
    AccessKey secret Your AccessKey secret.
    Endpoint http://oss-cn-shanghai-internal.aliyuncs.com
    Bucket dw-emr-demo (the bucket you created when preparing the environment)
  4. Find a desired resource group and click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns.

    Note

    At least one resource group must show Connected. Otherwise, you cannot use the codeless UI to configure a data synchronization task for this data source.

  5. Click Complete Creation.

Step 2: Create the workflow

  1. On the Data Sources page, click the 图标 icon in the upper-left corner and choose All Products > Data Development And Task Operation > DataStudio.

  2. In the Scheduled Workflow pane, right-click Business Flow and select Create Workflow.

  3. In the Create Workflow dialog box, set Workflow Name to workshop_emr and click Create.

  4. Double-click the new workflow to open its configuration tab. Create the following nodes:

    1. Click Create Node and drag Zero-Load Node from the General section to the canvas. Set Name to workshop_start_emr and click Confirm.

    2. Click Create Node and drag Offline synchronization from the Data Integration section to the canvas. Create two batch synchronization nodes: ods_raw_log_d_2oss_emr (for HttpFile logs) and ods_user_info_d_2oss_emr (for MySQL user data). Click Confirm for each.

  5. On the workflow canvas, draw directed lines to make workshop_start_emr the ancestor node of both batch synchronization nodes.

    image

Step 3: Configure the nodes

Configure the workflow trigger node

The workshop_start_emr zero-load node triggers the workflow to run daily.

  1. In the Scheduled Workflow pane, double-click the workshop_start_emr node. In the right-side pane of the node configuration tab, click Properties.

  2. Configure the scheduling properties.

    Section Setting
    Schedule Set the scheduling time to 00:30. Set Rerun to Allow Regardless of Running Status.
    Scheduling dependencies Set workshop_start_emr as a descendant of the workspace root node (named Workspace name_root). The root node triggers workshop_start_emr to run.
  3. Click the 保存 icon to save.

Configure batch synchronization nodes

Sync MySQL user data to OSS (ods_user_info_d_2oss_emr)

  1. On the DataStudio page, double-click ods_user_info_d_2oss_emr to open the node configuration page.

  2. Set the source, resource group, and destination, then click Next and complete the connectivity test.

    Parameter Value
    Source MySQL — Data Source Name: user_behavior_analysis_mysql
    Resource Group Your serverless resource group
    Destination OSS — Data Source Name: test_g
  3. Configure the synchronization details.

    Parameter Value
    Source: Table ods_user_info_d
    Source: Split key uid (INTEGER type; use a primary key or indexed column)
    Destination: Text type text
    Destination: Object Name (Path Included) ods_user_info_d/user_${bizdate}/user_${bizdate}.txtods_user_info_d is the folder in the OSS bucket; ${bizdate} is replaced with the previous day's date at runtime (for example, 20191106 when the workflow runs on November 7, 2019).
    Destination: Column Delimiter |
  4. Click Properties in the right-side pane and configure the scheduling settings.

    Section Setting
    Scheduling Parameter Click Add Parameter. Set Parameter Name to bizdate and Parameter Value to $[yyyymmdd-1].
    Schedule Set Rerun to Allow Regardless of Running Status.
    Dependencies Confirm that the output table follows the Workspace name.Node name format.
  5. Click the 保存 icon to save.

Sync HttpFile access logs to OSS (ods_raw_log_d_2oss_emr)

  1. On the DataStudio page, double-click ods_raw_log_d_2oss_emr to open the node configuration page.

  2. Set the source, resource group, and destination, then click Next and complete the connectivity test.

    Parameter Value
    Source HttpFile — Data Source Name: user_behavior_analysis_httpfile
    Resource Group Your serverless resource group
    Destination OSS — Data Source Name: test_g
  3. Configure the synchronization details.

    Parameter Value
    Source: File Path /user_log.txt
    Source: Text type text
    Source: Column Delimiter |
    Source: Compression format None
    Source: Skip Header No
    Destination: Text type text
    Destination: Object Name (Path Included) ods_raw_log_d/log_${bizdate}/log_${bizdate}.txtods_raw_log_d is the folder in the OSS bucket; ${bizdate} is replaced with the previous day's date at runtime.
    Destination: Column Delimiter |
  4. Click Properties in the right-side pane and configure the scheduling settings.

    Section Setting
    Scheduling Parameter Click Add Parameter. Set Parameter Name to bizdate and Parameter Value to $[yyyymmdd-1].
    Schedule Set Rerun to Allow Regardless of Running Status.
    Dependencies Confirm that the output table follows the Workspace name.Node name format.
  5. Click the 保存 icon to save.

Create EMR Hive tables

Create two EMR Hive external tables — ods_user_info_d_emr and ods_raw_log_d_emr — that point to the OSS folders where the synced data is stored.

  1. In the Scheduled Workflow pane, click the workshop_emr workflow, right-click EMR, and choose Create Node > EMR Hive.

  2. Create two EMR Hive nodes: ods_user_info_d_emr and ods_raw_log_d_emr. Draw dependency lines so that both EMR Hive nodes run after the batch synchronization nodes complete.

    image

  3. Configure and run each EMR Hive node. ods_user_info_d_emr — user information table Double-click the ods_user_info_d_emr node and enter the following SQL:

    Note

    The LOCATION path must match the Object Name (Path Included) value set in the ods_user_info_d_2oss_emr node. dw-emr-demo is the OSS bucket name you created when preparing the environment. ${bizdate} is replaced with the previous day's date at runtime.

    Note

    The LOCATION path must match the Object Name (Path Included) value set in the ods_raw_log_d_2oss_emr node. dw-emr-demo is the OSS bucket name you created when preparing the environment.

    Section Setting
    Scheduling Parameter Set Parameter Name to bizdate and Parameter Value to $[yyyymmdd-1].
    Schedule Set Rerun to Allow Regardless of Running Status.
    Dependencies Confirm that the output table follows the Workspace name.Node name format.
    Resource Group Select your serverless resource group.
    CREATE EXTERNAL TABLE IF NOT EXISTS ods_user_info_d_emr
    (
        `uid` STRING COMMENT 'The user ID',
        `gender` STRING COMMENT 'The gender',
        `age_range` STRING COMMENT 'The age range',
        `zodiac` STRING COMMENT 'The zodiac sign'
    ) PARTITIONED BY (
      dt STRING
    )
    ROW FORMAT delimited fields terminated by '|'
    LOCATION 'oss://dw-emr-demo/ods_user_info_d/';
    
    ALTER TABLE ods_user_info_d_emr ADD IF NOT EXISTS PARTITION (dt='${bizdate}')
    LOCATION 'oss://dw-emr-demo/ods_user_info_d/user_${bizdate}/';

    Click Properties in the right-side pane and configure the scheduling settings: Click the image icon to save and run. ods_raw_log_d_emr — access log table Double-click the ods_raw_log_d_emr node and enter the following SQL:

    -- Create the table used to store website access logs.
    CREATE EXTERNAL TABLE IF NOT EXISTS ods_raw_log_d_emr
    (
      `col` STRING
    ) PARTITIONED BY (
      dt STRING
    );
    ALTER TABLE ods_raw_log_d_emr ADD IF NOT EXISTS PARTITION (dt='${bizdate}')
    LOCATION 'oss://dw-emr-demo/ods_raw_log_d/log_${bizdate}/';

    Apply the same scheduling settings as above, then save and run.

Step 4: Run the workflow and verify the results

Run the workflow

  1. On the DataStudio page, double-click the workshop_emr workflow under Business Flow. On the workflow configuration tab, click the image.png icon in the toolbar. The workflow runs all nodes in dependency order.

  2. Check node status.

    • Nodes in the image.png state are running normally.

    • If nodes show the image state with the error "java.net.ConnectException: Connection timed out (Connection timed out)", add a security group rule in the ECS console: allow port 10000 and set Authorization Object to the vSwitch CIDR block. To find the CIDR block, go to the Resource Groups page, find your resource group, and click Network Settings > VPC Binding. See Add a security group rule.

Verify the results

  1. In the left-side navigation pane of the DataStudio page, click the image.png icon. In the Ad Hoc Query pane, right-click Ad Hoc Query and choose Create Node > EMR Hive.

  2. Run the following queries to confirm the data was written successfully. Replace <data_timestamp> with the previous day's date in yyyymmdd format. For example, if the workflow ran on November 7, 2019, use 20191106.

    SELECT * FROM ods_user_info_d_emr WHERE dt=<data_timestamp>;
    SELECT * FROM ods_raw_log_d_emr WHERE dt=<data_timestamp>;
  3. To view the execution logs, right-click the ods_user_info_d_emr or ods_raw_log_d_emr node and select View Logs. Successful synchronization looks like:

    image.png

What's next

Now that you have synced data into OSS and made it queryable through EMR Hive tables, proceed to the next tutorial to compute and analyze the data. See Process data.