All Products
Search
Document Center

DataWorks:Synchronize data

Last Updated:Feb 14, 2026

This tutorial shows how to create HttpFile and MySQL Data Sources to access user and website log data. You will configure a data synchronization pipeline to transfer this data to an Object Storage Service (OSS) bucket. You will then create an E-MapReduce (EMR) Hive External Table to parse the data in OSS and run a query to verify the results.

Prerequisites

  • Ensure the required environment is prepared. For more information, see Prepare the environment.

  • To ensure network connectivity, go to the ECS console and add a security group rule to the Security Group associated with your EMR cluster. This rule must allow inbound traffic on port 10000 from the vSwitch CIDR block of the resource group's VPC. For more information, see Add a security group rule.

1. Create data sources

Add the following Data Sources to your DataWorks workspace to retrieve and store the sample data for subsequent processing.

  • MySQL Data Source: A sample Data Source provided by DataWorks that stores basic user information.

  • HttpFile Data Source: A sample Data Source provided by DataWorks that stores user website access logs.

  • OSS Data Source: The EMR OSS-HDFS storage that you created in Prepare the EMR environment. This is used to store the sample user information and website access logs synchronized from the MySQL and HttpFile Data Sources.

Create a MySQL data source (user_behavior_analysis_mysql)

The sample user data for this tutorial is stored in a MySQL database. Create a MySQL Data Source to retrieve this data (ods_user_info_d).

  1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

  2. On the SettingCenter page, click Data Sources in the left-side navigation pane.

  3. Click Add Data Source, and then search for and select MySQL as the Data Source type.

  4. On the Add MySQL Data Source page, configure the parameters. For this tutorial, use the same example values for both the development and production environments.

    The following table describes only the key parameters. Use the default values for all other parameters.

    Parameter

    Description

    Data source name

    Enter a name for the Data Source. In this tutorial, enter user_behavior_analysis_mysql.

    Data source description

    This is a read-only Data Source for DataWorks tutorials. It provides sample data for Batch Synchronization nodes within Data Integration and is not supported in other modules.

    Configuration mode

    Select Connection String Mode.

    Connection address

    • Host IP address: rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com

    • Port number: 3306

    Database name

    Enter the database name. In this tutorial, enter workshop.

    Username

    Enter the username. In this tutorial, enter workshop.

    Password

    Enter the password. In this tutorial, enter workshop#2017.

    Authentication method

    No authentication.

  5. In the Connection Configuration section, click Test Network Connectivity for both the production and development environments. Ensure that the connectivity status is Connected.

    Important
    • Ensure that the resource group is attached to the workspace and has public network access enabled. Otherwise, data synchronization fails. For more information, see Prepare the environment.

    • If you do not have a resource group available, follow the prompts in the connection configuration section. Click Purchase and Associated Purchased Resource Group.

  6. Click Complete Creation.

Create an HttpFile data source (user_behavior_analysis_httpfile)

The user website access logs for this tutorial are stored in a sample OSS bucket provided by DataWorks. Create an HttpFile Data Source to retrieve these logs (user_log.txt).

  1. Go to the Data Sources page.

    1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

    2. On the SettingCenter page, click Data Sources in the left-side navigation pane.

  2. Click Add Data Source, and then search for and select HttpFile as the Data Source type.

  3. On the Add HttpFile Data Source page, configure the parameters. For this tutorial, use the same example values for both the development and production environments.

    The following table describes only the key parameters. Use the default values for all other parameters.

    Parameter

    Description

    Data source name

    Enter a name for the Data Source. In this tutorial, enter user_behavior_analysis_httpfile.

    Data source description

    This is a read-only Data Source for DataWorks tutorials. It provides sample data for Batch Synchronization nodes within Data Integration and is not supported in other modules.

    URL

    Set the URL to https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com for both the development and production environments.

  4. In the Connection Configuration section, click Test Network Connectivity for both the production and development environments. Ensure the status is Connected.

    Important
    • Ensure a resource group with Public Access is associated with the workspace to prevent Data Synchronization failures. For configuration steps, see Prepare the environment.

    • If you do not have an available resource group, follow the on-screen instructions to click Purchase and Associate Purchased Resource Group.

  5. Click Complete Creation.

Create an OSS data source (test_g)

Add the OSS bucket you created in Prepare the EMR environment to DataWorks. This bucket will store the data synchronized from the MySQL and HttpFile Data Sources.

Note

When adding an OSS Data Source, ensure the current account has the AliyunOSSFullAccess permission for Object Storage Service (OSS). For detailed instructions, see View the permissions of a RAM user and Manage the permissions of a RAM user.

  1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

  2. On the SettingCenter page, click Data Sources in the left-side navigation pane.

  3. Click Add Data Source, and then search for and select OSS as the Data Source type.

  4. On the Add OSS Data Source page, configure the parameters. For this tutorial, use the same example values for both the development and production environments.

    Parameter

    Description

    Data source name

    Enter a name for the Data Source. In this example, use test_g.

    Data source description

    Enter a brief description for the Data Source.

    Access mode

    Select AccessKey Mode.

    AccessKey ID

    The AccessKey ID of the current account.

    You can log on to the DataWorks console, hover over your profile picture in the upper-right corner, and go to the AccessKey page to obtain the AccessKey ID of a RAM user with the required permissions.

    AccessKey Secret

    Enter the AccessKey Secret of the current account.

    Important

    The AccessKey Secret is displayed only upon creation and cannot be retrieved later. Keep it confidential. If your AccessKey is compromised or lost, delete it and create a new one.

    Region

    Select the China (Shanghai) Region.

    Endpoint

    Enter oss-cn-shanghai-internal.aliyuncs.com.

    Bucket

    The name of the OSS bucket that you configured when creating the EMR cluster. In this example, use dw-emr-demo.

  5. In the Connection Configuration section, click Test Network Connectivity for both the production and development environments. Ensure that the connectivity status is Connected.

    Important
    • Ensure that the resource group is attached to the workspace and has public network access enabled. Otherwise, data synchronization fails. For more information, see Prepare the environment.

    • If you do not have a resource group available, follow the prompts in the connection configuration section. Click Purchase and Associated Purchased Resource Group.

  6. Click Complete Creation.

2. Build the synchronization pipeline

In this step, you build a synchronization pipeline to move user data and website access logs to an OSS bucket for subsequent processing.

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. On the Data Studio page, click the image icon in the left-side navigation pane to go to the Data Development page. Then, select Workspace Directories in the directory tree on the left.

  3. Create a directory named work and a workflow named workshop_emr. For more information, see Workspace Directories and Orchestrate periodic workflows.

  4. Click the workshop_emr workflow to open the workflow orchestration page. Then, create a Zero Load Node, two Batch Synchronization nodes, and two EMR Hive nodes.

    1. Select General > Zero Load Node and drag it to the editing canvas on the right. In the Create Node dialog box, enter workshop_start_emr for Node Name and click Confirm.

    2. Drag two Data Integration nodes to the editing canvas. In the Create Node dialog box, set the source type to MySQL and the destination type to OSS. Select Batch Synchronization as the specific node type. Name the nodes ods_user_info_d_2oss_emr and ods_raw_log_d_2oss_emr, respectively. These nodes will synchronize MySQL user information and HttpFile log data to your OSS bucket. Click Confirm.

    3. Drag two EMR > EMR Hive nodes to the editing canvas. In the Create Node dialog box, name the nodes ods_user_info_d_emr and ods_raw_log_d_emr, respectively. These nodes will create Hive tables to parse the data in your OSS bucket. Click Confirm.

    The following table describes the example nodes used in this tutorial.

    Type

    Parameter

    Function

    imageZero Load Node

    workshop_start_emr

    Orchestrates the workflow and defines the data flow. This Zero Load Node requires no code editing.

    imageBatch Synchronization node

    ods_user_info_d_2oss_emr

    Synchronizes basic user information from the MySQL Data Source to your OSS Data Source (test_g).

    imageBatch Synchronization node

    ods_raw_log_d_2oss_emr

    Synchronizes user website access logs from the HttpFile Data Source (OSS) to your OSS Data Source (test_g).

    imageEMR Hive node

    ods_user_info_d_emr

    Creates the Hive table ods_user_info_d_emr to parse basic user information from the OSS Data Source (test_g).

    imageEMR Hive node

    ods_raw_log_d_emr

    Creates the Hive table ods_raw_log_d_emr to parse user website access logs from the OSS Data Source (test_g).

  5. On the workflow development panel, drag connections to set the workshop_start_emr node as the upstream dependency for the two Batch Synchronization nodes. The final result should look like this:

    image

3. Configure synchronization nodes

Follow these steps to synchronize user information from the MySQL Data Source and log data from the HttpFile Data Source to your OSS bucket. Then, create Hive External Tables to parse and query the data stored in OSS.

Configure the zero-load node

Configure the scheduling properties for the Zero Load Node as follows.

  1. Open the Zero Load Node.

    On the workflow orchestration page, hover over the Zero Load Node and click Open Node to go to the node editing page.

  2. Configure the node.

    In the right-side pane, click Properties and configure the following parameters.

    Parameter

    Description

    Resource group for scheduling

    Select the serverless resource group you created during environment preparation.

    Scheduling dependencies

    The workflow has no upstream dependencies, so no configuration is needed. For unified management, you can click Use Workspace Root Node to set the Workspace Root Node as the parent dependency.

    The Workspace Root Node is named in the format WorkspaceName_root.

  3. Save the node.

    After you complete the configuration, click the image icon in the toolbar to save the node.

MySQL user data synchronization

This task synchronizes user data from the MySQL Data Source to your OSS bucket.

  1. Open the Batch Synchronization node.

    On the workflow orchestration page, hover over the ods_user_info_d_2oss_emr node and click Open Node to go to the node configuration page.

  2. Configure the Data Source and compute resource.

    Configure the source, destination, and resource group as follows.

    Parameter

    Description

    Source

    • Source: MySQL.

    • Data source name: user_behavior_analysis_mysql.

    Destination

    • Destination: OSS.

    • Data source name: Select the OSS Data Source that you created earlier. In this example, use test_g.

    Resource group for Data Integration

    For Resource Group, select the serverless resource group that you purchased.

  3. Configure the source and destination settings.

    1. The following table describes only the key parameters. Use the default values for all other parameters.

      Parameter

      Description

      Source

      • Table: Select ods_user_info_d from the Data Source.

      • Split key: Use a primary key or an indexed column as the split key. Only integer-type fields are supported. For this tutorial, set the split key to uid.

      Destination

      • Text type: Select text.

      • Object name (including path): Enter the path based on your OSS directory structure. Example: ods_user_info_d/user_${bizdate}/user_${bizdate}.txt. Here, ods_user_info_d is your OSS directory name, and ${bizdate} represents the data timestamp of the previous day.

      • Column delimiter: Enter |.

    2. Confirm the Field Mapping and Channel Control settings.

      DataWorks allows you to map source and destination fields to write data from specific source fields to specific destination fields. You can also set parameters such as task concurrency and a policy for dirty data records in the Channel Control settings. For this tutorial, set Policy for Dirty Data Records to Disallow Dirty Data Records and keep the default values for other settings. For more information, see Configure a synchronization node in wizard mode.

  4. Configure debugging parameters.

    In the right-side pane, click Debugging Configurations and configure the following parameters. These parameters are used for the test run in Step 5.

    Parameter

    Description

    Resource group

    Select the serverless resource group you purchased in Prepare the environment.

    Script parameters

    No configuration is needed. The example code in this tutorial uses ${bizdate} to represent the data timestamp. When you run the workflow in Step 5, set Value Used in This Run to a specific constant, such as 20250223. The node will then use this constant to replace the variable during the run.

  5. Save the Batch Synchronization node.

    After you complete the configuration, click the image icon in the toolbar to save the node.

HttpFile log data synchronization

This task synchronizes log data from the HttpFile Data Source to your OSS bucket.

  1. Open the Batch Synchronization node.

    On the workflow orchestration page, hover over the ods_raw_log_d_2oss_emr node and click Open Node to go to the node configuration page.

  2. Configure the Data Source and compute resource.

    Configure the source, destination, and resource group as follows.

    Parameter

    Description

    Source

    • Source: HttpFile.

    • Data source name: user_behavior_analysis_httpfile.

    Destination

    • Destination: OSS.

    • Data source name: Select the OSS Data Source that you created earlier. In this example, use test_g.

    Resource group for Data Integration

    For Resource Group, select the serverless resource group that you purchased.

  3. Configure the source and destination settings.

    1. The following table describes only the key parameters. Use the default values for all other parameters.

      Parameter

      Description

      Source

      • File path: /user_log.txt.

      • Text type: Select text.

      • Column delimiter: Enter |.

      • Compression format: Select one of four types: None, Gzip, Bzip2, and Zip. For this tutorial, select None.

      • Skip header: Select No.

      Destination

      • Text type: Select text.

      • Object name (including path): Enter the path based on your OSS directory structure. Example: ods_raw_log_d/log_${bizdate}/log_${bizdate}.txt. Here, ods_raw_log_d is your OSS directory name, and ${bizdate} represents the data timestamp of the previous day.

      • Column delimiter: Enter |.

    2. Configure debugging parameters.

      In the right-side pane, click Debugging Configurations and configure the following parameters. These parameters are used for the test run in Step 5.

      Parameter

      Description

      Resource group

      Select the serverless resource group you purchased in Prepare the environment.

      Script parameters

      No configuration is needed. The example code in this tutorial uses ${bizdate} to represent the data timestamp. When you run the workflow in Step 5, set Value Used in This Run to a specific constant, such as 20250223. The node will then use this constant to replace the variable during the run.

  4. Save the Batch Synchronization node.

    After you complete the configuration, click the image icon in the toolbar to save the node.

4. Parse OSS data

After the Batch Synchronization nodes finish running, follow these steps to create Hive External Tables to parse the sample data stored in your test_g OSS Data Source.

Create the ods_user_info_d_emr table

  1. Develop the code.

    On the workflow orchestration page, hover over the ods_user_info_d_emr EMR Hive node and click Open Node to go to the EMR Hive node editing page. Enter the following sample SQL statement.

    CREATE EXTERNAL TABLE IF NOT EXISTS ods_user_info_d_emr
    (
        `uid` STRING COMMENT 'User ID',
        `gender` STRING COMMENT 'Gender',
        `age_range` STRING COMMENT 'Age range',
        `zodiac` STRING COMMENT 'Zodiac sign'
    ) PARTITIONED BY (
      dt STRING
    )
    ROW FORMAT  delimited fields terminated by '|'
    LOCATION 'oss://dw-emr-demo/ods_user_info_d/';
    
    ALTER TABLE ods_user_info_d_emr ADD IF NOT EXISTS PARTITION (dt='${bizdate}')
    LOCATION 'oss://dw-emr-demo/ods_user_info_d/user_${bizdate}/';
    Note

    The LOCATION in the preceding code is an example path. It must be the same as the destination path you configured for the Batch Synchronization node. Replace dw-emr-demo with the name of the OSS bucket you created during environment preparation.

  2. Configure debugging parameters.

    In the right-side pane, click Debugging Configurations and configure the following parameters for the test run.

    Parameter

    Description

    Computing resource

    Select the EMR compute resource that you associated in the Prepare the environment step.

    Resource group

    Select the serverless resource group you purchased in Prepare the environment.

    Script parameters

    No configuration is needed. The example code in this tutorial uses ${bizdate} to represent the data timestamp. When you debug the node, set Value Used in This Run to a specific constant, such as 20250223. The node will then use this constant to replace the variable during the run.

  3. After you complete the configuration, save the node.

Create the ods_raw_log_d_emr table

  1. Develop the code.

    On the workflow orchestration page, hover over the ods_raw_log_d_emr EMR Hive node and click Open Node to go to the EMR Hive node editing page. Enter the following sample SQL statement.

    -- Create the destination table for the OSS logs
    CREATE EXTERNAL TABLE IF NOT EXISTS ods_raw_log_d_emr
    (
      `col` STRING
    ) PARTITIONED BY (
      dt STRING
    );
    ALTER TABLE ods_raw_log_d_emr ADD IF NOT EXISTS PARTITION (dt='${bizdate}')
    LOCATION 'oss://dw-emr-demo/ods_raw_log_d/log_${bizdate}/';
    Note

    The LOCATION in the preceding code is an example path. It must be the same as the destination path you configured for the Batch Synchronization node. Replace dw-emr-demo with the name of the OSS bucket you created during environment preparation.

  2. Configure debugging parameters.

    In the right-side pane, click Debugging Configurations and configure the following parameters for the test run.

    Parameter

    Description

    Computing resource

    Select the EMR compute resource that you associated in the Prepare the environment step.

    Resource group

    Select the serverless resource group you purchased in Prepare the environment.

    Script parameters

    No configuration is needed. The example code in this tutorial uses ${bizdate} to represent the data timestamp. When you debug the node, set Value Used in This Run to a specific constant, such as 20250223. The node will then use this constant to replace the variable during the run.

  3. Save the node.

    After you complete the configuration, save the node.

5. Run the synchronization workflow

  1. Configure workflow scheduling.

    On the right side of the workflow orchestration page, click Properties and configure the parameters. The following table describes only the key parameters. Use the default values for all other parameters.

    Scheduling parameter

    Description

    Scheduling parameters

    Set scheduling parameters that are available to all nodes in the workflow. For this tutorial, set it to bizdate=$[yyyymmdd-1] to get the date of the previous day.

    Recurrence

    For this tutorial, set Recurrence to Day.

    Start time

    For this tutorial, set the Start Time to 00:30. The workflow will start at 00:30 every day.

    Scheduling dependencies

    The workflow has no upstream dependencies, so no configuration is needed. For unified management, you can click Use Workspace Root Node to set the Workspace Root Node as the parent dependency.

    The Workspace Root Node is named in the format WorkspaceName_root.

  2. After configuring the workflow scheduling properties, click Run in the toolbar at the top of the workflow canvas. Set the parameter variables for this run. This tutorial uses 20250223 as an example, but you can use a different value. Click Confirm and wait for the run to complete.

    • If the ods_raw_log_d_emr and ods_user_info_d_emr nodes show the image.png status, the synchronization succeeded.

    • If the ods_raw_log_d_emr and ods_user_info_d_emr nodes show the image status, and you see a java.net.ConnectException: Connection timed out (Connection timed out) error, you must add a security group rule in the ECS console to open ECS port 10000. Set the authorized object to the vSwitch CIDR block of the resource group's VPC. You can obtain the vSwitch CIDR block by clicking Network Settings for the corresponding resource group in the resource group list. For more information about how to add a security group rule, see Add a security group rule.

Note

Data Synchronization takes approximately 24 minutes.

6. Verify data synchronization

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. On the Data Studio page, click the image icon to enter Data Development, and find the Workspace Directories section in the navigation pane.

  3. Right-click the work directory you created and select Create Node... > EMR > EMR Hive. Enter a custom name for the EMR Hive node and press Enter to create the node.

    On the EMR Hive node editing page, replace your_data_timestamp with the data timestamp from your workflow run. Run the following queries to verify that data is correctly mapped to the ods_raw_log_d_emr and ods_user_info_d_emr tables.

    Note

    The partition column in the query must be updated to the data timestamp. For example, if the workflow runs on 20250223, the data timestamp is 20250222, which is the day before the run date.

    SELECT * FROM ods_user_info_d_emr WHERE dt=your_data_timestamp; 
    SELECT * FROM ods_raw_log_d_emr WHERE dt=your_data_timestamp; 
    • If the preceding commands return data, it means the data processing is complete.

    • If no data is returned, ensure that the Value Used in This Run that is configured for the workflow run matches the data timestamp specified by dt in the query. You can click the workflow, click Running History in the right-side pane, and then click View in the Actions column for a run record to confirm the data timestamp value (partition=[pt=xxx]) in the workflow run log.

Next steps

Now that you have synchronized the data, you can proceed to the next tutorial to learn how to process and analyze it. For more information, see Process data.