All Products
Search
Document Center

DataWorks:Synchronize data

Last Updated:Feb 19, 2025

This topic explains how to add HttpFile and MySQL data sources to access basic user information and website access logs provided in the example. It also covers configuring data synchronization tasks to transfer data to a private OSS data source and using EMR Hive nodes to create tables and query the synchronized data.

Prerequisites

  • An environment is prepared. For more information, see Prepare the environment.

  • To ensure network connectivity, add a security group rule in the ECS console to open port 10000 on the ECS instance and set the authorization object to the CIDR block of the VPC switch in the resource group. For more information, see Add a security group rule.

Step 1: Create data sources

For the scenario in this experiment, create and register the following three data sources in the DataWorks workspace:

  • HttpFile data source: Accesses website access log data provided in this tutorial.

  • MySQL data source: Accesses user information data provided in this tutorial.

  • OSS data source: Receives test data from the HttpFile and MySQL data sources.

Create an HttpFile data source

  1. Go to the Data Sources page.

    1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

    2. In the left-side navigation pane of the SettingCenter page, click Data Sources.

  2. Click Add Data Source, and in the Add Data Source dialog box, search for and select HttpFile as the data source type.

  3. On the Create HttpFile Data Source page, configure the parameters using example values for both development and production environments.

    Parameter

    Description

    Data Source Name

    Enter the data source name. In this tutorial, enter user_behavior_analysis_httpfile.

    Data Source Description

    The description of the data source. Specify that the data source is exclusively provided for the use cases of DataWorks and is used as the source of a batch synchronization task to access the provided test data. In addition, you also need to specify that the data source can be used only for data reading in data synchronization scenarios.

    URL Domain Name

    The URL domain name for both the development and production environments ishttps://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com.

  4. Click the Connectivity (development Environment) and Connectivity (production Environment) columns of the specified resource group, and click Test Connectivity. Wait for the interface to indicate that the test is complete and the connectivity status is Connected.

    Important

    Ensure that at least one resource group is in the Connected status. Otherwise, this data source cannot be used to create a sync task in codeless UI.

  5. Click Complete Creation.

Create a MySQL data source

  1. On the Management Center page, click to enter the Data Source page and then click Add Data Source.

  2. In the Add Data Source dialog box, search for and select MySQL as the data source type.

  3. On the Create MySQL Data Source page, configure the parameters using example values for both development and production environments.

    Parameter

    Description

    Data Source Name

    Enter the data source name. In this tutorial, enter user_behavior_analysis_mysql.

    Data Source Description

    The description of the data source. Specify that the data source is exclusively provided for the use cases of DataWorks and is used as the source of a batch synchronization task to access the provided test data. In addition, you also need to specify that the data source can be used only for data reading in data synchronization scenarios.

    Configuration Mode

    Select Connection String Mode.

    Connection Address

    • Host Address IP: rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com

    • Port Number: 3306

    Database Name

    Enter the database name. In this tutorial, enter workshop.

    Username

    Enter the username. In this tutorial, enter workshop.

    Password

    Enter the password. In this tutorial, enter workshop#2017.

    Authentication Option

    Set this parameter to No Authentication.

  4. Click the Connectivity (development Environment) and Connectivity (production Environment) columns of the specified resource group, and click Test Connectivity. Wait for the interface to indicate that the test is complete and the connectivity status is Connected.

  5. Click Complete Creation.

Create an OSS data source

In this experiment, synchronize user information from the MySQL data source and log information from the HttpFile data source to the OSS data source.

  1. On the Management Center page, click to enter the Data Source page and then click Add Data Source.

  2. In the Add Data Source dialog box, search for and select OSS as the data source type.

  3. On the Create OSS Data Source page, configure the parameters using example values for both development and production environments.

    Parameter

    Description

    Data Source Name

    Enter the name of the data source. In this example, enter test_g.

    Data Source Description

    Provide a brief description of the data source.

    Access Mode

    Select Access Key Mode.

    AccessKey ID

    The AccessKey ID of the account that is used to log on. You can go to the AccessKey page to copy the AccessKey ID.

    AccessKey Secret

    Enter the AccessKey secret of the account that is used to log on.

    Important

    The AccessKey secret is displayed only during creation. You cannot view the AccessKey secret after you create it. Keep it confidential. If the AccessKey is leaked or lost, delete it and create a new AccessKey.

    Endpoint

    Enter http://oss-cn-shanghai-internal.aliyuncs.com.

    Bucket

    The name of the OSS bucket that is configured when you prepare the environment and create the EMR cluster. In this example, enter dw-emr-demo.

  4. Click the Connectivity (development Environment) and Connectivity (production Environment) columns of the specified resource group, and click Test Connectivity. Wait for the interface to indicate that the test is complete and the connectivity status is Connected.

    Note

    Ensure that at least one resource group is in the Connected status. Otherwise, this data source cannot be used to create a sync task in codeless UI.

  5. Click Complete Creation.

Step 2: Set up synchronization links

  1. Click the 图标 icon in the upper left corner, and select All Products > Data Development And Operations > Datastudio (data Development).

  2. In the Data Development panel, right-click Business Flow, and select Create Business Flow.

  3. In the Create Business Flow dialog box, enter Business Name as workshop_emr, and then click Create.

  4. Enter the business flow development panel, and create a zero load node and offline synchronization node.

    1. Click Create Node, select Zero Load Node, and drag it onto the editing page on the right. In the Create Node dialog box, enter Node Name as workshop_start_emr, and click Confirm.

    2. Click Create Node, select Offline Synchronization node, and drag it onto the editing page on the right. Set the Node Name to ods_raw_log_d_2oss_emr and ods_user_info_d_2oss_emr respectively, which are used to synchronize MySQL user information and HttpFile log information. Then click Confirm.

  5. In the business flow development panel, drag and connect lines to set the workshop_start_emr node as the ancestor node of the two offline synchronization nodes.

    image

Step 3: Configure synchronization tasks

Configure the initial node of the business flow

  1. On the Data Development page, double-click the zero load node under the corresponding business flow. Open the node's editing page, and click Scheduling Configuration on the right.

  2. Configure the scheduling properties.

    Scheduling Configuration

    Image Display

    Description

    Scheduling Time Configuration

    image

    • The scheduling time for the zero load node is set to 00:30. This zero load node will start the current business flow and execute it at 00:30 every day.

    • Set Rerun Property to Rerun On Success Or Failure.

    Scheduling Dependency Configuration

    image

    Since the zero load node workshop_start_emr has no upstream dependency, it can directly depend on the Workspace Root Node, which triggers the execution of the workshop_start_emr node.

    The workspace root node is named: WorkspaceName_root.

  3. After completing the configuration, click the 保存 icon in the upper left corner to save.

Configure offline synchronization nodes

  1. Synchronize user information from the MySQL data source to the self-built OSS.

    1. On the Data Development page, double-click the ods_user_info_d_2oss_emr node to enter the node configuration page.

    2. Establish network connections for data synchronization.

      After completing the network and resource configurations, click Next, and follow the interface prompts to complete the connectivity test.

      Parameter

      Description

      Data Source

      • Data Source: MySQL.

      • Data Source Name: user_behavior_analysis_mysql.

      My Resource Group

      Select the purchased Serverless Resource Group.

      Data Destination

      • Data Destination: OSS.

      • Data Source Name: Set it to test_g, which specifies the name of the private OSS data source that you added.

    3. Configure the data synchronization node.

      Parameter

      Description

      Data Source

      • Table: Select ods_user_info_d from the data source.

      • Shard Key: It is recommended to use the primary key or an indexed column as the shard key. Only fields of the integer type are supported. Set the shard key to uid.

      Data Destination

      • Text Type: Select text type.

      • File Name (including Path): Enter the path based on the directory of your self-built OSS. In this example, enter ods_user_info_d/user_${bizdate}/user_${bizdate}.txt. Here, ods_user_info_d is the directory name you created, and $bizdate indicates the date of the previous day.

      • Column Delimiter: Enter the column delimiter as |.

    4. Configure the scheduling properties.

      On the configuration page, click Scheduling Configuration on the right to enter the Scheduling Configuration panel to configure scheduling and node information. The following table describes the configurations.

      Configuration Item

      Configuration Content

      Image

      Scheduling Parameter

      Add the following in the Scheduling Parameter area:

      • Parameter Name: bizdate

      • Parameter Value: $[yyyymmdd-1]

      image

      Time Attribute

      Set Rerun Property to Rerun On Success Or Failure.

      image

      Scheduling Dependency

      In Scheduling Dependency, confirm that the output table is set as the output of this node.

      The format is WorkSpaceName.NodeName

      image

    5. After the configuration is complete, click the 保存 icon in the toolbar to save.

  1. Synchronize log information from the HttpFile data source to the self-built OSS.

    1. On the Data Development page, double-click the ods_raw_log_d_2oss_emr node to enter the node configuration page.

    2. Establish network connections for data synchronization.

      After completing the network and resource configurations, click Next, and follow the interface prompts to complete the connectivity test.

      Parameter

      Description

      Data Source

      • Data Source: HttpFile.

      • Data Source Name: user_behavior_analysis_httpfile.

      My Resource Group

      Select the purchased Serverless Resource Group.

      Data Destination

      • Data Destination: OSS.

      • Data Source Name: Set it to test_g, which specifies the name of the private OSS data source that you added.

    3. Configure the data synchronization node.

      Parameter

      Description

      Data Source

      • File Path: /user_log.txt.

      • Text Type: Select text type.

      • Column Delimiter: Enter the column delimiter as |.

      • Compression Format: Includes four types: None, Gzip, Bzip2, and Zip. Select None.

      • Skip Header: Select No.

      Data Destination

      • Text Type: Select text type.

      • File Name (including Path): Enter the path based on the directory of your self-built OSS. In this example, enter ods_raw_log_d/log_${bizdate}/log_${bizdate}.txt. Here, ods_raw_log_d is the directory name you created, and $bizdate indicates the date of the previous day.

      • Column Delimiter: Enter the column delimiter as |.

    4. Configure the scheduling properties.

      On the configuration page, click Scheduling Configuration on the right to enter the Scheduling Configuration panel to configure scheduling and node information. The following table describes the configurations.

      Configuration Item

      Configuration Content

      Image

      Scheduling Parameter

      Add the following in the Scheduling Parameter area:

      • Parameter Name: bizdate

      • Parameter Value: $[yyyymmdd-1]

      image

      Time Attribute

      Set Rerun Property to Rerun On Success Or Failure.

      image

      Scheduling Dependency

      In Scheduling Dependency, confirm that the output table is set as the output of this node.

      The format is WorkSpaceName.NodeName

      image

    5. After the configuration is complete, click the 保存 icon in the toolbar to save.

Create EMR tables to obtain synchronized data

Create two tables ods_user_info_d_emr and ods_raw_log_d_emr to query the synchronized original MySQL user data and OSS log data respectively.

  1. On the Data Development page, open the newly created business flow, right-click EMR, and select Create Node > EMR Hive.

  2. In the Create Node dialog box, enter Node Name, and click Confirm.

    Create two EMR Hive nodes (ods_user_info_d_emr and ods_raw_log_d_emr) to establish the corresponding tables. Configure the upstream and downstream dependencies of the nodes in the business flow development panel by drawing lines, as shown in the figure below.

    image

  3. Enter the table creation statements in the EMR Hive nodes respectively, and in Scheduling Configuration, select Schedule Resource Group as the purchased Serverless Resource Group. Then click Save and Run each table creation statement.

    • Create the ods_user_info_d_emr table.

      Double-click the ods_user_info_d_emr node to enter the editing page and configure the node.

      1. Edit the SQL code.

        CREATE EXTERNAL TABLE IF NOT EXISTS ods_user_info_d_emr
        (
            `uid` STRING COMMENT 'User ID',
            `gender` STRING COMMENT 'Gender',
            `age_range` STRING COMMENT 'Age range',
            `zodiac` STRING COMMENT 'Zodiac sign'
        ) PARTITIONED BY (
          dt STRING
        )
        ROW FORMAT  delimited fields terminated by '|'
        LOCATION 'oss://dw-emr-demo/ods_user_info_d/';
        
        ALTER TABLE ods_user_info_d_emr ADD IF NOT EXISTS PARTITION (dt='${bizdate}')
        LOCATION 'oss://dw-emr-demo/ods_user_info_d/user_${bizdate}/';
        Note

        The location in the code above is an example path, which should match the data destination configured in the offline synchronization task. Replace 'dw-emr-demo' with the OSS bucket domain name you created during environment preparation.

      2. Configure the scheduling properties.

        Configuration Item

        Configuration Content

        Image

        Scheduling Parameter

        Add the following in the Scheduling Parameter area:

        • Parameter Name: bizdate

        • Parameter Value: $[yyyymmdd-1]

        image

        Time Attribute

        Set Rerun Property to Rerun On Success Or Failure.

        image

        Scheduling Dependency

        In Scheduling Dependency, confirm that the output table is set as the output of this node.

        The format is WorkSpaceName.NodeName

        image

      3. Once the configuration is complete, click image to save the node.

    • Create the ods_raw_log_d_emr table.

      Double-click the ods_raw_log_d_emr node to configure the ods_raw_log_d_emr node.

      1. Edit the SQL code

        --Create the target table corresponding to the OSS log
        CREATE EXTERNAL TABLE IF NOT EXISTS ods_raw_log_d_emr
        (
          `col` STRING
        ) PARTITIONED BY (
          dt STRING
        );
        ALTER TABLE ods_raw_log_d_emr ADD IF NOT EXISTS PARTITION (dt='${bizdate}')
        LOCATION 'oss://dw-emr-demo/ods_raw_log_d/log_${bizdate}/';
        Note

        The location in the code above is an example path, which should match the data destination configured in the offline synchronization task. Replace 'dw-emr-demo' with the OSS bucket name you created during environment preparation.

      2. Configure the scheduling properties.

        Configuration Item

        Configuration Content

        Image

        Scheduling Parameter

        Add the following in the Scheduling Parameter area:

        • Parameter Name: bizdate

        • Parameter Value: $[yyyymmdd-1]

        image

        Time Attribute

        Set Rerun Property to Rerun On Success Or Failure.

        image

        Scheduling Dependency

        In Scheduling Dependency, confirm that the output table is set as the output of this node.

        The format is WorkSpaceName.NodeName

        image

      3. After completing the configuration, click image to save the node.

Step 4: Run and view results

Run the business flow

  1. On the Data Development page, double-click Business Flow under workshop_emr, open the business flow panel, and click the image.png icon in the toolbar to run the business flow according to the upstream and downstream dependencies.

  2. Confirm the status.

    • View the task execution status:

      • If the node is in the image.png status, it indicates that the synchronization process is successful.

      • If the node is in the image status and an error such as java.net.ConnectException: Connection timed out (Connection timed out) occurs, you need to add a security group rule in the ECS console to open port 10000 of the ECS instance and set the authorization object to the CIDR block of the VPC switch in the resource group. You can click Network Settings of the corresponding resource group in the resource group list to obtain the CIDR block. For more information about how to add a security group rule, see Add a security group rule.

    • View the task execution logs: Right-click the Ods_user_info_d_emr, Ods_raw_log_d_emr nodes, and select View Logs. If the information shown in the following figure appears in the logs, the node is run and data is synchronized.

      image.png

Query synchronization results

  1. Create an ad hoc query.

    On the Data Development page, click image.png in the left-side navigation pane to enter the ad hoc query panel. Right-click Ad Hoc Query, and select Create Node>EMR Hive.

  2. Query synchronization result tables.

    Execute the following SQL statements to confirm the data write results. View the number of records that are imported into the ods_raw_log_d_emr and ods_user_info_d_emr tables.

    --The partition key column in the query statement needs to be updated to the data timestamp. For example, if the node is run on November 7, 2019, the data timestamp is 20191106, which is one day before the node is run.
    SELECT * from ods_user_info_d_emr where dt=Data timestamp; 
    SELECT * from ods_raw_log_d_emr where dt=Data timestamp; 

What to do next

After familiarizing yourself with data synchronization in this tutorial, you can move on to the next one. The following tutorial will teach you how to compute and analyze the synchronized data. For more information, see Process data.