All Products
Search
Document Center

DataWorks:Synchronize data

Last Updated:Aug 29, 2024

This topic describes how to add HttpFile and MySQL data sources to access the basic user information and website access logs that are provided in the example in this topic. This topic also describes how to configure data synchronization tasks to synchronize data to a private Object Storage Service (OSS) data source, and use E-MapReduce (EMR) Hive nodes to create tables and query the synchronized data.

Prerequisites

An environment is prepared. For more information, see Prepare environments.

Step 1: Add data sources

You need to add an HttpFile data source, a MySQL data source, and an OSS data source that you own. Then, register the data sources in a DataWorks workspace to perform subsequent operations. The HttpFile and MySQL data sources are used to access the test data provided in this example. The test data is the basic user information and website access logs of users. Your OSS data source is used to store the obtained test data.

Add an HttpFile data source

  1. Go to the Data Source page.

    1. Log on to the DataWorks console. In the top navigation bar, select the desired region. Then, click Management Center in the left-side navigation pane. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

    2. In the left-side navigation pane of the SettingCenter page, choose Data Sources > Data Sources.

  2. On the Data Sources page, click Add Data Source.

  3. In the Add Data Source dialog box, select HttpFile.

  4. On the Add HttpFile Data Source page, configure the parameters. In this example, the sample values are used.

    Parameter

    Description

    Data Source Name

    The name of the data source. In this example, user_behavior_analysis_httpfile is used.

    Data Source Description

    The description of the data source. Specify that the data source is exclusively provided for the use cases of DataWorks and is used as the source of a batch synchronization task to access the provided test data. In addition, you also need to specify that the data source can be used only for data reading in data synchronization scenarios.

    Environment

    Select Development Environment and Production Environment.

    Note

    After you add a data source in the development environment, select Production Environment to add the same data source in the production environment. Otherwise, an error is reported when a task that uses this data source is run in the production environment.

    URL

    Enter https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com.

  5. Find a desired resource group and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. If the network connectivity test is successful, Connectable appears in the corresponding column.

    Important

    Make sure that at least one resource group is connectable. Otherwise, you cannot use the codeless user interface (UI) to configure a data synchronization task for the data source.

  6. Click Complete Creation.

Add a MySQL data source

  1. On the SettingCenter page, choose Data Sources > Data Sources. On the Data Sources page, click Add Data Source.

  2. In the Add Data Source dialog box, select MySQL.

  3. On the Add MySQL Data Source page, configure the parameters. In this example, the sample values are used.

    Parameter

    Description

    Configuration Mode

    Set this parameter to connection string mode.

    Data Source Name

    The name of the data source. In this example, user_behavior_analysis_mysql is used.

    Data Source Description

    The description of the data source. Specify that the data source is exclusively provided for the use cases of DataWorks and is used as the source of a batch synchronization task to access the provided test data. In addition, you also need to specify that the data source can be used only for data reading in data synchronization scenarios.

    JDBC URL

    Set this parameter to jdbc:mysql://rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com:3306/workshop.

    Note

    You can add a data source in public connection string mode. This way, the MySQL data source provided by DataWorks for a test can be accessed by using a public IP address.

    Username

    The username. In this example, workshop is used.

    Password

    The password. In this example, workshop#2017 is used.

    Authentication Method

    Set this parameter to No Authentication.

  4. Find a desired resource group and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. If the network connectivity test is successful, Connectable appears in the corresponding column.

  5. Click Complete Creation.

Add a private OSS data source

In this example, the basic user information of the MySQL data source and the website access logs of the HttpFile data source are synchronized to a private OSS data source.

  1. On the SettingCenter page, choose Data Sources > Data Sources. On the Data Sources page, click Add Data Source.

  2. In the Add Data Source dialog box, select OSS.

  3. In the Add OSS Data Source dialog box, configure the parameters.

    Parameter

    Description

    Data Source Name

    The name of the data source. In this example, test_g is used.

    Data Source Description

    The description of the data source.

    Endpoint

    The endpoint of OSS. In this example, http://oss-cn-shanghai-internal.aliyuncs.com is used.

    Bucket

    The name of the OSS bucket that you created when you prepared the environment. In this example, dw-emr-demo is used.

    AccessKey ID

    The AccessKey ID of the account that is used to log on to DataWorks. You can go to the Security Management page to copy the AccessKey ID.

    AccessKey Secret

    The AccessKey secret of the account that is used to log on to DataWorks.

    Important

    The AccessKey secret is displayed only during creation. You cannot view the AccessKey secret after you create it. Keep it confidential. If the AccessKey pair is leaked or lost, delete the AccessKey pair and create a new AccessKey pair.

  4. Find a desired resource group and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. If the network connectivity test is successful, Connectable appears in the corresponding column.

    Important

    Make sure that at least one resource group is connectable. Otherwise, you cannot use the codeless UI to configure a data synchronization task for the data source.

  5. Click Complete Creation.

Step 2: Configure data synchronization tasks

Create a workflow

  1. Click the 图标 icon in the upper-left corner, and choose All Products > Data Development And Task Operation > DataStudio.

  2. In the Scheduled Workflow pane, right-click Business Flow and select Create Workflow.

  3. In the Create Workflow dialog box, configure the Workflow Name and Description parameters.

    Important

    The workflow name can be up to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).

  4. Click Create.

  5. Double-click the new workflow to go to the workflow configuration tab. Click Create Node in the top toolbar and drag Zero-Load Node in the General section to the canvas on the right. In the Create Node dialog box, set the Name parameter to workshop_start_emr and click Confirm.

    Create two batch synchronization nodes named ods_raw_log_d_2oss_emr and ods_user_info_d_2oss_emr in the same manner. The two nodes are used to synchronize the basic user information of ApsaraDB RDS for MySQL and the website access logs of OSS.

  6. Drag directed lines to configure the workshop_start_emr node as the ancestor node of the two batch synchronization nodes.

    image

Configure the workshop_start_emr node

  1. In the Scheduled Workflow pane, double-click the zero load node in the workflow. In the right-side navigation pane of the node configuration tab, click Properties.

  2. In the Schedule section, set the Rerun parameter to Allow Regardless of Running Status.

    Scheduling configuration

    Screenshot

    Description

    Scheduling time

    image

    • If you set the scheduling time of the zero load node to 00:30, the zero load node triggers the current workflow to run at 00:30 every day.

    • You can set the Rerun parameter to Allow Regardless of Running Status.

    Scheduling dependencies

    image

    The zero load node workshop_start_emr does not have any ancestor nodes. In this case, you can configure the zero load node as a descendant node of the root node of the workspace. The root node can be used to trigger the zero load node workshop_start_emr to run.

    The root node of a workspace is named in the Workspace name_root format.

  3. After the configuration is complete, click the 保存 icon in the upper-left corner.

Step 3: Configure data synchronization nodes

Configure the batch synchronization nodes

  1. Synchronize the basic user information of the MySQL data source to the created OSS bucket.

    1. On the DataStudio page, double-click the ods_user_info_d_2oss_emr node to go to the node configuration page.

    2. Establish network connections between the resource group that you want to use and the data sources.

      After you complete the configuration of network connections and resources, click Next and complete the connectivity test as prompted.

      Parameter

      Description

      Source

      • Source: Set it to MySQL.

      • Data Source Name: Set it to user_behavior_analysis_mysql.

      Resource Group

      Select the serverless resource group that you purchased.

      Destination

      • Destination: Set it to OSS.

      • Data Source Name: Set it to test_g, which specifies the name of the private OSS data source that you added.

    3. Configure the data synchronization node.

      Parameter

      Description

      Source

      • Table: Select the ods_user_info_d table in the data source.

      • Split key: The split key for the data to be read. We recommend that you use the primary key or an indexed column as the split key. Only fields of the INTEGER type are supported. In this example, uid is used.

      Destination

      • Text type: Set it to text.

      • Object Name (Path Included): The path of the OSS object. Configure this parameter based on the folder that you created in the OSS bucket. In this example, ods_user_info_d/user_${bizdate}/user_${bizdate}.txt is entered. ods_user_info_d is the name of the folder that you created in the OSS bucket. $bizdate indicates the date of the previous day.

      • Column Delimiter: Enter |.

    4. Configure scheduling properties for the node.

      In the right-side navigation pane of the node configuration tab, click Properties. On the Properties tab, you can configure scheduling properties and basic information about the node. The following table describes the scheduling parameters.

      Section

      Description

      Screenshot

      Scheduling Parameter

      Click Add Parameter in the Scheduling Parameter section. In the row that appears in the table, you can specify a scheduling parameter and the value of the scheduling parameter.

      • Set Parameter Value to $[yyyymmdd-1].

      image

      Dependencies

      Make sure that the generated table is used as the output table of the current node.

      The output table is named in the workspacename.Node name format.

      image

    5. After the configuration is complete, click the 保存 icon in the toolbar.

  1. Synchronize the website access logs of the HttpFile data source to the created OSS bucket.

    1. On the DataStudio page, double-click the ods_raw_log_d_2oss_emr node to go to the node configuration page.

    2. Establish network connections between the resource group that you want to use and the data sources.

      After you complete the configuration of network connections and resources, click Next and complete the connectivity test as prompted.

      Parameter

      Description

      Source

      • Source: Set it to HttpFile.

      • Data Source Name: Set it to user_behavior_analysis_httpfile.

      Resource Group

      Select the serverless resource group that you purchased.

      Destination

      • Destination: Set it to OSS.

      • Data Source Name: Set it to test_g, which specifies the name of the private OSS data source that you added.

    3. Configure the data synchronization node.

      Parameter

      Description

      Source

      • Text type: Set it to text.

      • File Path: Set it to /user_log.txt.

      • Column Delimiter: Enter |.

      • Compression format: The compression format of the OSS object. Valid values: None, Gzip, Bzip2, and Zip. Select None.

      • Skip Header: Set it to No.

      Destination

      • Text type: Set it to text.

      • Object Name (Path Included): The path of the OSS object. Configure this parameter based on the folder that you created in the OSS bucket. In this example, ods_raw_log_d/log_${bizdate}/log_${bizdate}.txt is entered. ods_raw_log_d is the name of the folder that you created in the OSS bucket. $bizdate indicates the date of the previous day.

      • Column Delimiter: Enter |.

    4. Configure scheduling properties for the node.

      In the right-side navigation pane of the node configuration tab, click Properties. On the Properties tab, you can configure scheduling properties and basic information about the node. The following table describes the scheduling parameters.

      Section

      Description

      Screenshot

      Scheduling Parameter

      Click Add Parameter in the Scheduling Parameter section. In the row that appears in the table, you can specify a scheduling parameter and the value of the scheduling parameter.

      • Set Parameter Value to $[yyyymmdd-1].

      image

      Dependencies

      Make sure that the generated table is used as the output table of the current node.

      The output table is named in the workspacename.Node name format.

      image

    5. After the configuration is complete, click the 保存 icon in the toolbar.

Create EMR tables

You need to create two EMR tables: ods_user_info_d_emr and ods_raw_log_d_emr. Then, you can use the tables to query the synchronized basic user information of ApsaraDB RDS for MySQL and the website access logs of OSS.

  1. In the Scheduled Workflow pane, click the new workflow, right-click EMR, and then choose Create Node > EMR Hive.

  2. In the Create Node dialog box, configure the Name parameter and click Confirm.

    You need to create two EMR Hive nodes: ods_user_info_d_emr and ods_raw_log_d_emr, which are used to create the related tables. Then, you need to configure the dependencies between nodes by drawing lines according to the following figure.

    image

  3. On the configuration tab of each EMR Hive node, enter a CREATE TABLE statement and select a serverless resource group for the Resource Group parameter in the Resource Group section of the Properties tab. Then, click Save and execute the CREATE TABLE statement for each node.

    • Create the ods_user_info_d_emr table.

      Double-click the ods_user_info_d_emr node. On the node configuration tab that appears, configure the ods_user_info_d_emr node.

      1. Edit the SQL code.

        CREATE EXTERNAL TABLE IF NOT EXISTS ods_user_info_d_emr
        (
            `uid` STRING COMMENT 'The user ID',
            `gender` STRING COMMENT 'The gender',
            `age_range` STRING COMMENT 'The age range',
            `zodiac` STRING COMMENT 'The zodiac sign'
        ) PARTITIONED BY (
          dt STRING
        )
        ROW FORMAT  delimited fields terminated by '|'
        LOCATION 'oss://dw-emr-demo/ods_user_info_d/';
        
        ALTER TABLE ods_user_info_d_emr ADD IF NOT EXISTS PARTITION (dt='${bizdate}')
        LOCATION 'oss://dw-emr-demo/ods_user_info_d/user_${bizdate}/';
        Note

        In the preceding code, a sample path is used for LOCATION. The path is based on the value of the Object Name (Path Included) parameter when you configure the related batch synchronization node. You must set the LOCATION parameter to the path of the created folder. dw-emr-demo is the name of the OSS bucket that you created when you prepared the environment.

      2. Configure scheduling properties for the node.

        Section

        Description

        Screenshot

        Scheduling Parameter

        Click Add Parameter in the Scheduling Parameter section. In the row that appears in the table, you can specify a scheduling parameter and the value of the scheduling parameter.

        • Set Parameter Value to $[yyyymmdd-1].

        image

        Dependencies

        Make sure that the generated table is used as the output table of the current node.

        The output table is named in the workspacename.Node name format.

        image

      3. After the configuration is complete, click the image icon.

    • Create the ods_raw_log_d_emr table.

      Double-click the ods_raw_log_d_emr node. On the node configuration tab that appears, configure the ods_raw_log_d_emr node.

      1. Edit the SQL code.

        -- Create the table used to store website access logs.
        CREATE EXTERNAL TABLE IF NOT EXISTS ods_raw_log_d_emr
        (
          `col` STRING
        ) PARTITIONED BY (
          dt STRING
        );
        ALTER TABLE ods_raw_log_d_emr ADD IF NOT EXISTS PARTITION (dt='${bizdate}')
        LOCATION 'oss://dw-emr-demo/ods_raw_log_d/log_${bizdate}/';
        Note

        In the preceding code, a sample path is used for LOCATION. The path is based on the value of the Object Name (Path Included) parameter when you configure the related batch synchronization node. You must set the LOCATION parameter to the path of the created folder. dw-emr-demo is the name of the OSS bucket that you created when you prepared the environment.

      2. Configure scheduling properties for the node.

        Section

        Description

        Screenshot

        Scheduling Parameter

        Click Add Parameter in the Scheduling Parameter section. In the row that appears in the table, you can specify a scheduling parameter and the value of the scheduling parameter.

        • Set Parameter Value to $[yyyymmdd-1].

        image

        Dependencies

        Make sure that the generated table is used as the output table of the current node.

        The output table is named in the workspacename.Node name format.

        image

      3. After the configuration is complete, click the image icon.

Run the nodes in the workflow and view the results

Run the workflow

  1. On the DataStudio page, double-click the WorkShop workflow below Business Flow. On the configuration tab of the workflow, click the image.png icon in the toolbar to run the nodes in the workflow based on the scheduling dependencies between the nodes.

  2. Confirm the status.

    • View the status of the node: If a node is in the image.png state, the synchronization process is normal.

    • View the node running logs: Right-click the ods_user_info_d_emr or ods_raw_log_d_emr node and select View Logs. If the information shown in the following figure appears in the logs, the node is run and data is synchronized.

      image.png

Query synchronization results

  1. Create an ad hoc query.

    In the left-side navigation pane of the DataStudio page, click the image.png icon. In the Ad Hoc Query pane, right-click Ad Hoc Query and choose Create Node > EMR Hive.

  2. Query synchronization result tables.

    Execute the following SQL statements to confirm the data write results. View the number of records that are imported into the ods_raw_log_d_emr and ods_user_info_d_emr tables.

    -- In the query statements, change the partition key value to the data timestamp of the node. For example, if the node is run on November 7, 2019, the data timestamp is 20191106, which is one day before the node is run. 
    SELECT * from ods_user_info_d_emr where dt=Data timestamp; 
    SELECT * from ods_raw_log_d_emr where dt=Data timestamp; 

What to do next

After you understand how to synchronize data based on this tutorial, you can proceed to the next tutorial. In the next tutorial, you will learn how to compute and analyze the synchronized data. For more information, see Process data.