All Products
Search
Document Center

DataWorks:Synchronize data

Last Updated:Apr 18, 2025

This topic describes how to add HttpFile and MySQL data sources to access the basic user information and user website access logs that are provided in the tutorial in this topic. This topic also describes how to configure data synchronization tasks to synchronize data to the Object Storage Service (OSS) data source that you added when you prepare an environment, and use E-MapReduce (EMR) Hive nodes to create tables and query the synchronized data in OSS. You can query and verify the data synchronization result to check whether the data synchronization is complete.

Prerequisites

  • An environment is prepared. For more information, see Prepare environments.

  • A security group rule is added to the security group that is associated with the EMR cluster in the ECS console to ensure network connectivity. You can use Port 10000 of the Elastic Compute Service (ECS) instance to connect to DataWorks and set Authorization Object to the CIDR block of the vSwitch with which the resource group is associated. For more information, see Add a security group rule.

Step 1: Add data sources

To ensure smooth data processing in subsequent operations, you must add the following data sources to your DataWorks workspace to obtain and store the test data provided by the system.

  • MySQL data source: a test data source provided by DataWorks to store the basic user information.

  • HttpFile data source: a test data source provided by DataWorks to store the user website access logs.

  • OSS data source: the OSS-HDFS data source that you created in Prepare an EMR environment. The OSS data source is used to store the test data of the basic user information and user website access logs obtained from the HttpFile and MySQL data sources.

Add a MySQL data source named user_behavior_analysis_mysql

The basic user information provided by this tutorial is stored in a MySQL database. You must add the MySQL database to your DataWorks workspace as a data source. This way, you can obtain the basic user information from the MySQL table ods_user_info_d in the data source.

  1. Log on to the DataWorks console and go to the SettingCenter page. In the left-side navigation pane of the SettingCenter page, click Data Sources.

  2. On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, click MySQL.

  3. On the Add MySQL Data Source page, configure the parameters. In this tutorial, the parameters for the development environment and the production environment are configured by referring to the instructions in the following table.

    The following table describes the key parameters that you must configure in this tutorial. You can retain default values for other parameters.

    Parameter

    Description

    Data Source Name

    Enter a name for the data source. In this tutorial, user_behavior_analysis_mysql is used.

    Data Source Description

    Enter a description for the data source. In this tutorial, the following description is entered: The data source is exclusive for experiencing cases in DataWorks. You can configure the data source in a batch synchronization task to access the test data provided by the system. The data source can be used only for data reading in Data Integration and cannot be used in other services.

    Configuration Mode

    Select Connection String Mode.

    Connection Address

    • Host IP Address: Enter rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com.

    • Port Number: Enter 3306.

    Database Name

    Enter a database name. In this tutorial, workshop is entered.

    Username

    Enter a username used to connect to the database. In this tutorial, workshop is entered.

    Password

    Enter a password used to connect to the database. In this tutorial, workshop#2017 is entered.

    Authentication Method

    Select No Authentication.

  4. In the Connection Configuration section, find the resource group that you want to use and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Make sure that the displayed connection status is Connected.

    Important
    • You must make sure that the resource group is associated with your workspace and can access the Internet. Otherwise, an error is reported when you run the synchronization task configured for the data source. For more information, see Prepare environments.

    • If no resource group is available, you can click Purchase or Associate Purchased Resource Group in the prompt message that is displayed in the Connection Configuration section to purchase a resource group or associate your resource group with your workspace.

  5. Click Complete Creation.

Add an HttpFile data source named user_behavior_analysis_httpfile

The user website access records provided by this tutorial are stored in OSS. You must add an HttpFile data source to your DataWorks workspace. This way, you can obtain the user website access records from the OSS object user_log.txt in the data source.

  1. Go to the Data Sources page.

    1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

    2. In the left-side navigation pane of the SettingCenter page, click Data Sources.

  2. On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, click HttpFile.

  3. On the Add HttpFile Data Source page, configure the parameters. In this tutorial, the parameters for the development environment and the production environment are configured by referring to the instructions in the following table.

    The following table describes the key parameters that you must configure in this tutorial. You can retain default values for other parameters.

    Parameter

    Description

    Data Source Name

    Enter a name for the data source. In this tutorial, user_behavior_analysis_httpfile is entered.

    Data Source Description

    Enter a description for the data source. In this tutorial, the following description is entered: The data source is exclusive for experiencing cases in DataWorks. You can configure the data source in a batch synchronization task to access the test data provided by the system. The data source can be used only for data reading in Data Integration and cannot be used in other services.

    URL

    Enter a URL. In this tutorial, the URL parameters for both the development environment and production environment are set to https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com.

  4. In the Connection Configuration section, find the resource group that you want to use and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Make sure that the displayed connection status is Connected.

    Important
    • You must make sure that the resource group is associated with your workspace and can access the Internet. Otherwise, an error is reported when you run the synchronization task configured for the data source. For more information, see Prepare environments.

    • If no resource group is available, you can click Purchase or Associate Purchased Resource Group in the prompt message that is displayed in the Connection Configuration section to purchase a resource group or associate your resource group with your workspace.

  5. Click Complete Creation.

Add an OSS data source named test_g

You can perform the following operations to add the OSS data source created in Prepare an EMR environment to DataWorks to store the data that is synchronized from the MySQL data source and the HttpFile data source.

  1. Log on to the DataWorks console and go to the SettingCenter page. In the left-side navigation pane of the SettingCenter page, click Data Sources.

  2. On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, click OSS.

  3. On the Add OSS Data Source page, configure the parameters. The following table describes the parameters. In this tutorial, the parameters for the development environment and the production environment are configured by referring to the instructions in the following table.

    Parameter

    Description

    Data Source Name

    The name of the data source. In this example, test_g is used.

    Data Source Description

    The description of the data source.

    Access Mode

    Select AccessKey Mode.

    AccessKey ID

    The AccessKey ID of the account that is used to log on to DataWorks. You can go to the AccessKey page to copy the AccessKey ID.

    AccessKey Secret

    The AccessKey secret of the account that is used to log on to DataWorks.

    Important

    The AccessKey secret is displayed only during creation. You cannot view the AccessKey secret after you create it. Keep it confidential. If the AccessKey pair is leaked or lost, delete the AccessKey pair and create a new AccessKey pair.

    Region

    In this example, select China (Shanghai).

    Endpoint

    The endpoint of OSS. In this example, enter oss-cn-shanghai-internal.aliyuncs.com.

    Bucket

    The name of the OSS bucket that you created when you prepared the environment. In this example, the bucket name is dw-emr-demo.

  4. In the Connection Configuration section, find the resource group that you want to use and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Make sure that the displayed connection status is Connected.

    Important
    • You must make sure that the resource group is associated with your workspace and can access the Internet. Otherwise, an error is reported when you run the synchronization task configured for the data source. For more information, see Prepare environments.

    • If no resource group is available, you can click Purchase or Associate Purchased Resource Group in the prompt message that is displayed in the Connection Configuration section to purchase a resource group or associate your resource group with your workspace.

  5. Click Complete Creation.

Step 2: Establish a synchronization link

In this step, you need to establish a synchronization link to synchronize the basic user information and user website access records to Hive tables to make preparations for subsequent data processing.

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. In the left-side navigation pane of the Data Studio page, click the image icon to go to the DATASTUDIO pane. The Workspace Directories section appears.

  3. Create a workspace directory named work and a workflow named workshop_emr. For more information, see Workspace directories and Auto triggered workflow.

  4. In the Create Node dialog box, create a zero load node, two batch synchronization nodes, and two EMR Hive nodes.

    1. On the configuration tab of the workflow, drag Zero Load Node to the canvas. In the Create Node dialog box, set the Node Name parameter to workshop_start_emr and click Confirm.

    2. On the configuration tab of the workflow, drag Batch Synchronization to the canvas. In the Create Node dialog box, set the Node Name parameter to ods_user_info_d_2oss_emr or ods_raw_log_d_2oss_emr to separately create two batch synchronization nodes and click Confirm. The nodes are used to synchronize the basic user information stored in MySQL and user website access logs stored in HttpFile to the created OSS bucket.

    3. On the configuration tab of the workflow, drag EMR Hive to the canvas. In the Create Node dialog box, set the Node Name parameter to ods_user_info_d_emr or ods_raw_log_d_emr to separately create two EMR Hive nodes and click OK. The nodes are used to synchronize the data in the OSS bucket to the Hive table.

    The following table describes the node names that are used in this tutorial and the functionalities of the nodes.

    Node type

    Node name

    Node functionality

    imageZero load node

    workshop_start_emr

    This node is used to manage the entire user profile analysis workflow and clarify the data forwarding path. This node is a dry-run node. You do not need to edit code for the node.

    imageBatch synchronization node

    ods_user_info_d_2oss_emr

    This node is used to synchronize the basic user information stored in MySQL to the OSS data source test_g.

    imageBatch synchronization node

    ods_raw_log_d_2oss_emr

    This node is used to synchronize the user website access logs stored in HttpFile to the OSS data source test_g.

    imageEMR Hive node

    ods_user_info_d_emr

    This node is used to synchronize the basic user information stored in the OSS data source test_g to the Hive table ods_user_info_d_emr.

    imageEMR Hive node

    ods_raw_log_d_emr

    This node is used to synchronize the user website access logs stored in the OSS data source test_g to the Hive table ods_raw_log_d_emr.

  5. On the workflow configuration tab, drag directed lines to configure the workshop_start_emr node as the ancestor node of the two batch synchronization nodes, as shown in the following figure.

    image
  6. Configure scheduling properties for the workflow.

    In the right-side navigation pane of the configuration tab of the workflow, click Properties. On the Properties tab, configure the parameters. The following table describes the key parameters that you must configure in this tutorial. You can retain default values for other parameters.

    Parameter

    Description

    Scheduling Parameters

    You can configure scheduling parameters for the workflow. The inner nodes of the workflow can directly use the configured scheduling parameters. In this tutorial, bizdate=$[yyyymmdd-1] is configured to obtain the date of the previous day.

    Scheduling Cycle

    In this tutorial, this parameter is set to Day.

    Scheduling Time

    In this tutorial, the Scheduling Time parameter is set to 00:30. This way, the workflow starts to run at 00:30 each day.

    Scheduling Dependencies

    The workflow does not need to depend on another node. You can ignore this parameter. To facilitate management, you can click Use Workspace Root Node to configure the root node of the workspace as the ancestor node of the workflow.

    The name of the root node is in the Workspace name_root format.

  7. After you configure the workflow scheduling time, click Save. Click Change Review. In the Change Review dialog box, check the configuration of the workflow. Then, click Save.

Step 3: Configure the inner nodes of the workflow

You can perform the following steps to synchronize the basic user information in the MySQL data source and the user website access logs in the HttpFile data source to the OSS data source that you created, and to create Hive external tables to parse and obtain the synchronized data in OSS.

Configure the zero load node

You can perform the following operations to configure the scheduling properties of the zero load node.

  1. Open the zero load node.

    In the canvas, move the pointer over the zero load node and click Open Node.

  2. Configure the zero load node.

    In the right-side navigation pane of the configuration tab of the zero load node, click Properties. On the properties tab, configure the parameters. The following table describes the key parameters that you must configure in this tutorial.

    Parameter

    Description

    Support For Rerun

    Set the value to Allow Regardless of Running Status.

    Resource Group For Scheduling

    Select the serverless resource group that you purchased in the environment preparation phase.

    Scheduling Dependencies

    The workflow does not need to depend on another node. You can ignore this parameter. To facilitate management, you can click Use Workspace Root Node to configure the root node of the workspace as the ancestor node of the workflow.

    The name of the root node is in the Workspace name_root format.

  3. Save the zero load node.

    After the configuration is complete, click the image icon in the top toolbar to save the node.

Synchronize the basic user information in the MySQL data source to the OSS data source

You can perform the following operations to synchronize the basic user information in the MySQL data source to the OSS data source.

  1. Open the batch synchronization node.

    In the canvas, move the pointer over the synchronization node ods_user_info_d_2oss_emr and click Open Node. The Configure Network Connections and Resource Group page appears.

  2. Configure network connections and a resource group.

    Configure network connections and a resource group. The following table describes the parameters. After the network connectivity test is complete, click Next. The Configure tasks page appears.

    Parameter

    Description

    Source

    • Source: Select MySQL.

    • Data Source Name: Select user_behavior_analysis_mysql.

    Resource Group

    In the Resource Group step, select the serverless resource group that you create when you prepare environments.

    Destination

    • Destination: Select OSS.

    • Data Source Name: Set the value to test_g, which specifies the name of the OSS data source that you added.

  3. Configure information about the source and destination.

    1. The following table describes the key parameters that you must configure in this tutorial. You can retain default values for other parameters.

      Parameter

      Description

      Source

      • Table: Select the ods_user_info_d table in the data source.

      • Split key: The shard key for the data to be read. We recommend that you use the primary key or an indexed column as the shard key. You can use the shard key to shard data only of the INTEGER type. In this example, uid is used.

      Destination

      • Text type: Set the value to text.

      • Object Name (Path Included): The path of the OSS object. Configure this parameter based on the folder that you created in the OSS bucket. In this example, ods_user_info_d/user_${bizdate}/user_${bizdate}.txt is entered. ods_user_info_d is the name of the folder that you created in the OSS bucket. $bizdate indicates the date of the previous day.

      • Column Delimiter: Enter |.

    2. Confirm field mappings in the Field Mapping section and configure channel control settings in the Channel Control section.

      DataWorks allows you to establish mappings between source fields and destination fields to synchronize data from specified source fields to specified destination fields. DataWorks also allows you to configure parameters such as Task Expected Maximum Concurrency and Policy for Dirty Data Records. In this tutorial, the Policy for Dirty Data Records parameter is set to Disallow Dirty Data Records, and default values are retained for the remaining parameters. For more information, see Configure a batch synchronization task by using the codeless UI.

  4. Configure debugging parameters.

    In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow in Step 4.

    Parameter

    Description

    Resource Group

    Select the serverless resource group that you create when you prepare environments. For more information, see Prepare environments.

    Script Parameters

    You do not need to configure this parameter. In the sample code provided in this tutorial, the ${bizdate} variable is used to represent the data timestamp. When you debug the workflow by following Step 4, you can set the value of the variable to a constant, such as 20250223. When the workflow is run, the variables defined for the inner nodes of the workflow are replaced with the constant.

  5. Save the batch synchronization node.

    After the configuration is complete, click the image icon in the top toolbar to save the node.

Synchronize the user website access logs in the HttpFile data source to the OSS bucket

You can perform the following operations to synchronize the user website access logs in the HttpFile data source to the OSS data source.

  1. Open the batch synchronization node.

    In the canvas, move the pointer over the batch synchronization node ods_raw_log_d_2oss_emr and click Open Node. The Configure Network Connections and Resource Group page appears.

  2. Configure network connections and a resource group.

    Configure network connections and a resource group. The following table describes the parameters. After the network connectivity test is complete, click Next. The Configure tasks page appears.

    Parameter

    Description

    Source

    • Source: Select HttpFile.

    • Data Source Name: Select user_behavior_analysis_httpfile.

    Resource Group

    In the Resource Group step, select the serverless resource group that you create when you prepare environments.

    Destination

    • Destination: Select OSS.

    • Data Source Name: Set the value to test_g, which specifies the name of the private OSS data source that you added.

  3. Configure information about the source and destination.

    1. The following table describes the key parameters that you must configure in this tutorial. You can retain default values for other parameters.

      Parameter

      Description

      Source

      • File Path: Set the value to /user_log.txt.

      • Text type: Set the value to text.

      • Column Delimiter: Enter |.

      • Compression format: The compression format of the OSS object. Valid values: None, Gzip, Bzip2, and Zip. Select None.

      • Skip Header: Set the value to No.

      Destination

      • Text type: Set the value to text.

      • Object Name (Path Included): The path of the OSS object. Configure this parameter based on the folder that you created in the OSS bucket. In this example, ods_raw_log_d/log_${bizdate}/log_${bizdate}.txt is entered. ods_raw_log_d is the name of the folder that you created in the OSS bucket. $bizdate indicates the date of the previous day.

      • Column Delimiter: Enter |.

    2. Configure debugging parameters.

      In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow in Step 4.

      Parameter

      Description

      Resource Group

      Select the serverless resource group that you create when you prepare environments. For more information, see Prepare environments.

      Script Parameters

      You do not need to configure this parameter. In the sample code provided in this tutorial, the ${bizdate} variable is used to represent the data timestamp. When you debug the workflow by following Step 4, you can set the value of the variable to a constant, such as 20250223. When the workflow is run, the variables defined for the inner nodes of the workflow are replaced with the constant.

  4. Save the batch synchronization node.

    After the configuration is complete, click the image icon in the top toolbar to save the node.

Step 4: Run the batch synchronization node

In the top toolbar of the configuration tab of the workflow, click Run. In the Enter runtime parameters dialog box, specify a value that is used for scheduling parameters defined for each node in this run, and click OK. In this tutorial, 20250223 is specified. You can specify a value based on your business requirements.

Step 5: Parse OSS data

Wait until the batch synchronization node is complete. You can perform the following steps to create a Hive external table to parse the test data that is stored in the test_g OSS data source.

Create a table named ods_user_info_d_emr to parse data

  1. Develop the code.

    In the canvas, move the pointer over the EMR Hive node ods_user_info_d_emr and click Open Node to go to the configuration tab of the node. Enter the following sample SQL statement:

    CREATE EXTERNAL TABLE IF NOT EXISTS ods_user_info_d_emr
    (
        `uid` STRING COMMENT 'The user ID',
        `gender` STRING COMMENT 'The gender',
        `age_range` STRING COMMENT 'The age range',
        `zodiac` STRING COMMENT 'The zodiac sign'
    ) PARTITIONED BY (
      dt STRING
    )
    ROW FORMAT  delimited fields terminated by '|'
    LOCATION 'oss://dw-emr-demo/ods_user_info_d/';
    
    ALTER TABLE ods_user_info_d_emr ADD IF NOT EXISTS PARTITION (dt='${bizdate}')
    LOCATION 'oss://dw-emr-demo/ods_user_info_d/user_${bizdate}/';
    Note

    In the preceding code, a sample path is used for LOCATION. The path is based on the value of the Object Name (Path Included) parameter when you configure the related batch synchronization node. You must set the LOCATION parameter to the path of the created folder. dw-emr-demo is the domain name of the OSS bucket that you created when you prepared the environment.

  2. Configure debugging parameters.

    In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow.

    Parameter

    Description

    Resource Group

    Select the serverless resource group that you create when you prepare environments. For more information, see Prepare environments.

    Script Parameters

    You do not need to configure this parameter. In the sample code provided in this tutorial, the ${bizdate} variable is used to represent the data timestamp. When you debug the node, you can set the value of the variable to a constant, such as 20250223. When the node is run, the variables defined for the inner nodes of the node are replaced with the constant.

  3. Save and run the node.

    After the configuration is complete, you can click Save and Run to save and run the node.

    • If nodes are in the image.png state, the synchronization process is normal.

    • If nodes are in the image state and the error message "java.net.ConnectException: Connection timed out (Connection timed out)" appears, you must add a security group rule in the ECS console, use Port 10000 of the ECS instance to connect to DataWorks and set Authorization Object to the CIDR block of the vSwitch with which the resource group is associated. You can perform the following steps to obtain the vSwitch CIDR block: Go to the Resource Groups page, find the desired resource group, and click Network Settings in the Actions column. On the VPC Binding tab of the page that appears, obtain the vSwitch CIDR block. For more information about how to add security group rules, see Add a security group rule.

  4. Query data synchronization results.

    1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

    2. In the left-side navigation pane of the Data Studio page, click the image icon to go to the DATASTUDIO pane. The Workspace Directories section appears.

    3. Right-click the work workspace directory that you created and choose Create Node > EMR > EMR Hive. In the Create Node dialog box, enter the name of the EMR Hive node and click OK to create the node.

      Go to the configuration tab of the EMR Hive node, replace the data timestamp in the following script with the current data timestamp, and confirm the data write results. View the number of records that are imported into the ods_raw_log_d_emr and ods_user_info_d_emr tables.

      Note

      In the query statement, change the partition key value to the data timestamp of the node. For example, if a node is scheduled to run on February 23, 2025, the data timestamp of the node is 20250222, which is one day earlier than the scheduling time of the node.

      SELECT * FROM ods_user_info_d_emr WHERE dt=Data timestamp;
      • If the results returned after you execute the preceding statements show that data exists in the destination, the data process is complete.

      • If the results returned after you execute the preceding statements show that data does not exist in the destination, you must make sure that the values specified for the scheduling parameters defined for the inner nodes of the workflow in this run are the same as the value of the dt field in the preceding statements when you run the workflow. You can click Running History in the right-side navigation pane of the configuration tab of the workflow, and then click View in the Actions column of the running record generated for this run to view the data timestamp that is used when the workflow is run in the run logs of the workflow. The data timestamp is in the partition=[pt=xxx] format.

Create a table named ods_raw_log_d_emr to parse data

  1. Develop the code.

    In the canvas, move the pointer over the EMR Hive node ods_raw_log_d_emr and click Open Node to go to the configuration tab of the node. Enter the following sample SQL statement:

    -- Create the table used to store user website access logs.
    CREATE EXTERNAL TABLE IF NOT EXISTS ods_raw_log_d_emr
    (
      `col` STRING
    ) PARTITIONED BY (
      dt STRING
    );
    ALTER TABLE ods_raw_log_d_emr ADD IF NOT EXISTS PARTITION (dt='${bizdate}')
    LOCATION 'oss://dw-emr-demo/ods_raw_log_d/log_${bizdate}/';
    Note

    In the preceding code, a sample path is used for LOCATION. The path is based on the value of the Object Name (Path Included) parameter when you configure the related batch synchronization node. You must set the LOCATION parameter to the path of the created folder. dw-emr-demo is the domain name of the OSS bucket that you created when you prepared the environment.

  2. Configure debugging parameters.

    In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow.

    Parameter

    Description

    Resource Group

    Select the serverless resource group that you create when you prepare environments. For more information, see Prepare environments.

    Script Parameters

    You do not need to configure this parameter. In the sample code provided in this tutorial, the ${bizdate} variable is used to represent the data timestamp. When you debug the node, you can set the value of the variable to a constant, such as 20250223. When the node is run, the variables defined for the inner nodes of the node are replaced with the constant.

  3. Save and run the node.

    After the configuration is complete, you can click Save and Run to save and run the node.

  4. Query data synchronization results.

    1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

    2. In the left-side navigation pane of the Data Studio page, click the image icon to go to the DATASTUDIO pane. The Workspace Directories section appears.

    3. Right-click the work workspace directory that you created and choose Create Node > EMR > EMR Hive. In the Create Node dialog box, enter the name of the EMR Hive node and click OK to create the node.

      Go to the configuration tab of the EMR Hive node, replace the data timestamp in the following script with the current data timestamp, and confirm the data write results. View the number of records that are imported into the ods_raw_log_d_emr and ods_user_info_d_emr tables.

      Note

      In the query statement, change the partition key value to the data timestamp of the node. For example, if a node is scheduled to run on February 23, 2025, the data timestamp of the node is 20250222, which is one day earlier than the scheduling time of the node.

      SELECT * FROM ods_raw_log_d_emr WHERE dt=Data timestamp; 
      • If the results returned after you execute the preceding statements show that data exists in the destination, the data process is complete.

      • If the results returned after you execute the preceding statements show that data does not exist in the destination, you must make sure that the values specified for the scheduling parameters defined for the inner nodes of the workflow in this run are the same as the value of the dt field in the preceding statements when you run the workflow. You can click Running History in the right-side navigation pane of the configuration tab of the workflow, and then click View in the Actions column of the running record generated for this run to view the data timestamp that is used when the workflow is run in the run logs of the workflow. The data timestamp is in the partition=[pt=xxx] format.

What to do next

After you understand how to synchronize data based on this tutorial, you can proceed to the next tutorial. In the next tutorial, you will learn how to compute and analyze the synchronized data. For more information, see Process data.