All Products
Search
Document Center

DataWorks:Synchronize data

Last Updated:Mar 26, 2026

In this tutorial, you use batch synchronization tasks in Data Integration to move data from two sources — basic user information stored in the ods_user_info_d MySQL table and website access logs stored in the user_log.txt HttpFile — into a private Object Storage Service (OSS) bucket. You then create external tables using Spark SQL so that E-MapReduce (EMR) can query the synchronized data directly from OSS.

Prerequisites

Before you begin, make sure that you have:

  • A serverless resource group purchased in your DataWorks workspace

  • A private OSS bucket created with the following folders: ods_raw_log_d and ods_user_info_d

  • A private OSS data source named test_g added to DataWorks

  • An HttpFile data source named user_behavior_analysis_httpfile added to DataWorks

  • A MySQL data source named user_behavior_analysis_mysql added to DataWorks

What you'll learn

By the end of this tutorial, you will have:

  1. Designed a workflow with scheduling dependencies in DataWorks

  2. Configured batch synchronization tasks to move data from MySQL and HttpFile to a private OSS bucket

  3. Created Spark SQL external tables that point to the synchronized OSS data

  4. Verified the synchronized data using an ad hoc query

How it works

This tutorial covers the data acquisition phase of a user profile analysis pipeline. Two batch synchronization nodes move raw data into OSS — one from an HttpFile source (website access logs) and one from a MySQL source (basic user information). Two EMR Spark SQL nodes then create external tables that point to the OSS data using LOCATION, so Spark can query the data without copying it again.

The bizdate scheduling parameter represents the business date — the date on which the source data was generated, which is always one day before the task runs (T-1). For example, if the task runs on August 8, 2024, bizdate equals 20240807. This parameter is used to organize OSS objects and Spark partitions by date.

Step 1: Design a workflow

Log on to the DataWorks console. In the top navigation bar, select the region. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select your workspace from the drop-down list and click Go to Data Development.

Create a workflow

Create a workflow named User profile analysis_Spark. For instructions, see Create a workflow.

Add nodes and configure dependencies

After the workflow is created, the workflow canvas opens automatically. Click Create Node in the upper part of the canvas, drag the nodes onto the canvas, and draw lines to set the dependency order. For details on designing the dependency graph, see Workflow design.

image
The zero load node has no lineage relationship with the synchronization nodes, so you must configure dependencies by drawing lines manually. For more information, see Scheduling dependency configuration guide.

The workflow uses five nodes:

Node category Node type Node name Function
General <img>Zero load node workshop_start_spark Triggers the entire workflow at 00:30 every day. This is a dry-run node — no code is required.
Data Integration <img>Batch synchronization ods_raw_log_d_2oss_spark Syncs website access logs from HttpFile to the private OSS bucket.
Data Integration <img>Batch synchronization ods_user_info_d_2oss_spark Syncs basic user information from MySQL to the private OSS bucket.
EMR <img>EMR Spark SQL ods_raw_log_d_spark Creates the ods_raw_log_d_spark external table to access the synchronized logs in OSS.
EMR <img>EMR Spark SQL ods_user_info_d_spark Creates the ods_user_info_d_spark external table to access the synchronized user information in OSS.

Configure the scheduling logic

The workshop_start_spark zero load node triggers the entire workflow at 00:30 every day. All other nodes depend on this node, so you only need to configure the scheduling properties for the zero load node.

Configuration item Screenshot Description
Scheduling time <img> Set to 00:30. The node triggers the workflow daily at this time.
Scheduling dependencies <img> Configure the node to depend on the root node of the workspace, which triggers workshop_start_spark to run.

For more information about scheduling, see Configure scheduling time for nodes in a workflow in different scenarios and Overview.

Step 2: Configure data synchronization tasks

Double-click each node on the workflow canvas to open its configuration tab.

Sync website access logs from HttpFile to OSS

Configure the ods_raw_log_d_2oss_spark node to move the user_log.txt file from HttpFile to your private OSS bucket.

The user_log.txt file contains website access logs where each record occupies one row in the following format:

$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent"$http_referer" "$http_user_agent" [unknown_content];
  1. On the DataStudio page, double-click the ods_raw_log_d_2oss_spark node.

  2. Set up the network connections between the resource group and the data sources, then click Next and complete the connectivity test.

    Parameter Value
    Source Source: HttpFile; Data source name: user_behavior_analysis_httpfile
    Resource group Select the serverless resource group you purchased
    Destination Destination: OSS; Data source name: test_g
  3. Configure the synchronization parameters.

    The ${bizdate} variable in the object path is replaced at runtime by the scheduling parameter. This writes each day's logs to a separate date-based directory in OSS.
    Parameter Value
    Source — File path /user_log.txt
    Source — Text type text
    Source — Column delimiter |
    Source — Compression format None
    Source — Skip header No
    Destination — Text type text
    Destination — Object name (path included) ods_raw_log_d/log_${bizdate}/log_${bizdate}.txt
    Destination — Column delimiter |
  4. On the Properties tab, configure the scheduling parameter so the daily data lands in its own directory.

    Section Configuration Screenshot
    Scheduling parameter Click Add Parameter. Set Parameter name to bizdate and Parameter value to $[yyyymmdd-1]. See Configure scheduling parameters. <img>
    Dependencies Make sure the output name follows the workspacename.nodename format. See Configure scheduling dependencies. <img>
  5. Click the 保存 icon to save the node.

Sync basic user information from MySQL to OSS

Configure the ods_user_info_d_2oss_spark node to move the ods_user_info_d table from MySQL to the private OSS bucket.

  1. On the DataStudio page, double-click the ods_user_info_d_2oss_spark node.

  2. Set up the network connections between the resource group and the data sources, then click Next and complete the connectivity test.

    Parameter Value
    Source Source: MySQL; Data source name: user_behavior_analysis_mysql
    Resource group Select the serverless resource group you purchased
    Destination Destination: OSS; Data source name: test_g
  3. Configure the synchronization parameters.

    Parameter Value
    Source — Table ods_user_info_d
    Source — Split key uid (primary key; must be an INTEGER-type column)
    Destination — Text type text
    Destination — Object name (path included) ods_user_info_d/user_${bizdate}/user_${bizdate}.txt
    Destination — Column delimiter |
  4. On the Properties tab, configure the scheduling parameter.

    Section Configuration Screenshot
    Scheduling parameter Click Add Parameter. Set Parameter name to bizdate and Parameter value to $[yyyymmdd-1]. See Configure scheduling parameters. <img>
    Dependencies Make sure the output name follows the workspacename.nodename format. See Configure scheduling dependencies. <img>
  5. Click the 保存 icon to save the node.

Create external tables using Spark SQL to access OSS data

After the batch synchronization nodes write data to OSS, use EMR Spark SQL nodes to create external tables. Each external table uses LOCATION to point to the OSS path, so Spark can query the data directly without importing it.

Configure the ods_raw_log_d_spark node

  1. On the DataStudio page, double-click the ods_raw_log_d_spark node and enter the following code.

    Replace dw-emr-demo with the domain name of the OSS bucket you created when preparing the environment. The LOCATION path must match the Object name (path included) value set in the ods_raw_log_d_2oss_spark batch synchronization node.
    -- Creates an external table for the website access logs synchronized to OSS.
    -- The LOCATION points to the date-partitioned directory written by the batch sync task.
    -- The ${bizdate} variable is replaced at runtime by the scheduling parameter.
    CREATE EXTERNAL TABLE IF NOT EXISTS ods_raw_log_d_spark
    (
      `col` STRING
    )
    PARTITIONED BY (
      dt STRING
    )
    LOCATION 'oss://dw-emr-demo/ods_raw_log_d/log_${bizdate}/';
    
    ALTER TABLE ods_raw_log_d_spark ADD IF NOT EXISTS PARTITION (dt = '${bizdate}')
    LOCATION 'oss://dw-emr-demo/ods_raw_log_d/log_${bizdate}/'
    ;
  2. On the Properties tab, configure the scheduling parameter so the node reads the correct date-based OSS object.

    Section Configuration Screenshot
    Scheduling parameter Click Add Parameter. Set Parameter name to bizdate and Parameter value to $[yyyymmdd-1]. See Configure scheduling parameters. <img>
    Dependencies Make sure the output name follows the workspacename.nodename format. See Configure scheduling dependencies. <img>
  3. Click the image icon to save the node.

Configure the ods_user_info_d_spark node

  1. On the DataStudio page, double-click the ods_user_info_d_spark node and enter the following code.

    Replace dw-emr-demo with the domain name of the OSS bucket you created when preparing the environment. The LOCATION path must match the Object name (path included) value set in the ods_user_info_d_2oss_spark batch synchronization node.
    -- Creates an external table for the basic user information synchronized to OSS.
    -- The LOCATION points to the date-partitioned directory written by the batch sync task.
    -- The ${bizdate} variable is replaced at runtime by the scheduling parameter.
    CREATE EXTERNAL TABLE IF NOT EXISTS ods_user_info_d_spark
    (
        `uid`        STRING COMMENT 'The user ID'
        ,`gender`    STRING COMMENT 'The gender'
        ,`age_range` STRING COMMENT 'The age range'
        ,`zodiac`    STRING COMMENT 'The zodiac sign'
    )
    PARTITIONED BY
    (
        dt           STRING
    )
    ROW FORMAT DELIMITED
    FIELDS
    TERMINATED
    BY'|'
    STORED AS TEXTFILE
    LOCATION 'oss://dw-emr-demo/ods_user_info_d/user_${bizdate}/'
    ;
    
    ALTER TABLE ods_user_info_d_spark ADD IF NOT EXISTS PARTITION (dt = '${bizdate}')
    LOCATION'oss://dw-emr-demo/ods_user_info_d/user_${bizdate}/'
    ;
  2. On the Properties tab, configure the scheduling parameter.

    Section Configuration Screenshot
    Scheduling parameter Click Add Parameter. Set Parameter name to bizdate and Parameter value to $[yyyymmdd-1]. See Configure scheduling parameters. <img>
    Dependencies Make sure the output name follows the workspacename.nodename format. See Configure scheduling dependencies. <img>
  3. Click the image icon to save the node.

Step 3: Verify the synchronized data

After all nodes run successfully, use an ad hoc query to confirm that the external tables were created correctly.

  1. In the left-side navigation pane of the DataStudio page, click Ad Hoc Query.

  2. In the Ad Hoc Query pane, create an ad hoc query task of the EMR Spark SQL type.

  3. Enter the following SQL to query both tables.

    -- Replace 'Data timestamp' with the actual bizdate value for your run.
    -- For example, if the task ran on August 8, 2024, use 20240807.
    SELECT * FROM ods_raw_log_d_spark WHERE dt = 'Data timestamp';
    SELECT * FROM ods_user_info_d_spark WHERE dt = 'Data timestamp';
  4. Click the image (Run with parameters) icon, assign a value to ${bizdate}, and run the query.

What's next

Data synchronization is complete. Continue to the next tutorial to learn how to process the synchronized data using Spark. For more information, see Process data.