All Products
Search
Document Center

E-MapReduce:Synchronize data

Last Updated:Mar 25, 2026

This tutorial shows how to sync user data from MySQL and HttpFile sources to a private OSS bucket using Data Integration batch synchronization, then query the data through Spark SQL external tables. It is part of a series on building a user profile analysis pipeline.

By the end of this tutorial, you will be able to:

  • Design a DataWorks workflow with batch synchronization and EMR Spark SQL nodes

  • Configure Data Integration to sync MySQL table data and HttpFile log data to OSS

  • Create Spark SQL external tables that read directly from OSS

  • Verify the synchronized data with ad hoc queries

Prerequisites

Before you begin, make sure you have:

  • A DataWorks workspace with Data Development enabled

  • A serverless resource group purchased and available

  • A private OSS bucket (domain: dw-emr-demo in this example)

  • Two data sources added to DataWorks:

    • user_behavior_analysis_httpfile (HttpFile) — provides user_log.txt

    • user_behavior_analysis_mysql (MySQL) — provides the ods_user_info_d table

  • The private OSS data source test_g added to DataWorks

  • Completed the previous tutorial in this series (Add data sources)

Data to synchronize

Source typeDataSchemaDestination
MySQLTable: ods_user_info_d (user profile)uid, gender, age_range, zodiacOSS
HttpFileFile: user_log.txt (access logs)$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent"$http_referer" "$http_user_agent" [unknown_content];OSS

Step 1: Design the workflow

The workflow uses five nodes to move data from sources to queryable external tables.

CategoryTypeNode nameFunction
GeneralZero load nodeworkshop_start_sparkEntry point; triggers the workflow at 00:30 daily. No code needed.
Data IntegrationBatch synchronizationods_raw_log_d_2oss_sparkSyncs HttpFile logs to the private OSS bucket
Data IntegrationBatch synchronizationods_user_info_d_2oss_sparkSyncs MySQL user data to the private OSS bucket
EMREMR Spark SQLods_raw_log_d_sparkCreates the ods_raw_log_d_spark external table pointing to the synced logs
EMREMR Spark SQLods_user_info_d_sparkCreates the ods_user_info_d_spark external table pointing to the synced user data

Create and configure the workflow

  1. Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose Data Development and O\&M > Data Development, select your workspace, and click Go to Data Development.

  2. Create a workflow named User profile analysis_Spark. For detailed steps, see Create a workflow.

  3. On the workflow canvas, click Create Node, drag the five node types onto the canvas, and draw dependency lines between them. For the complete layout, see Workflow design. The two batch synchronization nodes have no data lineage relationship with the zero load node, so dependencies are set manually by drawing lines. For more information, see Scheduling dependency configuration guide.

    image

Configure the scheduling logic

The workshop_start_spark zero load node triggers the entire workflow at 00:30 every day. You do not need to modify the scheduling configurations of other nodes.

ConfigurationValueDescription
Scheduling time00:30The workflow starts at 00:30 daily
Scheduling dependenciesRoot node of the workspaceThe workspace root node triggers workshop_start_spark, which then triggers the downstream nodes
All nodes in the data synchronization phase depend on workshop_start_spark. When this node runs, it triggers the data synchronization chain.

For other scheduling options, see Configure scheduling time and Scheduling overview.

Step 2: Configure data synchronization tasks

Each batch synchronization node moves data from a source to a dated folder in the OSS bucket. The ${bizdate} scheduling parameter controls which date's data is written — it resolves to the previous day's date ($[yyyymmdd-1]), so data written on August 8, 2024 goes to the 20240807 folder.

After the sync nodes complete, the Spark SQL nodes create external tables that point to those OSS folders using LOCATION.

Sync website access logs (HttpFile to OSS)

  1. On the DataStudio page, double-click ods_raw_log_d_2oss_spark to open its configuration tab.

  2. Configure the network connection between the resource group and the data sources. After completing the network configuration, click Next and finish the connectivity test.

    ParameterValue
    SourceHttpFile — user_behavior_analysis_httpfile
    Resource groupYour serverless resource group
    DestinationOSS — test_g
  3. Configure the synchronization parameters.

    ParameterValue
    Source — File path/user_log.txt
    Source — Text typetext
    Source — Column delimiter|
    Source — Compression formatNone
    Source — Skip headerNo
    Destination — Text typetext
    Destination — Object name (path included)ods_raw_log_d/log_${bizdate}/log_${bizdate}.txt
    Destination — Column delimiter|
  4. On the Properties tab, configure the scheduling parameter so the ${bizdate} variable in the object path resolves to the correct date at runtime.

    SectionConfiguration
    Scheduling parameterParameter name: bizdate / Parameter value: $[yyyymmdd-1]. See Configure scheduling parameters.
    DependenciesSet the output name to workspacename.ods_raw_log_d_2oss_spark. See Configure scheduling dependencies.
  5. Click the save icon icon to save.

Sync user profile data (MySQL to OSS)

  1. On the DataStudio page, double-click ods_user_info_d_2oss_spark to open its configuration tab.

  2. Configure the network connection between the resource group and the data sources. After completing the network configuration, click Next and finish the connectivity test.

    ParameterValue
    SourceMySQL — user_behavior_analysis_mysql
    Resource groupYour serverless resource group
    DestinationOSS — test_g
  3. Configure the synchronization parameters.

    ParameterValue
    Source — Tableods_user_info_d
    Source — Split keyuid (primary key; must be of the INTEGER type)
    Destination — Text typetext
    Destination — Object name (path included)ods_user_info_d/user_${bizdate}/user_${bizdate}.txt
    Destination — Column delimiter|
  4. On the Properties tab, add the bizdate scheduling parameter and dependency output name.

    SectionConfiguration
    Scheduling parameterParameter name: bizdate / Parameter value: $[yyyymmdd-1]. See Configure scheduling parameters.
    DependenciesSet the output name to workspacename.ods_user_info_d_2oss_spark. See Configure scheduling dependencies.
  5. Click the save icon icon to save.

Create external tables with Spark SQL

External tables let Spark SQL read data directly from OSS without importing it. The LOCATION path matches the destination folder set in the batch synchronization node.

Configure the ods_raw_log_d_spark node

  1. Double-click ods_raw_log_d_spark and paste the following SQL.

    Replace dw-emr-demo with your OSS bucket domain name. The LOCATION path must match the Object name (path included) value you set in the ods_raw_log_d_2oss_spark node.
    -- Creates an external table for the raw access log data in OSS.
    -- LOCATION points to the dated folder written by the batch sync node.
    -- The ${bizdate} parameter resolves to the previous day's date at runtime.
    CREATE EXTERNAL TABLE IF NOT EXISTS ods_raw_log_d_spark
    (
      `col` STRING
    )
    PARTITIONED BY (
      dt STRING
    )
    LOCATION 'oss://dw-emr-demo/ods_raw_log_d/log_${bizdate}/';
    
    -- Registers the partition so Spark can discover the data.
    ALTER TABLE ods_raw_log_d_spark ADD IF NOT EXISTS PARTITION (dt = '${bizdate}')
    LOCATION 'oss://dw-emr-demo/ods_raw_log_d/log_${bizdate}/';
  2. On the Properties tab, add the scheduling parameter and output dependency.

    SectionConfiguration
    Scheduling parameterParameter name: bizdate / Parameter value: $[yyyymmdd-1]. See Configure scheduling parameters.
    DependenciesSet the output name to workspacename.ods_raw_log_d_spark. See Configure scheduling dependencies.
  3. Click the save icon icon to save.

Configure the ods_user_info_d_spark node

  1. Double-click ods_user_info_d_spark and paste the following SQL.

    Replace dw-emr-demo with your OSS bucket domain name. The LOCATION path must match the Object name (path included) value you set in the ods_user_info_d_2oss_spark node.
    -- Creates an external table for the user profile data in OSS.
    -- Each field maps to a pipe-delimited column in the synced text file.
    -- The ${bizdate} parameter resolves to the previous day's date at runtime.
    CREATE EXTERNAL TABLE IF NOT EXISTS ods_user_info_d_spark
    (
        `uid`        STRING COMMENT 'The user ID'
        ,`gender`    STRING COMMENT 'The gender'
        ,`age_range` STRING COMMENT 'The age range'
        ,`zodiac`    STRING COMMENT 'The zodiac sign'
    )
    PARTITIONED BY
    (
        dt           STRING
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '|'
    STORED AS TEXTFILE
    LOCATION 'oss://dw-emr-demo/ods_user_info_d/user_${bizdate}/';
    
    -- Registers the partition so Spark can discover the data.
    ALTER TABLE ods_user_info_d_spark ADD IF NOT EXISTS PARTITION (dt = '${bizdate}')
    LOCATION 'oss://dw-emr-demo/ods_user_info_d/user_${bizdate}/';
  2. On the Properties tab, add the scheduling parameter and output dependency.

    SectionConfiguration
    Scheduling parameterParameter name: bizdate / Parameter value: $[yyyymmdd-1]. See Configure scheduling parameters.
    DependenciesSet the output name to workspacename.ods_user_info_d_spark. See Configure scheduling dependencies.
  3. Click the save icon icon to save.

Step 3: Verify the synchronized data

After all nodes run successfully, run ad hoc queries to confirm the external tables were created and contain data.

  1. In the left-side navigation pane of the DataStudio page, click Ad Hoc Query.

  2. Create an ad hoc query task of the EMR Spark SQL type.

  3. Run the following queries. Click the Run with Parameters icon (Run with Parameters) icon, assign a value to ${bizdate}, and run the query. For example, if the task runs on August 8, 2024, set bizdate to 20240807.

    -- Query the access log external table.
    SELECT * FROM ods_raw_log_d_spark WHERE dt = '${bizdate}';
    
    -- Query the user profile external table.
    SELECT * FROM ods_user_info_d_spark WHERE dt = '${bizdate}';

If both queries return rows, the data synchronization is complete and the external tables are working correctly.

What's next

With the data synchronized and accessible through external tables, you can now process it with Spark. See Process data.