All Products
Search
Document Center

E-MapReduce:Synchronize data

Last Updated:Mar 26, 2026

In this tutorial, you're building the data ingestion layer for a user behavior analysis pipeline. You'll pull raw data from two sources—user profile records stored in MySQL and website access logs stored in an OSS object—into a private OSS bucket, then expose that data to EMR Spark SQL through external tables.

By the end of this tutorial, you will have:

  1. Added three data sources (MySQL, HttpFile, and OSS) to your DataWorks workspace.

  2. Created a User_profile_analysis_spark workflow with batch synchronization and EMR Spark SQL nodes.

  3. Synchronized user data from MySQL and HttpFile into a private OSS bucket.

  4. Created Spark external tables (ods_raw_log_d_spark and ods_user_info_d_spark) that read from the synchronized OSS data.

  5. Verified the results with SQL queries.

Prerequisites

Before you begin, make sure you have:

About the data

This tutorial uses two datasets:

  • User profile data: basic user records stored in a MySQL table named ods_user_info_d. Fields include uid, gender, age_range, and zodiac.

  • Website access logs: user clickstream data stored as user_log.txt in an OSS object, accessible via an HttpFile data source.

Both datasets will be synchronized into the private OSS bucket (dw-spark-demo) you created during environment preparation.

Step 1: Add data sources

Add the following three data sources to your DataWorks workspace. These provide the raw data for the pipeline.

Add a MySQL data source

Add the MySQL database as a data source so you can synchronize the ods_user_info_d table to the OSS bucket.

  1. Go to the Data Sources page.

    1. Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose More > Management Center. Select your workspace from the drop-down list and click Go to Management Center.

    2. In the left-side navigation pane of the SettingCenter page, click Data Sources.

  2. On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, search for and select MySQL.

  3. On the Add MySQL Data Source page, configure the parameters. Set Data Source Name to user_behavior_analysis_mysql.

  4. In the Connection Configuration section, find the resource group and click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Make sure the status shows Connected.

    Important

    The resource group must be associated with your workspace and have internet access. If no resource group is available, click Purchase or Associate Purchased Resource Group in the prompt message to resolve this.

  5. Click Complete Creation.

Add an HttpFile data source

Add an HttpFile data source to synchronize the user_log.txt access logs from OSS into your private OSS bucket.

  1. In the left-side navigation pane of the SettingCenter page, click Data Sources. On the Data Sources page, click Add Data Source.

  2. In the Add Data Source dialog box, search for and click HttpFile.

  3. On the Add HttpFile Data Source page, configure the following parameters. Retain default values for other parameters.

    ParameterValue
    Data Source Nameuser_behavior_analysis_httpfile
    Data Source DescriptionThe data source is exclusive for experiencing cases in DataWorks. You can configure the data source in a batch synchronization task to access the test data provided by the system. The data source can be used only for data reading in Data Integration and cannot be used in other services.
    URLhttps://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com (set for both development and production environments)
  4. In the Connection Configuration section, click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns for the target resource group. Confirm that the status shows Connected.

    Important

    The resource group must be associated with your workspace and have internet access. If no resource group is available, click Purchase or Associate Purchased Resource Group in the prompt message.

  5. Click Complete Creation.

Add an OSS data source

Add a private OSS data source as the destination for synchronized data.

  1. In the left-side navigation pane of the SettingCenter page, click Data Sources. In the upper-left corner, click Add Data Source.

  2. In the Add Data Source dialog box, select OSS.

  3. On the Add OSS Data Source page, configure the following parameters. Use the same values for both development and production environments.

    ParameterValue
    Data Source Nametest_g
    Access ModeSelect AccessKey Mode
    AccessKey IDThe AccessKey ID of your account. Go to the AccessKey page to copy it.
    AccessKey SecretThe AccessKey secret of your account. The secret is only displayed at creation time—store it securely. If the AccessKey pair is leaked, delete it and create a new one.
    Endpointhttp://oss-cn-shanghai-internal.aliyuncs.com
    Bucketdw-spark-demo (the private OSS bucket created during environment preparation)
  4. Click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Make sure at least one resource group shows Connected—otherwise you won't be able to configure synchronization nodes with the codeless UI.

  5. Click Complete Creation.

Step 2: Create a workflow

  1. In the upper-left corner of the DataWorks console, click the image icon and choose All Products > Data Development And Task Operation > DataStudio.

  2. In the Workspace Directories section, click the image icon and select Create Workflow. In the dialog box, set the workflow name to User_profile_analysis_spark.

  3. On the workflow configuration tab, drag Zero Load Node, Batch Synchronization, and EMR SPARK SQL onto the canvas. Name the nodes as follows:

    Node typeNode namePurpose
    Zero load nodeworkshop_start_sparkManages the workflow and controls startup timing. This is a dry-run node—no code required.
    Batch synchronization nodeods_raw_log_d_2oss_sparkSynchronizes website access logs from HttpFile to the private OSS bucket.
    Batch synchronization nodeods_user_info_d_2oss_sparkSynchronizes user profile data from MySQL to the private OSS bucket.
    EMR SPARK SQLods_raw_log_d_sparkCreates the ods_raw_log_d_spark external table to read access logs from OSS.
    EMR SPARK SQLods_user_info_d_sparkCreates the ods_user_info_d_spark external table to read user profile data from OSS.
  4. Draw dependency lines to make workshop_start_spark the ancestor node of all batch synchronization and EMR Spark SQL nodes.

    image

  5. Configure scheduling properties for the workflow. In the right-side navigation pane, click Properties. Configure the following parameters. Retain default values for others.

    ParameterValue
    Scheduling Parametersbizdate=$[yyyymmdd-1] (substitutes the previous day's date at runtime)
    Scheduling CycleDay
    Scheduling Time00:30
    Scheduling DependenciesClick Use Workspace Root Node to trigger the workflow from the workspace root node (format: Workspace name_root).

Step 3: Configure the workflow nodes

Configure the workshop_start_spark node

  1. In the canvas, hover over the workshop_start_spark node and click Open Node.

  2. In the right-side navigation pane, click Properties and configure the following parameters.

    ParameterValue
    Scheduling TypeDry-run
    Resource Group For SchedulingThe serverless resource group created during environment preparation.
    Scheduling DependenciesClick Use Workspace Root Node to trigger from the workspace root node (format: Workspace name_root).

Configure the ods_raw_log_d_2oss_spark node

This node synchronizes website access logs from the HttpFile data source to the private OSS bucket.

  1. In the canvas, hover over the ods_raw_log_d_2oss_spark node and click Open Node.

  2. Configure the source, destination, and resource group.

    SectionSetting
    SourceSource: HttpFile; Data Source Name: user_behavior_analysis_httpfile
    Resource GroupSelect the serverless resource group created during environment preparation.
    DestinationDestination: OSS; Data Source Name: test_g
  3. Click Next and configure the source and destination settings. In the Field Mapping section, confirm field mappings. In the Channel Control section, set Policy for Dirty Data Records to Disallow Dirty Data Records. Retain defaults for other parameters. For details, see Configure a batch synchronization task by using the codeless UI.

    SectionParameterValue
    SourceFile Path/user_log.txt
    Text typetext
    Column Delimiter|
    Compression formatNone
    Skip HeaderNo
    DestinationText typetext
    Object Name (Path Included)ods_raw_log_d/log_${bizdate}/log_${bizdate}.txt
    Column Delimiter|
  4. Configure debugging parameters. In the right-side navigation pane, click Debugging Configurations and set the following.

    ParameterValue
    Resource GroupSelect the serverless resource group.
    Script ParametersSet bizdate to a date in yyyymmdd format. Example: bizdate=20250223.
  5. (Optional) Click Properties to configure scheduling properties. Scheduling parameters are inherited from the workflow—no further configuration needed for inner nodes. If you want to offset the node's start time relative to the workflow, set Time for Delayed Execution under Scheduling Policies. For other parameters, see Node scheduling.

  6. Click Save.

Configure the ods_user_info_d_2oss_spark node

This node synchronizes user profile data from the MySQL data source to the private OSS bucket.

  1. In the canvas, hover over the ods_user_info_d_2oss_spark node and click Open Node.

  2. Configure the source, destination, and resource group.

    SectionSetting
    SourceSource: MySQL; Data Source Name: user_behavior_analysis_mysql
    Resource GroupSelect the serverless resource group associated with your workspace.
    DestinationDestination: OSS; Data Source Name: test_g
  3. Click Next and configure the source and destination settings. In the Field Mapping section, confirm field mappings. In the Channel Control section, set Policy for Dirty Data Records to Disallow Dirty Data Records.

    SectionParameterValue
    SourceTableods_user_info_d
    Split keyuid (use the primary key or an indexed column of INTEGER type as the shard key)
    DestinationText Typetext
    Object Name (Path Included)ods_user_info_d/user_${bizdate}/user_${bizdate}.txt
    Column Delimiter|
  4. Configure debugging parameters. In the right-side navigation pane, click Debugging Configurations and set the following.

    ParameterValue
    Resource GroupSelect the serverless resource group.
    Script ParametersSet bizdate to a date in yyyymmdd format. Example: bizdate=20250223.
  5. (Optional) Configure scheduling properties as described in the ods_raw_log_d_2oss_spark section.

  6. Click Save.

Configure the ods_raw_log_d_spark node

This EMR Spark SQL node creates the ods_raw_log_d_spark external table and points it to the access logs synchronized to the OSS bucket.

  1. In the canvas, hover over the ods_raw_log_d_spark node and click Open Node.

  2. Enter the following SQL statements.

    Replace dw-spark-demo in the LOCATION value with your OSS bucket name from environment preparation.
    -- Creates the ods_raw_log_d_spark external table using EMR Spark SQL.
    -- LOCATION points to the access logs synchronized to the private OSS bucket.
    -- The $bizdate variable is replaced at runtime by the scheduling parameter configured for the workflow.
    CREATE EXTERNAL TABLE IF NOT EXISTS ods_raw_log_d_spark
    (
      `col` STRING
    )
    PARTITIONED BY (
      dt STRING
    )
    LOCATION 'oss://dw-spark-demo.oss-cn-shanghai-internal.aliyuncs.com/ods_raw_log_d/log_${bizdate}/';
    
    ALTER TABLE ods_raw_log_d_spark ADD IF NOT EXISTS PARTITION (dt = '${bizdate}')
    LOCATION 'oss://dw-spark-demo.oss-cn-shanghai-internal.aliyuncs.com/ods_raw_log_d/log_${bizdate}/'
  3. Configure debugging parameters. In the right-side navigation pane, click Debugging Configurations and set the following.

    ParameterValue
    Computing ResourceSelect the EMR Serverless Spark computing resource associated with your workspace during environment preparation.
    Resource GroupSelect the serverless resource group.
    Script ParametersClick Add Parameter. Enter bizdate as the parameter name and a date in yyyymmdd format as the value. Example: bizdate=20250223.
  4. (Optional) Configure scheduling properties as described in the ods_raw_log_d_2oss_spark section.

  5. Click Save.

Configure the ods_user_info_d_spark node

This EMR Spark SQL node creates the ods_user_info_d_spark external table and points it to the user profile data synchronized to the OSS bucket.

  1. In the canvas, hover over the ods_user_info_d_spark node and click Open Node.

  2. Enter the following SQL statements.

    Replace dw-spark-demo in the LOCATION value with your OSS bucket name from environment preparation.
    -- Creates the ods_user_info_d_spark external table using EMR Spark SQL.
    -- LOCATION points to the user profile data synchronized to the private OSS bucket.
    -- The $bizdate variable is replaced at runtime by the scheduling parameter configured for the workflow.
    CREATE EXTERNAL TABLE IF NOT EXISTS ods_user_info_d_spark
    (
        `uid`        STRING COMMENT 'The user ID'
        ,`gender`    STRING COMMENT 'The gender'
        ,`age_range` STRING COMMENT 'The age range'
        ,`zodiac`    STRING COMMENT 'The zodiac sign'
    )
    PARTITIONED BY
    (
        dt           STRING
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '|'
    STORED AS TEXTFILE
    LOCATION 'oss://dw-spark-demo.oss-cn-shanghai-internal.aliyuncs.com/ods_user_info_d/user_${bizdate}/'
    ;
    
    ALTER TABLE ods_user_info_d_spark ADD IF NOT EXISTS PARTITION (dt = '${bizdate}')
    LOCATION 'oss://dw-spark-demo.oss-cn-shanghai-internal.aliyuncs.com/ods_user_info_d/user_${bizdate}/'
    ;
  3. Configure debugging parameters. In the right-side navigation pane, click Debugging Configurations and set the following.

    ParameterValue
    Computing ResourceSelect the EMR Serverless Spark computing resource associated with your workspace.
    Resource GroupSelect the serverless resource group.
    Script ParametersClick Add Parameter. Enter bizdate as the parameter name and a date in yyyymmdd format as the value. Example: bizdate=20250223.
  4. (Optional) Configure scheduling properties as described in the ods_raw_log_d_2oss_spark section.

  5. Click Save.

Step 4: Run the workflow

  1. In the top toolbar of the workflow configuration tab, click Run. In the Enter runtime parameters dialog box, enter the bizdate value for this run—for example, 20250223—and click OK.

  2. After the workflow completes, log on to the OSS console and confirm that files exist in the /ods_user_info_d and /ods_raw_log_d directories of your OSS bucket.

Step 5: Verify the results

Query the external tables in DataAnalysis to confirm that data was synchronized and parsed correctly.

  1. Go to the SQL Query page. Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose Data Analysis and Service > DataAnalysis. Click Go to DataAnalysis, then click SQL Query in the left-side navigation pane.

  2. Create a SQL query file.

    1. In the SQL Query pane, click the image icon next to My Files and select Create File. Set the file name.

    2. Click the file in the navigation tree to open it.

    3. In the upper-right corner, click the image icon and configure the following parameters.

      ParameterValue
      WorkspaceSelect the workspace containing the User_profile_analysis_spark workflow.
      Data Source TypeSelect EMR Spark SQL.
      Data Source NameSelect the EMR Serverless Spark computing resource associated with the workspace.
      SQL ComputeSelect the SQL session created in EMR Serverless Spark.
    4. Click OK.

  3. Run the following queries to check that the external tables contain the expected data.

    -- Replace 'Data timestamp' with the actual partition value.
    -- The data timestamp is one day before the scheduling date.
    -- For example, if the node ran on February 23, 2025, the timestamp is 20250222.
    SELECT * FROM ods_raw_log_d_spark WHERE dt = 'Data timestamp';
    SELECT * FROM ods_user_info_d_spark WHERE dt = 'Data timestamp';

    Both queries should return rows. If either table returns no data, check that the workflow ran successfully and that the OSS bucket directories contain the synchronized files.

What's next

With your data synchronized and accessible via external tables, proceed to the next tutorial to compute and analyze the data. For details, see Process data.