Build a Spark Pipeline to Load MySQL & HttpFile Data into OSS - E-MapReduce

In this tutorial, you're building the data ingestion layer for a user behavior analysis pipeline. You'll pull raw data from two sources—user profile records stored in MySQL and website access logs stored in an OSS object—into a private OSS bucket, then expose that data to EMR Spark SQL through external tables.

By the end of this tutorial, you will have:

Added three data sources (MySQL, HttpFile, and OSS) to your DataWorks workspace.
Created a User_profile_analysis_spark workflow with batch synchronization and EMR Spark SQL nodes.
Synchronized user data from MySQL and HttpFile into a private OSS bucket.
Created Spark external tables (ods_raw_log_d_spark and ods_user_info_d_spark) that read from the synchronized OSS data.
Verified the results with SQL queries.

Prerequisites

Before you begin, make sure you have:

Completed the environment setup. For details, see Prepare an environment.

About the data

This tutorial uses two datasets:

User profile data: basic user records stored in a MySQL table named ods_user_info_d. Fields include uid, gender, age_range, and zodiac.
Website access logs: user clickstream data stored as user_log.txt in an OSS object, accessible via an HttpFile data source.

Both datasets will be synchronized into the private OSS bucket (dw-spark-demo) you created during environment preparation.

Step 1: Add data sources

Add the following three data sources to your DataWorks workspace. These provide the raw data for the pipeline.

Add a MySQL data source

Add the MySQL database as a data source so you can synchronize the ods_user_info_d table to the OSS bucket.

Go to the Data Sources page.
1. Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose More > Management Center. Select your workspace from the drop-down list and click Go to Management Center.
2. In the left-side navigation pane of the SettingCenter page, click Data Sources.
On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, search for and select MySQL.
On the Add MySQL Data Source page, configure the parameters. Set Data Source Name to user_behavior_analysis_mysql.
In the Connection Configuration section, find the resource group and click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Make sure the status shows Connected.
Important
The resource group must be associated with your workspace and have internet access. If no resource group is available, click Purchase or Associate Purchased Resource Group in the prompt message to resolve this.
Click Complete Creation.

Add an HttpFile data source

Add an HttpFile data source to synchronize the user_log.txt access logs from OSS into your private OSS bucket.

In the left-side navigation pane of the SettingCenter page, click Data Sources. On the Data Sources page, click Add Data Source.
In the Add Data Source dialog box, search for and click HttpFile.

On the Add HttpFile Data Source page, configure the following parameters. Retain default values for other parameters.

Parameter	Value
Data Source Name	`user_behavior_analysis_httpfile`
Data Source Description	The data source is exclusive for experiencing cases in DataWorks. You can configure the data source in a batch synchronization task to access the test data provided by the system. The data source can be used only for data reading in Data Integration and cannot be used in other services.
URL	`https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com` (set for both development and production environments)

In the Connection Configuration section, click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns for the target resource group. Confirm that the status shows Connected.
Important
The resource group must be associated with your workspace and have internet access. If no resource group is available, click Purchase or Associate Purchased Resource Group in the prompt message.
Click Complete Creation.

Add an OSS data source

Add a private OSS data source as the destination for synchronized data.

In the left-side navigation pane of the SettingCenter page, click Data Sources. In the upper-left corner, click Add Data Source.
In the Add Data Source dialog box, select OSS.

On the Add OSS Data Source page, configure the following parameters. Use the same values for both development and production environments.

Parameter	Value
Data Source Name	`test_g`
Access Mode	Select AccessKey Mode
AccessKey ID	The AccessKey ID of your account. Go to the AccessKey page to copy it.
AccessKey Secret	The AccessKey secret of your account. The secret is only displayed at creation time—store it securely. If the AccessKey pair is leaked, delete it and create a new one.
Endpoint	`http://oss-cn-shanghai-internal.aliyuncs.com`
Bucket	`dw-spark-demo` (the private OSS bucket created during environment preparation)

Click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Make sure at least one resource group shows Connected—otherwise you won't be able to configure synchronization nodes with the codeless UI.
Click Complete Creation.

Step 2: Create a workflow

In the upper-left corner of the DataWorks console, click the icon and choose All Products > Data Development And Task Operation > DataStudio.
In the Workspace Directories section, click the icon and select Create Workflow. In the dialog box, set the workflow name to User_profile_analysis_spark.

On the workflow configuration tab, drag Zero Load Node, Batch Synchronization, and EMR SPARK SQL onto the canvas. Name the nodes as follows:

Node type	Node name	Purpose
Zero load node	`workshop_start_spark`	Manages the workflow and controls startup timing. This is a dry-run node—no code required.
Batch synchronization node	`ods_raw_log_d_2oss_spark`	Synchronizes website access logs from HttpFile to the private OSS bucket.
Batch synchronization node	`ods_user_info_d_2oss_spark`	Synchronizes user profile data from MySQL to the private OSS bucket.
EMR SPARK SQL	`ods_raw_log_d_spark`	Creates the `ods_raw_log_d_spark` external table to read access logs from OSS.
EMR SPARK SQL	`ods_user_info_d_spark`	Creates the `ods_user_info_d_spark` external table to read user profile data from OSS.

Draw dependency lines to make workshop_start_spark the ancestor node of all batch synchronization and EMR Spark SQL nodes.

Configure scheduling properties for the workflow. In the right-side navigation pane, click Properties. Configure the following parameters. Retain default values for others.

Parameter	Value
Scheduling Parameters	`bizdate=$[yyyymmdd-1]` (substitutes the previous day's date at runtime)
Scheduling Cycle	`Day`
Scheduling Time	`00:30`
Scheduling Dependencies	Click Use Workspace Root Node to trigger the workflow from the workspace root node (format: `Workspace name_root`).

Step 3: Configure the workflow nodes

Configure the workshop_start_spark node

In the canvas, hover over the workshop_start_spark node and click Open Node.

In the right-side navigation pane, click Properties and configure the following parameters.

Parameter	Value
Scheduling Type	`Dry-run`
Resource Group For Scheduling	The serverless resource group created during environment preparation.
Scheduling Dependencies	Click Use Workspace Root Node to trigger from the workspace root node (format: `Workspace name_root`).

Configure the ods_raw_log_d_2oss_spark node

This node synchronizes website access logs from the HttpFile data source to the private OSS bucket.

In the canvas, hover over the ods_raw_log_d_2oss_spark node and click Open Node.

Configure the source, destination, and resource group.

Section	Setting
Source	Source: `HttpFile`; Data Source Name: `user_behavior_analysis_httpfile`
Resource Group	Select the serverless resource group created during environment preparation.
Destination	Destination: `OSS`; Data Source Name: `test_g`

Click Next and configure the source and destination settings. In the Field Mapping section, confirm field mappings. In the Channel Control section, set Policy for Dirty Data Records to Disallow Dirty Data Records. Retain defaults for other parameters. For details, see Configure a batch synchronization task by using the codeless UI.

Section	Parameter	Value
Source	File Path	`/user_log.txt`
	Text type	`text`
	Column Delimiter	`\|`
	Compression format	`None`
	Skip Header	`No`
Destination	Text type	`text`
	Object Name (Path Included)	`ods_raw_log_d/log_${bizdate}/log_${bizdate}.txt`
	Column Delimiter	`\|`

Configure debugging parameters. In the right-side navigation pane, click Debugging Configurations and set the following.
Parameter Value
Resource Group Select the serverless resource group.
Script Parameters Set bizdate to a date in yyyymmdd format. Example: bizdate=20250223.
(Optional) Click Properties to configure scheduling properties. Scheduling parameters are inherited from the workflow—no further configuration needed for inner nodes. If you want to offset the node's start time relative to the workflow, set Time for Delayed Execution under Scheduling Policies. For other parameters, see Node scheduling.
Click Save.

Parameter	Value
Resource Group	Select the serverless resource group.
Script Parameters	Set bizdate to a date in `yyyymmdd` format. Example: `bizdate=20250223`.

Configure the ods_user_info_d_2oss_spark node

This node synchronizes user profile data from the MySQL data source to the private OSS bucket.

In the canvas, hover over the ods_user_info_d_2oss_spark node and click Open Node.

Configure the source, destination, and resource group.

Section	Setting
Source	Source: `MySQL`; Data Source Name: `user_behavior_analysis_mysql`
Resource Group	Select the serverless resource group associated with your workspace.
Destination	Destination: `OSS`; Data Source Name: `test_g`

Section	Parameter	Value
Source	Table	`ods_user_info_d`
	Split key	`uid` (use the primary key or an indexed column of INTEGER type as the shard key)
Destination	Text Type	`text`
	Object Name (Path Included)	`ods_user_info_d/user_${bizdate}/user_${bizdate}.txt`
	Column Delimiter	`\|`

Configure debugging parameters. In the right-side navigation pane, click Debugging Configurations and set the following.
Parameter Value
Resource Group Select the serverless resource group.
Script Parameters Set bizdate to a date in yyyymmdd format. Example: bizdate=20250223.
(Optional) Configure scheduling properties as described in the ods_raw_log_d_2oss_spark section.
Click Save.

Parameter	Value
Resource Group	Select the serverless resource group.
Script Parameters	Set bizdate to a date in `yyyymmdd` format. Example: `bizdate=20250223`.

Configure the ods_raw_log_d_spark node

This EMR Spark SQL node creates the ods_raw_log_d_spark external table and points it to the access logs synchronized to the OSS bucket.

In the canvas, hover over the ods_raw_log_d_spark node and click Open Node.

Enter the following SQL statements.

Replace dw-spark-demo in the LOCATION value with your OSS bucket name from environment preparation.

-- Creates the ods_raw_log_d_spark external table using EMR Spark SQL.
-- LOCATION points to the access logs synchronized to the private OSS bucket.
-- The $bizdate variable is replaced at runtime by the scheduling parameter configured for the workflow.
CREATE EXTERNAL TABLE IF NOT EXISTS ods_raw_log_d_spark
(
  `col` STRING
)
PARTITIONED BY (
  dt STRING
)
LOCATION 'oss://dw-spark-demo.oss-cn-shanghai-internal.aliyuncs.com/ods_raw_log_d/log_${bizdate}/';

ALTER TABLE ods_raw_log_d_spark ADD IF NOT EXISTS PARTITION (dt = '${bizdate}')
LOCATION 'oss://dw-spark-demo.oss-cn-shanghai-internal.aliyuncs.com/ods_raw_log_d/log_${bizdate}/'

Configure debugging parameters. In the right-side navigation pane, click Debugging Configurations and set the following.

Parameter	Value
Computing Resource	Select the EMR Serverless Spark computing resource associated with your workspace during environment preparation.
Resource Group	Select the serverless resource group.
Script Parameters	Click Add Parameter. Enter `bizdate` as the parameter name and a date in `yyyymmdd` format as the value. Example: `bizdate=20250223`.

(Optional) Configure scheduling properties as described in the ods_raw_log_d_2oss_spark section.
Click Save.

Configure the ods_user_info_d_spark node

This EMR Spark SQL node creates the ods_user_info_d_spark external table and points it to the user profile data synchronized to the OSS bucket.

In the canvas, hover over the ods_user_info_d_spark node and click Open Node.

Enter the following SQL statements.

Replace dw-spark-demo in the LOCATION value with your OSS bucket name from environment preparation.

-- Creates the ods_user_info_d_spark external table using EMR Spark SQL.
-- LOCATION points to the user profile data synchronized to the private OSS bucket.
-- The $bizdate variable is replaced at runtime by the scheduling parameter configured for the workflow.
CREATE EXTERNAL TABLE IF NOT EXISTS ods_user_info_d_spark
(
    `uid`        STRING COMMENT 'The user ID'
    ,`gender`    STRING COMMENT 'The gender'
    ,`age_range` STRING COMMENT 'The age range'
    ,`zodiac`    STRING COMMENT 'The zodiac sign'
)
PARTITIONED BY
(
    dt           STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'oss://dw-spark-demo.oss-cn-shanghai-internal.aliyuncs.com/ods_user_info_d/user_${bizdate}/'
;

ALTER TABLE ods_user_info_d_spark ADD IF NOT EXISTS PARTITION (dt = '${bizdate}')
LOCATION 'oss://dw-spark-demo.oss-cn-shanghai-internal.aliyuncs.com/ods_user_info_d/user_${bizdate}/'
;

Configure debugging parameters. In the right-side navigation pane, click Debugging Configurations and set the following.

Parameter	Value
Computing Resource	Select the EMR Serverless Spark computing resource associated with your workspace.
Resource Group	Select the serverless resource group.
Script Parameters	Click Add Parameter. Enter `bizdate` as the parameter name and a date in `yyyymmdd` format as the value. Example: `bizdate=20250223`.

(Optional) Configure scheduling properties as described in the ods_raw_log_d_2oss_spark section.
Click Save.

Step 4: Run the workflow

In the top toolbar of the workflow configuration tab, click Run. In the Enter runtime parameters dialog box, enter the bizdate value for this run—for example, 20250223—and click OK.
After the workflow completes, log on to the OSS console and confirm that files exist in the /ods_user_info_d and /ods_raw_log_d directories of your OSS bucket.

Step 5: Verify the results

Query the external tables in DataAnalysis to confirm that data was synchronized and parsed correctly.

Go to the SQL Query page. Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose Data Analysis and Service > DataAnalysis. Click Go to DataAnalysis, then click SQL Query in the left-side navigation pane.

Create a SQL query file.

In the SQL Query pane, click the icon next to My Files and select Create File. Set the file name.
Click the file in the navigation tree to open it.

In the upper-right corner, click the icon and configure the following parameters.

Parameter	Value
Workspace	Select the workspace containing the `User_profile_analysis_spark` workflow.
Data Source Type	Select `EMR Spark SQL`.
Data Source Name	Select the EMR Serverless Spark computing resource associated with the workspace.
SQL Compute	Select the SQL session created in EMR Serverless Spark.

Click OK.

Run the following queries to check that the external tables contain the expected data.

-- Replace 'Data timestamp' with the actual partition value.
-- The data timestamp is one day before the scheduling date.
-- For example, if the node ran on February 23, 2025, the timestamp is 20250222.
SELECT * FROM ods_raw_log_d_spark WHERE dt = 'Data timestamp';
SELECT * FROM ods_user_info_d_spark WHERE dt = 'Data timestamp';

Both queries should return rows. If either table returns no data, check that the workflow ran successfully and that the OSS bucket directories contain the synchronized files.

What's next

With your data synchronized and accessible via external tables, proceed to the next tutorial to compute and analyze the data. For details, see Process data.