In this tutorial, you're building the data ingestion layer for a user behavior analysis pipeline. You'll pull raw data from two sources—user profile records stored in MySQL and website access logs stored in an OSS object—into a private OSS bucket, then expose that data to EMR Spark SQL through external tables.
By the end of this tutorial, you will have:
Added three data sources (MySQL, HttpFile, and OSS) to your DataWorks workspace.
Created a
User_profile_analysis_sparkworkflow with batch synchronization and EMR Spark SQL nodes.Synchronized user data from MySQL and HttpFile into a private OSS bucket.
Created Spark external tables (
ods_raw_log_d_sparkandods_user_info_d_spark) that read from the synchronized OSS data.Verified the results with SQL queries.
Prerequisites
Before you begin, make sure you have:
Completed the environment setup. For details, see Prepare an environment.
About the data
This tutorial uses two datasets:
User profile data: basic user records stored in a MySQL table named
ods_user_info_d. Fields includeuid,gender,age_range, andzodiac.Website access logs: user clickstream data stored as
user_log.txtin an OSS object, accessible via an HttpFile data source.
Both datasets will be synchronized into the private OSS bucket (dw-spark-demo) you created during environment preparation.
Step 1: Add data sources
Add the following three data sources to your DataWorks workspace. These provide the raw data for the pipeline.
Add a MySQL data source
Add the MySQL database as a data source so you can synchronize the ods_user_info_d table to the OSS bucket.
Go to the Data Sources page.
Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose More > Management Center. Select your workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Data Sources.
On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, search for and select MySQL.
On the Add MySQL Data Source page, configure the parameters. Set Data Source Name to
user_behavior_analysis_mysql.In the Connection Configuration section, find the resource group and click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Make sure the status shows Connected.
ImportantThe resource group must be associated with your workspace and have internet access. If no resource group is available, click Purchase or Associate Purchased Resource Group in the prompt message to resolve this.
Click Complete Creation.
Add an HttpFile data source
Add an HttpFile data source to synchronize the user_log.txt access logs from OSS into your private OSS bucket.
In the left-side navigation pane of the SettingCenter page, click Data Sources. On the Data Sources page, click Add Data Source.
In the Add Data Source dialog box, search for and click HttpFile.
On the Add HttpFile Data Source page, configure the following parameters. Retain default values for other parameters.
Parameter Value Data Source Name user_behavior_analysis_httpfileData Source Description The data source is exclusive for experiencing cases in DataWorks. You can configure the data source in a batch synchronization task to access the test data provided by the system. The data source can be used only for data reading in Data Integration and cannot be used in other services. URL https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com(set for both development and production environments)In the Connection Configuration section, click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns for the target resource group. Confirm that the status shows Connected.
ImportantThe resource group must be associated with your workspace and have internet access. If no resource group is available, click Purchase or Associate Purchased Resource Group in the prompt message.
Click Complete Creation.
Add an OSS data source
Add a private OSS data source as the destination for synchronized data.
In the left-side navigation pane of the SettingCenter page, click Data Sources. In the upper-left corner, click Add Data Source.
In the Add Data Source dialog box, select OSS.
On the Add OSS Data Source page, configure the following parameters. Use the same values for both development and production environments.
Parameter Value Data Source Name test_gAccess Mode Select AccessKey Mode AccessKey ID The AccessKey ID of your account. Go to the AccessKey page to copy it. AccessKey Secret The AccessKey secret of your account. The secret is only displayed at creation time—store it securely. If the AccessKey pair is leaked, delete it and create a new one. Endpoint http://oss-cn-shanghai-internal.aliyuncs.comBucket dw-spark-demo(the private OSS bucket created during environment preparation)Click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Make sure at least one resource group shows Connected—otherwise you won't be able to configure synchronization nodes with the codeless UI.
Click Complete Creation.
Step 2: Create a workflow
In the upper-left corner of the DataWorks console, click the
icon and choose All Products > Data Development And Task Operation > DataStudio.In the Workspace Directories section, click the
icon and select Create Workflow. In the dialog box, set the workflow name to User_profile_analysis_spark.On the workflow configuration tab, drag Zero Load Node, Batch Synchronization, and EMR SPARK SQL onto the canvas. Name the nodes as follows:
Node type Node name Purpose Zero load node workshop_start_sparkManages the workflow and controls startup timing. This is a dry-run node—no code required. Batch synchronization node ods_raw_log_d_2oss_sparkSynchronizes website access logs from HttpFile to the private OSS bucket. Batch synchronization node ods_user_info_d_2oss_sparkSynchronizes user profile data from MySQL to the private OSS bucket. EMR SPARK SQL ods_raw_log_d_sparkCreates the ods_raw_log_d_sparkexternal table to read access logs from OSS.EMR SPARK SQL ods_user_info_d_sparkCreates the ods_user_info_d_sparkexternal table to read user profile data from OSS.Draw dependency lines to make
workshop_start_sparkthe ancestor node of all batch synchronization and EMR Spark SQL nodes.Configure scheduling properties for the workflow. In the right-side navigation pane, click Properties. Configure the following parameters. Retain default values for others.
Parameter Value Scheduling Parameters bizdate=$[yyyymmdd-1](substitutes the previous day's date at runtime)Scheduling Cycle DayScheduling Time 00:30Scheduling Dependencies Click Use Workspace Root Node to trigger the workflow from the workspace root node (format: Workspace name_root).
Step 3: Configure the workflow nodes
Configure the workshop_start_spark node
In the canvas, hover over the
workshop_start_sparknode and click Open Node.In the right-side navigation pane, click Properties and configure the following parameters.
Parameter Value Scheduling Type Dry-runResource Group For Scheduling The serverless resource group created during environment preparation. Scheduling Dependencies Click Use Workspace Root Node to trigger from the workspace root node (format: Workspace name_root).
Configure the ods_raw_log_d_2oss_spark node
This node synchronizes website access logs from the HttpFile data source to the private OSS bucket.
In the canvas, hover over the
ods_raw_log_d_2oss_sparknode and click Open Node.Configure the source, destination, and resource group.
Section Setting Source Source: HttpFile; Data Source Name:user_behavior_analysis_httpfileResource Group Select the serverless resource group created during environment preparation. Destination Destination: OSS; Data Source Name:test_gClick Next and configure the source and destination settings. In the Field Mapping section, confirm field mappings. In the Channel Control section, set Policy for Dirty Data Records to Disallow Dirty Data Records. Retain defaults for other parameters. For details, see Configure a batch synchronization task by using the codeless UI.
Section Parameter Value Source File Path /user_log.txtText type textColumn Delimiter |Compression format NoneSkip Header NoDestination Text type textObject Name (Path Included) ods_raw_log_d/log_${bizdate}/log_${bizdate}.txtColumn Delimiter |Configure debugging parameters. In the right-side navigation pane, click Debugging Configurations and set the following.
Parameter Value Resource Group Select the serverless resource group. Script Parameters Set bizdate to a date in yyyymmddformat. Example:bizdate=20250223.(Optional) Click Properties to configure scheduling properties. Scheduling parameters are inherited from the workflow—no further configuration needed for inner nodes. If you want to offset the node's start time relative to the workflow, set Time for Delayed Execution under Scheduling Policies. For other parameters, see Node scheduling.
Click Save.
Configure the ods_user_info_d_2oss_spark node
This node synchronizes user profile data from the MySQL data source to the private OSS bucket.
In the canvas, hover over the
ods_user_info_d_2oss_sparknode and click Open Node.Configure the source, destination, and resource group.
Section Setting Source Source: MySQL; Data Source Name:user_behavior_analysis_mysqlResource Group Select the serverless resource group associated with your workspace. Destination Destination: OSS; Data Source Name:test_gClick Next and configure the source and destination settings. In the Field Mapping section, confirm field mappings. In the Channel Control section, set Policy for Dirty Data Records to Disallow Dirty Data Records.
Section Parameter Value Source Table ods_user_info_dSplit key uid(use the primary key or an indexed column of INTEGER type as the shard key)Destination Text Type textObject Name (Path Included) ods_user_info_d/user_${bizdate}/user_${bizdate}.txtColumn Delimiter |Configure debugging parameters. In the right-side navigation pane, click Debugging Configurations and set the following.
Parameter Value Resource Group Select the serverless resource group. Script Parameters Set bizdate to a date in yyyymmddformat. Example:bizdate=20250223.(Optional) Configure scheduling properties as described in the ods_raw_log_d_2oss_spark section.
Click Save.
Configure the ods_raw_log_d_spark node
This EMR Spark SQL node creates the ods_raw_log_d_spark external table and points it to the access logs synchronized to the OSS bucket.
In the canvas, hover over the
ods_raw_log_d_sparknode and click Open Node.Enter the following SQL statements.
Replace
dw-spark-demoin the LOCATION value with your OSS bucket name from environment preparation.-- Creates the ods_raw_log_d_spark external table using EMR Spark SQL. -- LOCATION points to the access logs synchronized to the private OSS bucket. -- The $bizdate variable is replaced at runtime by the scheduling parameter configured for the workflow. CREATE EXTERNAL TABLE IF NOT EXISTS ods_raw_log_d_spark ( `col` STRING ) PARTITIONED BY ( dt STRING ) LOCATION 'oss://dw-spark-demo.oss-cn-shanghai-internal.aliyuncs.com/ods_raw_log_d/log_${bizdate}/'; ALTER TABLE ods_raw_log_d_spark ADD IF NOT EXISTS PARTITION (dt = '${bizdate}') LOCATION 'oss://dw-spark-demo.oss-cn-shanghai-internal.aliyuncs.com/ods_raw_log_d/log_${bizdate}/'Configure debugging parameters. In the right-side navigation pane, click Debugging Configurations and set the following.
Parameter Value Computing Resource Select the EMR Serverless Spark computing resource associated with your workspace during environment preparation. Resource Group Select the serverless resource group. Script Parameters Click Add Parameter. Enter bizdateas the parameter name and a date inyyyymmddformat as the value. Example:bizdate=20250223.(Optional) Configure scheduling properties as described in the ods_raw_log_d_2oss_spark section.
Click Save.
Configure the ods_user_info_d_spark node
This EMR Spark SQL node creates the ods_user_info_d_spark external table and points it to the user profile data synchronized to the OSS bucket.
In the canvas, hover over the
ods_user_info_d_sparknode and click Open Node.Enter the following SQL statements.
Replace
dw-spark-demoin the LOCATION value with your OSS bucket name from environment preparation.-- Creates the ods_user_info_d_spark external table using EMR Spark SQL. -- LOCATION points to the user profile data synchronized to the private OSS bucket. -- The $bizdate variable is replaced at runtime by the scheduling parameter configured for the workflow. CREATE EXTERNAL TABLE IF NOT EXISTS ods_user_info_d_spark ( `uid` STRING COMMENT 'The user ID' ,`gender` STRING COMMENT 'The gender' ,`age_range` STRING COMMENT 'The age range' ,`zodiac` STRING COMMENT 'The zodiac sign' ) PARTITIONED BY ( dt STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://dw-spark-demo.oss-cn-shanghai-internal.aliyuncs.com/ods_user_info_d/user_${bizdate}/' ; ALTER TABLE ods_user_info_d_spark ADD IF NOT EXISTS PARTITION (dt = '${bizdate}') LOCATION 'oss://dw-spark-demo.oss-cn-shanghai-internal.aliyuncs.com/ods_user_info_d/user_${bizdate}/' ;Configure debugging parameters. In the right-side navigation pane, click Debugging Configurations and set the following.
Parameter Value Computing Resource Select the EMR Serverless Spark computing resource associated with your workspace. Resource Group Select the serverless resource group. Script Parameters Click Add Parameter. Enter bizdateas the parameter name and a date inyyyymmddformat as the value. Example:bizdate=20250223.(Optional) Configure scheduling properties as described in the ods_raw_log_d_2oss_spark section.
Click Save.
Step 4: Run the workflow
In the top toolbar of the workflow configuration tab, click Run. In the Enter runtime parameters dialog box, enter the bizdate value for this run—for example,
20250223—and click OK.After the workflow completes, log on to the OSS console and confirm that files exist in the
/ods_user_info_dand/ods_raw_log_ddirectories of your OSS bucket.
Step 5: Verify the results
Query the external tables in DataAnalysis to confirm that data was synchronized and parsed correctly.
Go to the SQL Query page. Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose Data Analysis and Service > DataAnalysis. Click Go to DataAnalysis, then click SQL Query in the left-side navigation pane.
Create a SQL query file.
In the SQL Query pane, click the
icon next to My Files and select Create File. Set the file name.Click the file in the navigation tree to open it.
In the upper-right corner, click the
icon and configure the following parameters.Parameter Value Workspace Select the workspace containing the User_profile_analysis_sparkworkflow.Data Source Type Select EMR Spark SQL.Data Source Name Select the EMR Serverless Spark computing resource associated with the workspace. SQL Compute Select the SQL session created in EMR Serverless Spark. Click OK.
Run the following queries to check that the external tables contain the expected data.
-- Replace 'Data timestamp' with the actual partition value. -- The data timestamp is one day before the scheduling date. -- For example, if the node ran on February 23, 2025, the timestamp is 20250222. SELECT * FROM ods_raw_log_d_spark WHERE dt = 'Data timestamp'; SELECT * FROM ods_user_info_d_spark WHERE dt = 'Data timestamp';Both queries should return rows. If either table returns no data, check that the workflow ran successfully and that the OSS bucket directories contain the synchronized files.
What's next
With your data synchronized and accessible via external tables, proceed to the next tutorial to compute and analyze the data. For details, see Process data.