This tutorial shows how to sync user data from MySQL and HttpFile sources to a private OSS bucket using Data Integration batch synchronization, then query the data through Spark SQL external tables. It is part of a series on building a user profile analysis pipeline.
By the end of this tutorial, you will be able to:
Design a DataWorks workflow with batch synchronization and EMR Spark SQL nodes
Configure Data Integration to sync MySQL table data and HttpFile log data to OSS
Create Spark SQL external tables that read directly from OSS
Verify the synchronized data with ad hoc queries
Prerequisites
Before you begin, make sure you have:
A DataWorks workspace with Data Development enabled
A serverless resource group purchased and available
A private OSS bucket (domain:
dw-emr-demoin this example)Two data sources added to DataWorks:
user_behavior_analysis_httpfile(HttpFile) — providesuser_log.txtuser_behavior_analysis_mysql(MySQL) — provides theods_user_info_dtable
The private OSS data source
test_gadded to DataWorksCompleted the previous tutorial in this series (Add data sources)
Data to synchronize
| Source type | Data | Schema | Destination |
|---|---|---|---|
| MySQL | Table: ods_user_info_d (user profile) | uid, gender, age_range, zodiac | OSS |
| HttpFile | File: user_log.txt (access logs) | $remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent"$http_referer" "$http_user_agent" [unknown_content]; | OSS |
Step 1: Design the workflow
The workflow uses five nodes to move data from sources to queryable external tables.
| Category | Type | Node name | Function |
|---|---|---|---|
| General | Zero load node | workshop_start_spark | Entry point; triggers the workflow at 00:30 daily. No code needed. |
| Data Integration | Batch synchronization | ods_raw_log_d_2oss_spark | Syncs HttpFile logs to the private OSS bucket |
| Data Integration | Batch synchronization | ods_user_info_d_2oss_spark | Syncs MySQL user data to the private OSS bucket |
| EMR | EMR Spark SQL | ods_raw_log_d_spark | Creates the ods_raw_log_d_spark external table pointing to the synced logs |
| EMR | EMR Spark SQL | ods_user_info_d_spark | Creates the ods_user_info_d_spark external table pointing to the synced user data |
Create and configure the workflow
Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose Data Development and O\&M > Data Development, select your workspace, and click Go to Data Development.
Create a workflow named
User profile analysis_Spark. For detailed steps, see Create a workflow.On the workflow canvas, click Create Node, drag the five node types onto the canvas, and draw dependency lines between them. For the complete layout, see Workflow design. The two batch synchronization nodes have no data lineage relationship with the zero load node, so dependencies are set manually by drawing lines. For more information, see Scheduling dependency configuration guide.

Configure the scheduling logic
The workshop_start_spark zero load node triggers the entire workflow at 00:30 every day. You do not need to modify the scheduling configurations of other nodes.
| Configuration | Value | Description |
|---|---|---|
| Scheduling time | 00:30 | The workflow starts at 00:30 daily |
| Scheduling dependencies | Root node of the workspace | The workspace root node triggers workshop_start_spark, which then triggers the downstream nodes |
All nodes in the data synchronization phase depend on workshop_start_spark. When this node runs, it triggers the data synchronization chain.For other scheduling options, see Configure scheduling time and Scheduling overview.
Step 2: Configure data synchronization tasks
Each batch synchronization node moves data from a source to a dated folder in the OSS bucket. The ${bizdate} scheduling parameter controls which date's data is written — it resolves to the previous day's date ($[yyyymmdd-1]), so data written on August 8, 2024 goes to the 20240807 folder.
After the sync nodes complete, the Spark SQL nodes create external tables that point to those OSS folders using LOCATION.
Sync website access logs (HttpFile to OSS)
On the DataStudio page, double-click
ods_raw_log_d_2oss_sparkto open its configuration tab.Configure the network connection between the resource group and the data sources. After completing the network configuration, click Next and finish the connectivity test.
Parameter Value Source HttpFile — user_behavior_analysis_httpfileResource group Your serverless resource group Destination OSS — test_gConfigure the synchronization parameters.
Parameter Value Source — File path /user_log.txtSource — Text type textSource — Column delimiter |Source — Compression format NoneSource — Skip header NoDestination — Text type textDestination — Object name (path included) ods_raw_log_d/log_${bizdate}/log_${bizdate}.txtDestination — Column delimiter |On the Properties tab, configure the scheduling parameter so the
${bizdate}variable in the object path resolves to the correct date at runtime.Section Configuration Scheduling parameter Parameter name: bizdate/ Parameter value:$[yyyymmdd-1]. See Configure scheduling parameters.Dependencies Set the output name to workspacename.ods_raw_log_d_2oss_spark. See Configure scheduling dependencies.Click the
icon to save.
Sync user profile data (MySQL to OSS)
On the DataStudio page, double-click
ods_user_info_d_2oss_sparkto open its configuration tab.Configure the network connection between the resource group and the data sources. After completing the network configuration, click Next and finish the connectivity test.
Parameter Value Source MySQL — user_behavior_analysis_mysqlResource group Your serverless resource group Destination OSS — test_gConfigure the synchronization parameters.
Parameter Value Source — Table ods_user_info_dSource — Split key uid(primary key; must be of the INTEGER type)Destination — Text type textDestination — Object name (path included) ods_user_info_d/user_${bizdate}/user_${bizdate}.txtDestination — Column delimiter |On the Properties tab, add the
bizdatescheduling parameter and dependency output name.Section Configuration Scheduling parameter Parameter name: bizdate/ Parameter value:$[yyyymmdd-1]. See Configure scheduling parameters.Dependencies Set the output name to workspacename.ods_user_info_d_2oss_spark. See Configure scheduling dependencies.Click the
icon to save.
Create external tables with Spark SQL
External tables let Spark SQL read data directly from OSS without importing it. The LOCATION path matches the destination folder set in the batch synchronization node.
Configure the ods_raw_log_d_spark node
Double-click
ods_raw_log_d_sparkand paste the following SQL.Replace
dw-emr-demowith your OSS bucket domain name. The LOCATION path must match the Object name (path included) value you set in theods_raw_log_d_2oss_sparknode.-- Creates an external table for the raw access log data in OSS. -- LOCATION points to the dated folder written by the batch sync node. -- The ${bizdate} parameter resolves to the previous day's date at runtime. CREATE EXTERNAL TABLE IF NOT EXISTS ods_raw_log_d_spark ( `col` STRING ) PARTITIONED BY ( dt STRING ) LOCATION 'oss://dw-emr-demo/ods_raw_log_d/log_${bizdate}/'; -- Registers the partition so Spark can discover the data. ALTER TABLE ods_raw_log_d_spark ADD IF NOT EXISTS PARTITION (dt = '${bizdate}') LOCATION 'oss://dw-emr-demo/ods_raw_log_d/log_${bizdate}/';On the Properties tab, add the scheduling parameter and output dependency.
Section Configuration Scheduling parameter Parameter name: bizdate/ Parameter value:$[yyyymmdd-1]. See Configure scheduling parameters.Dependencies Set the output name to workspacename.ods_raw_log_d_spark. See Configure scheduling dependencies.Click the
icon to save.
Configure the ods_user_info_d_spark node
Double-click
ods_user_info_d_sparkand paste the following SQL.Replace
dw-emr-demowith your OSS bucket domain name. The LOCATION path must match the Object name (path included) value you set in theods_user_info_d_2oss_sparknode.-- Creates an external table for the user profile data in OSS. -- Each field maps to a pipe-delimited column in the synced text file. -- The ${bizdate} parameter resolves to the previous day's date at runtime. CREATE EXTERNAL TABLE IF NOT EXISTS ods_user_info_d_spark ( `uid` STRING COMMENT 'The user ID' ,`gender` STRING COMMENT 'The gender' ,`age_range` STRING COMMENT 'The age range' ,`zodiac` STRING COMMENT 'The zodiac sign' ) PARTITIONED BY ( dt STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://dw-emr-demo/ods_user_info_d/user_${bizdate}/'; -- Registers the partition so Spark can discover the data. ALTER TABLE ods_user_info_d_spark ADD IF NOT EXISTS PARTITION (dt = '${bizdate}') LOCATION 'oss://dw-emr-demo/ods_user_info_d/user_${bizdate}/';On the Properties tab, add the scheduling parameter and output dependency.
Section Configuration Scheduling parameter Parameter name: bizdate/ Parameter value:$[yyyymmdd-1]. See Configure scheduling parameters.Dependencies Set the output name to workspacename.ods_user_info_d_spark. See Configure scheduling dependencies.Click the
icon to save.
Step 3: Verify the synchronized data
After all nodes run successfully, run ad hoc queries to confirm the external tables were created and contain data.
In the left-side navigation pane of the DataStudio page, click Ad Hoc Query.
Create an ad hoc query task of the EMR Spark SQL type.
Run the following queries. Click the
(Run with Parameters) icon, assign a value to ${bizdate}, and run the query. For example, if the task runs on August 8, 2024, setbizdateto20240807.-- Query the access log external table. SELECT * FROM ods_raw_log_d_spark WHERE dt = '${bizdate}'; -- Query the user profile external table. SELECT * FROM ods_user_info_d_spark WHERE dt = '${bizdate}';
If both queries return rows, the data synchronization is complete and the external tables are working correctly.
What's next
With the data synchronized and accessible through external tables, you can now process it with Spark. See Process data.