This tutorial walks you through synchronizing data from two heterogeneous sources — a MySQL database and an OSS flat file — into MaxCompute using DataWorks Data Integration. By the end, the ods_user_info_d and ods_raw_log_d tables in MaxCompute will be populated and ready for downstream processing.
In this tutorial, you will:
-
Add MySQL and HttpFile data sources to your DataWorks workspace.
-
Build a sync pipeline with three nodes: a Zero Load Node and two Batch Synchronization nodes.
-
Configure each sync task, including the destination table DDL, partition parameters, and dirty data policy.
-
Run the pipeline and verify that data is written to MaxCompute.
Prerequisites
Before you begin, make sure you have:
-
Completed Prepare the environment to set up your DataWorks workspace, Serverless resource group, and public network access
-
A Serverless resource group attached to your workspace (if not, follow the prompts in the connection configuration section: click Purchase and then Associated Purchased Resource Group)
Step 1: Create data sources
Add the following two data sources to your DataWorks workspace so that subsequent sync tasks can reach the raw data.
If you have already created MySQL and HttpFile data sources for other user profile analysis tutorials, skip this step.
Create a MySQL data source
The basic user information for this tutorial is stored in a MySQL database. Create a MySQL data source named user_behavior_analysis_mysql to pull the ods_user_info_d table into MaxCompute.
-
Log on to the DataWorks console. In the top navigation bar, select your region. In the left-side navigation pane, choose More > Management Center. Select your workspace from the drop-down list and click Go to Management Center.
-
In the left-side navigation pane, click Data Sources.
-
Click Add Data Source, then search for and select MySQL.
-
On the Add MySQL Data Source page, configure the parameters below. Use these values for both the development and production environments. Keep the default values for any parameters not listed.
Parameter Value Data Source Name user_behavior_analysis_mysqlData Source Description This data source is for DataWorks tutorials. Read data from this data source when you configure an offline sync task to access the test data provided by the platform. This data source can only be read in Data Integration scenarios. Other modules cannot use it. Configuration Mode Connection String Mode Connection Address Host IP: rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com, Port:3306Database Name workshopUsername workshopPassword workshop#2017Authentication Method No authentication -
In the Connection Configuration section, click Test Network Connectivity for both environments. Confirm that the status shows Connected.
ImportantThe resource group must be attached to the workspace with public network access enabled. Otherwise, data synchronization fails.
-
Click Complete Creation.
Create an HttpFile data source
The user website access logs for this tutorial are stored in OSS (Object Storage Service). Create an HttpFile data source named user_behavior_analysis_httpfile to pull user_log.txt into MaxCompute.
-
In the left-side navigation pane, click Data Sources.
-
Click Add Data Source, then search for and select HttpFile.
-
On the Add HttpFile Data Source page, configure the parameters below. Use these values for both the development and production environments. Keep the default values for any parameters not listed.
Parameter Value Data Source Name user_behavior_analysis_httpfileData Source Description This data source is for DataWorks tutorials. Read data from this data source when you configure an offline sync task to access the test data provided by the platform. This data source can only be read in Data Integration scenarios. Other modules cannot use it. URL https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com -
In the Connection Configuration section, click Test Network Connectivity for both environments. Confirm that the status shows Connected.
ImportantThe resource group must be attached to the workspace with public network access enabled. Otherwise, data synchronization fails.
-
Click Complete Creation.
Step 2: Build a sync pipeline
Build a workflow that orchestrates the two sync tasks and runs daily on a schedule.
-
Click the
icon in the upper-left corner and select All Products > Data Development And Task Operation > DataStudio. Switch to the workspace created for this tutorial. -
In the left-side navigation pane, click
to open the Data Development page. In the Workspace Directories area, click
, select Create Workflow, and name it user_profile_analysis. -
On the workflow canvas, drag one Zero Load Node and two Batch Synchronization nodes from the left pane onto the canvas. Name them as follows:
Node type Node name Function Zero Load Node workshop_startDry-run ancestor node that manages the entire workflow. No code editing required. Batch Synchronization ods_user_info_dSyncs basic user information from MySQL to the ods_user_info_dtable in MaxCompute.Batch Synchronization ods_raw_log_dSyncs user website access logs from OSS to the ods_raw_log_dtable in MaxCompute. -
Drag connections to set
workshop_startas the ancestor node of both Batch Synchronization nodes, as shown below. -
Configure workflow scheduling. On the canvas, click Scheduling in the right-side pane and set the following parameters. Keep the default values for any parameters not listed.
Scheduling parameter Value Scheduling Parameters bizdate=$[yyyymmdd-1]— passes the previous day's date to all inner nodesScheduling Cycle DayScheduling Time 00:30Scheduling Dependencies Click Use Workspace Root Node to attach the workflow to the root node (format: workspace_name_root).
Step 3: Configure sync tasks
Configure the initial node
-
On the workflow canvas, hover over
workshop_startand click Open Node. -
In the right-side pane, click Scheduling and set the following parameters. Keep the default values for any parameters not listed.
Scheduling parameter Value Scheduling Type dry-run schedulingResource Group Select the Serverless resource group created in Prepare the environment. Scheduling Dependencies Click Use Workspace Root Node to trigger execution from the workspace root node ( workspace_name_root).
Configure the user data sync task (ods_user_info_d)
-
On the workflow canvas, hover over
ods_user_info_dand click Open Node. -
Configure the source and destination:
Parameter Value Source — Data source type MySQLSource — Data source name user_behavior_analysis_mysqlResource Group Select the Serverless resource group from Prepare the environment. Destination — Data destination type MaxCompute(ODPS)Destination — Data source name Select the MaxCompute resource attached in Prepare the environment, for example MaxCompute_Source. -
Click Next to configure the sync task details.
-
Configure source and destination. Set the following parameters. Keep the default values for any parameters not listed.
Module Configuration Item Value Data source Table ods_user_info_dData source Split key uid(the primary key of theods_user_info_dtable in MySQL)Destination Tunnel Resource Group Common transmission resources (default). If you have an exclusive Tunnel Quota, you can select it from the drop-down list. For more information, see Purchase and use an exclusive resource group for Data Transmission Service. Destination Table Click Generate Destination Table Schema. In the Statement for Creating Table box, paste the following DDL and click Create Table: Module Configuration Item Value Destination Partition information ${bizdate}CREATE TABLE IF NOT EXISTS ods_user_info_d ( uid STRING COMMENT 'User ID', gender STRING COMMENT 'Gender', age_range STRING COMMENT 'Age range', zodiac STRING COMMENT 'Zodiac sign' ) PARTITIONED BY ( dt STRING ) LIFECYCLE 7;Using
${bizdate}lets you pass a constant during testing and a dynamic value during scheduled runs. For supported variable formats, see Scheduling Parameters. -
Confirm field mapping and channel control. DataWorks maps source fields to destination fields based on the field mapping configuration. Set Policy for Dirty Data Records to Disallow Dirty Data Records. Keep the default values for all other settings. For details, see Configure a sync task using the codeless UI.
-
-
Configure debug parameters. In the right-side pane, click Debugging Configurations and set the following:
Configuration item Value Resource Group Select the Serverless resource group from Prepare the environment. Script Parameters Leave blank. When you run the workflow in Step 4, set Value Used in This Run to a constant such as 20250223. The job replaces${bizdate}with this value. -
(Optional) Configure scheduling properties. In the right-side pane, click Scheduling. The workflow-level scheduling parameters already apply to this node; no separate configuration is required. For details, see Node scheduling.
-
In the node toolbar, click Save.
Configure the user log sync task (ods_raw_log_d)
-
On the workflow canvas, hover over
ods_raw_log_dand click Open Node. -
Configure the source and destination:
Parameter Value Source — Data source type HttpFileSource — Data source name user_behavior_analysis_httpfileResource Group Select the Serverless resource group from Prepare the environment. Destination — Data destination type MaxCompute(ODPS)Destination — Data source name Select the MaxCompute resource from Prepare the environment, for example MaxCompute_Source. -
Click Next to configure the sync task details.
-
Configure source and destination. Set the following parameters. Keep the default values for any parameters not listed.
ImportantAfter you configure the data source settings, click Confirm Data Structure to verify that DataWorks can read the log file correctly before proceeding.
Module Configuration Item Value Data source File path /user_log.txtData source Text type textData source Column Delimiter |Data source Advanced configuration > Skip Header NoDestination Tunnel Resource Group Common transmission resources (default). If you have an exclusive Tunnel Quota, you can select it from the drop-down list. For more information, see Purchase and use an exclusive resource group for Data Transmission Service. Destination Table Click Generate Destination Table Schema. In the Statement for Creating Table box, paste the following DDL and click Create Table: Module Configuration Item Value Destination Partition information ${bizdate}CREATE TABLE IF NOT EXISTS ods_raw_log_d ( col STRING ) PARTITIONED BY ( dt STRING ) LIFECYCLE 7; -
Confirm field mapping and channel control. Set Policy for Dirty Data Records to Disallow Dirty Data Records. Keep the default values for all other settings. For details, see Configure a sync task using the codeless UI.
-
-
Configure debug parameters. In the right-side pane, click Debugging Configurations and set the following:
Configuration item Value Resource Group Select the Serverless resource group from Prepare the environment. Script Parameters Leave blank. When you run the workflow in Step 4, set Value Used in This Run to a constant such as 20250223. -
(Optional) Configure scheduling properties. In the right-side pane, click Scheduling. The workflow-level parameters apply to this node automatically. For details, see Node scheduling.
-
In the node toolbar, click Save.
Step 4: Sync data
-
In the workflow toolbar, click Run. Set Value Used in This Run for the
bizdatevariable — for example,20250223— and click OK. Wait for the run to complete. -
Verify the results.
-
In the left-side navigation pane of DataStudio, click
to open the Data Development page. In the personal folder area, click
to create a file with the .sqlextension. -
At the bottom of the page, confirm that the language mode is
MaxCompute SQL.
-
Run the following queries to count the records loaded into each table. Replace
your_data_timestampwith thebizdatevalue used in the run (for example,20250223).-- Replace your_data_timestamp with the actual bizdate value (e.g., 20250223) SELECT count(*) FROM ods_user_info_d WHERE dt='your_data_timestamp'; SELECT count(*) FROM ods_raw_log_d WHERE dt='your_data_timestamp';-
If both queries return a non-zero count, data synchronization is complete.
-
If a query returns zero, confirm that the Value Used in This Run matches the
dtvalue in the query. To check, click the workflow, open Running History in the right-side pane, and click View in the Actions column to inspect the partition value (partition=[pt=xxx]) in the run log.
-
-
What you learned
In this tutorial, you:
-
Added a MySQL data source and an HttpFile data source to your DataWorks workspace.
-
Built a
user_profile_analysisworkflow with a Zero Load Node and two Batch Synchronization nodes. -
Configured source and destination tables, partition parameters, and dirty data policy for each sync task.
-
Ran the pipeline and verified that data was written to the
ods_user_info_dandods_raw_log_dtables in MaxCompute.
What's next
Proceed to Process data to transform and analyze the synchronized data in MaxCompute.