Synchronize data from MySQL and OSS to MaxCompute by using Data Integration - DataWorks

What you'll learn

By the end of this tutorial, you'll know how to:

Create partitioned MaxCompute tables to receive synchronized data
Register an ApsaraDB RDS for MySQL data source and an HttpFile data source in DataWorks
Configure a batch synchronization node for structured relational data (MySQL to MaxCompute)
Configure a batch synchronization node for raw log files (OSS to MaxCompute)
Set up daily scheduling with $bizdate parameterization
Verify synchronization results using an ad hoc query

Prerequisites

Before you begin, make sure you have:

Completed the experiment introduction and have a general understanding of the tutorial goals
Completed environment preparation, including purchasing a serverless resource group and configuring an Internet NAT gateway

Important

The test data and data sources used in this tutorial are pre-provisioned mock data for DataWorks experiments. The data can only be read through Data Integration and is not intended for production use.

Data synchronization overview

This tutorial synchronizes data from two sources into MaxCompute:

Source type	Data	Source table or object	Destination table
MySQL	Basic user information	`ods_user_info_d`	`ods_user_info_d_odps`
HttpFile (OSS)	Website access logs	`user_log.txt`	`ods_raw_log_d_odps`

Why batch synchronization? Batch synchronization is appropriate here because the source data accumulates throughout the day and is processed as a daily snapshot. If you need to process data within minutes of it arriving, use a real-time synchronization approach instead.

Go to DataStudio

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select your workspace from the drop-down list, then click Go to Data Development.

Step 1: Design the workflow

Create a workflow

Development in DataWorks is organized around workflows. Before creating nodes, create a workflow named User profile analysis_MaxCompute. For instructions, see Create a workflow.

Add nodes to the workflow

After creating the workflow, the workflow canvas opens automatically. Click Create Node in the upper part of the canvas, drag nodes onto the canvas, and draw dependency lines between them. For the full workflow design, see the workflow design reference.

The workflow for this tutorial uses the following three nodes:

Node classification	Node type	Node name	Purpose
General	Zero load node	`workshop_start_odps`	Triggers the workflow at 00:30 daily and manages the overall execution order
Data Integration	Batch synchronization	`ods_user_info_d_odps`	Syncs basic user information from MySQL to MaxCompute
Data Integration	Batch synchronization	`ods_raw_log_d_odps`	Syncs website access logs from OSS to MaxCompute

Name each synchronization node after the destination table it populates.

Because no data lineage exists between the zero load node and the synchronization nodes, configure their dependencies by drawing lines in the workflow canvas. For details, see Scheduling dependency configuration guide.

All nodes in a DataWorks workflow require a parent node. Both synchronization nodes in this tutorial depend on workshop_start_odps, so they start running after that node completes.

Configure scheduling for the zero load node

The workshop_start_odps zero load node triggers the entire workflow. Configure its scheduling properties as follows. Leave the scheduling properties of the two synchronization nodes at their defaults for now — you'll configure them in Step 2.

Configuration item	Setting	Description
Scheduling time	`00:30`	The workflow runs daily at 00:30.
Scheduling dependencies	Root node of the workspace	The zero load node has no parent nodes, so it depends on the workspace root node.

For more information about scheduling options, see Configure scheduling time for nodes in a workflow in different scenarios and the scheduling overview.

Step 2: Configure data synchronization tasks

Create the destination MaxCompute tables

Create the MaxCompute tables that will receive the synchronized data before configuring the synchronization tasks.

Go to the table creation entry point.
Create the ods_raw_log_d_odps table. In the Create Table dialog box, enter ods_raw_log_d_odps in the Name field. Click DDL, paste the following statement, and click Generate Table Schema. In the Confirm dialog box, click Confirmation to overwrite the existing configuration.
```
CREATE TABLE IF NOT EXISTS ods_raw_log_d_odps
(
 col STRING
)
PARTITIONED BY
(
 dt STRING
)
LIFECYCLE 7;
```
Create the ods_user_info_d_odps table. In the Create Table dialog box, enter ods_user_info_d_odps in the Name field. Click DDL, paste the following statement, and click Generate Table Schema. In the Confirm dialog box, click Confirmation to overwrite the existing configuration.
```
CREATE TABLE IF NOT EXISTS ods_user_info_d_odps (
 uid STRING COMMENT 'The user ID',
 gender STRING COMMENT 'The gender',
 age_range STRING COMMENT 'The age range',
 zodiac STRING COMMENT 'The zodiac sign'
)
PARTITIONED BY (
 dt STRING
)
LIFECYCLE 7;
```
Commit and deploy both tables. For each table, click Commit to Development Environment and then Commit to Production Environment on the configuration tab. This creates the physical tables in the MaxCompute projects associated with your workspace in each environment.

After committing, you can view the tables in the MaxCompute project for the corresponding environment. Tables committed to the development environment appear in the development MaxCompute project; tables committed to the production environment appear in the production MaxCompute project.

Register data sources

Register the data sources before configuring synchronization tasks, so you can select them by name during task configuration.

Important

The test data for both data sources is hosted on the internet. Make sure an Internet NAT gateway is configured for your DataWorks resource group as described in Step 2 of environment preparation. Without this, connectivity tests will fail with timeout errors.

To navigate to the Data Sources page:

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. Select your workspace from the drop-down list, then click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Data Sources.

Register the MySQL data source

Add the ApsaraDB RDS for MySQL data source (user_behavior_analysis_mysql) to access the basic user information table.

On the Data Sources page, click Add Data Source.
In the Add Data Source dialog box, click MySQL.

On the Add MySQL Data Source page, configure the following parameters:

Set Environment to Development and Production. A data source must exist in both environments; otherwise, tasks fail when they run against the production environment.

Parameter	Value
Data Source Name	`user_behavior_analysis_mysql`
Configuration Mode	Connection String Mode
Environment	Development and Production
Host IP Address	`rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com`
Port Number	`3306`
Database Name	`workshop`
Username	`workshop`
Password	`workshop#2017`
Authentication Method	No Authentication

In the Connection Configuration section, find your serverless resource group and click Test Network Connectivity in the Connection Status column. Test connectivity for both the development and production environments. When the test succeeds, the status changes to Connected.

Register the HttpFile data source

Add the HttpFile data source (user_behavior_analysis_httpfile) to access the website access log file stored in OSS.

On the Data Sources page, click Add Data Source.
In the Add Data Source dialog box, click HttpFile.

On the Add HttpFile Data Source page, configure the following parameters:

Set Environment to both Development Environment and Production Environment to avoid task failures when running in production.

Parameter	Value
Data Source Name	`user_behavior_analysis_httpfile`
Environment	Development Environment and Production Environment
URL Domain	`https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com`

In the Connection Configuration section, click Test Network Connectivity for your serverless resource group. Test both environments. When the test succeeds, the status changes to Connected.

Configure the MySQL to MaxCompute synchronization task

This task reads the ods_user_info_d table from MySQL and writes the data to the ods_user_info_d_odps MaxCompute table.

Double-click the ods_user_info_d_odps batch synchronization node to open its configuration tab.

Configure the connection. Set the source, resource group, and destination, then click Next and complete the connectivity test.

Parameter	Value
Source	Type: `MySQL`; Data Source Name: `user_behavior_analysis_mysql`
Resource Group	The serverless resource group you purchased during environment preparation
Destination	Type: `MaxCompute`; Data Source Name: `user_behavior_analysis_mysql`

Configure the source and destination settings. Use the following settings. Leave all other parameters at their defaults. For field mapping and Channel Control settings, the defaults work for this tutorial. For other configuration options, see Configure a task in the codeless UI.

Section	Parameter	Value
Source	Table	`ods_user_info_d`
Source	Split key	`uid` (primary key; used to parallelize data reads)
Destination	Tunnel Resource Group	Common transmission resources (default)
Destination	Schema	`default`
Destination	Table	`ods_user_info_d_odps`
Destination	Partition information	`${bizdate}`
Destination	Write Mode	Clean up existing data before writing (Insert Overwrite)
Destination	Write by Converting Empty Strings into Null	No

Configure scheduling properties. Click Properties in the right-side navigation pane. Use the following settings and leave all other parameters at their defaults.

The node runs after workshop_start_odps completes each day. DataWorks uses the node output to resolve scheduling dependencies for downstream SQL nodes through automatic data lineage parsing.

Section	Parameter	Value
Scheduling Parameter	Parameter Name / Value	`bizdate` / `$bizdate` (date of the previous day in `yyyymmdd` format)
Schedule	Scheduling Cycle	`Day`
Schedule	Scheduled time	`00:30`
Schedule	Rerun	Allow Regardless of Running Status
Resource Group	—	The serverless resource group from environment preparation
Dependencies	Parent node	`workshop_start_odps`
Dependencies	Node output	Verify that an output named `<MaxCompute project name in the production environment>.ods_user_info_d_odps` exists. Add it manually if it is missing.

Configure the HttpFile to MaxCompute synchronization task

This task reads the user_log.txt log file from OSS via HttpFile and writes the raw log data to the ods_raw_log_d_odps MaxCompute table.

Double-click the ods_raw_log_d_odps batch synchronization node to open its configuration tab.

Configure the connection. Set the source, resource group, and destination, then click Next and complete the connectivity test.

Parameter	Value
Source	Type: `HttpFile`; Data Source Name: `user_behavior_analysis_HttpFile`
Resource Group	The serverless resource group you purchased during environment preparation
Destination	Type: `MaxCompute`; Data Source Name: `user_behavior_analysis_mysql`

Configure the source and destination settings. Use the following settings. Leave all other parameters at their defaults. After setting the source parameters, click Confirm Data Structure to verify that DataWorks can read the log file. For field mapping and Channel Control settings, the defaults work for this tutorial. For other configuration options, see Configure a task in the codeless UI.

Section	Parameter	Value
Source	File Path	`/user_log.txt`
Source	File Type	`text`
Source	Column Delimiter	`\|`
Source (Advanced)	Coding	`UTF-8`
Source (Advanced)	Compression format	`UTF-8`
Source (Advanced)	Skip Header	No
Destination	Tunnel Resource Group	Common transmission resources (default)
Destination	Schema	`default`
Destination	Table	`ods_raw_log_d_odps`
Destination	Partition information	`${bizdate}`
Destination	Write Mode	Clean up existing data before writing (Insert Overwrite)
Destination	Write by Converting Empty Strings into Null	No

Configure scheduling properties. Click Properties in the right-side navigation pane. Use the following settings and leave all other parameters at their defaults.

Section	Parameter	Value
Scheduling Parameter	Parameter Name / Value	`bizdate` / `$bizdate` (date of the previous day in `yyyymmdd` format)
Schedule	Scheduling Cycle	`Day`
Schedule	Scheduled time	`00:30`
Schedule	Rerun	Allow Regardless of Running Status
Resource Group	—	The serverless resource group from environment preparation
Dependencies	Parent node	`workshop_start_odps`
Dependencies	Node output	Verify that an output named `<MaxCompute project name in the production environment>.ods_raw_log_d_odps` exists. Add it manually if it is missing.

Step 3: Run the workflow and verify results

Run the workflow

In the DataStudio left-side navigation pane, under Business Flow, double-click the User profile analysis_MaxCompute workflow.
On the workflow canvas, click the icon in the top toolbar to trigger all nodes based on their scheduling dependencies.

Check node status

Node state indicator: A node displaying the icon is running normally.
Logs: Right-click the ods_user_info_d_odps or ods_raw_log_d_odps node and select View Logs. A successful run shows a completion message in the log output.

Verify the synchronization results

When the workflow completes successfully:

All records from ods_user_info_d in ApsaraDB RDS for MySQL are written to the previous day's partition in workshop2024_01_dev.ods_user_info_d_odps.
All access log records from user_log.txt in OSS are written to the previous day's partition in workshop2024_01_dev.ods_raw_log_d_odps.

To confirm the row counts, run an ad hoc query:

In the DataStudio left-side navigation pane, click the icon. In the Ad Hoc Query pane, right-click Ad Hoc Query and choose Create Node > ODPS SQL.
Run the following queries, replacing Data timestamp with the $bizdate value for your run (the date of the previous day in yyyymmdd format — for example, if the node ran on June 21, 2023, the data timestamp is 20230620):
```
select count(*) from ods_user_info_d_odps where dt='Data timestamp';
select count(*) from ods_raw_log_d_odps where dt='Data timestamp';
```

Nodes run in DataStudio execute in the development environment by default. Query results reflect data in the workshop2024_01_dev MaxCompute project.

What's next

Data collection is complete. The next tutorial covers processing the user information and access logs you just synchronized into MaxCompute. See Process data.