Build an RDS MySQL to MaxCompute Pipeline for Offline Analytics - DataWorks

Prepare data for electricity analysis by synchronizing raw CSV files from ApsaraDB RDS for MySQL into MaxCompute. This tutorial walks you through the full data preparation pipeline: registering a MySQL data source in DataWorks, creating a workflow with three offline synchronization nodes, defining the target MaxCompute tables, and verifying that data is written successfully.

By the end of this tutorial, you will have:

Registered an ApsaraDB RDS for MySQL data source in DataWorks Management Center
Created a workflow with one start node and three offline synchronization nodes
Created three MaxCompute tables: trend_data, indicators_data, and steal_flag_data
Verified that data is synchronized to MaxCompute

Prerequisites

Before you begin, ensure that you have:

Completed the required environment setup. See Prepare the environment.
Added a MaxCompute data source. See Add a MaxCompute data source.

Prepare a data source

In the ApsaraDB RDS console, create an ApsaraDB RDS for MySQL instance and note the instance ID. See Create an ApsaraDB RDS for MySQL instance and configure databases.
Configure an IP address whitelist for the instance.
If you use a serverless resource group to run synchronization tasks, the whitelist configuration depends on how the resource group accesses the data source: - VPC access: Add the CIDR block of the vSwitch associated with the resource group to the IP address whitelist. - Internet access: Add the elastic IP address (EIP) of the VPC associated with the serverless resource group to the IP address whitelist. For older resource groups, add the resource group's EIP instead. For details, see Network connectivity solutions.
Download the raw data files for this tutorial:
Upload the raw data to the ApsaraDB RDS for MySQL instance. See Import data from an Excel file to ApsaraDB RDS for MySQL.

Add a data source

Add the ApsaraDB RDS for MySQL instance as a DataWorks data source so that synchronization tasks can access it.

Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose More > Management Center. Select your workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane, choose Data Sources > Data Sources.
Click Add Data Source in the upper-left corner. In the Add Data Source dialog box, click MySQL.

On the Add MySQL Data Source page, set Configuration Mode to Alibaba Cloud Instance Mode and configure the remaining parameters.

Parameter	Description
Applicable environment	The environment where the data source is used. Valid values: Development Environment and Production Environment.
Configuration Mode	Select Alibaba Cloud Instance Mode.
Belongs to Cloud Account	Select Current Cloud Account.
Region	The region where the data source resides.
Instance	Select the ApsaraDB RDS for MySQL instance you created. After selecting, click Obtain Latest Address to view the instance details. If no instance appears, create one in the ApsaraDB RDS console.
Database Name, Username, and Password	The default database name and credentials used to connect. Do not use at signs (@) in the password. If you configure a batch synchronization task with multiple databases, add a separate data source for each database.
Authentication Options	Select No Authentication.
Backup Settings	If the instance is a primary instance with one or more secondary read-only instances, enable Backup Settings and select the secondary instance ID. This reduces read pressure on the primary instance. If the primary instance has multiple secondary read-only instances, only one of them is read. Note Enabling Backup Settings requires an exclusive resource group. The task cannot use a serverless resource group.

Test network connectivity. In the Connection Configuration section, click Data Integration and Data Scheduling separately. For each resource group you plan to use, click Test Network Connectivity in the Connection Status column.
- If the status shows Connected, the resource group can reach the data source.
- If the status does not show Connected, check your IP address whitelist and VPC configuration. For troubleshooting guidance, see Network connectivity solutions.
A synchronization task can use only one resource group of each type. Test connectivity for all resource group types your task will use.
Click Complete Creation.

Create a workflow

In the upper-left corner of the Management Center page, click the icon and choose All Products > Data Development And Task Operation > DataStudio.
In DataStudio, right-click Business Flow and select Create Workflow.
In the Create Workflow dialog box, enter a Workflow Name and Description, then click Create.
The workflow name can contain letters, digits, underscores (_), and periods (.), up to 128 characters.
On the workflow canvas, drag one zero load node and three offline synchronization nodes onto the canvas. Name the zero load node start. The three offline synchronization nodes will synchronize electricity usage trend data, electricity-stealing flag data, and metric data.
Connect the nodes so that start is the ancestor node of all three offline synchronization nodes.
Click Submit.

Configure the start node

Double-click the start node. In the right-side panel, click the Properties tab.
Set the workspace root node as the ancestor node of the start node. The root node is named in the format {WorkspaceName}_root.
Every node in DataWorks must have an ancestor node and a descendant node. Linking the start node to the workspace root node satisfies this requirement.
Click the icon to save.

Create tables

Create three MaxCompute tables to store the synchronized data.

Right-click the workflow and choose Create Table > MaxCompute > Table.
In the Create Table dialog box, configure Engine Instance, Path, and Name, then click Create. Create three tables:
Table names can be up to 64 characters, must start with a letter, and cannot contain special characters.
Table name Stores
trend_data Electricity usage trend data
indicators_data Metric data
steal_flag_data Electricity-stealing flag data

Table name	Stores
`trend_data`	Electricity usage trend data
`indicators_data`	Metric data
`steal_flag_data`	Electricity-stealing flag data

On the configuration tab for each table, click DDL and enter the corresponding CREATE TABLE statement:

-- Create a table to store electricity usage trend data.
CREATE TABLE trend_data (
    uid BIGINT,
    trend BIGINT
)
PARTITIONED BY (dt string);

-- Create a table to store metric data.
CREATE TABLE indicators_data (
    uid BIGINT,
    xiansun BIGINT,
    warnindicator BIGINT
)
COMMENT '*'
PARTITIONED BY (ds string)
LIFECYCLE 36000;

-- Create a table to store electricity-stealing flag data.
CREATE TABLE steal_flag_data (
    uid BIGINT,
    flag BIGINT
)
COMMENT '*'
PARTITIONED BY (ds string)
LIFECYCLE 36000;

Click Generate Table Schema, then click OK.
In the General section, enter a display name for each table.
Click Commit to Development Environment, then click Commit to Production Environment.

Configure the offline synchronization nodes

Configure each offline synchronization node to move data from MySQL to MaxCompute. The steps below use the trend data node as an example; repeat the same configuration for the other two nodes.

Double-click the offline synchronization node to open its configuration tab.

Configure the source:

Parameter	Value
Connection	Select MySQL > workshop
Table	Select `trending`
Filter	(Optional) A SQL WHERE condition to filter rows. The LIMIT keyword is not supported.
Shard Key	(Optional) When set, data is split across parallel threads, which improves synchronization throughput.

Configure the destination:

Parameter	Value
Connection	Select ODPS, then select your MaxCompute data source
Table	Select `trend_data`
Partition Key Column	Default: `dt=${bdp.system.bizdate}`
Writing Rule	Select Write with Original Data Deleted (Insert Overwrite)
Convert Empty Strings to Null	Select No

Configure field mappings between the source and destination.

Configure the Channel settings:

Parameter	Description
Expected Maximum Concurrency	The maximum number of parallel threads for reading from the source or writing to the destination.
Bandwidth Throttling	Enable throttling and set a maximum transmission rate to limit read pressure on the source. Set this based on your source database capacity.
Dirty Data Records Allowed	The maximum number of dirty data records allowed.

Channel configuration

Click the icon to save.

Commit the workflow

On the workflow canvas, click the icon.
In the Commit dialog box, select the nodes to commit, enter comments in Change description, and select Ignore I/O Inconsistency Alerts.
Click Commit. The Committed successfully message confirms the workflow is committed.

Verify data synchronization

Run an ad-hoc query to confirm that data has been written to each MaxCompute table.

In the left-side navigation pane, click Ad-Hoc Query.
Right-click Ad-Hoc Query and choose Create Node > ODPS SQL.
Run the following SQL statements. Replace <data_timestamp> with the business date of the node run—one day before the scheduling date. For example, if the node runs on August 9, 2019, use 20190808.
```
-- Check whether data is written to MaxCompute.
SELECT COUNT(*) FROM trend_data WHERE dt=<data_timestamp>;
SELECT COUNT(*) FROM indicators_data WHERE ds==<data_timestamp>;
SELECT COUNT(*) FROM steal_flag_data WHERE ds=<data_timestamp>;
```
Each query should return a non-zero row count, confirming that data was synchronized successfully.

What's next

With data synchronized to MaxCompute, you are ready to compute and analyze the collected data. Proceed to the next tutorial to set up data processing tasks.