You must synchronize raw data to MaxCompute during data preparation.
Prerequisites
A MaxCompute data source is added. For more information, see Add a MaxCompute data source.
Prepare a data source
Create an ApsaraDB RDS for MySQL instance in the ApsaraDB RDS console and record the instance ID. For more information, see Quickly create an ApsaraDB RDS for MySQL instance.
Configure a whitelist for the ApsaraDB RDS for MySQL instance in the ApsaraDB RDS console. For more information, see Configure an IP address whitelist.
NoteIf you use a custom resource group to run a data synchronization task, you must add the IP addresses of the servers in the custom resource group to the whitelist of the ApsaraDB RDS for MySQL instance.
Download the raw data required in this tutorial: indicators_data, steal_flag_data, and trend_data.
Upload the raw data to the ApsaraDB RDS for MySQL instance. For more information, see Import data from an Excel file to ApsaraDB RDS for MySQL.
Add a data source
In this example, you must add an ApsaraDB RDS for MySQL data source.
Go to the Management Center page.
Log on to the DataWorks console. In the left-side navigation pane, click Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Data Source. The Data Source page appears.
On the Data Source page, click Add Data Source.
In the Add Data Source dialog box, click MySQL.
In the Add MySQL Data Source dialog box, configure the parameters.
Parameter
Description
Data Source Type
The mode in which you want to add the data source to DataWorks. Set this parameter to Alibaba Cloud Instance Mode.
Data Source Name
The name of the data source. The name can contain letters, digits, and underscores (_) and must start with a letter.
Data Source Description
The description of the data source. The description can be up to 80 characters in length.
Environment
The environment in which the data source is used. Valid values: Development and Production.
NoteThis parameter is displayed only if the workspace is in standard mode.
Region
The region of the ApsaraDB RDS for MySQL instance.
RDS instance ID
The ID of the ApsaraDB RDS for MySQL instance. You can log on to the ApsaraDB RDS console to view the instance ID.
Default Database Name
The name of the database.
Username
The username that is used to connect to the database.
Password
The password that is used to connect to the database.
Special Authentication Method
Select None Auth.
Select a resource group and click Test Connectivity in the Connectivity Status (Production Environment) column.
After the data source passes the network connectivity test, click Complete.
Create a workflow
Click the icon in the upper-left corner and choose .
On the page that appears, right-click Business Flow and select Create Workflow.
In the Create Workflow dialog box, configure Workflow Name and Description.
NoteThe workflow name cannot exceed 128 characters in length and can contain only letters, digits, underscores (_), and periods (.).
Click Create.
On the workflow configuration tab that appears, drag one zero load node and three batch synchronization nodes to the canvas, name the zero load node start, configure the related settings for the nodes, and click Submit. The batch synchronization nodes are used to synchronize electricity usage trend data, electricity-stealing flag data, and metric data.
Draw lines between nodes and configure the start node as the ancestor node of the three batch synchronization nodes.
Configure the start node
Double-click the start node. In the right-side navigation pane of the configuration tab of the start node, click the Properties tab.
Configure the root node of the workspace as the ancestor node of the start node.
In the latest version of DataWorks, each node must have its ancestor and descendant nodes. Therefore, you must specify an ancestor node for the start node. In this example, the root node of the workspace is specified as the ancestor node of the start node. The root node of the workspace is named in the Workspace name_root format.
After the configuration is complete, click the icon in the upper-left corner.
Create tables
Right-click the created workflow and choose .
In the Create Table dialog box, configure the Engine Instance, Path, and Name parameters and click Create.
Create three tables named trend_data, indicators_data, and steal_flag_data. The trend_data table is used to store electricity usage trend data, the indicators_data table is used to store metric data, and the steal_flag_data table is used to store electricity-stealing flag data.
NoteThe table name cannot exceed 64 characters in length. The table name must start with a letter and cannot contain special characters.
On the configuration tab of each table, click DDL and enter the following CREATE TABLE statements:
-- Create a table to store electricity usage trend data. CREATE TABLE trend_data ( uid bigint, trend bigint ) PARTITIONED BY (dt string);
-- Create a table to store metric data. CREATE TABLE indicators_data ( uid bigint, xiansun bigint, warnindicator bigint ) COMMENT '*' PARTITIONED BY (ds string) LIFECYCLE 36000;
-- Create a table to store electricity-stealing flag data. CREATE TABLE steal_flag_data ( uid bigint, flag bigint ) COMMENT '*' PARTITIONED BY (ds string) LIFECYCLE 36000;
After you enter the CREATE TABLE statements, click Generate Table Schema. Then, click OK.
On the configuration tab of each table, enter the display name in the General section.
After the configuration is complete, click Commit to Development Environment and Commit to Production Environment in sequence.
Configure the batch synchronization nodes
Configure the node to synchronize electricity usage trend data.
Double-click the node to go to the node configuration tab.
Configure a source.
Parameter
Description
Connection
Select
in sequence.Table
Select the trending table from which data is to be synchronized.
Filter
The condition used to filter the data that you want to synchronize. Filtering based on the LIMIT keyword is not supported. The SQL syntax is determined based on the selected data source. This parameter is optional.
Shard Key
If you configure this parameter, data sharding is performed based on the value of this parameter, and parallel threads can be used to read data. This improves data synchronization efficiency. This parameter is optional.
Configure a destination.
Parameter
Description
Connection
Select ODPS and then the MaxCompute data source that you add.
Table
Select the trend_data table to store the source data.
Partition Key Column
Enter the partition key column to be synchronized. Default value:
dt=${bdp.system.bizdate}
.Writing Rule
Select Write with Original Data Deleted (Insert Overwrite).
Convert Empty Strings to Null
Select No.
Configure the mappings between fields in the source and destination.
Configure parameters in the Channel section.
Parameter
Description
Expected Maximum Concurrency
The maximum number of parallel threads that the synchronization node can use to read data from the source or write data to the destination. You can configure the parallelism for the data synchronization node on the codeless UI.
Bandwidth Throttling
Specifies whether to enable throttling. You can enable throttling and specify a maximum transmission rate to prevent heavy read workloads on the source. We recommend that you enable throttling and set the maximum transmission rate to an appropriate value based on the configurations of the source.
Dirty Data Records Allowed
The maximum number of dirty data records allowed.
Verify that the preceding configurations are correct and click the icon in the upper-left corner.
Commit a workflow
Go to the workflow configuration tab and click the icon in the upper-left corner.
In the Commit dialog box, select the nodes to be committed, enter your comments in the Change description field, and then select Ignore I/O Inconsistency Alerts.
Click Commit. The Committed successfully message appears.
Check whether data is synchronized to MaxCompute
In the left-side navigation pane, click Ad-Hoc Query. The Ad-Hoc Query pane appears.
In the Ad-Hoc Query pane, right-click Ad-Hoc Query and choose .
Write and execute SQL statements to query the number of data records synchronized to the trend_data, indicators_data, and steal_flag_data tables.
Execute the following SQL statements. In each statement, change the partition key value to the data timestamp of the node. For example, if the scheduling time of the node is August 9, 2019, the data timestamp is 20190808, which is one day before the node is run.
-- Check whether the data is written to MaxCompute. SELECT count(*) from trend_data where dt=Data timestamp of the ad hoc query node; SELECT count(*) from indicators_data where ds=Data timestamp of the ad hoc query node; SELECT count(*) from steal_flag_data where ds=Data timestamp of the ad hoc query node;
What to do next
You understand how to collect and synchronize data. You can now proceed with the next tutorial. The next tutorial describes how to compute and analyze collected data.