You must synchronize raw data to MaxCompute during data preparation.
Prepare the data source
- Create an ApsaraDB RDS for MySQL instance in the ApsaraDB RDS console and record the instance ID. For more information, see Create an ApsaraDB RDS for MySQL instance.
- Configure a whitelist for the ApsaraDB RDS for MySQL instance in the ApsaraDB RDS console. For more information, see Configure a whitelist.
Note If you use a custom resource group to run the synchronization node for the ApsaraDB RDS for MySQL instance, you must add the IP addresses of the servers in the custom resource group to the whitelist of the ApsaraDB RDS for MySQL instance.
- Download the raw data required in this tutorial: indicators_data, steal_flag_data, and trend_data.
- Upload the raw data to the ApsaraDB RDS for MySQL instance. For more information, see Import data from Excel to ApsaraDB RDS for MySQL.
Create a data source
Note In this example, you must create an ApsaraDB RDS for MySQL data source.
- Go to the Data Source page.
- Log on to the DataWorks console.
- In the left-side navigation pane, click Workspaces.
- After you select the region where the required workspace resides, find the workspace and click Data Integration.
- In the left-side navigation pane, choose .
- On the Data Source page, click Add data source in the upper-right corner.
- In the Add data source dialog box, click MySQL.
- In the Add MySQL data source dialog box, set the parameters based on your business requirements.
Parameter Description Data source type The type of the data source. Set the parameter to Alibaba Cloud instance mode. Data Source Name The name of the data source. The name can contain letters, digits, and underscores (_) and must start with a letter. Data source description The description of the data source. The description can be a maximum of 80 characters in length. Environment The environment in which the data source is used. Valid values: Development and Production.Note This parameter is displayed only when the workspace is in standard mode. Region The region of the ApsaraDB RDS for MySQL instance. RDS instance ID The ID of your ApsaraDB RDS for MySQL instance. You can view the ID in the ApsaraDB RDS console. RDS instance account ID The ID of the Alibaba Cloud account that is used to purchase the ApsaraDB RDS for MySQL instance. You can view the ID on the security settings page in the ApsaraDB RDS console. Database name The name of the ApsaraDB RDS for MySQL database. User name The username that is used to connect to the ApsaraDB RDS for MySQL database. Password The password that is used to connect to the ApsaraDB RDS for MySQL database.
- Click Test connectivity.
- On the Data Integration tab, click Test connectivity in the Operation column of each resource group.
A sync node uses only one resource group. To ensure that your sync nodes can be properly run, you must test the connectivity of all the resource groups for Data Integration on which your sync nodes will be run. If you need to test the connectivity of multiple resource groups for Data Integration at a time, select the resource groups and click Batch test connectivity. For more information, see Select a network connectivity solution.
- After the data source passes the connectivity test, click Complete.
Create a workflow
- Click the icon in the upper-left corner and choose .
- Right-click Business Flow and select Create Workflow.
- In the Create Workflow dialog box, set the Workflow Name and Description parameters.
DataStudioNote The workflow name can be a maximum of 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
- Click Create.
- On the workflow configuration tab that appears, drag Zero-Load Node to the canvas, name the zero-load node start, and then click Commit. Create three batch synchronization nodes in the same way for synchronizing power consumption trend data, electricity-stealing flag data, and metric data.
- Draw lines between nodes and set the start node as the ancestor node of the three batch synchronization nodes.
Configure the start node
- Double-click the start node. In the right-side navigation pane, click the Properties tab.
- Set the root node of the workspace as the ancestor node of the start node.
In the latest version of DataWorks, each node must have its ancestor and descendant nodes. Therefore, you must set an ancestor node for the start node. In this example, the root node of the workspace is set as the ancestor node of the start node. The root node of the workspace is named in the Workspace name_root format.
- After the configuration is complete, click the icon in the upper-left corner.
- Click the created workflow. Then, click MaxCompute.
- Right-click Table in the MaxCompute folder and select Create Table.
- In the Create Table dialog box, set the Table Name parameter and click Create.
Create three tables named trend_data, indicators_data, and steal_flag_data. The trend_data table is used to store power consumption trend data, the indicators_data table is used to store metric data, and the steal_flag_data table is used to store electricity-stealing flag data.Note The table name can be a maximum of 64 characters in length. It cannot contain special characters and must start with a letter.
- On the configuration tab of each table, click DDL Statement and enter the following CREATE TABLE statements:
-- Create a table to store power consumption trend data. CREATE TABLE `trend_data` ( `uid` bigint, `trend` bigint ) PARTITIONED BY (dt string);
-- Create a table to store metric data. CREATE TABLE `indicators_data` ( `uid` bigint, `xiansun` bigint, `warnindicator` bigint ) COMMENT '*' PARTITIONED BY (ds string) LIFECYCLE 36000;
-- Create a table to store electricity-stealing flag data. CREATE TABLE `steal_flag_data` ( `uid` bigint, `flag` bigint ) COMMENT '*' PARTITIONED BY (ds string) LIFECYCLE 36000;
- After you enter the CREATE TABLE statements, click Generate Table Schema. Then, click OK.
- On the configuration tab of each table, enter the display name in the General section.
- After the configuration is complete, click Commit in Development Environment and Commit to Production Environment in sequence.
Configure the batch synchronization nodes
Configure the node to synchronize power consumption trend data.
- Double-click the node to go to the node configuration tab.
- Configure a source.
Parameter Description Connection Select workshop in sequence.and Table Select the trending table from which data is to be synchronized. Filter The condition used to filter the data that you want to synchronize. Filtering based on the LIMIT keyword is not supported. The SQL syntax is determined based on the selected data source. This parameter is optional. Shard Key If you specify this parameter, data sharding is performed based on the value of this parameter, and parallel threads can be used to read data. This improves data synchronization efficiency. This parameter is optional.
- Configure a destination.
Parameter Description Connection Select odps_first in sequence.and Table Select the trend_data table to store the source data. Partition Key Column Enter the partition key column to be synchronized. Default value:
Writing Rule Select Write with Original Data Deleted (Insert Overwrite). Convert Empty Strings to Null Select No.
- Configure the mappings between fields in the source and destination.
- Set parameters in the Channel section.
Parameter Description Expected Maximum Concurrency The maximum number of parallel threads that the synchronization node uses to read data from the source and write data to the destination. You can configure the parallelism for the synchronization node on the codeless UI. Bandwidth Throttling Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and set a maximum transmission rate to avoid heavy read workloads on the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to an appropriate value based on the configurations of the source. Dirty Data Records Allowed The maximum number of dirty data records allowed.
- Verify that the preceding configurations are correct and click the icon in the upper-left corner.
Commit a workflow
- Go to the workflow configuration tab and click the icon in the upper-left corner.
- In the Commit dialog box, select the nodes to be committed, enter your comments in the Change description field, and then select Ignore I/O Inconsistency Alerts.
- Click Commit. The Committed successfully message appears.
Verify data synchronization to MaxCompute
- In the left-side navigation pane, click Ad-Hoc Query. The Ad-Hoc Query tab appears.
- On the Ad-Hoc Query tab, right-click Ad-Hoc Query and choose .
- Write and execute SQL statements to query the number of data records synchronized to the trend_data, indicators_data, and steal_flag_data tables.
Use the following SQL statements. In each statement, change the partition key value to the data timestamp. For example, if the node is run on August 9, 2019, the data timestamp is 20190808, which is one day before the node is run.
-- Check whether the data is written to MaxCompute. SELECT count(*) from trend_data where dt=Data timestamp of the ad-hoc query node; SELECT count(*) from indicators_data where ds=Data timestamp of the ad-hoc query node; SELECT count(*) from steal_flag_data where ds=Data timestamp of the ad-hoc query node;