You must synchronize raw data to MaxCompute during data preparation.
Prepare the source data store
- Create an ApsaraDB RDS for MySQL instance in the ApsaraDB RDS console and take note of the instance ID. For more information, see Create an ApsaraDB RDS for MySQL instance.
- Configure a whitelist for the RDS instance in the ApsaraDB RDS console. For more information,
see Configure whitelists.Note If you use a custom resource group to run the sync node, you must add the IP addresses of the servers in the custom resource group to the whitelist of the RDS instance.
- Download the raw data required in this tutorial: indicators_data, steal_flag_data, and trend_data.
- Upload the raw data to the RDS instance. For more information, see Import data from Excel to ApsaraDB RDS for MySQL.
Create a connection
Note In this example, you must create a MySQL connection.
- Go to the Data Source page.
- Log on to the DataWorks console.
- In the left-side navigation pane, click Workspaces.
- After you select the region where the required workspace resides, find the workspace and click Data Integration.
- In the left-side navigation pane, click Connection. The page appears.
- On the Data Source page, click New data source in the upper-right corner.
- In the Add data source dialog box, click MySQL.
- In the Add MySQL data source dialog box, set the parameters as required.
Parameter Description Data source type Set the parameter to. Data Source Name The name of the connection can contain letters, digits, and underscores (_), and must start with a letter. Data source description The description of the connection. The description can be up to 80 characters in length. Environment Valid values: Development and Production.Note This parameter is displayed only when the workspace is in standard mode. Region Select the required region. RDS instance ID You can view the ID of the RDS instance in the ApsaraDB RDS console. RDS instance account ID After you log on to the Alibaba Cloud Management Console with your Alibaba Cloud account, you can view your account ID used to purchase the RDS instance on the Security Settings page. Database name The name of the database. User name The username that you can use to connect to the database. Password The password that you can use to connect to the database.
- Click Test connectivity.
- After the connection passes the connectivity test, click Complete.
Create a workflow
- Click the icon in the upper-left corner and choose .
- Right-click Business Flow and select Create Workflow.
- In the Create Workflow dialog box, set the Workflow Name and Description parameters.DataStudioNote The workflow name can be up to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
- Click Create.
- On the workflow configuration tab that appears, drag Zero-Load Node to the canvas, name the zero load node start, and then click Commit. Create three batch sync nodes in the same way for synchronizing power consumption trend data, electricity-stealing flag data, and metrics data.
- Draw lines between nodes and set the start node as the parent node of the three batch sync nodes.
Configure the start node
- Double-click the zero load node. In the right-side navigation pane, click the Properties tab.
- Set the root node of the workspace as the parent node of the start node. In the latest version of DataWorks, each node must have its parent and child nodes. Therefore, you must set a parent node for the start node. In this example, the root node of the workspace is set as the parent node of the start node. The root node of the workspace is named in the Workspace name_root format.
- After the configuration is completed, click the icon in the upper-left corner.
- Click the created workflow. Then, click MaxCompute.
- Right-click Table under MaxCompute and select Create Table.
- In the Create Table dialog box, set the Table Name parameter and click Create.Create three tables named trend_data, indicators_data, and steal_flag_data. The trend_data table is used to store power consumption trend data, the indicators_data table is used to store metrics data, and the steal_flag_data table is used to store electricity-stealing flag data.Note The table name must be 1 to 64 characters in length. It must start with a letter and cannot contain special characters.
- On the configuration tab of each table, click DDL Statement and enter the following CREATE TABLE statements:
-- Create a table for storing power consumption trend data. CREATE TABLE `trend_data` ( `uid` bigint, `trend` bigint ) PARTITIONED BY (dt string);
-- Create a table for storing metrics data. CREATE TABLE `indicators_data` ( `uid` bigint, `xiansun` bigint, `warnindicator` bigint ) COMMENT '*' PARTITIONED BY (ds string) LIFECYCLE 36000;
-- Create a table for storing electricity-stealing flag data. CREATE TABLE `steal_flag_data` ( `uid` bigint, `flag` bigint ) COMMENT '*' PARTITIONED BY (ds string) LIFECYCLE 36000;
- After you enter the CREATE TABLE statements, click Generate Table Schema. Then, click OK.
- On the configuration tab of each table, enter the display name in the General section.
- After the configuration is completed, click Commit in Development Environment and Commit to Production Environment in sequence.
Configure the batch sync nodes
Configure the node for synchronizing power consumption trend data.
- Double-click the node to go to the node configuration tab.
- Configure a connection to the source data store.
Parameter Description Connection Select workshop in sequence.and Table Select the trending table from which data is to be synchronized. Filter The filter condition for the data to be synchronized. Filtering based on the limit keyword is not supported. The SQL syntax is determined by the selected connection. This parameter is optional. Shard Key If data sharding is performed based on the configured shard key, data can be read concurrently to improve data synchronization efficiency. This parameter is optional.
- Configure a connection to the destination data store.
Parameter Description Connection Select odps_first in sequence.and Table Select the trend_data table for storing the source data. Partition Key Column Enter the partition key column to be synchronized. Default value:
Writing Rule Select Write with Original Data Deleted (Insert Overwrite). Convert Empty Strings to Null Select No.
- Configure the mappings between fields in the source and destination.
- Set parameters in the Channel section.
Parameter Description Expected Maximum Concurrency The maximum number of concurrent threads that the sync node uses to read data from the source data store and write data to the destination data store. You can configure the concurrency for the sync node on the codeless user interface (UI). Bandwidth Throttling You can enable bandwidth throttling and set a maximum transmission rate to avoid heavy read workload of the source data store. We recommend that you enable bandwidth throttling and set the maximum transmission rate to a proper value based on the configurations of the source data store. Dirty Data Records Allowed The maximum number of dirty data records that are allowed.
- Verify that the preceding configuration is correct and click the icon in the upper-left corner.
Commit the workflow
- Go to the workflow configuration tab and click the icon in the upper-left corner.
- In the Commit dialog box, select the nodes to be committed, enter your comments in the Change description field, and then select Ignore I/O Inconsistency Alerts.
- Click Commit. The Committed successfully message appears.
Verify data synchronization to MaxCompute
- On the left-side navigation submenu, click the Ad-Hoc Query icon. The Ad-Hoc Query tab appears.
- Right-click Ad-Hoc Query and choose .
- Write and execute SQL statements to query the number of data records synchronized
to the trend_data, indicators_data, and steal_flag_data tables.Use the following SQL statements. In each statement, change the partition key value to the data timestamp. For example, if the node is run on August 9, 2019, the data timestamp is 20190808, which is one day before the node is run.
-- Check whether the data is written to MaxCompute. SELECT count(*) from trend_data where dt=Data timestamp of the ad-hoc query node; SELECT count(*) from indicators_data where ds=Data timestamp of the ad-hoc query node; SELECT count(*) from steal_flag_data where ds=Data timestamp of the ad-hoc query node;