You need to synchronize raw data to MaxCompute during data preparation.
Prepare the source data store
- Create an ApsaraDB RDS for MySQL instance in the ApsaraDB for RDS console and take note of the instance ID.
- Configure a whitelist for the ApsaraDB RDS for MySQL instance in the ApsaraDB for
RDS console. For more information, see Configure a whitelist.
Note If you use a custom resource group to run sync nodes, you must also add the server IP address of the custom resource group to the whitelist of the ApsaraDB RDS for MySQL instance.
- Download raw data required in this tutorial: indicators_data, steal_flag_data, and trend_data.
- Upload the raw data to the ApsaraDB RDS for MySQL instance.
Create a connection
Note In this example, you must create a MySQL connection.
- Go to the Data Source page.
- Log on to the DataWorks console.
- In the left-side navigation pane, click Workspaces.
- In the top navigation bar, select the region where the target workspace resides. Find the target workspace and click Data Integration in the Actions column.
- On the page that appears, click Connection in the left-side navigation pane. The page appears.
- On the Data Source page, click Add a Connection in the upper-right corner.
- In the Add Connection dialog box that appears, select MySQL.
- In the Add MySQL Connection dialog box, set parameters for the MySQL connection.
Parameter Description Connect To The type of the connection. Set the value to. Connection Name The name of the connection. The name can contain letters, digits, and underscores (_) and must start with a letter. Description The description of the connection. The description can be up to 80 characters in length. Environment The environment in which the connection is used. Valid values: Development and Production.Note This parameter is available only when the workspace is in the standard mode. Region The region of the ApsaraDB RDS for MySQL instance. RDS Instance ID The ID of the ApsaraDB RDS for MySQL instance. You can view the ID in the ApsaraDB for RDS console. RDS Instance Account ID The ID of the Alibaba Cloud account used to purchase the ApsaraDB RDS for MySQL instance. You can view your account ID on the Security Settings page after you log on to the Alibaba Cloud console with your Alibaba Cloud account. Database Name The name of the ApsaraDB RDS for MySQL database. Username The username for logging on to the database. Password The password for logging on to the database.
- Click Test Connection.
- After the connection passes the connectivity test, click Complete.
Create a workflow
- Click in the upper-left corner and choose .
- On the Data Analytics tab, right-click Business Flow and select Create Workflow.
- In the Create Workflow dialog box that appears, set Workflow Name and Description.
Note The workflow name can be up to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
- Click Create.
- On the workflow dashboard that appears, click Zero-Load Node or drag it to the canvas. In the Create Node dialog box that appears, create a zero-load node named start and click Commit. Create three batch sync nodes in the same way for synchronizing power consumption trend data, electricity-stealing flag data, and metrics data.
- Draw lines between nodes and set the start node as the parent node of the three batch sync nodes.
Configure the start node
- In the directed acyclic graph (DAG) of the workflow, double-click the zero-load node. On the node configuration tab that appears, click the Properties tab.
- On the Properties tab that appears, set the root node of the workspace as the parent
node of the start node.
In the latest version of DataWorks, each node must have its parent and child nodes. Therefore, you must set a parent node for the start node. In this example, the root node of the workspace is set as the parent node of the start node. The root node of the workspace is named in the Workspace name_root format.
- After the configuration is completed, click in the upper-left corner.
- On the Data Analytics tab, click the created workflow and click MaxCompute.
- Right-click Table under MaxCompute and select Create Table.
- In the Create Table dialog box that appears, set Table Name and click Commit.
Create three tables named trend_data, indicators_data, and steal_flag_data. The trend_data table is used to store power consumption trend data, the indicators_data table is used to store metrics data, and the steal_flag_data table is used to store electricity-stealing flag data.Note The table name can be up to 64 characters in length. The table name must start with a letter and cannot contain Chinese or special characters.
- On the editing tab of each table, click DDL Statement. In the DDL Statement dialog box that appears, enter the following CREATE TABLE statements:
-- Create a table for storing power consumption trend data. CREATE TABLE `trend_data` ( `uid` bigint, `trend` bigint ) PARTITIONED BY (dt string);
-- Create a table for storing metrics data. CREATE TABLE `indicators_data` ( `uid` bigint, `xiansun` bigint, `warnindicator` bigint ) COMMENT '*' PARTITIONED BY (ds string) LIFECYCLE 36000;
-- Create a table for storing electricity-stealing flag data. CREATE TABLE `steal_flag_data` ( `uid` bigint, `flag` bigint ) COMMENT '*' PARTITIONED BY (ds string) LIFECYCLE 36000;
- Click Generate Table Schema. In the Confirm message that appears, click OK.
- On the editing tab of each table, enter the display name in the General section.
- Click Commit in Development Environment and Commit to Production Environment in sequence.
Configure batch sync nodes
Configure the node for synchronizing power consumption trend data.
- In the DAG of the workflow, double-click the target node to go to the node configuration tab.
- Select the source connection.
Parameter Description Connection The connection for accessing the source data. Select workshop in sequence.and Table The table from which data is synchronized. Select trending. Filter The filter condition for the data to be synchronized. Currently, filtering based on the limit keyword is not supported. The SQL syntax is determined by the selected connection. This parameter is optional. Shard Key The shard key in the source data. If data sharding is performed based on the configured shard key, data can be read concurrently to improve data synchronization efficiency. This parameter is optional.
- Select the destination connection.
Parameter Description Connection The connection for storing the source data to MaxCompute. Select odps_first in sequence.and Table The table for storing the source data. Select trend_data. Partition Key Column The partition key column to be synchronized. Default value:
Writing Rule The rule for writing data to MaxCompute. Default value: Write with Original Data Deleted (Insert Overwrite). Convert Empty Strings to Null Specifies whether to convert empty strings to null. Select No.
- Configure the mapping between fields in the source and destination tables.
- Set parameters in the Channel section.
Parameter Description Expected Maximum Concurrency The maximum number of concurrent threads to read data from or write data to data storage within the sync node. You can configure the concurrency for a node on the codeless user interface (UI). Bandwidth Throttling Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and set a maximum transmission rate to avoid heavy read workload of the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to a proper value. Dirty Data Records Allowed The maximum number of dirty data records allowed.
- Verify that the preceding configuration is correct and click in the upper-left corner.
Commit the workflow
- On the Data Analytics tab, double-click the target workflow. On the workflow dashboard that appears, click in the upper-left corner.
- In the Commit dialog box that appears, select the nodes to be committed, set Description, and then select Ignore I/O Inconsistency Alerts.
- Click Commit. The message Committed successfully appears.
Verify data synchronization to MaxCompute
- On the left-side navigation submenu, click the Ad-Hoc Query icon. The Ad-Hoc Query tab appears.
- On the Ad-Hoc Query tab, right-click Ad-Hoc Query and choose .
- In the Create Node dialog box that appears, enter the node name and click Commit.
On the node configuration tab that appears, write and execute SQL statements to query
the number of data records synchronized to the trend_data, indicators_data, and steal_flag_data
Use the following SQL statements. In each statement, change the partition key value to the data timestamp of the ad-hoc query node. For example, if the node is run on August 9, 2019, the data timestamp is 20190808, which is one day before the node is run.
-- Check whether data is synchronized to MaxCompute. SELECT count(*) from trend_data where dt=Data timestamp of the ad-hoc query node; SELECT count(*) from indicators_data where ds=Data timestamp of the ad-hoc query node; SELECT count(*) from steal_flag_data where ds=Data timestamp of the ad-hoc query node;