All Products
Search
Document Center

DataWorks:Prepare data

Last Updated:Feb 02, 2024

You must synchronize raw data to MaxCompute during data preparation.

Prepare a data source

  1. Create an ApsaraDB RDS for MySQL instance in the ApsaraDB RDS console and record the instance ID. For more information, see Quickly create an ApsaraDB RDS for MySQL instance.

  2. Configure a whitelist for the ApsaraDB RDS for MySQL instance in the ApsaraDB RDS console. For more information, see Configure an IP address whitelist.

    Note

    If you use a custom resource group to run a data synchronization task, you must add the IP addresses of the servers in the custom resource group to the whitelist of the ApsaraDB RDS for MySQL instance.

  3. Download the raw data required in this tutorial: indicators_data, steal_flag_data, and trend_data.

  4. Upload the raw data to the ApsaraDB RDS for MySQL instance. For more information, see Import data from an Excel file to ApsaraDB RDS for MySQL.

Add a data source

Note

In this example, you must add an ApsaraDB RDS for MySQL data source.

  1. Go to the Management Center page.

    Log on to the DataWorks console. In the left-side navigation pane, click Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

  2. In the left-side navigation pane of the SettingCenter page, click Data Source. The Data Source page appears.

  3. On the Data Source page, click Add Data Source.

  4. In the Add data source dialog box, click MySQL.

  5. In the Add MySQL data source dialog box, configure the parameters.

    Parameter

    Description

    Data Source Type

    The mode in which you want to add the data source to DataWorks. Set this parameter to Alibaba Cloud instance mode.

    Data Source Name

    The name of the data source. The name can contain letters, digits, and underscores (_) and must start with a letter.

    Data Source Description

    The description of the data source. The description can be up to 80 characters in length.

    Applicable environment

    The environment in which the data source is used. Valid values: Development and Production.

    Note

    This parameter is displayed only if the workspace is in standard mode.

    Region

    The region of the ApsaraDB RDS for MySQL instance.

    RDS instance ID

    The ID of the ApsaraDB RDS for MySQL instance. You can log on to the ApsaraDB RDS console to view the instance ID.

    Default Database Name

    The name of the database.

    Username

    The username that is used to connect to the database.

    Password

    The password that is used to connect to the database.

    Special Authentication Method

    Select None Auth.

  6. Select a resource group and click Test Connectivity in the Connectivity Status (Production Environment) column.

  7. After the data source passes the network connectivity test, click Complete.

Create a workflow

  1. Click the 图标 icon in the upper-left corner and choose All Products > Data Development And Task Operation > DataStudio.

  2. On the page that appears, right-click Business Flow and select Create Workflow.

  3. In the Create Workflow dialog box, configure Workflow Name and Description.

    Note

    The workflow name cannot exceed 128 characters in length and can contain only letters, digits, underscores (_), and periods (.).

  4. Click Create.

  5. On the workflow configuration tab that appears, drag one zero load node and three batch synchronization nodes to the canvas, name the zero load node start, configure the related settings for the nodes, and click Submit. The batch synchronization nodes are used to synchronize electricity usage trend data, electricity-stealing flag data, and metric data.

  6. Draw lines between nodes and configure the start node as the ancestor node of the three batch synchronization nodes.

Configure the start node

  1. Double-click the start node. In the right-side navigation pane of the configuration tab of the start node, click the Properties tab.

  2. Configure the root node of the workspace as the ancestor node of the start node.

    In the latest version of DataWorks, each node must have its ancestor and descendant nodes. Therefore, you must specify an ancestor node for the start node. In this example, the root node of the workspace is specified as the ancestor node of the start node. The root node of the workspace is named in the Workspace name_root format.

  3. After the configuration is complete, click the 保存 icon in the upper-left corner.

Create tables

  1. Right-click the created workflow and choose Create Table > MaxCompute > Table.

  2. In the Create Table dialog box, configure the Engine Instance, Path, and Name parameters and click Create.

    Create three tables named trend_data, indicators_data, and steal_flag_data. The trend_data table is used to store electricity usage trend data, the indicators_data table is used to store metric data, and the steal_flag_data table is used to store electricity-stealing flag data.

    Note

    The table name cannot exceed 64 characters in length. The table name must start with a letter and cannot contain special characters.

  3. On the configuration tab of each table, click DDL and enter the following CREATE TABLE statements:

    -- Create a table to store electricity usage trend data.
    CREATE TABLE trend_data (
        uid bigint,
        trend bigint
    )
    PARTITIONED BY (dt string);
    -- Create a table to store metric data.
    CREATE TABLE indicators_data (
        uid bigint,
        xiansun bigint,
        warnindicator bigint
    )
    COMMENT '*'
    PARTITIONED BY (ds string)
    LIFECYCLE 36000;
    -- Create a table to store electricity-stealing flag data.
    CREATE TABLE steal_flag_data (
        uid bigint,
        flag bigint
    )
    COMMENT '*'
    PARTITIONED BY (ds string)
    LIFECYCLE 36000;
  4. After you enter the CREATE TABLE statements, click Generate Table Schema. Then, click OK.

  5. On the configuration tab of each table, enter the display name in the General section.

  6. After the configuration is complete, click Commit to Development Environment and Commit to Production Environment in sequence.

Configure the batch synchronization nodes

Configure the node to synchronize electricity usage trend data.

  1. Double-click the node to go to the node configuration tab.

  2. Configure a source.

    Parameter

    Description

    Connection

    Select MySQL > workshop in sequence.

    Table

    Select the trending table from which data is to be synchronized.

    Filter

    The condition used to filter the data that you want to synchronize. Filtering based on the LIMIT keyword is not supported. The SQL syntax is determined based on the selected data source. This parameter is optional.

    Shard Key

    If you configure this parameter, data sharding is performed based on the value of this parameter, and parallel threads can be used to read data. This improves data synchronization efficiency. This parameter is optional.

  3. Configure a destination.

    Parameter

    Description

    Connection

    Select ODPS > odps_first in sequence.

    Table

    Select the trend_data table to store the source data.

    Partition Key Column

    Enter the partition key column to be synchronized. Default value: dt=${bdp.system.bizdate}.

    Writing Rule

    Select Write with Original Data Deleted (Insert Overwrite).

    Convert Empty Strings to Null

    Select No.

  4. Configure the mappings between fields in the source and destination.

  5. Configure parameters in the Channel section.

    通道控制

    Parameter

    Description

    Expected Maximum Concurrency

    The maximum number of parallel threads that the synchronization node can use to read data from the source or write data to the destination. You can configure the parallelism for the data synchronization node on the codeless UI.

    Bandwidth Throttling

    Specifies whether to enable throttling. You can enable throttling and specify a maximum transmission rate to prevent heavy read workloads on the source. We recommend that you enable throttling and set the maximum transmission rate to an appropriate value based on the configurations of the source.

    Dirty Data Records Allowed

    The maximum number of dirty data records allowed.

  6. Verify that the preceding configurations are correct and click the 保存 icon in the upper-left corner.

Commit a workflow

  1. Go to the workflow configuration tab and click the 提交 icon in the upper-left corner.

  2. In the Commit dialog box, select the nodes to be committed, enter your comments in the Change description field, and then select Ignore I/O Inconsistency Alerts.

  3. Click Commit. The Committed successfully message appears.

Check whether data is synchronized to MaxCompute

  1. In the left-side navigation pane, click Ad-Hoc Query. The Ad-Hoc Query pane appears.

  2. In the Ad-Hoc Query pane, right-click Ad-Hoc Query and choose Create Node > ODPS SQL.

  3. Write and execute SQL statements to query the number of data records synchronized to the trend_data, indicators_data, and steal_flag_data tables.

    Execute the following SQL statements. In each statement, change the partition key value to the data timestamp of the node. For example, if the scheduling time of the node is August 9, 2019, the data timestamp is 20190808, which is one day before the node is run.

    -- Check whether the data is written to MaxCompute.
    SELECT count(*) from trend_data where dt=Data timestamp of the ad hoc query node;
    SELECT count(*) from indicators_data where ds=Data timestamp of the ad hoc query node;
    SELECT count(*) from steal_flag_data where ds=Data timestamp of the ad hoc query node;

What to do next

You understand how to collect and synchronize data. You can now proceed with the next tutorial. The next tutorial describes how to compute and analyze collected data.