All Products
Search
Document Center

DataWorks:Prepare data

Last Updated:Mar 26, 2026

Prepare data for electricity analysis by synchronizing raw CSV files from ApsaraDB RDS for MySQL into MaxCompute. This tutorial walks you through the full data preparation pipeline: registering a MySQL data source in DataWorks, creating a workflow with three offline synchronization nodes, defining the target MaxCompute tables, and verifying that data is written successfully.

By the end of this tutorial, you will have:

  • Registered an ApsaraDB RDS for MySQL data source in DataWorks Management Center

  • Created a workflow with one start node and three offline synchronization nodes

  • Created three MaxCompute tables: trend_data, indicators_data, and steal_flag_data

  • Verified that data is synchronized to MaxCompute

Prerequisites

Before you begin, ensure that you have:

Prepare a data source

  1. In the ApsaraDB RDS console, create an ApsaraDB RDS for MySQL instance and note the instance ID. See Create an ApsaraDB RDS for MySQL instance and configure databases.

  2. Configure an IP address whitelist for the instance.

    If you use a serverless resource group to run synchronization tasks, the whitelist configuration depends on how the resource group accesses the data source: - VPC access: Add the CIDR block of the vSwitch associated with the resource group to the IP address whitelist. - Internet access: Add the elastic IP address (EIP) of the VPC associated with the serverless resource group to the IP address whitelist. For older resource groups, add the resource group's EIP instead. For details, see Network connectivity solutions.
  3. Download the raw data files for this tutorial:

  4. Upload the raw data to the ApsaraDB RDS for MySQL instance. See Import data from an Excel file to ApsaraDB RDS for MySQL.

Add a data source

Add the ApsaraDB RDS for MySQL instance as a DataWorks data source so that synchronization tasks can access it.

  1. Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose More > Management Center. Select your workspace from the drop-down list and click Go to Management Center.

  2. In the left-side navigation pane, choose Data Sources > Data Sources.

  3. Click Add Data Source in the upper-left corner. In the Add Data Source dialog box, click MySQL.

  4. On the Add MySQL Data Source page, set Configuration Mode to Alibaba Cloud Instance Mode and configure the remaining parameters.

    ParameterDescription
    Applicable environmentThe environment where the data source is used. Valid values: Development Environment and Production Environment.
    Configuration ModeSelect Alibaba Cloud Instance Mode.
    Belongs to Cloud AccountSelect Current Cloud Account.
    RegionThe region where the data source resides.
    InstanceSelect the ApsaraDB RDS for MySQL instance you created. After selecting, click Obtain Latest Address to view the instance details. If no instance appears, create one in the ApsaraDB RDS console.
    Database Name, Username, and PasswordThe default database name and credentials used to connect. Do not use at signs (@) in the password. If you configure a batch synchronization task with multiple databases, add a separate data source for each database.
    Authentication OptionsSelect No Authentication.
    Backup SettingsIf the instance is a primary instance with one or more secondary read-only instances, enable Backup Settings and select the secondary instance ID. This reduces read pressure on the primary instance. If the primary instance has multiple secondary read-only instances, only one of them is read.
    Note

    Enabling Backup Settings requires an exclusive resource group. The task cannot use a serverless resource group.

    image.png

  5. Test network connectivity. In the Connection Configuration section, click Data Integration and Data Scheduling separately. For each resource group you plan to use, click Test Network Connectivity in the Connection Status column.

    • If the status shows Connected, the resource group can reach the data source.

    • If the status does not show Connected, check your IP address whitelist and VPC configuration. For troubleshooting guidance, see Network connectivity solutions.

    A synchronization task can use only one resource group of each type. Test connectivity for all resource group types your task will use.
  6. Click Complete Creation.

Create a workflow

  1. In the upper-left corner of the Management Center page, click the 图标 icon and choose All Products > Data Development And Task Operation > DataStudio.

  2. In DataStudio, right-click Business Flow and select Create Workflow.

  3. In the Create Workflow dialog box, enter a Workflow Name and Description, then click Create.

    The workflow name can contain letters, digits, underscores (_), and periods (.), up to 128 characters.
  4. On the workflow canvas, drag one zero load node and three offline synchronization nodes onto the canvas. Name the zero load node start. The three offline synchronization nodes will synchronize electricity usage trend data, electricity-stealing flag data, and metric data.

  5. Connect the nodes so that start is the ancestor node of all three offline synchronization nodes.

  6. Click Submit.

Configure the start node

  1. Double-click the start node. In the right-side panel, click the Properties tab.

  2. Set the workspace root node as the ancestor node of the start node. The root node is named in the format {WorkspaceName}_root.

    Every node in DataWorks must have an ancestor node and a descendant node. Linking the start node to the workspace root node satisfies this requirement.
  3. Click the 保存 icon to save.

Create tables

Create three MaxCompute tables to store the synchronized data.

  1. Right-click the workflow and choose Create Table > MaxCompute > Table.

  2. In the Create Table dialog box, configure Engine Instance, Path, and Name, then click Create. Create three tables:

    Table names can be up to 64 characters, must start with a letter, and cannot contain special characters.
    Table nameStores
    trend_dataElectricity usage trend data
    indicators_dataMetric data
    steal_flag_dataElectricity-stealing flag data
  3. On the configuration tab for each table, click DDL and enter the corresponding CREATE TABLE statement:

    -- Create a table to store electricity usage trend data.
    CREATE TABLE trend_data (
        uid BIGINT,
        trend BIGINT
    )
    PARTITIONED BY (dt string);
    -- Create a table to store metric data.
    CREATE TABLE indicators_data (
        uid BIGINT,
        xiansun BIGINT,
        warnindicator BIGINT
    )
    COMMENT '*'
    PARTITIONED BY (ds string)
    LIFECYCLE 36000;
    -- Create a table to store electricity-stealing flag data.
    CREATE TABLE steal_flag_data (
        uid BIGINT,
        flag BIGINT
    )
    COMMENT '*'
    PARTITIONED BY (ds string)
    LIFECYCLE 36000;
  4. Click Generate Table Schema, then click OK.

  5. In the General section, enter a display name for each table.

  6. Click Commit to Development Environment, then click Commit to Production Environment.

Configure the offline synchronization nodes

Configure each offline synchronization node to move data from MySQL to MaxCompute. The steps below use the trend data node as an example; repeat the same configuration for the other two nodes.

  1. Double-click the offline synchronization node to open its configuration tab.

  2. Configure the source:

    ParameterValue
    ConnectionSelect MySQL > workshop
    TableSelect trending
    Filter(Optional) A SQL WHERE condition to filter rows. The LIMIT keyword is not supported.
    Shard Key(Optional) When set, data is split across parallel threads, which improves synchronization throughput.
  3. Configure the destination:

    ParameterValue
    ConnectionSelect ODPS, then select your MaxCompute data source
    TableSelect trend_data
    Partition Key ColumnDefault: dt=${bdp.system.bizdate}
    Writing RuleSelect Write with Original Data Deleted (Insert Overwrite)
    Convert Empty Strings to NullSelect No
  4. Configure field mappings between the source and destination.

  5. Configure the Channel settings:

    ParameterDescription
    Expected Maximum ConcurrencyThe maximum number of parallel threads for reading from the source or writing to the destination.
    Bandwidth ThrottlingEnable throttling and set a maximum transmission rate to limit read pressure on the source. Set this based on your source database capacity.
    Dirty Data Records AllowedThe maximum number of dirty data records allowed.

    Channel configuration

  6. Click the 保存 icon to save.

Commit the workflow

  1. On the workflow canvas, click the 提交 icon.

  2. In the Commit dialog box, select the nodes to commit, enter comments in Change description, and select Ignore I/O Inconsistency Alerts.

  3. Click Commit. The Committed successfully message confirms the workflow is committed.

Verify data synchronization

Run an ad-hoc query to confirm that data has been written to each MaxCompute table.

  1. In the left-side navigation pane, click Ad-Hoc Query.

  2. Right-click Ad-Hoc Query and choose Create Node > ODPS SQL.

  3. Run the following SQL statements. Replace <data_timestamp> with the business date of the node run—one day before the scheduling date. For example, if the node runs on August 9, 2019, use 20190808.

    -- Check whether data is written to MaxCompute.
    SELECT COUNT(*) FROM trend_data WHERE dt=<data_timestamp>;
    SELECT COUNT(*) FROM indicators_data WHERE ds==<data_timestamp>;
    SELECT COUNT(*) FROM steal_flag_data WHERE ds=<data_timestamp>;

    Each query should return a non-zero row count, confirming that data was synchronized successfully.

What's next

With data synchronized to MaxCompute, you are ready to compute and analyze the collected data. Proceed to the next tutorial to set up data processing tasks.