All Products
Search
Document Center

DataWorks:Synchronize data

Last Updated:Jul 25, 2025

In this tutorial, batch synchronization nodes are used to respectively synchronize the basic user information stored in the MySQL table ods_user_info_d and the user website access logs stored in the Object Storage Service (OSS) object user_log.txt to the StarRocks tables ods_user_info_d_starrocks and ods_raw_log_d_starrocks. This topic describes how to use the Data Integration service of DataWorks to synchronize data between heterogeneous data sources to complete synchronization for data warehouses.

Prerequisites

The required environments are prepared for data synchronization. For more information, see Requirement analysis.

Step 1: Add data sources

To ensure smooth data processing in subsequent operations, you must add the following data sources to your DataWorks workspace to obtain the raw data provided by the system.

  • MySQL data source: used to obtain the basic user information stored in a MySQL table named ods_user_info_d. In this tutorial, the MySQL data source is named user_behavior_analysis_mysql.

  • HttpFile data source: used to obtain the user website access logs stored in an OSS object named user_log.txt. In this tutorial, the OSS data source is named user_behavior_analysis_httpfile.

Add a MySQL data source named user_behavior_analysis_mysql

The basic user information provided by this tutorial is stored in a MySQL database. You must add the MySQL database to your DataWorks workspace as a data source. This way, you can synchronize the basic user information from the MySQL table ods_user_info_d in the data source to StarRocks.

  1. Go to the Data Sources page.

    1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

    2. In the left-side navigation pane of the SettingCenter page, click Data Sources.

Add an HttpFile data source named user_behavior_analysis_httpfile

The user website access records provided by this tutorial are stored in OSS. You must add an HttpFile data source to your DataWorks workspace. This way, you can synchronize the user website access records from the OSS object user_log.txt in the data source to StarRocks.

  1. Go to the SettingCenter page of Management Center. In the left-side navigation pane of the SettingCenter page, click Data Sources.

  2. On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, search for and click HttpFile.

  3. On the Add HttpFile Data Source page, configure the parameters. In this tutorial, the parameters for the development environment and the production environment are configured by referring to the instructions in the following table.

    The following table describes the key parameters that you must configure in this tutorial. You can retain default values for other parameters.

    Parameter

    Description

    Data Source Name

    Enter a name for the data source. In this tutorial, user_behavior_analysis_httpfile is entered.

    Data Source Description

    Enter a description for the data source. In this tutorial, the following description is entered: The data source is exclusive for experiencing cases in DataWorks. You can configure the data source in a batch synchronization task to access the test data provided by the system. The data source can be used only for data reading in Data Integration and cannot be used in other services.

    URL

    Enter a URL. In this tutorial, the URL parameters for both the development environment and production environment are set to https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com.

  4. In the Connection Configuration section, find the resource group that you want to use and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Make sure that the displayed connection status is Connected.

    Important
    • You must make sure that the resource group is associated with your workspace and can access the Internet. Otherwise, an error is reported when you run the synchronization task configured for the data source. For more information, see Prepare an environment.

    • If no resource group is available, you can click Purchase or Associate Purchased Resource Group in the prompt message that is displayed in the Connection Configuration section to purchase a resource group or associate your resource group with your workspace.

  5. Click Complete Creation.

Step 2: Establish a synchronization link

  1. In the upper-left corner of the DataWorks console, click the 图标 icon and choose All Products > Data Development And Task Operation > DataStudio. In the top navigation bar of the Data Studio page, switch to the workspace created for this tutorial.

  2. In the Workspace Directories section, click the image icon and select Create Workflow. In the Create Workflow dialog box, specify a name for the workflow.image In this tutorial, user_profile_analysis_starrocks is specified.

  3. On the configuration tab of the workflow, drag Zero Load Node, StarRocks, and Batch Synchronization to the canvas and specify names for the nodes.

    The following table describes the node names that are used in this tutorial and the functionalities of the nodes.

    Node type

    Node name

    Node functionality

    imageZero load node

    workshop_start_starrocks

    This node is used to manage the entire user profile analysis workflow and clarify the data forwarding path. This node is a dry-run node. You do not need to edit code for the node.

    imageStarRocks node

    ddl_ods_user_info_d_starrocks

    This node needs to be created before you create a batch synchronization node. This node is used to create the StarRocks table ods_user_info_d_starrocks for receiving the basic user information in a MySQL table.

    imageStarRocks node

    ddl_ods_raw_log_d_starrocks

    This node needs to be created before you create a batch synchronization node. This node is used to create the StarRocks table ods_raw_log_d_starrocks for receiving the user website access logs in an OSS object.

    imageBatch synchronization node

    ods_user_info_d_starrocks

    This node is used to synchronize the basic user information stored in MySQL to the StarRocks table ods_user_info_d_starrocks.

    imageBatch synchronization node

    ods_raw_log_d_starrocks

    This node is used to synchronize the user website access logs stored in OSS to the StarRocks table ods_raw_log_d_starrocks.

  4. Draw lines to configure the workshop_start_starrocks node as the ancestor node of the StarRocks nodes and batch synchronization nodes, as shown in the following figure.

    image
  5. Configure scheduling properties for the workflow.

    In the right-side navigation pane of the configuration tab of the workflow, click Properties. On the Properties tab, configure the parameters. The following table describes the key parameters that you must configure in this tutorial. You can retain default values for other parameters.

    Parameter

    Description

    Scheduling Parameters

    You can configure scheduling parameters for the workflow. The inner nodes of the workflow can directly use the configured scheduling parameters. In this tutorial, bizdate=$[yyyymmdd-1] is configured to obtain the date of the previous day.

    Scheduling Cycle

    In this tutorial, this parameter is set to Day.

    Scheduling Time

    In this tutorial, the Scheduling Time parameter is set to 00:30. This way, the workflow starts to run at 00:30 each day.

    Scheduling Dependencies

    The workflow does not need to depend on another node. You can ignore this parameter. To facilitate management, you can click Use Workspace Root Node to configure the root node of the workspace as the ancestor node of the workflow.

    The name of the root node is in the Workspace name_root format.

Step 3: Configure the inner nodes of the workflow

Configure the workshop_start_starrocks node

  1. In the canvas of the workflow, move the pointer over the workshop_start_starrocks node and click Open Node.

  2. In the right-side navigation pane of the configuration tab of the workshop_start_starrocks node, click Properties. On the Properties tab, configure the parameters. The following table describes the key parameters that you must configure in this tutorial. You can retain default values for other parameters.

    Parameter

    Description

    Scheduling Type

    In this tutorial, this parameter is set to Dry-run.

    Resource Group For Scheduling

    In this tutorial, this parameter is set to the name of the serverless resource group that you create when you prepare environments. For more information, see Prepare environments.

    Scheduling Dependencies

    workshop_start_starrocks is the start node of the workflow and does not need to depend on another node. You can click Use Workspace Root Node to use the root node of the workspace to trigger the workflow.

    The name of the root node is in the Workspace name_root format.

Create a table named ddl_ods_user_info_d_starrocks

Before you synchronize data to StarRocks, you need to create a table named ddl_ods_user_info_d_starrocks to receive the basic user information synchronized from the MySQL data source. You can execute the following statement to create the table in the ddl_ods_user_info_d_starrocks node. You can also manually create the table in a data catalog. For more information about data catalogs, see Data catalogs.

  1. In the canvas of the workflow, move the pointer over the ddl_ods_user_info_d_starrocks node and click Open Node.

  2. In the code editor of the configuration tab of the ddl_ods_user_info_d_starrocks node, write the following table creation statement:

    CREATE TABLE IF NOT EXISTS ods_user_info_d_starrocks (
        uid STRING COMMENT 'The user ID',
        gender STRING COMMENT 'The gender',
        age_range STRING COMMENT 'The age range',
        zodiac STRING COMMENT 'The zodiac sign',
        dt STRING not null COMMENT 'The time'
    ) 
    DUPLICATE KEY(uid)
    COMMENT 'User behavior analysis case - table that stores basic user information'
    PARTITION BY(dt) 
    PROPERTIES("replication_num" = "1");
  3. Configure debugging parameters.

    In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow in Step 4.

    Parameter

    Description

    Computing Resource

    Select the StarRocks computing resource that is associated with the workspace when you prepare environments.

    Resource Group

    Select the serverless resource group that you create when you prepare environments.

  4. Optional. Configure scheduling properties.

    You can retain default values for parameters related to scheduling properties in this tutorial. You can click Properties in the right-side navigation pane of the node configuration tab. For information about parameters on the Properties tab, see Scheduling properties.

    • Scheduling Parameters: In this tutorial, scheduling parameters are configured for the workflow. You do not need to configure scheduling parameters for the inner nodes of the workflow. The configured scheduling parameters can be directly used for code and tasks developed based on the inner nodes.

    • Scheduling Policies: You can configure the Time for Delayed Execution parameter to specify the duration by which the running of the batch synchronization node lags behind the running of the workflow. In this tutorial, you do not need to configure this parameter.

  5. In the top toolbar of the configuration tab, click Save.

Create a table named ddl_ods_raw_log_d_starrocks

Before you synchronize data to StarRocks, you need to create a table named ddl_ods_raw_log_d_starrocks to receive the user website access logs synchronized from the HttpFile data source. You can execute the following statement to create the table in the ddl_ods_raw_log_d_starrocks node. You can also manually create the table in a data catalog. For more information about data catalogs, see Data catalogs.

  1. In the canvas of the workflow, move the pointer over the ddl_ods_raw_log_d_starrocks node and click Open Node.

  2. In the code editor of the configuration tab of the ddl_ods_raw_log_d_starrocks node, write the following table creation statement:

    CREATE TABLE IF NOT EXISTS ods_raw_log_d_starrocks (
        col STRING COMMENT 'The log',
        dt DATE  not null COMMENT 'The time'
    ) DUPLICATE KEY(col) 
    COMMENT 'User behavior analysis case - table that stores the user website access logs' 
    PARTITION BY(dt) 
    PROPERTIES ("replication_num" = "1");
  3. Configure debugging parameters.

    In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow in Step 4.

    Parameter

    Description

    Computing Resource

    Select the StarRocks computing resource that is associated with the workspace when you prepare environments.

    Resource Group

    Select the serverless resource group that you create when you prepare environments.

  4. Optional. Configure scheduling properties.

    You can retain default values for parameters related to scheduling properties in this tutorial. You can click Properties in the right-side navigation pane of the node configuration tab. For information about parameters on the Properties tab, see Scheduling properties.

    • Scheduling Parameters: In this tutorial, scheduling parameters are configured for the workflow. You do not need to configure scheduling parameters for the inner nodes of the workflow. The configured scheduling parameters can be directly used for code and tasks developed based on the inner nodes.

    • Scheduling Policies: You can configure the Time for Delayed Execution parameter to specify the duration by which the running of the batch synchronization node lags behind the running of the workflow. In this tutorial, you do not need to configure this parameter.

  5. In the top toolbar of the configuration tab, click Save.

Configure the ods_user_info_d_starrocks node

  1. In the canvas of the workflow, move the pointer over the ods_user_info_d_starrocks node and click Open Node

  2. Configure the source and destination and the resource group that you want to use.

    Section

    Description

    Source

    • Source: Select MySQL.

    • Data Source Name: Select user_behavior_analysis_mysql.

    Resource Group

    Select the serverless resource group that you create when you prepare environments.

    Destination

    • Destination: Select StarRocks.

    • Data Source Name: Select doc_starrocks_storage_compute_tightly_01.

  3. Configure details for the batch synchronization node.

    • Configure the source and destination.

      Section

      Parameter

      Description

      Source

      Table

      Select the MySQL table ods_user_info_d.

      Split key

      The shard key for the data to be read. We recommend that you use the primary key or an indexed column as the shard key. The shard key can be used to shard data only of INTEGER type.

      In this example, the uid field is used as the shard key.

      Destination

      Table

      Select the StarRocks table ods_user_info_d_starrocks.

      Statement Run Before Writing

      In this example, field data is dynamically partitioned based on the dt field. To prevent the node from rerunning and repeatedly writing data, the following SQL statement is used to delete the existing destination partitions before each synchronization:

      ALTER TABLE ods_user_info_d_starrocks DROP PARTITION IF EXISTS p${var} FORCE. In the SQL statement, ${var} is a parameter. The scheduling parameter that you configure in the scheduling configuration phase will be assigned to the ${var} parameter as a value. In this case, the value of the scheduling parameter is dynamically replaced in the code of the node based on the configuration of the scheduling parameter. For more information, see Configure scheduling properties.

      StreamLoad Request Parameters

      The StreamLoad request parameter, which must be in the JSON format.

      {
        "row_delimiter": "\\x02",
        "column_separator": "\\x01"
      }
    • Configure field mappings.

      Configure mappings between source fields and destination fields. You can assign a scheduling parameter to a variable as a value in the node code. This way, the value of the partition field in the StarRocks table can be dynamically replaced. In this case, the data of each day can be written to the data timestamp-based partition in the StarRocks table.

      1. Click Map Fields with Same Name. The fields in the source MySQL table are automatically mapped to fields that have the same names in the destination table. In this case, data in the source fields is automatically written to the destination fields that have the same names.

      2. Click Add, enter '${var}', and then manually map this field with the dt field of the StarRocks table.

    • Configure channel control policies.

      In this tutorial, Policy for Dirty Data Records is set to Disallow Dirty Data Records, and default values are retained for other parameters. For more information, see Configure a batch synchronization task by using the codeless UI.

  4. Configure debugging parameters.

    In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow in Step 4.

    Parameter

    Description

    Resource Group

    Select the serverless resource group that you create when you prepare environments.

    Script Parameters

    In the Script Parameters section, click Add Parameter. In the row that appears, enter var in the Parameter Name field and a constant value in the yyyymmdd format in the Parameter Value field. Example: var=20250223. When you debug the node, Data Studio replaces the variable that you define for the node with the constant value.

  5. Optional. Configure scheduling properties.

    You can retain default values for parameters related to scheduling properties in this tutorial. You can click Properties in the right-side navigation pane of the node configuration tab. For information about parameters on the Properties tab, see Scheduling properties.

    • Scheduling Parameters: In this tutorial, scheduling parameters are configured for the workflow. You do not need to configure scheduling parameters for the inner nodes of the workflow. The configured scheduling parameters can be directly used for code and tasks developed based on the inner nodes.

    • Scheduling Policies: You can configure the Time for Delayed Execution parameter to specify the duration by which the running of the batch synchronization node lags behind the running of the workflow. In this tutorial, you do not need to configure this parameter.

  6. In the top toolbar of the configuration tab, click Save.

Configure the ods_raw_log_d_starrocks node

  1. In the canvas of the workflow, move the pointer over the ods_raw_log_d_starrocks node and click Open Node.

  2. Configure the source and destination and the resource group that you want to use.

    After you finish configuring the source, resource group, and destination, complete the network connectivity test and click Next. The following table describes the parameters that you must configure.

    Section

    Description

    Source

    • Source: Select HttpFile.

    • Data Source Name: Select user_behavior_analysis_HttpFile.

    Resource Group

    Select the serverless resource group that you create when you prepare environments.

    Destination

    • Destination: Select StarRocks.

    • Data Source Name: Select doc_starrocks_storage_compute_tightly_01.

  3. Click Next to configure details for the batch synchronization node.

    • Configure the source and destination.

      Section

      Parameter

      Description

      Source

      File path

      Enter /user_log.txt.

      Text type

      Select text.

      Column Delimiter

      Enter |.

      Skip Header > Advanced configuration

      Select No.

      After you finish configuring the source, click Confirm Data Structure.

      Destination

      Table

      Select the StarRocks table ods_raw_log_d_starrocks.

      Statement Run Before Writing

      In this example, field data is dynamically partitioned based on the dt field. To prevent the node from rerunning and repeatedly writing data, the following SQL statement is used to delete the existing destination partitions before each synchronization:

      ALTER TABLE ods_user_info_d_starrocks DROP PARTITION IF EXISTS p${var} FORCE. In the SQL statement, ${var} is a parameter. The scheduling parameter that you configure in the scheduling configuration phase will be assigned to the ${var} parameter as a value. In this case, the value of the scheduling parameter is dynamically replaced in the code of the node based on the configuration of the scheduling parameter. For more information, see Configure scheduling properties.

      StreamLoad Request Parameters

      The StreamLoad request parameter, which must be in the JSON format.

      {
        "row_delimiter": "\\x02",
        "column_separator": "\\x01"
      }
    • Configure channel control policies.

      In this tutorial, Policy for Dirty Data Records is set to Disallow Dirty Data Records, and default values are retained for other parameters. For more information, see Configure a batch synchronization task by using the codeless UI.

    • Configure field mappings.

      In the top toolbar of the configuration tab of the node, click the image icon to change the node configuration mode from the codeless UI to the code editor. Configure mappings between fields in the HttpFile file and fields in the StarRocks table, and enable the value of the dt partition field in the StarRocks table to be dynamically replaced.

      • Additional configuration for columns in an HttpFile file:

         {
                   "type": "STRING",
                  "value": "${var}"
                }
      • Complete sample script for the ods_raw_log_d_starrocks node:

        {
            "type": "job",
            "version": "2.0",
            "steps": [
                {
                    "stepType": "httpfile",
                    "parameter": {
                        "fileName": "/user_log.txt",
                        "nullFormat": "",
                        "compress": "",
                        "requestMethod": "GET",
                        "connectTimeoutSeconds": 60,
                        "column": [
                            {
                                "index": 0,
                                "type": "STRING"
                            },
                            {
                                "type": "STRING",
                                "value": "${var}"
                            }
                        ],
                        "skipHeader": "false",
                        "encoding": "UTF-8",
                        "fieldDelimiter": "|",
                        "fieldDelimiterOrigin": "|",
                        "socketTimeoutSeconds": 3600,
                        "envType": 0,
                        "datasource": "user_behavior_analysis",
                        "bufferByteSizeInKB": 1024,
                        "fileFormat": "text"
                    },
                    "name": "Reader",
                    "category": "reader"
                },
                {
                    "stepType": "starrocks",
                    "parameter": {
                        "loadProps": {
                            "row_delimiter": "\\x02",
                            "column_separator": "\\x01"
                        },
                        "envType": 0,
                        "datasource": "Doc_StarRocks_Storage_Compute_Tightly_01",
                        "column": [
                            "col",
                            "dt"
                        ],
                        "tableComment": "",
                        "table": "ods_raw_log_d_starrocks",
                        "preSql": "ALTER TABLE ods_raw_log_d_starrocks DROP PARTITION IF EXISTS  p${var} FORCE ; "
                    },
                    "name": "Writer",
                    "category": "writer"
                },
                {
                    "copies": 1,
                    "parameter": {
                        "nodes": [],
                        "edges": [],
                        "groups": [],
                        "version": "2.0"
                    },
                    "name": "Processor",
                    "category": "processor"
                }
            ],
            "setting": {
                "errorLimit": {
                    "record": "0"
                },
                "locale": "zh",
                "speed": {
                    "throttle": false,
                    "concurrent": 2
                }
            },
            "order": {
                "hops": [
                    {
                        "from": "Reader",
                        "to": "Writer"
                    }
                ]
            }
        }
  4. Configure debugging parameters.

    In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow in Step 4.

    Parameter

    Description

    Resource Group

    Select the serverless resource group that you create when you prepare environments.

    Script Parameters

    In the Script Parameters section, click Add Parameter. In the row that appears, enter var in the Parameter Name field and a constant value in the yyyymmdd format in the Parameter Value field. Example: var=20250223. When you debug the node, Data Studio replaces the variable that you define for the node with the constant value.

  5. Optional. Configure scheduling properties.

    You can retain default values for parameters related to scheduling properties in this tutorial. You can click Properties in the right-side navigation pane of the node configuration tab. For information about parameters on the Properties tab, see Scheduling properties.

    • Scheduling Parameters: In this tutorial, scheduling parameters are configured for the workflow. You do not need to configure scheduling parameters for the inner nodes of the workflow. The configured scheduling parameters can be directly used for code and tasks developed based on the inner nodes.

    • Scheduling Policies: You can configure the Time for Delayed Execution parameter to specify the duration by which the running of the batch synchronization node lags behind the running of the workflow. In this tutorial, you do not need to configure this parameter.

  6. In the top toolbar of the configuration tab, click Save.

Step 4: Run the nodes

  1. Synchronize data.

    In the top toolbar of the configuration tab of the workflow, click Run. In the Enter runtime parameters dialog box, specify a value that is used for scheduling parameters defined for each node in this run, and click OK. In this tutorial, 20250223 is specified. You can specify a value based on your business requirements.

  2. Query results.

    1. Go to the SQL Query page.

      Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Analysis and Service > DataAnalysis. On the page that appears, click Go to DataAnalysis. In the left-side navigation pane of the page that appears, click SQL Query.

    2. Configure an SQL query file.

      1. In the SQL Query pane, click the image icon next to My Files and select Create File. In the Create File dialog box, enter a name in the File Name field and click OK.

      2. In the left-side navigation tree, find the created SQL query file and click the file name to go to the configuration tab of the file.

      3. In the upper-right corner of the configuration tab of the SQL query file, click the image icon. In the popover that appears, configure the following parameters.

        Parameter

        Description

        Workspace

        Select the workspace to which the user_profile_analysis_starrocks workflow belongs.

        Data Source Type

        Select StarRocks from the drop-down list.

        Data Source Name

        Select the StarRocks computing resource that is associated with the workspace when you prepare environments.

      4. Click OK.

    3. In the code editor of the configuration tab, write data query statements.

      After all nodes in this topic are successfully run, you can write the following statements to check whether the external tables created for the StarRocks nodes are generated as expected:

      -- In the following query statements, change the partition key value to the data timestamps of the nodes. For example, if a node is scheduled to run on February 23, 2025, the data timestamp of the node is 20250222, which is one day earlier than the scheduling time of the node. 
      SELECT * FROM ods_raw_log_d_starrocks WHERE dt=Data timestamp; 
      SELECT * FROM ods_user_info_d_starrocks WHERE dt=Data timestamp; 

What to do next

After you understand how to synchronize data based on this tutorial and complete data synchronization, you can proceed to the next tutorial. In the next tutorial, you will learn how to compute and analyze the synchronized data. For more information, .