All Products
Search
Document Center

DataWorks:Synchronize data

Last Updated:Feb 14, 2026

This tutorial shows you how to use a Data Integration Batch Synchronization task to synchronize data from a MySQL table (ods_user_info_d) and an Object Storage Service (OSS) log file (user_log.txt) into StarRocks tables (ods_user_info_d_starrocks and ods_raw_log_d_starrocks).

Prerequisites

Before you begin, ensure you have prepared the required environment. For details, see Prepare the environment.

1. Create data sources

To ensure that data can be processed in subsequent steps, you must add the following data sources to your DataWorks workspace to retrieve the raw data.

  • MySQL data source: This tutorial uses a data source named user_behavior_analysis_mysql to retrieve basic user information (ods_user_info_d) from MySQL.

  • HttpFile data source: In this tutorial, the data source is named user_behavior_analysis_httpfile and is used to retrieve user website access logs (user_log.txt) stored in OSS.

Create a MySQL data source (user_behavior_analysis_mysql)

This tutorial's user information is in a MySQL database. Create a MySQL Data Source to synchronize this data (ods_user_info_d) to StarRocks.

  1. Go to the Data Source page.

    1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

    2. In the navigation pane on the left, click Data Sources to go to the Data Sources page.

  2. Click Add Data Source. Search for and select MySQL as the data source type.

  3. On the Add MySQL Data Source page, configure the parameters. For this tutorial, use the same example values for both the development and production environments.

    The following table describes the key parameters. You can keep the default values for the other parameters.

    Parameter

    Description

    Data Source Name

    Enter a name for the data source. For this tutorial, enter user_behavior_analysis_mysql.

    Data Source Description

    This data source is for DataWorks tutorials. Read data from this data source when you configure an batch sync task to access the test data provided by the platform. This data source can only be read in Data Integration scenarios. Other modules cannot use it.

    Configuration Mode

    Select Connection String Mode.

    Connection Address

    • Host IP address: rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com

    • Port: 3306

    Database Name

    Enter the database name. For this tutorial, enter workshop.

    Username

    Enter the username. For this tutorial, enter workshop.

    Password

    Enter the password. For this tutorial, enter workshop#2017.

    Authentication Method

    No authentication.

  4. In the Connection Configuration section, click Test Network Connectivity for both the production and development environments. Ensure that the connectivity status is Connected.

    Important
    • Ensure that the resource group is attached to the workspace and has public network access enabled. Otherwise, data synchronization fails. For more information, see Prepare the environment.

    • If you do not have a resource group available, follow the prompts in the connection configuration section. Click Purchase and Associated Purchased Resource Group.

  5. Click Complete Creation.

Create an HttpFile data source (user_behavior_analysis_httpfile)

The user website access logs for this tutorial are stored in OSS. You need to create an HttpFile Data Source to synchronize the log data (user_log.txt) to StarRocks.

  1. Click Data Sources in the navigation pane on the left.

  2. Click Add Data Source. In the Add Data Source dialog box, search for and select HttpFile as the data source type.

  3. On the Add HttpFile Data Source page, configure the parameters. For this tutorial, use the same example values for both the development and production environments.

    The following table describes the key parameters. You can keep the default values for the other parameters.

    Parameter

    Description

    Data Source Name

    Enter the data source name. For this tutorial, enter user_behavior_analysis_httpfile.

    Data Source Description

    This data source is for DataWorks tutorials. Read data from this data source when you configure an batch sync task to access the test data provided by the platform. This data source can only be read in Data Integration scenarios. Other modules cannot use it.

    URL

    Set URL to https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com for both the development and production environments.

  4. In the Connection Configuration section, click Test Network Connectivity for both the production and development environments. Ensure that the connectivity status is Connected.

    Important
    • Ensure that the resource group is attached to the workspace and has public network access enabled. Otherwise, data synchronization fails. For more information, see Prepare the environment.

    • If you do not have a resource group available, follow the prompts in the connection configuration section. Click Purchase and Associated Purchased Resource Group.

  5. Click Complete Creation.

2. Build the synchronization pipeline

  1. Click the icon icon in the upper-left corner and select All Products > Data Development And Task Operation > DataStudio. Then, at the top of the page, switch to the workspace that is created for this tutorial.

  2. In the left-side navigation pane, click the image icon. In the Workspace Directories section, click the image icon, select Create Workflow, and name the Workflow. For this tutorial, name it user_profile_analysis_starrocks.

  3. On the Workflow canvas, create the nodes listed in the table below. You can find components in the left-side pane and drag them onto the canvas.

    • Source Type: MySQL.

    • Destination Type: OSS.

    • Specific Type: Batch Synchronization.

    Node type

    Node name

    Description

    imageZero load node

    workshop_start_starrocks

    Manages the entire user persona analysis workflow and clarifies the data forwarding path. This node is a Dry-run node and requires no code editing.

    imageStarRocks node

    ddl_ods_user_info_d_starrocks

    Creates the destination StarRocks table, ods_user_info_d_starrocks, to receive user data from the MySQL Data Source.

    imageStarRocks node

    ddl_ods_raw_log_d_starrocks

    Creates the destination StarRocks table, ods_raw_log_d_starrocks, to receive the website access logs from the OSS source.

    imageBatch synchronization node

    ods_user_info_d_starrocks

    Synchronizes user information data from MySQL to the StarRocks table ods_user_info_d_starrocks.

    imageBatch synchronization node

    ods_raw_log_d_starrocks

    Synchronizes user website access logs from OSS to the StarRocks table ods_raw_log_d_starrocks.

  4. Connect the nodes as shown, with workshop_start_starrocks as the Ancestor Node for both Batch Synchronization nodes. The final Workflow should look like this:

    image
  5. Configure the workflow scheduling properties.

    On the workflow canvas, click Scheduling in the right-side pane and configure the parameters. The following table describes the key parameters. You can keep the default values for the other parameters.

    Scheduling Parameter

    Description

    Scheduling Parameters

    You can configure scheduling parameters for the entire workflow. The inner nodes of the workflow can directly use the configured scheduling parameters. In this tutorial, the parameter is set to bizdate=$[yyyymmdd-1] to obtain the date of the previous day.

    Scheduling Cycle

    Set to Day for this tutorial.

    Scheduling Time

    In this tutorial, Scheduling Time is set to 00:30, which means the workflow will start daily at 00:30.

    Scheduling Dependencies

    The workflow has no upstream dependencies, so you do not need to configure any. For easier management, click Use Workspace Root Node to attach the workflow to the root node of the workspace.

    The naming format for the workspace root node is workspace_name_root.

3. Configure synchronization tasks

Configure the initial node

  1. On the Workflow canvas, hover over the workshop_start_starrocks node and click Open Node.

  2. On the right side of the workshop_start_starrocks node configuration page, click Properties and configure the parameters. The following table describes the key parameters. For parameters not listed, you can retain the default settings.

    Scheduling parameter

    Description

    Scheduling type

    For this tutorial, select Dry-run.

    Resource group for scheduling

    For this tutorial, select the serverless Resource Group created in the Prepare the environment step.

    Scheduling dependencies

    Because workshop_start_starrocks is the initial node and has no upstream dependencies, you can click Use Workspace Root Node. This allows the Workspace root node to trigger the Workflow.

    The Workspace root node is named in the format of WorkspaceName_root.

Create the user table ddl_ods_user_info_d_starrocks

First, create the ddl_ods_user_info_d_starrocks table in StarRocks to receive user data from the MySQL Data Source. You can create the table within this node or manually create it in the Data Catalog.

  1. On the Workflow canvas, hover over the ddl_ods_user_info_d_starrocks node and click Open Node.

  2. Edit the table creation statement.

    CREATE TABLE IF NOT EXISTS ods_user_info_d_starrocks (
        uid STRING COMMENT 'User ID',
        gender STRING COMMENT 'Gender',
        age_range STRING COMMENT 'Age range',
        zodiac STRING COMMENT 'Zodiac sign',
        dt STRING not null COMMENT 'Date'
    ) 
    DUPLICATE KEY(uid)
    COMMENT 'User profile analysis use case - User information table'
    PARTITION BY(dt) 
    PROPERTIES("replication_num" = "1");
  3. Configure debugging parameters.

    On the right side of the StarRocks node configuration page, click Run Configuration. Configure the following parameters to use for the test run in Step 4.

    Parameter

    Description

    Computing resource

    Select the StarRocks compute resource that you bound in the Prepare the environment step.

    Resource group

    Select the serverless resource group that you purchased in the Prepare the environment step.

  4. (Optional) Configure scheduling properties.

    For this tutorial, you can keep the default scheduling parameters. On the right side of the node configuration page, click Properties. For more information about the parameters, see Configure node scheduling properties.

    • Scheduling parameters: These are already configured at the Workflow level, so no node-level configuration is needed. They can be used directly in the task or code.

    • Scheduling policies: You can use the Time for Delayed Execution parameter to specify a delay for the child node's execution after the Workflow runs. This is not configured for this tutorial.

  5. In the top toolbar, click Save to save the node.

Create the log table ddl_ods_raw_log_d_starrocks

First, create the ddl_ods_raw_log_d_starrocks table in StarRocks to receive the user website access logs from the HttpFile Data Source. You can create the table within this node or manually create it in the Data Catalog.

  1. On the Workflow canvas, hover over the ddl_ods_raw_log_d_starrocks node and click Open Node.

  2. Edit the table creation statement.

    CREATE TABLE IF NOT EXISTS ods_raw_log_d_starrocks (
        col STRING COMMENT 'Log',
        dt DATE  not null COMMENT 'Date'
    ) DUPLICATE KEY(col) 
    COMMENT 'User profile analysis use case - Raw website access log table' 
    PARTITION BY(dt) 
    PROPERTIES ("replication_num" = "1");
  3. Configure debugging parameters.

    On the right side of the Batch Synchronization task configuration page, click Run Configuration. Configure the following parameters to use for the test run in Step 4.

    Parameter

    Description

    Computing resource

    Select the StarRocks compute resource that you bound in the Prepare the environment step.

    Resource group

    Select the serverless resource group that you purchased in the Prepare the environment step.

  4. (Optional) Configure scheduling properties.

    For this tutorial, you can keep the default scheduling parameters. On the right side of the node configuration page, click Properties. For more information about the parameters, see Configure node scheduling properties.

    • Scheduling parameters: These are already configured at the Workflow level, so no node-level configuration is needed. They can be used directly in the task or code.

    • Scheduling policies: You can use the Time for Delayed Execution parameter to specify a delay for the child node's execution after the Workflow runs. This is not configured for this tutorial.

  5. In the top toolbar, click Save to save the node.

Configure the user data synchronization pipeline (ods_user_info_d_starrocks)

  1. On the Workflow canvas, hover over the ods_user_info_d_starrocks node and click Open Node.

  2. Configure the network and resources for the Data Synchronization pipeline.

    Parameter

    Description

    Source

    • Source: MySQL

    • Data Source name: user_behavior_analysis_mysql

    Resource group

    Select the serverless resource group that you created in the Prepare the environment step.

    Destination

    • Destination: StarRocks

    • Data Source name: doc_starrocks_storage_compute_tightly_01

  3. Configure the task.

    • Configure the source and destination.

      Module

      Parameter

      Configuration

      Source

      Table

      Select the MySQL table ods_user_info_d.

      Split Key

      We recommend using a primary key or an indexed column as the split key. Only integer-type fields are supported.

      For this tutorial, use the uid field as the Split Key.

      Destination

      Table

      Select the StarRocks table ods_user_info_d_starrocks.

      Statement run before writing

      This use case uses dynamic partitions based on the dt field. To prevent data duplication when the node is rerun, the following SQL statement is used to delete the target partition before each synchronization.

      ALTER TABLE ods_user_info_d_starrocks DROP PARTITION IF EXISTS p${var} FORCEThe ${var} is a variable that will be assigned a value from the scheduling parameters during the scheduling configuration phase. This enables dynamic parameter input in scheduled scenarios. For more information, see Configure scheduling.

      StreamLoad request parameters

      The request parameters for StreamLoad, which must be in JSON format.

      {
        "row_delimiter": "\\x02",
        "column_separator": "\\x01"
      }
    • Field Mapping.

      Configure the Field Mapping to define how source fields are written to destination fields. You will also use a scheduling parameter variable to dynamically populate the StarRocks partition field, which ensures that data for each day is written to the corresponding partition in StarRocks.

      1. Click Map Fields with Same Name to automatically map source fields to destination fields with matching names.

      2. Click Add and enter '${var}'. Manually map this field to the dt field in StarRocks.

    • Channel Control.

      For this tutorial, set Policy for Dirty Data to Disallow Dirty Data and keep the default values for other settings. For more information, see Configure a task in wizard mode.

  4. Configure debugging parameters.

    On the right side of the Batch Synchronization task configuration page, click Run Configuration. Configure the following parameters to use for the test run in Step 4.

    Parameter

    Description

    Resource group

    Select the serverless resource group that you purchased in the Prepare the environment step.

    Script parameters

    Click Add Parameter and set it to a specific constant in yyyymmdd format, such as var=20250223. During debugging, DataStudio will replace the variable defined in the task with this constant.

  5. (Optional) Configure scheduling properties.

    For this tutorial, you can keep the default scheduling parameters. On the right side of the node configuration page, click Properties. For more information about the parameters, see Configure node scheduling properties.

    • Scheduling parameters: These are already configured at the Workflow level, so no node-level configuration is needed. They can be used directly in the task or code.

    • Scheduling policies: You can use the Time for Delayed Execution parameter to specify a delay for the child node's execution after the Workflow runs. This is not configured for this tutorial.

  6. In the top toolbar, click Save to save the node.

Configure the log data synchronization pipeline (ods_raw_log_d_starrocks)

  1. On the Workflow canvas, hover over the ods_raw_log_d_starrocks node and click Open Node.

  2. Configure the network and resources for the Data Synchronization pipeline.

    After you configure the Source, Resource Group, and Destination, click Next and complete the connectivity test as prompted. The following table provides the configuration details.

    Parameter

    Configuration

    Source

    • Source: HttpFile

    • Data Source name: user_behavior_analysis_httpfile

    Resource group

    Select the serverless resource group that you created in the Prepare the environment step.

    Destination

    • Destination: StarRocks

    • Data Source name: doc_starrocks_storage_compute_tightly_01

  3. Click Next to configure the Data Synchronization task.

    • Configure the source and destination.

      Module

      Parameter

      Configuration

      Source

      File path

      /user_log.txt

      Text type

      text

      Column delimiter

      |

      Advanced configuration > Skip header

      No

      After configuring the source, click Confirm Data Structure.

      Destination

      Table

      Select the StarRocks table ods_raw_log_d_starrocks.

      Statement run before writing

      This use case uses dynamic partitions based on the dt field. To prevent data duplication when the node is rerun, the following SQL statement is used to delete the target partition before each synchronization.

      ALTER TABLE ods_raw_log_d_starrocks DROP PARTITION IF EXISTS p${var} FORCEThe ${var} is a variable that will be assigned a value from the scheduling parameters during the scheduling configuration phase. This enables dynamic parameter input in scheduled scenarios. For more information, see Configure scheduling.

      StreamLoad request parameters

      The request parameters for StreamLoad, which must be in JSON format.

      {
        "row_delimiter": "\\x02",
        "column_separator": "\\x01"
      }
    • Channel Control.

      For this tutorial, set Policy for Dirty Data to Disallow Dirty Data and keep the default values for other settings. For more information, see Configure a task in wizard mode.

    • Field Mapping.

      In the toolbar, switch from Wizard Mode to Script Mode by clicking the image icon. This lets you configure the Field Mapping for the HttpFile Data Source and dynamically assign a value to the dt partition field in StarRocks.

      • In the source HttpFile configuration, add the following to the Column section:

         {
                   "type": "STRING",
                  "value": "${var}"
                }
      • The following is the complete script for the ods_raw_log_d_starrocks node:

        {
            "type": "job",
            "version": "2.0",
            "steps": [
                {
                    "stepType": "httpfile",
                    "parameter": {
                        "fileName": "/user_log.txt",
                        "nullFormat": "",
                        "compress": "",
                        "requestMethod": "GET",
                        "connectTimeoutSeconds": 60,
                        "column": [
                            {
                                "index": 0,
                                "type": "STRING"
                            },
                            {
                                "type": "STRING",
                                "value": "${var}"
                            }
                        ],
                        "skipHeader": "false",
                        "encoding": "UTF-8",
                        "fieldDelimiter": "|",
                        "fieldDelimiterOrigin": "|",
                        "socketTimeoutSeconds": 3600,
                        "envType": 0,
                        "datasource": "user_behavior_analysis_httpfile",
                        "bufferByteSizeInKB": 1024,
                        "fileFormat": "text"
                    },
                    "name": "Reader",
                    "category": "reader"
                },
                {
                    "stepType": "starrocks",
                    "parameter": {
                        "loadProps": {
                            "row_delimiter": "\\x02",
                            "column_separator": "\\x01"
                        },
                        "envType": 0,
                        "datasource": "doc_starrocks_storage_compute_tightly_01",
                        "column": [
                            "col",
                            "dt"
                        ],
                        "tableComment": "",
                        "table": "ods_raw_log_d_starrocks",
                        "preSql": "ALTER TABLE ods_raw_log_d_starrocks DROP PARTITION IF EXISTS  p${var} FORCE ; "
                    },
                    "name": "Writer",
                    "category": "writer"
                },
                {
                    "copies": 1,
                    "parameter": {
                        "nodes": [],
                        "edges": [],
                        "groups": [],
                        "version": "2.0"
                    },
                    "name": "Processor",
                    "category": "processor"
                }
            ],
            "setting": {
                "errorLimit": {
                    "record": "0"
                },
                "locale": "zh",
                "speed": {
                    "throttle": false,
                    "concurrent": 2
                }
            },
            "order": {
                "hops": [
                    {
                        "from": "Reader",
                        "to": "Writer"
                    }
                ]
            }
        }
  4. Configure debugging parameters.

    On the right side of the Batch Synchronization task configuration page, click Run Configuration. Configure the following parameters to use for the test run in Step 4.

    Parameter

    Description

    Resource group

    Select the serverless resource group that you purchased in the Prepare the environment step.

    Script parameters

    Click Add Parameter and set it to a specific constant in yyyymmdd format, such as var=20250223. During debugging, DataStudio will replace the variable defined in the task with this constant.

  5. (Optional) Configure scheduling properties.

    For this tutorial, you can keep the default scheduling parameters. On the right side of the node configuration page, click Properties. For more information about the parameters, see Configure node scheduling properties.

    • Scheduling parameters: These are already configured at the Workflow level, so no node-level configuration is needed. They can be used directly in the task or code.

    • Scheduling policies: You can use the Time for Delayed Execution parameter to specify a delay for the child node's execution after the Workflow runs. This is not configured for this tutorial.

  6. In the top toolbar, click Save to save the node.

4. Run tasks

  1. Sync the data.

    In the workflow toolbar, click Run. Set the values for the parameter variables that are defined in each node for this run. This tutorial uses 20250223, but you can modify the value as needed. Then, click OK and wait for the run to complete.

  2. Query the results.

    1. Go to the SQL Query page.

      Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Analysis and Service > DataAnalysis. On the page that appears, click Go to DataAnalysis. In the left-side navigation pane of the page that appears, click SQL Query.

    2. Configure an SQL query file.

      1. Click the image icon next to My Files to create a new file, and then enter a name for the SQL query file.

      2. Click the newly created file to open the editor.

      3. In the upper-right corner of the editor, click the image icon to configure the workspace and other settings for the query. The following table provides the configuration details.

        Parameter

        Description

        Workspace

        Select the Workspace where the user_profile_analysis_starrocks Workflow is located.

        Data Source type

        Select StarRocks from the drop-down list.

        Data Source name

        Select the StarRocks development environment that you bound in the Prepare the environment step.

      4. Click OK to finish configuring the query data source.

    3. Edit the query SQL.

      After the nodes run successfully, execute the following queries to verify that the data was written to the StarRocks tables.

      -- Update `your_business_date` with the correct Data Timestamp. For a task run on 20250223, the Data Timestamp is 20250222 (the previous day).
      SELECT * FROM ods_raw_log_d_starrocks WHERE dt=your_business_date; 
      SELECT * FROM ods_user_info_d_starrocks WHERE dt=your_business_date; 

Next steps

Now that you have synchronized the data, proceed to the next tutorial to learn how to process and analyze it: Process data.