All Products
Search
Document Center

DataWorks:Synchronize data

Last Updated:Mar 26, 2026

This tutorial walks you through synchronizing raw data into StarRocks using two Data Integration Batch Synchronization tasks: one pulls user records from a MySQL table (ods_user_info_d) and another pulls website access logs from an Object Storage Service (OSS) file via HttpFile (user_log.txt). The data lands in two StarRocks tables: ods_user_info_d_starrocks and ods_raw_log_d_starrocks.

Prerequisites

Before you begin, ensure that you have:

  • Completed the environment setup described in Prepare the environment

  • A serverless resource group attached to your workspace with public network access enabled

  • A StarRocks compute resource bound to your workspace

Step 1: Create data sources

Create two data sources in your DataWorks workspace so that the Batch Synchronization tasks can reach the raw data:

  • MySQL data source (user_behavior_analysis_mysql): Provides user profile records from MySQL.

  • HttpFile data source (user_behavior_analysis_httpfile): Provides website access logs stored in OSS.

Create the MySQL data source

  1. Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose More > Management Center, select your workspace, and click Go to Management Center.

  2. In the left-side navigation pane, click Data Sources.

  3. Click Add Data Source, then search for and select MySQL.

  4. On the Add MySQL Data Source page, configure the following parameters. Use the same values for both the development and production environments.

    ParameterValue
    Data source nameuser_behavior_analysis_mysql
    Data source descriptionThis data source is for DataWorks tutorials. Read data from this data source when you configure a Batch Synchronization task to access the test data provided by the platform. This data source can only be read in Data Integration scenarios. Other modules cannot use it.
    Configuration modeConnection String Mode
    Connection address (host)rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com
    Connection address (port)3306
    Database nameworkshop
    Usernameworkshop
    Passwordworkshop#2017
    Authentication methodNo authentication
  5. In the Connection Configuration section, click Test Network Connectivity for both environments and confirm the status shows Connected.

    Important

    Make sure the resource group is attached to your workspace and has public network access enabled. If no resource group is available, click Purchase and then Associated Purchased Resource Group in the connection configuration section. For details, see Prepare the environment.

  6. Click Complete Creation.

Create the HttpFile data source

The website access logs are stored in OSS. Create an HttpFile data source to let the Batch Synchronization task read the log file.

  1. In the left-side navigation pane, click Data Sources.

  2. Click Add Data Source, then search for and select HttpFile.

  3. On the Add HttpFile Data Source page, configure the following parameters. Use the same values for both environments.

    ParameterValue
    Data source nameuser_behavior_analysis_httpfile
    Data source descriptionThis data source is for DataWorks tutorials. Read data from this data source when you configure a Batch Synchronization task to access the test data provided by the platform. This data source can only be read in Data Integration scenarios. Other modules cannot use it.
    URLhttps://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com
  4. In the Connection Configuration section, click Test Network Connectivity for both environments and confirm the status shows Connected.

    Important

    Make sure the resource group is attached to your workspace and has public network access enabled. If no resource group is available, click Purchase and then Associated Purchased Resource Group in the connection configuration section. For details, see Prepare the environment.

  5. Click Complete Creation.

Step 2: Build the synchronization pipeline

Create a Workflow

  1. Click the icon icon in the upper-left corner and select All Products > Data Development And Task Operation > DataStudio. Switch to the workspace created for this tutorial.

  2. In the left-side navigation pane, click the image icon. In the Workspace Directories section, click the image icon, select Create Workflow, and name it user_profile_analysis_starrocks.

  3. On the Workflow canvas, add the following nodes. Find each component in the left-side pane and drag it onto the canvas. When prompted for source and destination types, set Source Type to MySQL, Destination Type to OSS, and Specific Type to Batch Synchronization.

    Node typeNode nameRole
    image Zero load nodeworkshop_start_starrocksAncestor node that manages the Workflow and clarifies the data forwarding path. Configured as a Dry-run node; no code editing required.
    image StarRocks nodeddl_ods_user_info_d_starrocksCreates the StarRocks table ods_user_info_d_starrocks for user data.
    image StarRocks nodeddl_ods_raw_log_d_starrocksCreates the StarRocks table ods_raw_log_d_starrocks for website access logs.
    image Batch synchronization nodeods_user_info_d_starrocksSyncs user data from MySQL into ods_user_info_d_starrocks.
    image Batch synchronization nodeods_raw_log_d_starrocksSyncs website access logs from OSS into ods_raw_log_d_starrocks.
  4. Connect the nodes so that workshop_start_starrocks is the ancestor node for both Batch Synchronization nodes. The completed Workflow should look like this:

    image

Configure Workflow scheduling

On the Workflow canvas, click Scheduling in the right-side pane and set the following parameters.

ParameterValue
Scheduling parametersbizdate=$[yyyymmdd-1] — passes the previous day's date to all nodes in the Workflow.
Scheduling cycleDay
Scheduling time00:30
Scheduling dependenciesClick Use Workspace Root Node to attach the Workflow to the workspace root node (named workspace_name_root).

Step 3: Configure synchronization tasks

Configure the initial node

  1. On the Workflow canvas, hover over workshop_start_starrocks and click Open Node.

  2. Click Properties in the right-side pane and configure the following parameters.

    ParameterValue
    Scheduling typeDry-run
    Resource group for schedulingSelect the serverless resource group from Prepare the environment.
    Scheduling dependenciesClick Use Workspace Root Node so the workspace root node triggers this Workflow. The root node follows the format WorkspaceName_root.

Create the user table (ddl_ods_user_info_d_starrocks)

Create the ods_user_info_d_starrocks table in StarRocks to receive user data from the MySQL data source. You can also create the table manually in the Data Catalog.

  1. On the Workflow canvas, hover over ddl_ods_user_info_d_starrocks and click Open Node.

  2. Enter the following table creation statement:

    CREATE TABLE IF NOT EXISTS ods_user_info_d_starrocks (
        uid STRING COMMENT 'User ID',
        gender STRING COMMENT 'Gender',
        age_range STRING COMMENT 'Age range',
        zodiac STRING COMMENT 'Zodiac sign',
        dt STRING not null COMMENT 'Date'
    )
    DUPLICATE KEY(uid)
    COMMENT 'User profile analysis use case - User information table'
    PARTITION BY(dt)
    PROPERTIES("replication_num" = "1");
  3. Click Run Configuration in the right-side pane and set the following parameters for the test run.

    ParameterValue
    Computing resourceSelect the StarRocks compute resource bound in Prepare the environment.
    Resource groupSelect the serverless resource group from Prepare the environment.
  4. (Optional) Click Properties to review scheduling settings. Scheduling parameters are already configured at the Workflow level, so no node-level changes are needed. For details, see Configure node scheduling properties.

  5. In the top toolbar, click Save.

Create the log table (ddl_ods_raw_log_d_starrocks)

Create the ods_raw_log_d_starrocks table in StarRocks to receive website access logs from the HttpFile data source. You can also create the table manually in the Data Catalog.

  1. On the Workflow canvas, hover over ddl_ods_raw_log_d_starrocks and click Open Node.

  2. Enter the following table creation statement:

    CREATE TABLE IF NOT EXISTS ods_raw_log_d_starrocks (
        col STRING COMMENT 'Log',
        dt DATE  not null COMMENT 'Date'
    ) DUPLICATE KEY(col)
    COMMENT 'User profile analysis use case - Raw website access log table'
    PARTITION BY(dt)
    PROPERTIES ("replication_num" = "1");
  3. Click Run Configuration in the right-side pane and set the following parameters for the test run.

    ParameterValue
    Computing resourceSelect the StarRocks compute resource bound in Prepare the environment.
    Resource groupSelect the serverless resource group from Prepare the environment.
  4. (Optional) Click Properties to review scheduling settings. Scheduling parameters are already configured at the Workflow level. For details, see Configure node scheduling properties.

  5. In the top toolbar, click Save.

Configure the user data sync task (ods_user_info_d_starrocks)

  1. On the Workflow canvas, hover over ods_user_info_d_starrocks and click Open Node.

  2. Set the source, resource group, and destination:

    ParameterValue
    SourceMySQL — data source user_behavior_analysis_mysql
    Resource groupSelect the serverless resource group from Prepare the environment.
    DestinationStarRocks — data source doc_starrocks_storage_compute_tightly_01
  3. Configure the source and destination settings:

    ModuleParameterValue
    SourceTableods_user_info_d
    SourceSplit keyuid — use a primary key or indexed integer column as the split key. Only integer-type fields are supported.
    DestinationTableods_user_info_d_starrocks
    DestinationStatement run before writingALTER TABLE ods_user_info_d_starrocks DROP PARTITION IF EXISTS p${var} FORCE — drops the target partition before each sync to prevent duplicate data. ${var} is replaced by the scheduling parameter at runtime. For more information, see Configure scheduling.
    DestinationStreamLoad request parameters{"row_delimiter": "\\x02", "column_separator": "\\x01"}
  4. Configure field mapping:

    1. Click Map Fields with Same Name to automatically map matching source and destination fields.

    2. Click Add and enter '${var}'. Map this field to the dt column in StarRocks. The scheduling parameter dynamically assigns the partition date at runtime.

  5. In Channel Control, set Policy for dirty data to Disallow Dirty Data. For more information, see Configure a task in wizard mode.

  6. Click Run Configuration in the right-side pane and set the following parameters for the test run.

    ParameterValue
    Resource groupSelect the serverless resource group from Prepare the environment.
    Script parametersClick Add Parameter and enter var=20250223. DataStudio replaces ${var} with this value during the test run.
  7. (Optional) Click Properties to review scheduling settings. Scheduling parameters are inherited from the Workflow level. For details, see Configure node scheduling properties.

  8. In the top toolbar, click Save.

Configure the log data sync task (ods_raw_log_d_starrocks)

The HttpFile data source requires Script Mode to configure field mapping with a dynamic partition variable. Set up the basic configuration in Wizard Mode first, then switch to Script Mode for the field mapping.

  1. On the Workflow canvas, hover over ods_raw_log_d_starrocks and click Open Node.

  2. Set the source, resource group, and destination, then click Next to complete the connectivity test.

    ParameterValue
    SourceHttpFile — data source user_behavior_analysis_httpfile
    Resource groupSelect the serverless resource group from Prepare the environment.
    DestinationStarRocks — data source doc_starrocks_storage_compute_tightly_01
  3. Click Next and configure the source and destination settings:

    ModuleParameterValue
    SourceFile path/user_log.txt
    SourceText typetext
    SourceColumn delimiter|
    SourceAdvanced configuration > Skip headerNo
    SourceAfter configuring the source, click Confirm Data Structure.
    DestinationTableods_raw_log_d_starrocks
    DestinationStatement run before writingALTER TABLE ods_raw_log_d_starrocks DROP PARTITION IF EXISTS p${var} FORCE — drops the target partition before each sync to prevent duplicate data. For more information, see Configure scheduling.
    DestinationStreamLoad request parameters{"row_delimiter": "\\x02", "column_separator": "\\x01"}
  4. In Channel Control, set Policy for dirty data to Disallow Dirty Data. For more information, see Configure a task in wizard mode.

  5. Configure field mapping in Script Mode. The HttpFile data source does not support dynamic variables in Wizard Mode. Switch to Script Mode by clicking the image icon in the toolbar. In the source HttpFile configuration, add the following object to the column array to pass the scheduling parameter as the dt partition value:

    {
      "type": "STRING",
      "value": "${var}"
    }

    The complete script for this node is:

    {
        "type": "job",
        "version": "2.0",
        "steps": [
            {
                "stepType": "httpfile",
                "parameter": {
                    "fileName": "/user_log.txt",
                    "nullFormat": "",
                    "compress": "",
                    "requestMethod": "GET",
                    "connectTimeoutSeconds": 60,
                    "column": [
                        {
                            "index": 0,
                            "type": "STRING"
                        },
                        {
                            "type": "STRING",
                            "value": "${var}"
                        }
                    ],
                    "skipHeader": "false",
                    "encoding": "UTF-8",
                    "fieldDelimiter": "|",
                    "fieldDelimiterOrigin": "|",
                    "socketTimeoutSeconds": 3600,
                    "envType": 0,
                    "datasource": "user_behavior_analysis_httpfile",
                    "bufferByteSizeInKB": 1024,
                    "fileFormat": "text"
                },
                "name": "Reader",
                "category": "reader"
            },
            {
                "stepType": "starrocks",
                "parameter": {
                    "loadProps": {
                        "row_delimiter": "\\x02",
                        "column_separator": "\\x01"
                    },
                    "envType": 0,
                    "datasource": "doc_starrocks_storage_compute_tightly_01",
                    "column": [
                        "col",
                        "dt"
                    ],
                    "tableComment": "",
                    "table": "ods_raw_log_d_starrocks",
                    "preSql": "ALTER TABLE ods_raw_log_d_starrocks DROP PARTITION IF EXISTS  p${var} FORCE ; "
                },
                "name": "Writer",
                "category": "writer"
            },
            {
                "copies": 1,
                "parameter": {
                    "nodes": [],
                    "edges": [],
                    "groups": [],
                    "version": "2.0"
                },
                "name": "Processor",
                "category": "processor"
            }
        ],
        "setting": {
            "errorLimit": {
                "record": "0"
            },
            "locale": "zh",
            "speed": {
                "throttle": false,
                "concurrent": 2
            }
        },
        "order": {
            "hops": [
                {
                    "from": "Reader",
                    "to": "Writer"
                }
            ]
        }
    }
  6. Click Run Configuration in the right-side pane and set the following parameters for the test run.

    ParameterValue
    Resource groupSelect the serverless resource group from Prepare the environment.
    Script parametersClick Add Parameter and enter var=20250223. DataStudio replaces ${var} with this value during the test run.
  7. (Optional) Click Properties to review scheduling settings. Scheduling parameters are inherited from the Workflow level. For details, see Configure node scheduling properties.

  8. In the top toolbar, click Save.

Step 4: Run and verify

Run the Workflow

In the Workflow toolbar, click Run. When prompted, set the parameter variable for this run — this tutorial uses 20250223, but adjust the value as needed. Click OK and wait for all nodes to complete.

Verify the results

  1. Log on to the DataWorks console. In the left-side navigation pane, choose Data Analysis and Service > DataAnalysis, then click Go to DataAnalysis. In the left navigation pane, click SQL Query.

  2. Click the image icon next to My Files to create a new SQL query file and enter a name.

  3. Open the file, then click the image icon in the upper-right corner to configure the query settings:

    ParameterValue
    WorkspaceSelect the workspace containing the user_profile_analysis_starrocks Workflow.
    Data source typeStarRocks
    Data source nameSelect the StarRocks development environment bound in Prepare the environment.
  4. Click OK to apply the query data source.

  5. Run the following queries to confirm that data was written to both StarRocks tables. Replace your_business_date with the Data Timestamp for your run — for a task run on 20250223, the Data Timestamp is 20250222 (the previous day).

    SELECT * FROM ods_raw_log_d_starrocks WHERE dt=your_business_date;
    SELECT * FROM ods_user_info_d_starrocks WHERE dt=your_business_date;

What's next

With data now loaded into StarRocks, proceed to the next tutorial: Process data.