All Products
Search
Document Center

DataWorks:Synchronize data

Last Updated:Mar 26, 2026

Use DataWorks Data Integration to batch-synchronize data from heterogeneous sources into StarRocks. This tutorial walks you through syncing basic user information from MySQL and website access logs from an HttpFile source into two StarRocks tables — setting up the workflow, configuring the sync tasks, and verifying the results.

In this tutorial, you complete the following steps:

  1. Design a workflow with the required nodes and scheduling logic.

  2. Create the destination StarRocks tables.

  3. Configure two batch synchronization tasks: one for MySQL data, one for HttpFile data.

  4. Run the workflow and verify the synchronized data.

Prerequisites

Before you begin, ensure that you have:

Objective

Sync data from the public sources provided in this example to StarRocks.

Source type Data Source schema Destination type Destination table
MySQL Table: ods_user_info_d (basic user information) uid, gender, age_range, zodiac StarRocks ods_user_info_d_starrocks
HttpFile Object: user_log.txt (website access logs) One access record per row: $remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" [unknown_content]; StarRocks ods_raw_log_d_starrocks

The destination tables include an extra dt partition field that is not in the source. The batch synchronization tasks dynamically assign this field using a scheduling parameter.

Go to the DataStudio page

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select your workspace from the drop-down list and click Go to Data Development.

Step 1: Design a workflow

Create and design the workflow

  1. Create a workflow named User profile analysis_StarRocks. For instructions, see Create a workflow.

    image

  2. On the workflow canvas, click Create Node in the top toolbar. Drag nodes onto the canvas and draw lines between them to configure dependencies. Base your node layout on the workflow design. The zero load node and the synchronization nodes have no data lineage between them, so configure their dependencies by drawing lines manually. For details, see Scheduling dependency configuration guide. The following table describes the five nodes in this workflow.

    Node classification Node type Node name Purpose
    General Zero load node workshop_start_starrocks Triggers the entire workflow at a scheduled time. Acts as a dry-run entry point — no code required.
    Database StarRocks ddl_ods_user_info_d_starrocks Creates the StarRocks table ods_user_info_d_starrocks before data is synchronized.
    Database StarRocks ddl_ods_raw_log_d_starrocks Creates the StarRocks table ods_raw_log_d_starrocks before data is synchronized.
    Data Integration Offline synchronization ods_user_info_d_starrocks Syncs basic user information from MySQL to ods_user_info_d_starrocks.
    Data Integration Offline synchronization ods_raw_log_d_starrocks Syncs website access logs from the HttpFile source to ods_raw_log_d_starrocks.

    image

Configure scheduling

The zero load node workshop_start_starrocks controls when the entire workflow runs. Configure its scheduling properties as follows. You do not need to change the scheduling settings on any other node.

Configuration Setting Notes
Scheduling time 00:30 The node triggers the workflow at 00:30 every day.
Scheduling dependencies Root node of the workspace Because workshop_start_starrocks has no ancestor nodes, set it to depend on the workspace root node. All data synchronization nodes in the workflow depend on workshop_start_starrocks, so the root node ultimately triggers the entire workflow.

For more information, see Configure scheduling time for nodes in a workflow in different scenarios and Overview.

Step 2: Configure data synchronization tasks

Create the destination StarRocks tables

Create the destination tables before running any synchronization. On the workflow canvas, double-click each database node and run the corresponding DDL statement by clicking the image icon.

  • ddl_ods_user_info_d_starrocks

    CREATE TABLE IF NOT EXISTS ods_user_info_d_starrocks (
        uid STRING COMMENT 'The user ID',
        gender STRING COMMENT 'The gender',
        age_range STRING COMMENT 'The age range',
        zodiac STRING COMMENT 'The zodiac sign',
        dt STRING not null COMMENT 'The time'
    )
    DUPLICATE KEY(uid)
    COMMENT 'User behavior analysis case - table that stores basic user information'
    PARTITION BY(dt)
    PROPERTIES("replication_num" = "1");
  • ddl_ods_raw_log_d_starrocks

    CREATE TABLE IF NOT EXISTS ods_raw_log_d_starrocks (
        col STRING COMMENT 'Log',
        dt DATE not null COMMENT 'The time'
    ) DUPLICATE KEY(col)
    COMMENT 'User behavior analysis case - table that stores the website access logs of users'
    PARTITION BY(dt)
    PROPERTIES ("replication_num" = "1");

Sync basic user information (MySQL to StarRocks)

On the workflow canvas, double-click the batch synchronization node ods_user_info_d_starrocks to open its configuration tab.

1. Configure the connection and resource group

Set the source, resource group, and destination, then click Next and complete the connectivity test.

Parameter Value
Source Type: MySQL; Data Source Name: user_behavior_analysis_mysql
Resource group The serverless resource group purchased during environment preparation
Destination Type: StarRocks; Data Source Name: Doc_StarRocks_Storage_Compute_Tightly_01
image

2. Configure the source and destination tables

Item Parameter Value
Source Table ods_user_info_d
Split key uid — use a primary key or indexed column of the INTEGER type as the split key
Destination Table ods_user_info_d_starrocks
Statement run before writing ALTER TABLE ods_user_info_d_starrocks DROP PARTITION IF EXISTS p${var} FORCE — deletes the target partition before each sync to prevent duplicate writes when the node reruns. ${var} is replaced at runtime by the scheduling parameter configured below.
StreamLoad request parameters {"row_delimiter": "\\x02", "column_separator": "\\x01"}
image

3. Configure field mappings

Click Map Fields with Same Name to automatically map source MySQL fields to destination fields with identical names.

Then click Add, enter '${var}', and map this value to the dt partition field in the StarRocks table. This lets the partition value be set dynamically each time the node runs.

image

4. Configure scheduling properties

On the configuration tab, click Properties in the right-side navigation pane. For more details, see Scheduling properties of a node.

Section Configuration
Scheduling parameter Click Add Parameter. Set Parameter Name to var and Parameter Value to $[yyyymmdd-1]. At runtime, ${var} in the node code is replaced with yesterday's date, writing data to the correct partition.
Dependencies Set the output name to the output table name in workspacename.tablename format.

Sync website access logs (HttpFile to StarRocks)

On the workflow canvas, double-click the batch synchronization node ods_raw_log_d_starrocks to open its configuration tab.

1. Configure the connection and resource group

Set the source, resource group, and destination, then click Next and complete the connectivity test.

Parameter Value
Source Type: HttpFile; Data Source Name: user_behavior_analysis_HttpFile
Resource group The serverless resource group purchased during environment preparation
Destination Type: StarRocks; Data Source Name: Doc_StarRocks_Storage_Compute_Tightly_01
image

2. Configure the source and destination tables

Item Parameter Value
Source File path /user_log.txt
File type text
Column delimiter |
Advanced settings > Skip header No
After configuring the data sources, click Confirm data structure.
Destination Table ods_raw_log_d_starrocks
Statement run before writing ALTER TABLE ods_raw_log_d_starrocks DROP PARTITION IF EXISTS p${var} FORCE — deletes the target partition before each sync to prevent duplicate writes when the node reruns.
StreamLoad request parameters {"row_delimiter": "\\x02", "column_separator": "\\x01"}
image

3. Configure field mappings

Because the HttpFile source has no named columns, switch from the codeless UI to script mode by clicking the image icon in the top toolbar. Script mode lets you manually define column mappings and inject the dynamic partition value.

Add the following column definition for the dt field:

{
  "type": "STRING",
  "value": "${var}"
}

The complete script for the ods_raw_log_d_starrocks node:

{
    "type": "job",
    "version": "2.0",
    "steps": [
        {
            "stepType": "httpfile",
            "parameter": {
                "fileName": "/user_log.txt",
                "nullFormat": "",
                "compress": "",
                "requestMethod": "GET",
                "connectTimeoutSeconds": 60,
                "column": [
                    {
                        "index": 0,
                        "type": "STRING"
                    },
                    {
                        "type": "STRING",
                        "value": "${var}"
                    }
                ],
                "skipHeader": "false",
                "encoding": "UTF-8",
                "fieldDelimiter": "|",
                "fieldDelimiterOrigin": "|",
                "socketTimeoutSeconds": 3600,
                "envType": 0,
                "datasource": "user_behavior_analysis",
                "bufferByteSizeInKB": 1024,
                "fileFormat": "text"
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "starrocks",
            "parameter": {
                "loadProps": {
                    "row_delimiter": "\\x02",
                    "column_separator": "\\x01"
                },
                "envType": 0,
                "datasource": "Doc_StarRocks_Storage_Compute_Tightly_01",
                "column": [
                    "col",
                    "dt"
                ],
                "tableComment": "",
                "table": "ods_raw_log_d_starrocks",
                "preSql": "ALTER TABLE ods_raw_log_d_starrocks DROP PARTITION IF EXISTS  p${var} FORCE ; "
            },
            "name": "Writer",
            "category": "writer"
        },
        {
            "copies": 1,
            "parameter": {
                "nodes": [],
                "edges": [],
                "groups": [],
                "version": "2.0"
            },
            "name": "Processor",
            "category": "processor"
        }
    ],
    "setting": {
        "errorLimit": {
            "record": "0"
        },
        "locale": "zh",
        "speed": {
            "throttle": false,
            "concurrent": 2
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}

4. Configure scheduling properties

Section Configuration
Scheduling parameter Click Add Parameter. Set Parameter Name to var and Parameter Value to $[yyyymmdd-1].
Dependencies Set the output name to the output table name in workspacename.tablename format.

Step 3: Verify the synchronized data

Run the workflow

  1. Under Business Flow, double-click the User profile analysis_StarRocks workflow to open the canvas.

    image

  2. Click the image icon in the top toolbar to run the workflow. Nodes run in dependency order.

  3. Watch the node status on the canvas. When all nodes reach the success state, the workflow has completed successfully.

  4. To inspect a node's execution details, right-click ods_user_info_d_starrocks or ods_raw_log_d_starrocks on the canvas and select View log.

    image

Query the synchronized data

  1. In the left-side navigation pane of the DataStudio page, click the ad hoc query icon. In the Ad hoc query pane, right-click Ad hoc query and choose Create node > StarRocks.

    image

  2. Run the following queries to confirm that data was written to the correct partitions. Replace <data_timestamp> with the data date — one day before the node's run date. For example, if the node ran on January 2, 2024, use 20240101.

    SELECT * FROM ods_raw_log_d_starrocks WHERE dt = <data_timestamp>;
    SELECT * FROM ods_user_info_d_starrocks WHERE dt = <data_timestamp>;

What's next

Data synchronization is complete. In the next tutorial, you process the synchronized basic user information and access logs in StarRocks. See Process data.