Use DataWorks Data Integration to batch-synchronize data from heterogeneous sources into StarRocks. This tutorial walks you through syncing basic user information from MySQL and website access logs from an HttpFile source into two StarRocks tables — setting up the workflow, configuring the sync tasks, and verifying the results.
In this tutorial, you complete the following steps:
-
Design a workflow with the required nodes and scheduling logic.
-
Create the destination StarRocks tables.
-
Configure two batch synchronization tasks: one for MySQL data, one for HttpFile data.
-
Run the workflow and verify the synchronized data.
Prerequisites
Before you begin, ensure that you have:
-
Completed environment preparation. See Prepare environments.
Objective
Sync data from the public sources provided in this example to StarRocks.
| Source type | Data | Source schema | Destination type | Destination table |
|---|---|---|---|---|
| MySQL | Table: ods_user_info_d (basic user information) |
uid, gender, age_range, zodiac |
StarRocks | ods_user_info_d_starrocks |
| HttpFile | Object: user_log.txt (website access logs) |
One access record per row: $remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" [unknown_content]; |
StarRocks | ods_raw_log_d_starrocks |
The destination tables include an extra dt partition field that is not in the source. The batch synchronization tasks dynamically assign this field using a scheduling parameter.
Go to the DataStudio page
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select your workspace from the drop-down list and click Go to Data Development.
Step 1: Design a workflow
Create and design the workflow
-
Create a workflow named
User profile analysis_StarRocks. For instructions, see Create a workflow.
-
On the workflow canvas, click Create Node in the top toolbar. Drag nodes onto the canvas and draw lines between them to configure dependencies. Base your node layout on the workflow design. The zero load node and the synchronization nodes have no data lineage between them, so configure their dependencies by drawing lines manually. For details, see Scheduling dependency configuration guide. The following table describes the five nodes in this workflow.
Node classification Node type Node name Purpose General Zero load node workshop_start_starrocksTriggers the entire workflow at a scheduled time. Acts as a dry-run entry point — no code required. Database StarRocks ddl_ods_user_info_d_starrocksCreates the StarRocks table ods_user_info_d_starrocksbefore data is synchronized.Database StarRocks ddl_ods_raw_log_d_starrocksCreates the StarRocks table ods_raw_log_d_starrocksbefore data is synchronized.Data Integration Offline synchronization ods_user_info_d_starrocksSyncs basic user information from MySQL to ods_user_info_d_starrocks.Data Integration Offline synchronization ods_raw_log_d_starrocksSyncs website access logs from the HttpFile source to ods_raw_log_d_starrocks.
Configure scheduling
The zero load node workshop_start_starrocks controls when the entire workflow runs. Configure its scheduling properties as follows. You do not need to change the scheduling settings on any other node.
| Configuration | Setting | Notes |
|---|---|---|
| Scheduling time | 00:30 | The node triggers the workflow at 00:30 every day. |
| Scheduling dependencies | Root node of the workspace | Because workshop_start_starrocks has no ancestor nodes, set it to depend on the workspace root node. All data synchronization nodes in the workflow depend on workshop_start_starrocks, so the root node ultimately triggers the entire workflow. |
For more information, see Configure scheduling time for nodes in a workflow in different scenarios and Overview.
Step 2: Configure data synchronization tasks
Create the destination StarRocks tables
Create the destination tables before running any synchronization. On the workflow canvas, double-click each database node and run the corresponding DDL statement by clicking the
icon.
-
ddl_ods_user_info_d_starrocksCREATE TABLE IF NOT EXISTS ods_user_info_d_starrocks ( uid STRING COMMENT 'The user ID', gender STRING COMMENT 'The gender', age_range STRING COMMENT 'The age range', zodiac STRING COMMENT 'The zodiac sign', dt STRING not null COMMENT 'The time' ) DUPLICATE KEY(uid) COMMENT 'User behavior analysis case - table that stores basic user information' PARTITION BY(dt) PROPERTIES("replication_num" = "1"); -
ddl_ods_raw_log_d_starrocksCREATE TABLE IF NOT EXISTS ods_raw_log_d_starrocks ( col STRING COMMENT 'Log', dt DATE not null COMMENT 'The time' ) DUPLICATE KEY(col) COMMENT 'User behavior analysis case - table that stores the website access logs of users' PARTITION BY(dt) PROPERTIES ("replication_num" = "1");
Sync basic user information (MySQL to StarRocks)
On the workflow canvas, double-click the batch synchronization node ods_user_info_d_starrocks to open its configuration tab.
1. Configure the connection and resource group
Set the source, resource group, and destination, then click Next and complete the connectivity test.
| Parameter | Value |
|---|---|
| Source | Type: MySQL; Data Source Name: user_behavior_analysis_mysql |
| Resource group | The serverless resource group purchased during environment preparation |
| Destination | Type: StarRocks; Data Source Name: Doc_StarRocks_Storage_Compute_Tightly_01 |
2. Configure the source and destination tables
| Item | Parameter | Value |
|---|---|---|
| Source | Table | ods_user_info_d |
| Split key | uid — use a primary key or indexed column of the INTEGER type as the split key |
|
| Destination | Table | ods_user_info_d_starrocks |
| Statement run before writing | ALTER TABLE ods_user_info_d_starrocks DROP PARTITION IF EXISTS p${var} FORCE — deletes the target partition before each sync to prevent duplicate writes when the node reruns. ${var} is replaced at runtime by the scheduling parameter configured below. |
|
| StreamLoad request parameters | {"row_delimiter": "\\x02", "column_separator": "\\x01"} |
3. Configure field mappings
Click Map Fields with Same Name to automatically map source MySQL fields to destination fields with identical names.
Then click Add, enter '${var}', and map this value to the dt partition field in the StarRocks table. This lets the partition value be set dynamically each time the node runs.
4. Configure scheduling properties
On the configuration tab, click Properties in the right-side navigation pane. For more details, see Scheduling properties of a node.
| Section | Configuration |
|---|---|
| Scheduling parameter | Click Add Parameter. Set Parameter Name to var and Parameter Value to $[yyyymmdd-1]. At runtime, ${var} in the node code is replaced with yesterday's date, writing data to the correct partition. |
| Dependencies | Set the output name to the output table name in workspacename.tablename format. |
Sync website access logs (HttpFile to StarRocks)
On the workflow canvas, double-click the batch synchronization node ods_raw_log_d_starrocks to open its configuration tab.
1. Configure the connection and resource group
Set the source, resource group, and destination, then click Next and complete the connectivity test.
| Parameter | Value |
|---|---|
| Source | Type: HttpFile; Data Source Name: user_behavior_analysis_HttpFile |
| Resource group | The serverless resource group purchased during environment preparation |
| Destination | Type: StarRocks; Data Source Name: Doc_StarRocks_Storage_Compute_Tightly_01 |
2. Configure the source and destination tables
| Item | Parameter | Value |
|---|---|---|
| Source | File path | /user_log.txt |
| File type | text |
|
| Column delimiter | | |
|
| Advanced settings > Skip header | No | |
| After configuring the data sources, click Confirm data structure. | ||
| Destination | Table | ods_raw_log_d_starrocks |
| Statement run before writing | ALTER TABLE ods_raw_log_d_starrocks DROP PARTITION IF EXISTS p${var} FORCE — deletes the target partition before each sync to prevent duplicate writes when the node reruns. |
|
| StreamLoad request parameters | {"row_delimiter": "\\x02", "column_separator": "\\x01"} |
3. Configure field mappings
Because the HttpFile source has no named columns, switch from the codeless UI to script mode by clicking the
icon in the top toolbar. Script mode lets you manually define column mappings and inject the dynamic partition value.
Add the following column definition for the dt field:
{
"type": "STRING",
"value": "${var}"
}
The complete script for the ods_raw_log_d_starrocks node:
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "httpfile",
"parameter": {
"fileName": "/user_log.txt",
"nullFormat": "",
"compress": "",
"requestMethod": "GET",
"connectTimeoutSeconds": 60,
"column": [
{
"index": 0,
"type": "STRING"
},
{
"type": "STRING",
"value": "${var}"
}
],
"skipHeader": "false",
"encoding": "UTF-8",
"fieldDelimiter": "|",
"fieldDelimiterOrigin": "|",
"socketTimeoutSeconds": 3600,
"envType": 0,
"datasource": "user_behavior_analysis",
"bufferByteSizeInKB": 1024,
"fileFormat": "text"
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "starrocks",
"parameter": {
"loadProps": {
"row_delimiter": "\\x02",
"column_separator": "\\x01"
},
"envType": 0,
"datasource": "Doc_StarRocks_Storage_Compute_Tightly_01",
"column": [
"col",
"dt"
],
"tableComment": "",
"table": "ods_raw_log_d_starrocks",
"preSql": "ALTER TABLE ods_raw_log_d_starrocks DROP PARTITION IF EXISTS p${var} FORCE ; "
},
"name": "Writer",
"category": "writer"
},
{
"copies": 1,
"parameter": {
"nodes": [],
"edges": [],
"groups": [],
"version": "2.0"
},
"name": "Processor",
"category": "processor"
}
],
"setting": {
"errorLimit": {
"record": "0"
},
"locale": "zh",
"speed": {
"throttle": false,
"concurrent": 2
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}
4. Configure scheduling properties
| Section | Configuration |
|---|---|
| Scheduling parameter | Click Add Parameter. Set Parameter Name to var and Parameter Value to $[yyyymmdd-1]. |
| Dependencies | Set the output name to the output table name in workspacename.tablename format. |
Step 3: Verify the synchronized data
Run the workflow
-
Under Business Flow, double-click the User profile analysis_StarRocks workflow to open the canvas.

-
Click the
icon in the top toolbar to run the workflow. Nodes run in dependency order. -
Watch the node status on the canvas. When all nodes reach the success state, the workflow has completed successfully.
-
To inspect a node's execution details, right-click
ods_user_info_d_starrocksorods_raw_log_d_starrockson the canvas and select View log.
Query the synchronized data
-
In the left-side navigation pane of the DataStudio page, click the ad hoc query icon. In the Ad hoc query pane, right-click Ad hoc query and choose Create node > StarRocks.

-
Run the following queries to confirm that data was written to the correct partitions. Replace
<data_timestamp>with the data date — one day before the node's run date. For example, if the node ran on January 2, 2024, use20240101.SELECT * FROM ods_raw_log_d_starrocks WHERE dt = <data_timestamp>; SELECT * FROM ods_user_info_d_starrocks WHERE dt = <data_timestamp>;
What's next
Data synchronization is complete. In the next tutorial, you process the synchronized basic user information and access logs in StarRocks. See Process data.