This tutorial walks you through synchronizing raw data into StarRocks using two Data Integration Batch Synchronization tasks: one pulls user records from a MySQL table (ods_user_info_d) and another pulls website access logs from an Object Storage Service (OSS) file via HttpFile (user_log.txt). The data lands in two StarRocks tables: ods_user_info_d_starrocks and ods_raw_log_d_starrocks.
Prerequisites
Before you begin, ensure that you have:
Completed the environment setup described in Prepare the environment
A serverless resource group attached to your workspace with public network access enabled
A StarRocks compute resource bound to your workspace
Step 1: Create data sources
Create two data sources in your DataWorks workspace so that the Batch Synchronization tasks can reach the raw data:
MySQL data source (
user_behavior_analysis_mysql): Provides user profile records from MySQL.HttpFile data source (
user_behavior_analysis_httpfile): Provides website access logs stored in OSS.
Create the MySQL data source
Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose More > Management Center, select your workspace, and click Go to Management Center.
In the left-side navigation pane, click Data Sources.
Click Add Data Source, then search for and select MySQL.
On the Add MySQL Data Source page, configure the following parameters. Use the same values for both the development and production environments.
Parameter Value Data source name user_behavior_analysis_mysqlData source description This data source is for DataWorks tutorials. Read data from this data source when you configure a Batch Synchronization task to access the test data provided by the platform. This data source can only be read in Data Integration scenarios. Other modules cannot use it. Configuration mode Connection String Mode Connection address (host) rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.comConnection address (port) 3306Database name workshopUsername workshopPassword workshop#2017Authentication method No authentication In the Connection Configuration section, click Test Network Connectivity for both environments and confirm the status shows Connected.
ImportantMake sure the resource group is attached to your workspace and has public network access enabled. If no resource group is available, click Purchase and then Associated Purchased Resource Group in the connection configuration section. For details, see Prepare the environment.
Click Complete Creation.
Create the HttpFile data source
The website access logs are stored in OSS. Create an HttpFile data source to let the Batch Synchronization task read the log file.
In the left-side navigation pane, click Data Sources.
Click Add Data Source, then search for and select HttpFile.
On the Add HttpFile Data Source page, configure the following parameters. Use the same values for both environments.
Parameter Value Data source name user_behavior_analysis_httpfileData source description This data source is for DataWorks tutorials. Read data from this data source when you configure a Batch Synchronization task to access the test data provided by the platform. This data source can only be read in Data Integration scenarios. Other modules cannot use it. URL https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.comIn the Connection Configuration section, click Test Network Connectivity for both environments and confirm the status shows Connected.
ImportantMake sure the resource group is attached to your workspace and has public network access enabled. If no resource group is available, click Purchase and then Associated Purchased Resource Group in the connection configuration section. For details, see Prepare the environment.
Click Complete Creation.
Step 2: Build the synchronization pipeline
Create a Workflow
Click the
icon in the upper-left corner and select All Products > Data Development And Task Operation > DataStudio. Switch to the workspace created for this tutorial.In the left-side navigation pane, click the
icon. In the Workspace Directories section, click the
icon, select Create Workflow, and name it user_profile_analysis_starrocks.On the Workflow canvas, add the following nodes. Find each component in the left-side pane and drag it onto the canvas. When prompted for source and destination types, set Source Type to
MySQL, Destination Type toOSS, and Specific Type to Batch Synchronization.Node type Node name Role
Zero load nodeworkshop_start_starrocksAncestor node that manages the Workflow and clarifies the data forwarding path. Configured as a Dry-run node; no code editing required.
StarRocks nodeddl_ods_user_info_d_starrocksCreates the StarRocks table ods_user_info_d_starrocksfor user data.
StarRocks nodeddl_ods_raw_log_d_starrocksCreates the StarRocks table ods_raw_log_d_starrocksfor website access logs.
Batch synchronization nodeods_user_info_d_starrocksSyncs user data from MySQL into ods_user_info_d_starrocks.
Batch synchronization nodeods_raw_log_d_starrocksSyncs website access logs from OSS into ods_raw_log_d_starrocks.Connect the nodes so that
workshop_start_starrocksis the ancestor node for both Batch Synchronization nodes. The completed Workflow should look like this:
Configure Workflow scheduling
On the Workflow canvas, click Scheduling in the right-side pane and set the following parameters.
| Parameter | Value |
|---|---|
| Scheduling parameters | bizdate=$[yyyymmdd-1] — passes the previous day's date to all nodes in the Workflow. |
| Scheduling cycle | Day |
| Scheduling time | 00:30 |
| Scheduling dependencies | Click Use Workspace Root Node to attach the Workflow to the workspace root node (named workspace_name_root). |
Step 3: Configure synchronization tasks
Configure the initial node
On the Workflow canvas, hover over
workshop_start_starrocksand click Open Node.Click Properties in the right-side pane and configure the following parameters.
Parameter Value Scheduling type Dry-runResource group for scheduling Select the serverless resource group from Prepare the environment. Scheduling dependencies Click Use Workspace Root Node so the workspace root node triggers this Workflow. The root node follows the format WorkspaceName_root.
Create the user table (ddl_ods_user_info_d_starrocks)
Create the ods_user_info_d_starrocks table in StarRocks to receive user data from the MySQL data source. You can also create the table manually in the Data Catalog.
On the Workflow canvas, hover over
ddl_ods_user_info_d_starrocksand click Open Node.Enter the following table creation statement:
CREATE TABLE IF NOT EXISTS ods_user_info_d_starrocks ( uid STRING COMMENT 'User ID', gender STRING COMMENT 'Gender', age_range STRING COMMENT 'Age range', zodiac STRING COMMENT 'Zodiac sign', dt STRING not null COMMENT 'Date' ) DUPLICATE KEY(uid) COMMENT 'User profile analysis use case - User information table' PARTITION BY(dt) PROPERTIES("replication_num" = "1");Click Run Configuration in the right-side pane and set the following parameters for the test run.
Parameter Value Computing resource Select the StarRocks compute resource bound in Prepare the environment. Resource group Select the serverless resource group from Prepare the environment. (Optional) Click Properties to review scheduling settings. Scheduling parameters are already configured at the Workflow level, so no node-level changes are needed. For details, see Configure node scheduling properties.
In the top toolbar, click Save.
Create the log table (ddl_ods_raw_log_d_starrocks)
Create the ods_raw_log_d_starrocks table in StarRocks to receive website access logs from the HttpFile data source. You can also create the table manually in the Data Catalog.
On the Workflow canvas, hover over
ddl_ods_raw_log_d_starrocksand click Open Node.Enter the following table creation statement:
CREATE TABLE IF NOT EXISTS ods_raw_log_d_starrocks ( col STRING COMMENT 'Log', dt DATE not null COMMENT 'Date' ) DUPLICATE KEY(col) COMMENT 'User profile analysis use case - Raw website access log table' PARTITION BY(dt) PROPERTIES ("replication_num" = "1");Click Run Configuration in the right-side pane and set the following parameters for the test run.
Parameter Value Computing resource Select the StarRocks compute resource bound in Prepare the environment. Resource group Select the serverless resource group from Prepare the environment. (Optional) Click Properties to review scheduling settings. Scheduling parameters are already configured at the Workflow level. For details, see Configure node scheduling properties.
In the top toolbar, click Save.
Configure the user data sync task (ods_user_info_d_starrocks)
On the Workflow canvas, hover over
ods_user_info_d_starrocksand click Open Node.Set the source, resource group, and destination:
Parameter Value Source MySQL — data source user_behavior_analysis_mysqlResource group Select the serverless resource group from Prepare the environment. Destination StarRocks — data source doc_starrocks_storage_compute_tightly_01Configure the source and destination settings:
Module Parameter Value Source Table ods_user_info_dSource Split key uid— use a primary key or indexed integer column as the split key. Only integer-type fields are supported.Destination Table ods_user_info_d_starrocksDestination Statement run before writing ALTER TABLE ods_user_info_d_starrocks DROP PARTITION IF EXISTS p${var} FORCE— drops the target partition before each sync to prevent duplicate data.${var}is replaced by the scheduling parameter at runtime. For more information, see Configure scheduling.Destination StreamLoad request parameters {"row_delimiter": "\\x02", "column_separator": "\\x01"}Configure field mapping:
Click Map Fields with Same Name to automatically map matching source and destination fields.
Click Add and enter
'${var}'. Map this field to thedtcolumn in StarRocks. The scheduling parameter dynamically assigns the partition date at runtime.
In Channel Control, set Policy for dirty data to Disallow Dirty Data. For more information, see Configure a task in wizard mode.
Click Run Configuration in the right-side pane and set the following parameters for the test run.
Parameter Value Resource group Select the serverless resource group from Prepare the environment. Script parameters Click Add Parameter and enter var=20250223. DataStudio replaces${var}with this value during the test run.(Optional) Click Properties to review scheduling settings. Scheduling parameters are inherited from the Workflow level. For details, see Configure node scheduling properties.
In the top toolbar, click Save.
Configure the log data sync task (ods_raw_log_d_starrocks)
The HttpFile data source requires Script Mode to configure field mapping with a dynamic partition variable. Set up the basic configuration in Wizard Mode first, then switch to Script Mode for the field mapping.
On the Workflow canvas, hover over
ods_raw_log_d_starrocksand click Open Node.Set the source, resource group, and destination, then click Next to complete the connectivity test.
Parameter Value Source HttpFile — data source user_behavior_analysis_httpfileResource group Select the serverless resource group from Prepare the environment. Destination StarRocks — data source doc_starrocks_storage_compute_tightly_01Click Next and configure the source and destination settings:
Module Parameter Value Source File path /user_log.txtSource Text type textSource Column delimiter |Source Advanced configuration > Skip header NoSource — After configuring the source, click Confirm Data Structure. Destination Table ods_raw_log_d_starrocksDestination Statement run before writing ALTER TABLE ods_raw_log_d_starrocks DROP PARTITION IF EXISTS p${var} FORCE— drops the target partition before each sync to prevent duplicate data. For more information, see Configure scheduling.Destination StreamLoad request parameters {"row_delimiter": "\\x02", "column_separator": "\\x01"}In Channel Control, set Policy for dirty data to Disallow Dirty Data. For more information, see Configure a task in wizard mode.
Configure field mapping in Script Mode. The HttpFile data source does not support dynamic variables in Wizard Mode. Switch to Script Mode by clicking the
icon in the toolbar. In the source HttpFile configuration, add the following object to the columnarray to pass the scheduling parameter as thedtpartition value:{ "type": "STRING", "value": "${var}" }The complete script for this node is:
{ "type": "job", "version": "2.0", "steps": [ { "stepType": "httpfile", "parameter": { "fileName": "/user_log.txt", "nullFormat": "", "compress": "", "requestMethod": "GET", "connectTimeoutSeconds": 60, "column": [ { "index": 0, "type": "STRING" }, { "type": "STRING", "value": "${var}" } ], "skipHeader": "false", "encoding": "UTF-8", "fieldDelimiter": "|", "fieldDelimiterOrigin": "|", "socketTimeoutSeconds": 3600, "envType": 0, "datasource": "user_behavior_analysis_httpfile", "bufferByteSizeInKB": 1024, "fileFormat": "text" }, "name": "Reader", "category": "reader" }, { "stepType": "starrocks", "parameter": { "loadProps": { "row_delimiter": "\\x02", "column_separator": "\\x01" }, "envType": 0, "datasource": "doc_starrocks_storage_compute_tightly_01", "column": [ "col", "dt" ], "tableComment": "", "table": "ods_raw_log_d_starrocks", "preSql": "ALTER TABLE ods_raw_log_d_starrocks DROP PARTITION IF EXISTS p${var} FORCE ; " }, "name": "Writer", "category": "writer" }, { "copies": 1, "parameter": { "nodes": [], "edges": [], "groups": [], "version": "2.0" }, "name": "Processor", "category": "processor" } ], "setting": { "errorLimit": { "record": "0" }, "locale": "zh", "speed": { "throttle": false, "concurrent": 2 } }, "order": { "hops": [ { "from": "Reader", "to": "Writer" } ] } }Click Run Configuration in the right-side pane and set the following parameters for the test run.
Parameter Value Resource group Select the serverless resource group from Prepare the environment. Script parameters Click Add Parameter and enter var=20250223. DataStudio replaces${var}with this value during the test run.(Optional) Click Properties to review scheduling settings. Scheduling parameters are inherited from the Workflow level. For details, see Configure node scheduling properties.
In the top toolbar, click Save.
Step 4: Run and verify
Run the Workflow
In the Workflow toolbar, click Run. When prompted, set the parameter variable for this run — this tutorial uses 20250223, but adjust the value as needed. Click OK and wait for all nodes to complete.
Verify the results
Log on to the DataWorks console. In the left-side navigation pane, choose Data Analysis and Service > DataAnalysis, then click Go to DataAnalysis. In the left navigation pane, click SQL Query.
Click the
icon next to My Files to create a new SQL query file and enter a name.Open the file, then click the
icon in the upper-right corner to configure the query settings:Parameter Value Workspace Select the workspace containing the user_profile_analysis_starrocksWorkflow.Data source type StarRocksData source name Select the StarRocks development environment bound in Prepare the environment. Click OK to apply the query data source.
Run the following queries to confirm that data was written to both StarRocks tables. Replace
your_business_datewith the Data Timestamp for your run — for a task run on20250223, the Data Timestamp is20250222(the previous day).SELECT * FROM ods_raw_log_d_starrocks WHERE dt=your_business_date; SELECT * FROM ods_user_info_d_starrocks WHERE dt=your_business_date;
What's next
With data now loaded into StarRocks, proceed to the next tutorial: Process data.