This tutorial shows you how to use a Data Integration Batch Synchronization task to synchronize data from a MySQL table (ods_user_info_d) and an Object Storage Service (OSS) log file (user_log.txt) into StarRocks tables (ods_user_info_d_starrocks and ods_raw_log_d_starrocks).
Prerequisites
Before you begin, ensure you have prepared the required environment. For details, see Prepare the environment.
1. Create data sources
To ensure that data can be processed in subsequent steps, you must add the following data sources to your DataWorks workspace to retrieve the raw data.
MySQL data source: This tutorial uses a data source named
user_behavior_analysis_mysqlto retrieve basic user information (ods_user_info_d) from MySQL.HttpFile data source: In this tutorial, the data source is named
user_behavior_analysis_httpfileand is used to retrieve user website access logs (user_log.txt) stored in OSS.
Create a MySQL data source (user_behavior_analysis_mysql)
This tutorial's user information is in a MySQL database. Create a MySQL Data Source to synchronize this data (ods_user_info_d) to StarRocks.
Go to the Data Source page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the navigation pane on the left, click Data Sources to go to the Data Sources page.
Click Add Data Source. Search for and select MySQL as the data source type.
On the Add MySQL Data Source page, configure the parameters. For this tutorial, use the same example values for both the development and production environments.
The following table describes the key parameters. You can keep the default values for the other parameters.
Parameter
Description
Data Source Name
Enter a name for the data source. For this tutorial, enter
user_behavior_analysis_mysql.Data Source Description
This data source is for DataWorks tutorials. Read data from this data source when you configure an batch sync task to access the test data provided by the platform. This data source can only be read in Data Integration scenarios. Other modules cannot use it.
Configuration Mode
Select Connection String Mode.
Connection Address
Host IP address:
rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.comPort:
3306
Database Name
Enter the database name. For this tutorial, enter
workshop.Username
Enter the username. For this tutorial, enter
workshop.Password
Enter the password. For this tutorial, enter
workshop#2017.Authentication Method
No authentication.
In the Connection Configuration section, click Test Network Connectivity for both the production and development environments. Ensure that the connectivity status is Connected.
ImportantEnsure that the resource group is attached to the workspace and has public network access enabled. Otherwise, data synchronization fails. For more information, see Prepare the environment.
If you do not have a resource group available, follow the prompts in the connection configuration section. Click Purchase and Associated Purchased Resource Group.
Click Complete Creation.
Create an HttpFile data source (user_behavior_analysis_httpfile)
The user website access logs for this tutorial are stored in OSS. You need to create an HttpFile Data Source to synchronize the log data (user_log.txt) to StarRocks.
Click Data Sources in the navigation pane on the left.
Click Add Data Source. In the Add Data Source dialog box, search for and select HttpFile as the data source type.
On the Add HttpFile Data Source page, configure the parameters. For this tutorial, use the same example values for both the development and production environments.
The following table describes the key parameters. You can keep the default values for the other parameters.
Parameter
Description
Data Source Name
Enter the data source name. For this tutorial, enter
user_behavior_analysis_httpfile.Data Source Description
This data source is for DataWorks tutorials. Read data from this data source when you configure an batch sync task to access the test data provided by the platform. This data source can only be read in Data Integration scenarios. Other modules cannot use it.
URL
Set URL to
https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.comfor both the development and production environments.In the Connection Configuration section, click Test Network Connectivity for both the production and development environments. Ensure that the connectivity status is Connected.
ImportantEnsure that the resource group is attached to the workspace and has public network access enabled. Otherwise, data synchronization fails. For more information, see Prepare the environment.
If you do not have a resource group available, follow the prompts in the connection configuration section. Click Purchase and Associated Purchased Resource Group.
Click Complete Creation.
2. Build the synchronization pipeline
Click the
icon in the upper-left corner and select . Then, at the top of the page, switch to the workspace that is created for this tutorial.In the left-side navigation pane, click the
icon. In the Workspace Directories section, click the
icon, select Create Workflow, and name the Workflow. For this tutorial, name it user_profile_analysis_starrocks.On the Workflow canvas, create the nodes listed in the table below. You can find components in the left-side pane and drag them onto the canvas.
Source Type:
MySQL.Destination Type:
OSS.Specific Type: Batch Synchronization.
Node type
Node name
Description
Zero load nodeworkshop_start_starrocksManages the entire user persona analysis workflow and clarifies the data forwarding path. This node is a Dry-run node and requires no code editing.
StarRocks nodeddl_ods_user_info_d_starrocksCreates the destination StarRocks table,
ods_user_info_d_starrocks, to receive user data from the MySQL Data Source.
StarRocks nodeddl_ods_raw_log_d_starrocksCreates the destination StarRocks table,
ods_raw_log_d_starrocks, to receive the website access logs from the OSS source.
Batch synchronization nodeods_user_info_d_starrocksSynchronizes user information data from MySQL to the StarRocks table
ods_user_info_d_starrocks.
Batch synchronization nodeods_raw_log_d_starrocksSynchronizes user website access logs from OSS to the StarRocks table
ods_raw_log_d_starrocks.Connect the nodes as shown, with
workshop_start_starrocksas the Ancestor Node for both Batch Synchronization nodes. The final Workflow should look like this:Configure the workflow scheduling properties.
On the workflow canvas, click Scheduling in the right-side pane and configure the parameters. The following table describes the key parameters. You can keep the default values for the other parameters.
Scheduling Parameter
Description
Scheduling Parameters
You can configure scheduling parameters for the entire workflow. The inner nodes of the workflow can directly use the configured scheduling parameters. In this tutorial, the parameter is set to
bizdate=$[yyyymmdd-1]to obtain the date of the previous day.Scheduling Cycle
Set to
Dayfor this tutorial.Scheduling Time
In this tutorial, Scheduling Time is set to
00:30, which means the workflow will start daily at00:30.Scheduling Dependencies
The workflow has no upstream dependencies, so you do not need to configure any. For easier management, click Use Workspace Root Node to attach the workflow to the root node of the workspace.
The naming format for the workspace root node is
workspace_name_root.
3. Configure synchronization tasks
Configure the initial node
On the Workflow canvas, hover over the
workshop_start_starrocksnode and click Open Node.On the right side of the
workshop_start_starrocksnode configuration page, click Properties and configure the parameters. The following table describes the key parameters. For parameters not listed, you can retain the default settings.Scheduling parameter
Description
Scheduling type
For this tutorial, select
Dry-run.Resource group for scheduling
For this tutorial, select the serverless Resource Group created in the Prepare the environment step.
Scheduling dependencies
Because
workshop_start_starrocksis the initial node and has no upstream dependencies, you can click Use Workspace Root Node. This allows the Workspace root node to trigger the Workflow.The Workspace root node is named in the format of
WorkspaceName_root.
Create the user table ddl_ods_user_info_d_starrocks
First, create the ddl_ods_user_info_d_starrocks table in StarRocks to receive user data from the MySQL Data Source. You can create the table within this node or manually create it in the Data Catalog.
On the Workflow canvas, hover over the
ddl_ods_user_info_d_starrocksnode and click Open Node.Edit the table creation statement.
CREATE TABLE IF NOT EXISTS ods_user_info_d_starrocks ( uid STRING COMMENT 'User ID', gender STRING COMMENT 'Gender', age_range STRING COMMENT 'Age range', zodiac STRING COMMENT 'Zodiac sign', dt STRING not null COMMENT 'Date' ) DUPLICATE KEY(uid) COMMENT 'User profile analysis use case - User information table' PARTITION BY(dt) PROPERTIES("replication_num" = "1");Configure debugging parameters.
On the right side of the StarRocks node configuration page, click Run Configuration. Configure the following parameters to use for the test run in Step 4.
Parameter
Description
Computing resource
Select the StarRocks compute resource that you bound in the Prepare the environment step.
Resource group
Select the serverless resource group that you purchased in the Prepare the environment step.
(Optional) Configure scheduling properties.
For this tutorial, you can keep the default scheduling parameters. On the right side of the node configuration page, click Properties. For more information about the parameters, see Configure node scheduling properties.
Scheduling parameters: These are already configured at the Workflow level, so no node-level configuration is needed. They can be used directly in the task or code.
Scheduling policies: You can use the Time for Delayed Execution parameter to specify a delay for the child node's execution after the Workflow runs. This is not configured for this tutorial.
In the top toolbar, click Save to save the node.
Create the log table ddl_ods_raw_log_d_starrocks
First, create the ddl_ods_raw_log_d_starrocks table in StarRocks to receive the user website access logs from the HttpFile Data Source. You can create the table within this node or manually create it in the Data Catalog.
On the Workflow canvas, hover over the
ddl_ods_raw_log_d_starrocksnode and click Open Node.Edit the table creation statement.
CREATE TABLE IF NOT EXISTS ods_raw_log_d_starrocks ( col STRING COMMENT 'Log', dt DATE not null COMMENT 'Date' ) DUPLICATE KEY(col) COMMENT 'User profile analysis use case - Raw website access log table' PARTITION BY(dt) PROPERTIES ("replication_num" = "1");Configure debugging parameters.
On the right side of the Batch Synchronization task configuration page, click Run Configuration. Configure the following parameters to use for the test run in Step 4.
Parameter
Description
Computing resource
Select the StarRocks compute resource that you bound in the Prepare the environment step.
Resource group
Select the serverless resource group that you purchased in the Prepare the environment step.
(Optional) Configure scheduling properties.
For this tutorial, you can keep the default scheduling parameters. On the right side of the node configuration page, click Properties. For more information about the parameters, see Configure node scheduling properties.
Scheduling parameters: These are already configured at the Workflow level, so no node-level configuration is needed. They can be used directly in the task or code.
Scheduling policies: You can use the Time for Delayed Execution parameter to specify a delay for the child node's execution after the Workflow runs. This is not configured for this tutorial.
In the top toolbar, click Save to save the node.
Configure the user data synchronization pipeline (ods_user_info_d_starrocks)
On the Workflow canvas, hover over the
ods_user_info_d_starrocksnode and click Open Node.Configure the network and resources for the Data Synchronization pipeline.
Parameter
Description
Source
Source: MySQL
Data Source name:
user_behavior_analysis_mysql
Resource group
Select the serverless resource group that you created in the Prepare the environment step.
Destination
Destination: StarRocks
Data Source name:
doc_starrocks_storage_compute_tightly_01
Configure the task.
Configure the source and destination.
Module
Parameter
Configuration
Source
Table
Select the MySQL table
ods_user_info_d.Split Key
We recommend using a primary key or an indexed column as the split key. Only integer-type fields are supported.
For this tutorial, use the
uidfield as the Split Key.Destination
Table
Select the StarRocks table
ods_user_info_d_starrocks.Statement run before writing
This use case uses dynamic partitions based on the
dtfield. To prevent data duplication when the node is rerun, the following SQL statement is used to delete the target partition before each synchronization.ALTER TABLE ods_user_info_d_starrocks DROP PARTITION IF EXISTS p${var} FORCEThe${var}is a variable that will be assigned a value from the scheduling parameters during the scheduling configuration phase. This enables dynamic parameter input in scheduled scenarios. For more information, see Configure scheduling.StreamLoad request parameters
The request parameters for StreamLoad, which must be in JSON format.
{ "row_delimiter": "\\x02", "column_separator": "\\x01" }Field Mapping.
Configure the Field Mapping to define how source fields are written to destination fields. You will also use a scheduling parameter variable to dynamically populate the StarRocks partition field, which ensures that data for each day is written to the corresponding partition in StarRocks.
Click Map Fields with Same Name to automatically map source fields to destination fields with matching names.
Click Add and enter '${var}'. Manually map this field to the dt field in StarRocks.
Channel Control.
For this tutorial, set Policy for Dirty Data to Disallow Dirty Data and keep the default values for other settings. For more information, see Configure a task in wizard mode.
Configure debugging parameters.
On the right side of the Batch Synchronization task configuration page, click Run Configuration. Configure the following parameters to use for the test run in Step 4.
Parameter
Description
Resource group
Select the serverless resource group that you purchased in the Prepare the environment step.
Script parameters
Click Add Parameter and set it to a specific constant in yyyymmdd format, such as
var=20250223. During debugging, DataStudio will replace the variable defined in the task with this constant.(Optional) Configure scheduling properties.
For this tutorial, you can keep the default scheduling parameters. On the right side of the node configuration page, click Properties. For more information about the parameters, see Configure node scheduling properties.
Scheduling parameters: These are already configured at the Workflow level, so no node-level configuration is needed. They can be used directly in the task or code.
Scheduling policies: You can use the Time for Delayed Execution parameter to specify a delay for the child node's execution after the Workflow runs. This is not configured for this tutorial.
In the top toolbar, click Save to save the node.
Configure the log data synchronization pipeline (ods_raw_log_d_starrocks)
On the Workflow canvas, hover over the
ods_raw_log_d_starrocksnode and click Open Node.Configure the network and resources for the Data Synchronization pipeline.
After you configure the Source, Resource Group, and Destination, click Next and complete the connectivity test as prompted. The following table provides the configuration details.
Parameter
Configuration
Source
Source: HttpFile
Data Source name:
user_behavior_analysis_httpfile
Resource group
Select the serverless resource group that you created in the Prepare the environment step.
Destination
Destination: StarRocks
Data Source name:
doc_starrocks_storage_compute_tightly_01
Click Next to configure the Data Synchronization task.
Configure the source and destination.
Module
Parameter
Configuration
Source
File path
/user_log.txtText type
textColumn delimiter
|NoAfter configuring the source, click Confirm Data Structure.
Destination
Table
Select the StarRocks table
ods_raw_log_d_starrocks.Statement run before writing
This use case uses dynamic partitions based on the
dtfield. To prevent data duplication when the node is rerun, the following SQL statement is used to delete the target partition before each synchronization.ALTER TABLE ods_raw_log_d_starrocks DROP PARTITION IF EXISTS p${var} FORCEThe${var}is a variable that will be assigned a value from the scheduling parameters during the scheduling configuration phase. This enables dynamic parameter input in scheduled scenarios. For more information, see Configure scheduling.StreamLoad request parameters
The request parameters for StreamLoad, which must be in JSON format.
{ "row_delimiter": "\\x02", "column_separator": "\\x01" }Channel Control.
For this tutorial, set Policy for Dirty Data to Disallow Dirty Data and keep the default values for other settings. For more information, see Configure a task in wizard mode.
Field Mapping.
In the toolbar, switch from Wizard Mode to Script Mode by clicking the
icon. This lets you configure the Field Mapping for the HttpFile Data Source and dynamically assign a value to the dt partition field in StarRocks.In the source HttpFile configuration, add the following to the Column section:
{ "type": "STRING", "value": "${var}" }The following is the complete script for the
ods_raw_log_d_starrocksnode:{ "type": "job", "version": "2.0", "steps": [ { "stepType": "httpfile", "parameter": { "fileName": "/user_log.txt", "nullFormat": "", "compress": "", "requestMethod": "GET", "connectTimeoutSeconds": 60, "column": [ { "index": 0, "type": "STRING" }, { "type": "STRING", "value": "${var}" } ], "skipHeader": "false", "encoding": "UTF-8", "fieldDelimiter": "|", "fieldDelimiterOrigin": "|", "socketTimeoutSeconds": 3600, "envType": 0, "datasource": "user_behavior_analysis_httpfile", "bufferByteSizeInKB": 1024, "fileFormat": "text" }, "name": "Reader", "category": "reader" }, { "stepType": "starrocks", "parameter": { "loadProps": { "row_delimiter": "\\x02", "column_separator": "\\x01" }, "envType": 0, "datasource": "doc_starrocks_storage_compute_tightly_01", "column": [ "col", "dt" ], "tableComment": "", "table": "ods_raw_log_d_starrocks", "preSql": "ALTER TABLE ods_raw_log_d_starrocks DROP PARTITION IF EXISTS p${var} FORCE ; " }, "name": "Writer", "category": "writer" }, { "copies": 1, "parameter": { "nodes": [], "edges": [], "groups": [], "version": "2.0" }, "name": "Processor", "category": "processor" } ], "setting": { "errorLimit": { "record": "0" }, "locale": "zh", "speed": { "throttle": false, "concurrent": 2 } }, "order": { "hops": [ { "from": "Reader", "to": "Writer" } ] } }
Configure debugging parameters.
On the right side of the Batch Synchronization task configuration page, click Run Configuration. Configure the following parameters to use for the test run in Step 4.
Parameter
Description
Resource group
Select the serverless resource group that you purchased in the Prepare the environment step.
Script parameters
Click Add Parameter and set it to a specific constant in yyyymmdd format, such as
var=20250223. During debugging, DataStudio will replace the variable defined in the task with this constant.(Optional) Configure scheduling properties.
For this tutorial, you can keep the default scheduling parameters. On the right side of the node configuration page, click Properties. For more information about the parameters, see Configure node scheduling properties.
Scheduling parameters: These are already configured at the Workflow level, so no node-level configuration is needed. They can be used directly in the task or code.
Scheduling policies: You can use the Time for Delayed Execution parameter to specify a delay for the child node's execution after the Workflow runs. This is not configured for this tutorial.
In the top toolbar, click Save to save the node.
4. Run tasks
Sync the data.
In the workflow toolbar, click Run. Set the values for the parameter variables that are defined in each node for this run. This tutorial uses
20250223, but you can modify the value as needed. Then, click OK and wait for the run to complete.Query the results.
Go to the SQL Query page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, click Go to DataAnalysis. In the left-side navigation pane of the page that appears, click SQL Query.
Configure an SQL query file.
Click the
icon next to My Files to create a new file, and then enter a name for the SQL query file.Click the newly created file to open the editor.
In the upper-right corner of the editor, click the
icon to configure the workspace and other settings for the query. The following table provides the configuration details.Parameter
Description
Workspace
Select the Workspace where the
user_profile_analysis_starrocksWorkflow is located.Data Source type
Select
StarRocksfrom the drop-down list.Data Source name
Select the StarRocks development environment that you bound in the Prepare the environment step.
Click OK to finish configuring the query data source.
Edit the query SQL.
After the nodes run successfully, execute the following queries to verify that the data was written to the StarRocks tables.
-- Update `your_business_date` with the correct Data Timestamp. For a task run on 20250223, the Data Timestamp is 20250222 (the previous day). SELECT * FROM ods_raw_log_d_starrocks WHERE dt=your_business_date; SELECT * FROM ods_user_info_d_starrocks WHERE dt=your_business_date;
Next steps
Now that you have synchronized the data, proceed to the next tutorial to learn how to process and analyze it: Process data.