This topic describes how to use StarRocks nodes in DataWorks to process data in the ods_user_info_d_starrocks table and the ods_raw_log_d_starrocks table that are synchronized to StarRocks to obtain user profile data. The ods_user_info_d_starrocks table stores basic user information, and the ods_raw_log_d_starrocks table stores user website access logs. This topic helps you understand how to compute and analyze the synchronized data by using DataWorks and StarRocks to complete simple data processing in data warehouses.
Prerequisites
Before you start this tutorial, complete the steps in Synchronize data.
Step 1: Design a data processing link
In the data synchronization phase, the required data is synchronized to StarRocks tables. The next objective is to further process the data to generate the basic user profile data.
Log on to the DataWorks console and go to the DATA STUDIO pane of the Data Studio page. In the Workspace Directories section of the DATA STUDIO pane, find the prepared workflow and click the workflow name to go to the configuration tab of the workflow.
Drag StarRocks from the Database section of the configuration tab to the canvas on the right. In the Create Node dialog box, configure the Node Name parameter.
In this tutorial, you need to create three StarRocks nodes. The following table lists node names that are used in this tutorial and the functionalities of the nodes.
Node type
Node name
Node functionality
StarRocksdwd_log_info_di_starrocksThis node is used to split data in the
ods_raw_log_d_starrockstable and synchronize the data to multiple fields in thedwd_log_info_di_starrockstable based on a built-in function or user-defined function (UDF).
StarRocksdws_user_info_all_di_starrocksThis node is used to aggregate data in the basic user information table
ods_user_info_d_starrocksand the log data tabledwd_log_info_di_starrocksand synchronize the aggregation result to thedws_user_info_all_di_starrockstable.
StarRocksads_user_info_1d_starrocksThis node is used to further process data in the
dws_user_info_all_di_starrockstable and synchronize the processed data to theads_user_info_1d_starrockstable to generate a basic user profile.Draw lines to configure ancestor nodes for the StarRocks nodes, as shown in the following figure.
NoteYou can draw lines to configure scheduling dependencies for nodes in a workflow. You can also use the automatic parsing feature to enable the system to automatically identify scheduling dependencies between nodes. In this tutorial, scheduling dependencies between nodes are configured by drawing lines. For information about the automatic parsing feature, see Method 1: Configure scheduling dependencies based on the lineage in the code of a node.
Step 2: Register a function
You can use methods such as a function to convert the structure of log data for the experiment into data in tables.
In this example, the required resources are provided for the function that is used to convert IP addresses into regions. You need to only download the resources to your on-premises machine, store the downloaded resources in an Object Storage Service (OSS) bucket, and register the resources as a function by following the steps below.
The IP address resources for this function are used only in this tutorial. If you need to implement the mappings between IP addresses and geographical locations in formal business scenarios, you must seek out professional IP address conversion services from specialized IP address websites.
Upload a resource (ip2region-starrocks.jar)
Download the ip2region-starrocks.jar package.
NoteThe
ip2region-starrocks.jarpackage is used only in this tutorial.Upload a resource to OSS.
Log on to the OSS console and go to the Buckets page. On the Buckets page, find the bucket that is created when you prepare environments and create the
dataworks_starrocksdirectory in the path of the bucket.Upload the
ip2region-starrocks.jarpackage to thedataworks_starrocksdirectory.In this tutorial, the full path in which the package is stored is
https://test.oss-cn-shanghai-internal.aliyuncs.com/dataworks_starrocks/ip2region-starrocks.jar. You can obtain the path in which the uploaded OSS resource is stored based on the full path.NoteIn this tutorial, a bucket named test is used.
The network address of the bucket to which the UDF belongs is the address that is used for Access from ECS over the Classic Network (internal network).
If you use an internal endpoint, the OSS bucket must reside in the same region as the DataWorks workspace. In this example, the China (Shanghai) region is used.
Register a function (getregion)
Create a StarRocks node to register a function.
Log on to the DataWorks console and go to the DATA STUDIO pane of the Data Studio page. In the Workspace Directories section of the DATA STUDIO pane, click the
icon and choose to create a StarRocks node. Write code to register a function.
Register a function.
CREATE FUNCTION getregion(string) RETURNS string PROPERTIES ( "symbol" = "com.starrocks.udf.sample.Ip2Region", "type" = "StarrocksJar", "file" = "Enter the full storage path of the OSS bucket. You can obtain the path in the preceding substep." );Check whether the function is registered.
SELECT getregion('The IP address of your on-premises machine');
ImportantA function can be separately registered in the development environment and production environment only once. You must deploy the StarRocks node to the production environment before you can register a function in the production environment.
In the top toolbar of the configuration tab of the StarRocks node, click Save. Then, click Deploy to deploy the StarRocks node to the StarRocks computing resource in the development environment and production environment by following the instructions displayed on the DEPLOY tab. Then, backfill data for the StarRocks node to complete the registration of the function in the production environment. After the function is registered, manually freeze the StarRocks node in the production environment in Operation Center. This prevents the StarRocks node from failing due to repeated registration.
Step 3: Configure the StarRocks nodes
To perform data processing, you must schedule the related StarRocks node to implement each layer of processing logic. In this tutorial, complete sample code for data processing is provided. You must configure the code separately for the dwd_log_info_di_starrocks, dws_user_info_all_di_starrocks, and ads_user_info_1d_starrocks nodes.
Configure the dwd_log_info_di_starrocks node
In the sample code for this node, the registered function is used to process the SQL code for fields in the ancestor table ods_raw_log_d_starrocks and synchronize the data in the table to the dwd_log_info_di_starrocks table.
In the canvas of the workflow, move the pointer over the
dwd_log_info_di_starrocksnode and click Open Node.On the configuration tab of the node, select the StarRocks computing resource that is associated with the workspace when you prepare environments from the Select DataSource drop-down list.
Copy the following SQL statements and paste them in the code editor:
NoteIn the sample code for the
dwd_log_info_di_starrocksnode, the registered function is used to process the SQL code for fields in the ancestor tableods_raw_log_d_starrocksand synchronize the data in the table to thedwd_log_info_di_starrockstable.Configure debugging parameters.
In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow in Step 4.
Parameter
Description
Computing Resource
Select the StarRocks computing resource that is associated with the workspace when you prepare environments.
Resource Group
Select the serverless resource group that you purchase when you prepare environments.
Script Parameters
In the Parameter Value column of the var parameter, enter a constant value in the
yyyymmddformat. Example:var=20250223. When you debug the workflow, Data Studio replaces the variables defined for nodes in the workflow with the constant.Optional. Configure scheduling properties.
You can retain default values for parameters related to scheduling properties in this tutorial. You can click Properties in the right-side navigation pane of the node configuration tab. For information about parameters on the Properties tab, see Scheduling properties.
Scheduling Parameters: In this tutorial, scheduling parameters are configured for the workflow. You do not need to configure scheduling parameters for the inner nodes of the workflow. The configured scheduling parameters can be directly used for code and tasks developed based on the inner nodes.
Scheduling Policies: You can configure the Time for Delayed Execution parameter to specify the duration by which the running of the batch synchronization node lags behind the running of the workflow. In this tutorial, you do not need to configure this parameter.
In the top toolbar of the configuration tab, click Save to save the node.
Configure the dws_user_info_all_di_starrocks node
This node is used to aggregate data in the basic user information table ods_user_info_d_starrocks and the log data table dwd_log_info_di_starrocks and synchronize the aggregation result to the dws_user_info_all_di_starrocks table.
In the canvas of the workflow, move the pointer over the
dws_user_info_all_di_starrocksnode and click Open Node.On the configuration tab of the node, select the StarRocks computing resource that is associated with the workspace when you prepare environments from the Select DataSource drop-down list.
Copy the following SQL statements and paste them in the code editor:
NoteOn the configuration tab of the
dws_user_info_all_di_starrocksnode, write code for aggregating data in the ancestor tablesdwd_log_info_di_starrocksandods_user_info_d_starrocksand synchronizing the aggregation result to thedws_user_info_all_di_starrockstable.Configure debugging parameters.
In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow in Step 4.
Parameter
Description
Computing Resource
Select the StarRocks computing resource that is associated with the workspace when you prepare environments.
Resource Group
Select the serverless resource group that you purchase when you prepare environments.
Script Parameters
In the Parameter Value column of the var parameter, enter a constant value in the
yyyymmddformat. Example:var=20250223. When you debug the workflow, Data Studio replaces the variables defined for nodes in the workflow with the constant.Optional. Configure scheduling properties.
You can retain default values for parameters related to scheduling properties in this tutorial. You can click Properties in the right-side navigation pane of the node configuration tab. For information about parameters on the Properties tab, see Scheduling properties.
Scheduling Parameters: In this tutorial, scheduling parameters are configured for the workflow. You do not need to configure scheduling parameters for the inner nodes of the workflow. The configured scheduling parameters can be directly used for code and tasks developed based on the inner nodes.
Scheduling Policies: You can configure the Time for Delayed Execution parameter to specify the duration by which the running of the batch synchronization node lags behind the running of the workflow. In this tutorial, you do not need to configure this parameter.
In the top toolbar of the configuration tab, click Save to save the node.
Configure the ads_user_info_1d_starrocks node
This node is used to further process data in the dws_user_info_all_di_starrocks table and synchronize the processed data to the ads_user_info_1d_starrocks table to generate a basic user profile.
In the canvas of the workflow, move the pointer over the
ads_user_info_1d_starrocksnode and click Open Node.On the configuration tab of the node, select the StarRocks computing resource that is associated with the workspace when you prepare environments from the Select DataSource drop-down list.
Copy the following SQL statements and paste them in the code editor:
Configure debugging parameters.
In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow in Step 4.
Parameter
Description
Computing Resource
Select the StarRocks computing resource that is associated with the workspace when you prepare environments.
Resource Group
Select the serverless resource group that you purchase when you prepare environments.
Script Parameters
In the Parameter Value column of the var parameter, enter a constant value in the
yyyymmddformat. Example:var=20250223. When you debug the workflow, Data Studio replaces the variables defined for nodes in the workflow with the constant.Optional. Configure scheduling properties.
You can retain default values for parameters related to scheduling properties in this tutorial. You can click Properties in the right-side navigation pane of the node configuration tab. For information about parameters on the Properties tab, see Scheduling properties.
Scheduling Parameters: In this tutorial, scheduling parameters are configured for the workflow. You do not need to configure scheduling parameters for the inner nodes of the workflow. The configured scheduling parameters can be directly used for code and tasks developed based on the inner nodes.
Scheduling Policies: You can configure the Time for Delayed Execution parameter to specify the duration by which the running of the batch synchronization node lags behind the running of the workflow. In this tutorial, you do not need to configure this parameter.
In the top toolbar of the configuration tab, click Save to save the node.
Step 4: Process data
Synchronize data.
In the top toolbar of the configuration tab of the workflow, click Run. In the Enter runtime parameters dialog box, specify a value that is used for scheduling parameters defined for each node in this run, and click OK. In this tutorial,
20250223is specified. You can specify a value based on your business requirements.Query the result.
Go to the SQL Query page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, click Go to DataAnalysis. In the left-side navigation pane of the page that appears, click SQL Query.
Configure an SQL query file.
In the SQL Query pane, click the
icon next to My Files and select Create File. In the Create File dialog box, configure the File Name parameter. In the left-side navigation tree, find the created SQL query file and click the file name to go to the configuration tab of the file.
In the upper-right corner of the configuration tab, click the
icon. In the popover that appears, configure the following parameters.Parameter
Description
Workspace
Select the workspace to which the
user_profile_analysis_starrocksworkflow belongs.Data Source Type
Select
StarRocksfrom the drop-down list.Data Source Name
Select the StarRocks computing resource that is associated with the workspace when you prepare environments.
Click OK.
Write an SQL statement for the query.
After all nodes in this topic are successfully run, write and execute the following SQL statement to check whether external tables are created based on the StarRocks nodes as expected.
-- In the query statements, change the partition key value to the data timestamp of the ads_user_info_1d_starrocks node. For example, if the node is scheduled to run on February 23, 2025, the data timestamp of the node is 20250222, which is one day earlier than the scheduling time of the node. SELECT * FROM ads_user_info_1d_starrocks WHERE dt=The data timestamp;
Step 5: Deploy the workflow
An auto triggered node can be automatically scheduled to run only after you deploy the node to the production environment. You can refer to the following steps to deploy the workflow to the production environment:
In this tutorial, scheduling parameters are configured for the workflow when you configure scheduling properties for the workflow. You do not need to separately configure scheduling parameters for each node in the workflow.
In the left-side navigation pane of the Data Studio page, click the
icon. In the Workspace Directories section of the DATA STUDIO pane, find the created workflow and click the workflow name to go to the configuration tab of the workflow.In the top toolbar of the configuration tab, click Deploy.
On the DEPLOY tab, click Start Deployment to Production Environment to deploy the workflow by following the on-screen instructions.
Step 6: Run the nodes in the production environment
After you deploy the nodes on a day, the instances generated for the nodes can be scheduled to run on the next day. You can use the data backfill feature to backfill data for nodes in a workflow that is deployed, which allows you to check whether the nodes can be run in the production environment. For more information, see Backfill data and view data backfill instances (new version).
After all nodes in the workflow are deployed, click Operation Center in the upper-right corner of the configuration tab of the node.
You can also click the
icon in the upper-left corner of the DataWorks console and choose . In the left-side navigation pane of the Operation Center page, choose . On the Auto Triggered Nodes page, find the zero load node
workshop_start_starrocksand click the node name.In the direct acyclic graph (DAG) of the node, right-click the
workshop_start_starrocksnode and choose .In the Backfill Data panel, select the nodes for which you want to backfill data, configure the Data Timestamp parameter, and then click Submit and Redirect.
In the upper part of the Data Backfill page, click Refresh to check whether the workshop_start_starrocks node and its descendant nodes are successfully run.
To prevent excessive fees from being generated after you complete the operations in this tutorial, you can configure the Effective Period parameter for all nodes in the workflow or freeze the zero load node workshop_start_starrocks.
What to do next
Display data in a visualized manner: After you complete user profile analysis, use DataAnalysis to display the processed data in charts. This helps you quickly extract key information to gain insights into the business trends behind the data.
Monitor data quality: Configure monitoring rules for tables that are generated after data processing to help identify and intercept dirty data in advance to prevent the impacts of dirty data from escalating.
Manage data: Data tables are generated in StarRocks after user profile analysis is complete. You can view the generated data tables in Data Map and view the relationships between the tables based on data lineages.
Use DataService Studio APIs to provide services: After you obtain the final processed data, use standardized APIs in DataService Studio to share data and to provide data for other business modules that use APIs to receive data.