Build User Profile Tables with StarRocks DWD-ADS Layers in DataWorks - DataWorks

Use StarRocks nodes in DataWorks to transform synchronized data into a user profile table. By the end of this tutorial, you will have produced three output tables:

dwd_log_info_di_starrocks — a fact table with parsed website access log fields
dws_user_info_all_di_starrocks — a wide aggregate table joining log data with user info
ads_user_info_1d_starrocks — a compact user profile table ready for consumption

How it works

The raw data sits in two StarRocks tables after synchronization: ods_user_info_d_starrocks (basic user info from MySQL) and ods_raw_log_d_starrocks (website access logs from Object Storage Service (OSS)). Separating each transformation into its own layer keeps each step focused, makes failures easier to isolate, and lets you rerun individual nodes without reprocessing the entire pipeline:

Layer	Purpose
ODS (Operational Data Store)	Raw synchronized data — no transformation
DWD (Detail Wide Data)	Parse and clean raw fields into analyzable columns
DWS (Data Warehouse Summary)	Join cleaned log data with user info into a wide table
ADS (Application Data Store)	Aggregate per-user metrics for downstream consumption

Each layer maps to one StarRocks node in the DataWorks workflow. Running them in order through a directed acyclic graph (DAG) enforces correct dependencies.

Prerequisites

Before you begin, ensure that you have:

Synchronized the source data into StarRocks. See Synchronize data. Specifically:
- The basic user information stored in the ods_user_info_d table in MySQL must be synchronized to the ods_user_info_d_starrocks table in StarRocks using a data synchronization task in Data Integration.
- The website access logs stored in the user_log.txt object in Object Storage Service (OSS) must be synchronized to the ods_raw_log_d_starrocks table in StarRocks using a synchronization task in Data Integration.
Set enable_udf to TRUE in the FE section on the Instance Configuration tab of your EMR Serverless StarRocks instance, then restarted the instance Without this setting, Java user-defined function (UDF) registration fails with the following error:
```
FAILED: Getting analyzing error. Detail message: UDF is not enabled in FE, please configure enable_udf=true in fe/conf/fe.conf or .
```

Go to DataStudio

Log in to the DataWorks console. In the top navigation bar, select a region. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select your workspace and click Go to Data Development.

Step 1: Design the workflow

In Business Flow, open the User profile analysis_StarRocks workflow. The workflow canvas appears.

In the upper part of the canvas, click Create Node to create the following three StarRocks nodes. Name each node after its output table.

Node name	Code logic
`dwd_log_info_di_starrocks`	Parse raw log fields using built-in functions and a UDF
`dws_user_info_all_di_starrocks`	Join parsed log data with user info to produce a wide table
`ads_user_info_1d_starrocks`	Aggregate the wide table into a compact user profile

Drag the nodes onto the canvas and draw dependency lines to form the DAG. When complete, the canvas shows three nodes connected left to right: dwd_log_info_di_starrocks → dws_user_info_all_di_starrocks → ads_user_info_1d_starrocks.

Step 2: Register the UDF

The dwd_log_info_di_starrocks node uses a getregion UDF to look up a geographic region from an IP address. Register the UDF before configuring the node.

Upload the function resource

Download ip2region-starrocks.jar to your local machine.
Log in to the OSS console and upload the JAR to your OSS bucket. For information about creating a bucket, see Create a bucket. Note the full path of the uploaded file — you will need it when registering the function. The path format is:
The OSS bucket must be in the same region as your DataWorks workspace. This tutorial uses China (Shanghai). Storage costs apply to the OSS bucket.
```
https://<bucket-name>.oss-cn-shanghai-internal.aliyuncs.com/<resource-path>
```
Example: https://test.oss-cn-shanghai-internal.aliyuncs.com/dataworks_starrocks/ip2region-starrocks.jar The -internal endpoint routes traffic through the internal network from Elastic Compute Service (ECS) over the classic network. The bucket name must be globally unique.

Create the function

In the left-side navigation pane of the DataStudio page, click . The Ad Hoc Query pane appears. Right-click Ad Hoc Query and choose Create Node > StarRocks.

Run the following SQL to register the function. Replace the "file" value with the full path of the uploaded JAR.

CREATE FUNCTION getregion(string)
RETURNS string
PROPERTIES (
    "symbol" = "com.starrocks.udf.sample.Ip2Region",
    "type" = "StarrocksJar",
    "file" = "<full path of the uploaded JAR>"
);

Verify that the function is registered.
```
SELECT getregion('<your IP address>');
```
A valid region string in the result confirms the UDF is working. For more information about creating resources and functions, see Create resources and functions.

Step 3: Configure the StarRocks nodes

Configure the dwd_log_info_di_starrocks node

This node parses the raw log entries in ods_raw_log_d_starrocks — each stored as a single delimited string — into separate analyzable fields, then writes them to dwd_log_info_di_starrocks.

Configure the node code

Double-click dwd_log_info_di_starrocks on the workflow canvas to open the configuration tab. Enter the following SQL.

Each run deletes the destination partition before inserting data (ALTER TABLE ... DROP PARTITION IF EXISTS p${var} FORCE). This prevents duplicate rows when a node reruns.

CREATE TABLE IF NOT EXISTS dwd_log_info_di_starrocks (
    uid STRING COMMENT 'The user ID',
    ip STRING COMMENT 'The IP address',
    TIME STRING COMMENT 'The time in the yyyymmddhh:mi:ss format',
    status STRING COMMENT 'The status code that is returned by the server',
    bytes STRING COMMENT 'The number of bytes that are returned to the client',
    region STRING COMMENT 'The region, which is obtained based on the IP address',
    method STRING COMMENT 'The HTTP request type',
    url STRING COMMENT 'url',
    protocol STRING COMMENT 'The version number of HTTP',
    referer STRING COMMENT 'The source URL',
    device STRING COMMENT 'The terminal type',
    identity STRING COMMENT 'The access type, which can be crawler, feed, user, or unknown',
    dt DATE NOT NULL COMMENT 'The time'
) DUPLICATE KEY(uid)
COMMENT 'User behavior analysis case - The fact table for the website access logs of users'
PARTITION BY(dt)
PROPERTIES ("replication_num" = "1");

-- Delete the destination partition before each run to prevent duplicate writes on rerun.
ALTER TABLE dwd_log_info_di_starrocks DROP PARTITION IF EXISTS p${var} FORCE;

-- Parse each raw log entry (delimiter: ##@@) into separate fields.
-- getregion() resolves the IP address to a region using the registered UDF.
-- REGEXP_EXTRACT splits the HTTP request field into method, URL, and protocol.
-- Define ${var} as a scheduling parameter to process incremental data by date partition.
INSERT INTO dwd_log_info_di_starrocks
SELECT
    uid
    , ip
    , time
    , status
    , bytes
    , getregion(ip) AS region
    , REGEXP_EXTRACT(request, '([^ ]+)', 1) AS method
    , REGEXP_EXTRACT(request, '^[^ ]+ (.*) [^ ]+$', 1) AS url
    , REGEXP_EXTRACT(request, '([^ ]+)$', 1) AS protocol
    , REGEXP_EXTRACT(referer, '^[^/]+://([^/]+)', 1) AS referer
    , CASE
        WHEN LOWER(agent) REGEXP 'android' THEN 'android'
        WHEN LOWER(agent) REGEXP 'iphone' THEN 'iphone'
        WHEN LOWER(agent) REGEXP 'ipad' THEN 'ipad'
        WHEN LOWER(agent) REGEXP 'macintosh' THEN 'macintosh'
        WHEN LOWER(agent) REGEXP 'windows phone' THEN 'windows_phone'
        WHEN LOWER(agent) REGEXP 'windows' THEN 'windows_pc'
        ELSE 'unknown'
    END AS device
    , CASE
        WHEN LOWER(agent) REGEXP '(bot|spider|crawler|slurp)' THEN 'crawler'
        WHEN LOWER(agent) REGEXP 'feed' OR REGEXP_EXTRACT(request, '^[^ ]+ (.*) [^ ]+$', 0) REGEXP 'feed' THEN 'feed'
        WHEN NOT (LOWER(agent) REGEXP '(bot|spider|crawler|feed|slurp)')
             AND agent REGEXP '^(Mozilla|Opera)'
             AND NOT (REGEXP_EXTRACT(request, '^[^ ]+ (.*) [^ ]+$', 0) REGEXP 'feed') THEN 'user'
        ELSE 'unknown'
    END AS identity
    , CAST('${var}' AS DATE) AS dt
FROM (
    SELECT
        SPLIT_PART(CAST(col AS VARCHAR(65533)), '##@@', 1) AS ip
        , SPLIT_PART(CAST(col AS VARCHAR(65533)), '##@@', 2) AS uid
        , SPLIT_PART(CAST(col AS VARCHAR(65533)), '##@@', 3) AS time
        , SPLIT_PART(CAST(col AS VARCHAR(65533)), '##@@', 4) AS request
        , SPLIT_PART(CAST(col AS VARCHAR(65533)), '##@@', 5) AS status
        , SPLIT_PART(CAST(col AS VARCHAR(65533)), '##@@', 6) AS bytes
        , SPLIT_PART(CAST(col AS VARCHAR(65533)), '##@@', 7) AS referer
        , SPLIT_PART(CAST(col AS VARCHAR(65533)), '##@@', 8) AS agent
    FROM ods_raw_log_d_starrocks
    WHERE dt = '${var}'
) a;

Configure scheduling properties

In the right-side navigation pane of the node's configuration tab, click General and set the following parameters.

Section	Setting
Scheduling Parameter	Click Add Parameter. Set Parameter Name to `var` and Parameter Value to `$[yyyymmdd-1]`.
Dependencies	Click Parse Input and Output from Code to auto-configure scheduling dependencies from the table lineage in the node code. See Scheduling dependency configuration guide.

In the Schedule section, set Scheduling Cycle to Day. Leave Scheduled time at its default — the workshop_start_starrocks zero load node controls when the workflow runs (after 00:30 every day).

Save the node

Click the icon in the top toolbar to save.

Configure the dws_user_info_all_di_starrocks node

This node joins dwd_log_info_di_starrocks (parsed log data) with ods_user_info_d_starrocks (basic user info) on uid to produce a wide aggregate table.

Configure the node code

Double-click dws_user_info_all_di_starrocks on the canvas to open the configuration tab. Enter the following SQL.

CREATE TABLE IF NOT EXISTS dws_user_info_all_di_starrocks (
    uid STRING COMMENT 'The user ID',
    gender STRING COMMENT 'The gender',
    age_range STRING COMMENT 'The age range',
    zodiac STRING COMMENT 'The zodiac sign',
    region STRING COMMENT 'The region, which is obtained based on the IP address',
    device STRING COMMENT 'The terminal type',
    identity STRING COMMENT 'The access type, which can be crawler, feed, user, or unknown',
    method STRING COMMENT 'The HTTP request type',
    url STRING COMMENT 'url',
    referer STRING COMMENT 'The source URL',
    TIME STRING COMMENT 'The time in the yyyymmddhh:mi:ss format',
    dt DATE NOT NULL COMMENT 'The time'
) DUPLICATE KEY(uid)
COMMENT 'User behavior analysis case - Wide table for the website access information about users'
PARTITION BY(dt)
PROPERTIES ("replication_num" = "1");

-- Delete the destination partition before each run to prevent duplicate writes on rerun.
ALTER TABLE dws_user_info_all_di_starrocks DROP PARTITION IF EXISTS p${var} FORCE;

-- LEFT JOIN preserves all log records even if a uid has no matching user info row.
-- ${var} is replaced at runtime by the scheduling parameter (yesterday's date).
INSERT INTO dws_user_info_all_di_starrocks
SELECT
    IFNULL(a.uid, b.uid) AS uid,
    b.gender,
    b.age_range,
    b.zodiac,
    a.region,
    a.device,
    a.identity,
    a.method,
    a.url,
    a.referer,
    a.time,
    a.dt
FROM dwd_log_info_di_starrocks a
LEFT JOIN ods_user_info_d_starrocks b
ON a.uid = b.uid
WHERE a.dt = '${var}';

Configure scheduling properties

Section	Setting
Scheduling Parameter	Click Add Parameter. Set Parameter Name to `var` and Parameter Value to `$[yyyymmdd-1]`.
Dependencies	Click Parse Input and Output from Code.

Set Scheduling Cycle to Day. The workshop_start_starrocks zero load node controls run time (after 00:30 every day).

Save the node

Click the icon to save.

Configure the ads_user_info_1d_starrocks node

This node aggregates the wide table in dws_user_info_all_di_starrocks into a compact per-user profile with page view counts.

The wide table contains a large number of fields and rows. Querying it directly for downstream consumption takes longer than querying this aggregated ADS table.

Configure the node code

Double-click ads_user_info_1d_starrocks on the canvas to open the configuration tab. Enter the following SQL.

CREATE TABLE IF NOT EXISTS ads_user_info_1d_starrocks (
    uid STRING COMMENT 'The user ID',
    region STRING COMMENT 'The region, which is obtained based on the IP address',
    device STRING COMMENT 'The terminal type',
    pv BIGINT COMMENT 'pv',
    gender STRING COMMENT 'The gender',
    age_range STRING COMMENT 'The age range',
    zodiac STRING COMMENT 'The zodiac sign',
    dt DATE NOT NULL COMMENT 'The time'
) DUPLICATE KEY(uid)
COMMENT 'User behavior analysis case - User profile data'
PARTITION BY(dt)
PROPERTIES ("replication_num" = "1");

-- Delete the destination partition before each run to prevent duplicate writes on rerun.
ALTER TABLE ads_user_info_1d_starrocks DROP PARTITION IF EXISTS p${var} FORCE;

-- GROUP BY uid and dt to produce one profile row per user per day.
-- COUNT(*) gives the page view count (pv) for the day.
INSERT INTO ads_user_info_1d_starrocks
SELECT
    uid,
    MAX(region) AS region,
    MAX(device) AS device,
    COUNT(*) AS pv,
    MAX(gender) AS gender,
    MAX(age_range) AS age_range,
    MAX(zodiac) AS zodiac,
    dt
FROM dws_user_info_all_di_starrocks
WHERE dt = '${var}'
GROUP BY uid, dt;

-- Verification query: inspect the wide table for the processed date.
SELECT * FROM dws_user_info_all_di_starrocks
WHERE dt = '${var}';

Configure scheduling properties

Section	Setting
Scheduling Parameter	Click Add Parameter. Set Parameter Name to `var` and Parameter Value to `$[yyyymmdd-1]`.
Dependencies	Click Parse Input and Output from Code.

Set Scheduling Cycle to Day. The workshop_start_starrocks zero load node controls run time (after 00:30 every day).

Save the node

Click the icon to save.

Step 4: Run the nodes

Run the workflow

Double-click User profile analysis_StarRocks under Business Flow to open the workflow canvas.
In the top toolbar, click the icon to run all nodes in dependency order.
Monitor node status. Nodes in the state are running normally.
To view execution details, right-click any node on the canvas and select View Log.

Verify the results

In the left-side navigation pane of the DataStudio page, click . In the Ad Hoc Query pane, right-click Ad Hoc Query and choose Create Node > StarRocks. For more information, see Create an ad hoc query.

Query the result table to confirm the data was processed correctly.

-- Replace <data timestamp> with the date one day before the node run date.
-- For example, if the node ran on January 2, 2024, use 20240101.
SELECT * FROM ads_user_info_1d_starrocks WHERE dt = <data timestamp>;

Step 5: Deploy the nodes

After confirming the nodes run as expected, commit them to the development environment and deploy them to the production environment. Nodes in the development environment are not automatically scheduled. For more information, see Deploy Center.

Commit to the development environment

In the top toolbar of the workflow canvas, click the icon to commit all nodes, then click Confirm.

Deploy to the production environment

Go to the Create Deploy Task page using one of these methods:

In the top toolbar of the workflow configuration tab, click the icon.
In the upper-left corner of the DataStudio page, click the more icon and choose All Products > Data Development and Task Operation > Deploy.

On the Create Deploy Task page, deploy items individually or in bulk:

Single item: Find the item and click Deploy in the Actions column.
Multiple items: Select the items and click Deploy at the bottom of the page, or click Add to List to stage them in the Nodes to Deploy panel before deploying in batches.

When deploying multiple nodes, also deploy the related resources and functions in the workflow.

Step 6: Backfill data

After deploying the nodes to the production environment, backfill historical data to populate the output tables for past dates. For more information, see Backfill data and view data backfill instances (new version).

Go to Operation Center using one of these methods:
- In the upper-right corner of the DataStudio page, click Operation Center.
- In the upper-left corner of the DataStudio page, click the more icon and choose All Products > Data Development and Task Operation > Operation Center.
In the left-side navigation pane, choose Auto Triggered Node O&M > Auto Triggered Nodes. Find and click the workshop_start_starrocks node. Its DAG appears.
Right-click the workshop_start_starrocks node and choose Run > Current and Descendent Nodes Retroactively.
In the Backfill Data panel, select all descendant nodes, specify the data timestamps to backfill, then click Submit and Redirect. The Data Backfill page appears.

What's next

The workflow is complete. To query the output tables or consume the data through APIs, see Manage data and APIs.