All Products
Search
Document Center

DataWorks:Process data

Last Updated:Mar 26, 2026

This tutorial walks you through processing synchronized data in StarRocks using DataWorks to generate user profile data. You process data from two source tables — ods_user_info_d_starrocks (basic user information) and ods_raw_log_d_starrocks (website access logs) — through three pipeline stages to produce a final user profile table.

By the end of this tutorial, you will know how to:

  • Design a multi-layer data processing pipeline in DataWorks

  • Register a user-defined function (UDF) that converts IP addresses to regions

  • Configure StarRocks nodes with SQL for each processing stage

  • Run and validate the pipeline in the development environment

  • Deploy the workflow to production and backfill data

Prerequisites

Before you begin, ensure that you have:

  • Completed Synchronize data — this step creates the source tables in StarRocks and configures the workflow

  • A DataWorks workspace with a StarRocks computing resource

  • A serverless resource group purchased and associated with the workspace

  • An Object Storage Service (OSS) bucket in the same region as the workspace (required in Step 2 for the UDF JAR file)

Step 1: Design the data processing pipeline

The synchronized data sits in operation data store (ODS) tables. To generate user profile data, you process it through three warehouse layers:

  • DWD (data warehouse detail layer): Cleans and enriches raw log data — splits delimited fields, parses HTTP request components, and maps IP addresses to regions.

  • DWS (data warehouse summary layer): Joins the enriched log data with basic user information to build a wide table.

  • ADS (application data store): Aggregates the wide table by user to produce the final user profile.

Each layer is implemented as a StarRocks node in DataWorks.

Create the three nodes:

  1. Log on to the DataWorks console and go to Data Studio. In Workspace Directories, open your workflow.

  2. From the Database section of the configuration tab, drag StarRocks onto the canvas. In the Create Node dialog, enter the node name. Repeat this for all three nodes:

    Node type Node name What it does
    StarRocks dwd_log_info_di_starrocks Splits raw log fields in ods_raw_log_d_starrocks using a UDF and regular expressions, then writes the parsed data to dwd_log_info_di_starrocks
    StarRocks dws_user_info_all_di_starrocks Joins dwd_log_info_di_starrocks with ods_user_info_d_starrocks and writes the merged result to dws_user_info_all_di_starrocks
    StarRocks ads_user_info_1d_starrocks Aggregates dws_user_info_all_di_starrocks by user to produce the final profile in ads_user_info_1d_starrocks
  3. Draw lines between the nodes to set their upstream dependencies, as shown below.

    You can connect nodes manually to define their upstream and downstream dependencies. DataWorks can also infer dependencies automatically from child node code. This tutorial uses manual connections. For details, see Automatic parsing mechanism.

    image

Step 2: Register a UDF

The DWD node uses a UDF called getregion to convert IP addresses to geographical regions. This requires uploading a JAR file to OSS and registering it as a function in DataWorks.

Note
  • The ip2region-starrocks.jar package is provided for this tutorial only. For production use, use a professional IP-to-location service.

  • A function can be registered in each environment only once. Deploy the node to production before registering there.

Upload ip2region-starrocks.jar to OSS

  1. Download ip2region-starrocks.jar.

  2. Log on to the OSS console. Open the bucket you created when preparing environments and create a directory named dataworks_starrocks.

  3. Upload ip2region-starrocks.jar to the dataworks_starrocks directory. The full internal path follows this pattern: https://<bucket-name>.oss-<region>-internal.aliyuncs.com/dataworks_starrocks/ip2region-starrocks.jar In this tutorial, the bucket is named test and is in the China (Shanghai) region, so the path is: https://test.oss-cn-shanghai-internal.aliyuncs.com/dataworks_starrocks/ip2region-starrocks.jar

    - Use the internal endpoint (access from ECS over the classic network). The OSS bucket must be in the same region as the DataWorks workspace. - Copy this path — you need it in the next step when registering the function.

Register the getregion function

  1. In Data Studio, click image and choose Create Node > Database > StarRocks to create a new StarRocks node for function registration.

  2. Paste the following SQL into the node editor. Replace the file value with the OSS path from the previous step.

    CREATE FUNCTION getregion(string)
    RETURNS string
    PROPERTIES (
        "symbol" = "com.starrocks.udf.sample.Ip2Region",
        "type" = "StarrocksJar",
        "file" = "<full-oss-internal-path>/dataworks_starrocks/ip2region-starrocks.jar"
    );

    To verify the function works, run:

    SELECT getregion('<your-ip-address>');
  3. Click Save, then Deploy to publish the node to the development and production environments. After deployment, backfill data for the node to complete registration in production.

  4. After registration, go to Operation Center and manually freeze this node in the production environment to prevent failures from repeated registration attempts.

Step 3: Configure the StarRocks nodes

Each node runs SQL that creates a destination table, drops the current partition to prevent duplicate writes on rerun, and then inserts processed data. All three nodes use a scheduling variable ${var} (in yyyymmdd format) as the partition key.

Configure dwd_log_info_di_starrocks

This node parses raw log fields from ods_raw_log_d_starrocks. Each raw row is a single delimited string (##@@ separator). The SQL splits it into fields, applies REGEXP_EXTRACT to parse the HTTP request, and calls getregion(ip) to resolve the region.

  1. In the workflow canvas, hover over the dwd_log_info_di_starrocks node and click Open Node.

  2. From the Select DataSource drop-down list, select the StarRocks computing resource associated with your workspace.

  3. Paste the following SQL into the code editor:

    CREATE TABLE IF NOT EXISTS dwd_log_info_di_starrocks (
        uid STRING COMMENT 'The user ID',
        ip STRING COMMENT 'The IP address',
        TIME STRING COMMENT 'The time in the yyyymmddhh:mi:ss format',
        status STRING COMMENT 'The status code returned by the server',
        bytes STRING COMMENT 'The number of bytes returned to the client',
        region STRING COMMENT 'The region, obtained based on the IP address',
        method STRING COMMENT 'The HTTP request type',
        url STRING COMMENT 'url',
        protocol STRING COMMENT 'The HTTP version number',
        referer STRING COMMENT 'The source URL',
        device STRING COMMENT 'The terminal type',
        identity STRING COMMENT 'The access type: crawler, feed, user, or unknown',
        dt DATE NOT NULL COMMENT 'The time'
    ) DUPLICATE KEY(uid)
    COMMENT 'User behavior analysis case - The fact table for the user website access logs'
    PARTITION BY(dt)
    PROPERTIES ("replication_num" = "1");
    
    -- Drop the destination partition before each run to prevent duplicate writes on rerun.
    ALTER TABLE dwd_log_info_di_starrocks DROP PARTITION IF EXISTS p${var} FORCE;
    
    -- Parse raw log data: split delimited fields, extract HTTP request components,
    -- resolve region from IP using the getregion UDF, and classify device and identity.
    -- ${var} is replaced at runtime by the scheduling parameter (e.g., 20250223).
    INSERT INTO dwd_log_info_di_starrocks
    SELECT
        uid,
        ip,
        time,
        status,
        bytes,
        getregion(ip) AS region,
        REGEXP_EXTRACT(request, '([^ ]+)', 1) AS method,
        REGEXP_EXTRACT(request, '^[^ ]+ (.*) [^ ]+$', 1) AS url,
        REGEXP_EXTRACT(request, '([^ ]+)$', 1) AS protocol,
        REGEXP_EXTRACT(referer, '^[^/]+://([^/]+)', 1) AS referer,
        CASE
            WHEN LOWER(agent) REGEXP 'android' THEN 'android'
            WHEN LOWER(agent) REGEXP 'iphone' THEN 'iphone'
            WHEN LOWER(agent) REGEXP 'ipad' THEN 'ipad'
            WHEN LOWER(agent) REGEXP 'macintosh' THEN 'macintosh'
            WHEN LOWER(agent) REGEXP 'windows phone' THEN 'windows_phone'
            WHEN LOWER(agent) REGEXP 'windows' THEN 'windows_pc'
            ELSE 'unknown'
        END AS device,
        CASE
            WHEN LOWER(agent) REGEXP '(bot|spider|crawler|slurp)' THEN 'crawler'
            WHEN LOWER(agent) REGEXP 'feed' OR REGEXP_EXTRACT(request, '^[^ ]+ (.*) [^ ]+$', 0) REGEXP 'feed' THEN 'feed'
            WHEN NOT (LOWER(agent) REGEXP '(bot|spider|crawler|feed|slurp)')
                 AND agent REGEXP '^(Mozilla|Opera)'
                 AND NOT (REGEXP_EXTRACT(request, '^[^ ]+ (.*) [^ ]+$', 0) REGEXP 'feed') THEN 'user'
            ELSE 'unknown'
        END AS identity,
        CAST('${var}' AS DATE) AS dt
    FROM (
        SELECT
            SPLIT_PART(CAST(col AS VARCHAR(65533)), '##@@', 1) AS ip,
            SPLIT_PART(CAST(col AS VARCHAR(65533)), '##@@', 2) AS uid,
            SPLIT_PART(CAST(col AS VARCHAR(65533)), '##@@', 3) AS time,
            SPLIT_PART(CAST(col AS VARCHAR(65533)), '##@@', 4) AS request,
            SPLIT_PART(CAST(col AS VARCHAR(65533)), '##@@', 5) AS status,
            SPLIT_PART(CAST(col AS VARCHAR(65533)), '##@@', 6) AS bytes,
            SPLIT_PART(CAST(col AS VARCHAR(65533)), '##@@', 7) AS referer,
            SPLIT_PART(CAST(col AS VARCHAR(65533)), '##@@', 8) AS agent
        FROM ods_raw_log_d_starrocks
        WHERE dt = '${var}'
    ) a;
  4. In the right-side navigation pane, click Run Configuration and set the following parameters. These are used when you run the workflow in Step 4.

    Parameter Description
    Computing Resource Select the StarRocks computing resource associated with your workspace
    Resource Group Select the serverless resource group you purchased
    Script Parameters For var, enter a date in yyyymmdd format — for example: var=20250223
  5. (Optional) Click Scheduling Configuration to review scheduling settings. For this tutorial, leave them at their defaults. Scheduling parameters are inherited from the workflow and apply to all nodes automatically.

  6. Click Save.

Configure dws_user_info_all_di_starrocks

This node joins the parsed log data with basic user information using a LEFT JOIN on uid, then writes the merged result to a wide table.

  1. In the workflow canvas, hover over dws_user_info_all_di_starrocks and click Open Node.

  2. From Select DataSource, select the StarRocks computing resource.

  3. Paste the following SQL into the code editor:

    CREATE TABLE IF NOT EXISTS dws_user_info_all_di_starrocks (
        uid STRING COMMENT 'The user ID',
        gender STRING COMMENT 'The gender',
        age_range STRING COMMENT 'The age range',
        zodiac STRING COMMENT 'The zodiac sign',
        region STRING COMMENT 'The region, obtained based on the IP address',
        device STRING COMMENT 'The terminal type',
        identity STRING COMMENT 'The access type: crawler, feed, user, or unknown',
        method STRING COMMENT 'The HTTP request type',
        url STRING COMMENT 'url',
        referer STRING COMMENT 'The source URL',
        TIME STRING COMMENT 'The time in the yyyymmddhh:mi:ss format',
        dt DATE NOT NULL COMMENT 'The time'
    ) DUPLICATE KEY(uid)
    COMMENT 'User behavior analysis case - Wide table for the website access information about users'
    PARTITION BY(dt)
    PROPERTIES ("replication_num" = "1");
    
    -- Drop the destination partition before each run to prevent duplicate writes on rerun.
    ALTER TABLE dws_user_info_all_di_starrocks DROP PARTITION IF EXISTS p${var} FORCE;
    
    -- Merge enriched log data with basic user information (gender, age range, zodiac sign).
    -- Uses LEFT JOIN so log records without a matching user profile are still included.
    INSERT INTO dws_user_info_all_di_starrocks
    SELECT
        IFNULL(a.uid, b.uid) AS uid,
        b.gender,
        b.age_range,
        b.zodiac,
        a.region,
        a.device,
        a.identity,
        a.method,
        a.url,
        a.referer,
        a.time,
        a.dt
    FROM dwd_log_info_di_starrocks a
    LEFT JOIN ods_user_info_d_starrocks b ON a.uid = b.uid
    WHERE a.dt = '${var}';
  4. Configure the same debugging parameters as in the previous node (Computing Resource, Resource Group, Script Parameters).

  5. (Optional) Review scheduling settings under Scheduling Configuration. Leave them at their defaults.

  6. Click Save.

Configure ads_user_info_1d_starrocks

This node aggregates the wide table by uid to produce one row per user per day. It uses MAX() to collapse multiple log entries and COUNT(*) for page view (PV) count.

  1. In the workflow canvas, hover over ads_user_info_1d_starrocks and click Open Node.

  2. From Select DataSource, select the StarRocks computing resource.

  3. Paste the following SQL into the code editor:

    CREATE TABLE IF NOT EXISTS ads_user_info_1d_starrocks (
        uid STRING COMMENT 'The user ID',
        region STRING COMMENT 'The region, obtained based on the IP address',
        device STRING COMMENT 'The terminal type',
        pv BIGINT COMMENT 'pv',
        gender STRING COMMENT 'The gender',
        age_range STRING COMMENT 'The age range',
        zodiac STRING COMMENT 'The zodiac sign',
        dt DATE NOT NULL COMMENT 'The time'
    ) DUPLICATE KEY(uid)
    COMMENT 'User behavior analysis case - User profile data'
    PARTITION BY(dt)
    PROPERTIES ("replication_num" = "1");
    
    -- Drop the destination partition before each run to prevent duplicate writes on rerun.
    ALTER TABLE ads_user_info_1d_starrocks DROP PARTITION IF EXISTS p${var} FORCE;
    
    -- Aggregate the wide table by user: one row per user per day.
    -- COUNT(*) captures total page views; MAX() collapses multi-value fields into a single representative value.
    INSERT INTO ads_user_info_1d_starrocks
    SELECT
        uid,
        MAX(region) AS region,
        MAX(device) AS device,
        COUNT(*) AS pv,
        MAX(gender) AS gender,
        MAX(age_range) AS age_range,
        MAX(zodiac) AS zodiac,
        dt
    FROM dws_user_info_all_di_starrocks
    WHERE dt = '${var}'
    GROUP BY uid, dt;
    
    SELECT * FROM dws_user_info_all_di_starrocks
    WHERE dt = '${var}';
  4. Configure the same debugging parameters as in the previous nodes (Computing Resource, Resource Group, Script Parameters).

  5. (Optional) Review scheduling settings under Scheduling Configuration. Leave them at their defaults.

  6. Click Save.

Step 4: Run and verify the pipeline

  1. In the workflow toolbar, click Run. Set the variable values for this run — this tutorial uses 20250223, but use any valid date in yyyymmdd format. Click OK and wait for the run to complete.

  2. Query the results to verify the output:

    1. In the DataWorks console, choose Data Analysis and Service > DataAnalysis, then click Go to DataAnalysis. In the left-side navigation pane, click SQL Query.

    2. Next to My Files, click image and select Create File. Configure the file name.

    3. Open the file and click image to configure the data source:

      Parameter Value

      Workspace

      Select the workspace that contains the user_profile_analysis_starrocks workflow

      Data Source Type

      StarRocks

      Data Source Name

      Select the StarRocks computing resource associated with your workspace

    4. Click OK.

    5. Run the following query to check the user profile output. The data timestamp is one day before the scheduling date — if the node ran on February 23, 2025, the data timestamp is 20250222.

      SELECT * FROM ads_user_info_1d_starrocks WHERE dt = <data-timestamp>;
      

Step 5: Deploy the workflow

Auto triggered nodes run on schedule only after deployment to production.

Scheduling parameters were configured for the workflow when you set up the workflow scheduling properties. No separate configuration is needed for individual nodes.
  1. In the left-side navigation pane of Data Studio, click image. In Project Directory, open your workflow.

  2. In the toolbar, click Publish.

  3. Click Start Publishing to Production and select a publishing method:

    • Full Release: Publishes the workflow and all its nodes.

    • Incremental Release: Publishes only nodes that have changed since the last release. Use this for iterative updates.

    image.png

  4. After you confirm the publishing method, the system automatically releases the workflow and selected nodes to the development and production environments. Click Confirm Release to complete the deployment.

    image

Step 6: Run the nodes in production

After deployment, instances are generated and run the next day. Use Backfill data to run the workflow immediately and verify it works in production. For details, see Data backfill instance O&M.

  1. After deploying all nodes, click Operation Center in the upper-right corner of the node configuration tab. Alternatively, click 图标 and choose All Products > Data Development And Task Operation > Operation Center.

  2. In the left-side navigation pane, choose Auto Triggered Node O&M > Auto Triggered Nodes. Find the zero load node workshop_start_starrocks and click its name.

  3. In the directed acyclic graph (DAG), right-click workshop_start_starrocks and choose Run > Current and Descendant Nodes Retroactively.

  4. In the Backfill Data panel, select the nodes to backfill, set the Data Timestamp, and click Submit and Redirect.

  5. On the Data Backfill page, click Refresh to check whether workshop_start_starrocks and its descendant nodes completed successfully.

To avoid unexpected charges after completing this tutorial, set the Effective Period for all nodes in the workflow, or freeze the zero load node workshop_start_starrocks.

What to do next

  • Display data in charts: Use DataAnalysis to visualize the user profile data and extract business insights.

  • Monitor data quality: Set up monitoring rules on the generated tables to detect and intercept dirty data before it propagates.

  • Manage data: View the generated StarRocks tables in Data Map and explore their relationships through data lineage.

  • Expose data via APIs: Use DataService Studio to share the processed user profile data with other systems through standardized APIs.