All Products
Search
Document Center

DataWorks:Process data

Last Updated:Mar 27, 2026

This tutorial walks you through building a three-layer SQL processing pipeline in DataWorks to transform raw website logs and user records into a basic user profile dataset. By the end, you will have:

  1. Set up three MaxCompute SQL nodes and configured their scheduling dependencies

  2. Uploaded the ip2region.jar resource and registered the getregion user-defined function (UDF)

  3. Written and tested SQL code for each data warehouse layer

  4. Deployed the workflow to the production environment and verified it runs correctly

Prerequisites

Before you begin, ensure that you have:

How it works

The pipeline processes data through four layers:

  • ODS (Operational Data Store) — Source tables (ods_raw_log_d_odps for website logs, ods_user_info_d_odps for user records) were populated in the data synchronization step.

  • DWD (Data Warehouse Detail) layer — The dwd_log_info_di_odps node splits each raw log entry into structured fields using the getregion UDF and regular expressions.

  • DWS (Data Warehouse Service) layer — The dws_user_info_all_di_odps node joins the structured log data with the user records table.

  • ADS (Application Data Service) layer — The ads_user_info_1d_odps node aggregates the joined data by user ID to produce the final user profile.

The three MaxCompute SQL nodes and their roles are summarized below.

Node type

Node name

Purpose

MaxCompute SQL

dwd_log_info_di_odps

Splits raw log data from ods_raw_log_d_odps into structured fields using the getregion UDF; writes output to dwd_log_info_di_odps

MaxCompute SQL

dws_user_info_all_di_odps

Joins ods_user_info_d_odps (user records) with dwd_log_info_di_odps (processed logs); writes output to dws_user_info_all_di_odps

MaxCompute SQL

ads_user_info_1d_odps

Aggregates the joined data by user ID; writes the final user profile to ads_user_info_1d_odps

Step 1: Build the data processing pipeline

  1. Log on to the DataWorks console and go to the Data Studio page. In the left-side navigation pane, click the DATA STUDIO icon. In the Workspace Directories section, find your workflow and click the workflow name.

  2. In the workflow canvas, create three MaxCompute SQL nodes with the names dwd_log_info_di_odps, dws_user_info_all_di_odps, and ads_user_info_1d_odps.

  3. Draw lines between the nodes to configure their scheduling dependencies, as shown in the following figure.

    You can also use the automatic parsing feature to let DataWorks identify scheduling dependencies from node code. This tutorial uses the manual line-drawing approach. For details on automatic parsing, see Method 1: Configure scheduling dependencies based on the lineage in the code of a node.

    image

Step 2: Upload a resource and register a UDF

The getregion UDF converts IP addresses to geographic regions. It is required by the dwd_log_info_di_odps node to process raw log data.

Important
  • The ip2region.jar resource is provided for this tutorial only. For production use cases that require IP-to-region mapping, use a professional IP address conversion service.

  • Upload the resource file before registering the UDF that references it.

Upload the ip2region.jar resource

  1. Download the ip2region.jar package to your local machine.

  2. On the Data Studio page, click the image icon in the left-side navigation pane. In the RESOURCE MANAGEMENT pane, click the image icon and choose Create Resource > MaxCompute Jar. In the Create Resource or Function dialog box, set the Name parameter and click OK.

    The resource name does not need to match the JAR file name.
  3. Set File Source to On-premises, click Upload next to File Content, and select the ip2region.jar file you downloaded.

  4. From the Data Source drop-down list, select the MaxCompute computing resource associated with your workspace. This is the resource you configured when preparing environments.

  5. Click Save and then Deploy in the top toolbar to deploy the resource to the MaxCompute project in both development and production environments.

Register the getregion UDF

  1. In the RESOURCE MANAGEMENT pane, click the image icon and choose Create Function > MaxCompute Function. In the Create Resource or Function dialog box, set the Name parameter to getregion.

  2. On the configuration tab, set the following parameters. Keep the defaults for all other parameters.

    Parameter

    Value

    Function Type

    OTHER

    Data Source

    The MaxCompute computing resource associated with your workspace

    Class Name

    org.alidata.odps.udf.Ip2Region

    Resource List

    ip2region.jar

    Description

    Conversion of an IP address into a region

    Command Syntax

    getregion('ip')

    Parameter Description

    IP address

  3. Click Save and then Deploy to deploy the UDF to the MaxCompute project in both environments.

Step 3: Configure the MaxCompute SQL nodes

Each node runs SQL that creates an output table and inserts processed data into it. The ${bizdate} variable in each script is replaced by the data timestamp at runtime.

Configure the dwd_log_info_di_odps node

This node uses getregion to map each log entry's IP address to a region, and uses regular expressions to parse the HTTP request field into separate method, URL, and protocol fields. The output goes to dwd_log_info_di_odps.

  1. In the Workspace Directories section, find your workflow and click the workflow name to open the canvas.

  2. In the canvas, move the pointer over the dwd_log_info_di_odps node and click Open Node.

  3. Copy the following SQL and paste it into the code editor.

    -- Create the dwd_log_info_di_odps table.
    CREATE TABLE IF NOT EXISTS dwd_log_info_di_odps (
     ip STRING COMMENT 'The IP address',
     uid STRING COMMENT 'The user ID',
     time STRING COMMENT 'The time in the yyyymmddhh:mi:ss format',
     status STRING COMMENT 'The status code that is returned by the server',
     bytes STRING COMMENT 'The number of bytes that are returned to the client',
     region STRING COMMENT 'The region, which is obtained based on the IP address',
     method STRING COMMENT 'The HTTP request type',
     url STRING COMMENT 'The URL',
     protocol STRING COMMENT 'The version number of HTTP',
     referer STRING COMMENT 'The source URL',
     device STRING COMMENT 'The terminal type',
     identity STRING COMMENT 'The access type, which can be crawler, feed, user, or unknown'
    )
    PARTITIONED BY (
     dt STRING
    )
    LIFECYCLE 14;
    
    -- Split raw log entries from ods_raw_log_d_odps into structured fields.
    -- Each raw log entry is a single delimited string (##@@ separator).
    -- getregion(ip): calls the registered UDF to map IP address to geographic region.
    -- regexp_substr / regexp_extract: parses the HTTP request field into method, URL, and protocol.
    -- CASE WHEN on agent: classifies device type (android, iphone, etc.) and access identity
    --   (crawler, feed, user, or unknown) based on the user-agent string.
    -- ${bizdate}: scheduling variable replaced by the data timestamp at runtime (e.g., 20250223).
    INSERT OVERWRITE TABLE dwd_log_info_di_odps PARTITION (dt='${bizdate}')
    SELECT ip
      , uid
      , time
      , status
      , bytes
      , getregion(ip) AS region -- Map IP address to region using the registered UDF.
      , regexp_substr(request, '(^[^ ]+ )') AS method -- Extract HTTP method from request field.
      , regexp_extract(request, '^[^ ]+ (.*) [^ ]+$') AS url -- Extract URL from request field.
      , regexp_substr(request, '([^ ]+$)') AS protocol -- Extract HTTP version from request field.
      , regexp_extract(referer, '^[^/]+://([^/]+){1}') AS referer -- Normalize referer to domain only.
      , CASE
        WHEN TOLOWER(agent) RLIKE 'android' THEN 'android'
        WHEN TOLOWER(agent) RLIKE 'iphone' THEN 'iphone'
        WHEN TOLOWER(agent) RLIKE 'ipad' THEN 'ipad'
        WHEN TOLOWER(agent) RLIKE 'macintosh' THEN 'macintosh'
        WHEN TOLOWER(agent) RLIKE 'windows phone' THEN 'windows_phone'
        WHEN TOLOWER(agent) RLIKE 'windows' THEN 'windows_pc'
        ELSE 'unknown'
      END AS device
      , CASE
        WHEN TOLOWER(agent) RLIKE '(bot|spider|crawler|slurp)' THEN 'crawler'
        WHEN TOLOWER(agent) RLIKE 'feed'
        OR regexp_extract(request, '^[^ ]+ (.*) [^ ]+$') RLIKE 'feed' THEN 'feed'
        WHEN TOLOWER(agent) NOT RLIKE '(bot|spider|crawler|feed|slurp)'
        AND agent RLIKE '^[Mozilla|Opera]'
        AND regexp_extract(request, '^[^ ]+ (.*) [^ ]+$') NOT RLIKE 'feed' THEN 'user'
        ELSE 'unknown'
      END AS identity
      FROM (
        SELECT SPLIT(col, '##@@')[0] AS ip
        , SPLIT(col, '##@@')[1] AS uid
        , SPLIT(col, '##@@')[2] AS time
        , SPLIT(col, '##@@')[3] AS request
        , SPLIT(col, '##@@')[4] AS status
        , SPLIT(col, '##@@')[5] AS bytes
        , SPLIT(col, '##@@')[6] AS referer
        , SPLIT(col, '##@@')[7] AS agent
      FROM ods_raw_log_d_odps
      WHERE dt ='${bizdate}'
    ) a;
  4. In the right-side navigation pane, click Debugging Configurations and set the following parameters. These are used when you run the workflow in Step 4.

    Parameter

    Value

    Computing Resource

    The MaxCompute computing resource associated with your workspace

    Resource Group

    The serverless resource group you purchased when preparing environments

    Script Parameters

    Leave blank. The ${bizdate} variable is replaced with a constant you specify when running the workflow.

  5. (Optional) Click Properties to review scheduling settings. In this tutorial, scheduling parameters are configured at the workflow level, so you do not need to configure them on individual nodes. For details, see Scheduling properties.

  6. Click Save in the top toolbar.

Configure the dws_user_info_all_di_odps node

This node left-joins the processed log data (dwd_log_info_di_odps) with the user records table (ods_user_info_d_odps) on uid, merging behavioral data with demographic data in dws_user_info_all_di_odps.

  1. In the canvas, move the pointer over the dws_user_info_all_di_odps node and click Open Node.

  2. Copy the following SQL and paste it into the code editor.

    -- Create the dws_user_info_all_di_odps table.
    CREATE TABLE IF NOT EXISTS dws_user_info_all_di_odps (
     uid STRING COMMENT 'The user ID',
     gender STRING COMMENT 'The gender',
     age_range STRING COMMENT 'The age range',
     zodiac STRING COMMENT 'The zodiac sign',
     region STRING COMMENT 'The region, which is obtained based on the IP address',
     device STRING COMMENT 'The terminal type',
     identity STRING COMMENT 'The access type, which can be crawler, feed, user, or unknown',
     method STRING COMMENT 'The HTTP request type',
     url STRING COMMENT 'The URL',
     referer STRING COMMENT 'The source URL',
     time STRING COMMENT 'The time in the yyyymmddhh:mi:ss format'
    )
    PARTITIONED BY (
     dt STRING
    )
    LIFECYCLE 14;
    
    -- Join processed log data (dwd_log_info_di_odps) with user records (ods_user_info_d_odps) on uid.
    -- LEFT OUTER JOIN: retains all log rows even if no matching user record exists.
    -- COALESCE(a.uid, b.uid): handles cases where uid appears in one table but not the other.
    -- ${bizdate}: scheduling variable replaced by the data timestamp at runtime.
    INSERT OVERWRITE TABLE dws_user_info_all_di_odps  PARTITION (dt='${bizdate}')
    SELECT COALESCE(a.uid, b.uid) AS uid
      , b.gender
      , b.age_range
      , b.zodiac
      , a.region
      , a.device
      , a.identity
      , a.method
      , a.url
      , a.referer
      , a.time
    FROM (
      SELECT *
      FROM dwd_log_info_di_odps
      WHERE dt = '${bizdate}'
    ) a
    LEFT OUTER JOIN (
      SELECT *
      FROM ods_user_info_d_odps
      WHERE dt = '${bizdate}'
    ) b
    ON a.uid = b.uid;
  3. Configure the same Debugging Configurations parameters as in the previous node.

  4. Click Save.

Configure the ads_user_info_1d_odps node

This node groups the joined data by uid and applies aggregation functions to produce one profile row per user, stored in ads_user_info_1d_odps.

  1. In the canvas, move the pointer over the ads_user_info_1d_odps node and click Open Node.

  2. Copy the following SQL and paste it into the code editor.

    -- Create the ads_user_info_1d_odps table.
    CREATE TABLE IF NOT EXISTS ads_user_info_1d_odps (
     uid STRING COMMENT 'The user ID',
     region STRING COMMENT 'The region, which is obtained based on the IP address',
     device STRING COMMENT 'The terminal type',
     pv BIGINT COMMENT 'The pv',
     gender STRING COMMENT 'The gender',
     age_range STRING COMMENT 'The age range',
     zodiac STRING COMMENT 'The zodiac sign'
    )
    PARTITIONED BY (
     dt STRING
    )
    LIFECYCLE 14;
    
    -- Aggregate the wide table (dws_user_info_all_di_odps) by uid to produce one profile row per user.
    -- COUNT(0) AS pv: counts the total number of page views per user.
    -- MAX(): selects one value per user for region, device, gender, age_range, and zodiac.
    -- ${bizdate}: scheduling variable replaced by the data timestamp at runtime.
    INSERT OVERWRITE TABLE ads_user_info_1d_odps  PARTITION (dt='${bizdate}')
    SELECT uid
      , MAX(region)
      , MAX(device)
      , COUNT(0) AS pv
      , MAX(gender)
      , MAX(age_range)
      , MAX(zodiac)
    FROM dws_user_info_all_di_odps
    WHERE dt = '${bizdate}'
    GROUP BY uid;
  3. Configure the same Debugging Configurations parameters as in the previous nodes.

  4. Click Save.

Step 4: Run and verify

  1. In the top toolbar of the workflow canvas, click Run. In the Enter runtime parameters dialog box, enter the data timestamp value to use for ${bizdate} — for example, 20250223 — and click OK.

  2. After the run completes, verify the result. In the left-side navigation pane, click the image icon. In the Personal Directory section, click the image icon to create a .sql file.

  3. In the lower part of the tab, confirm that the language mode is set to MaxCompute SQL. image

  4. Run the following query against the final result table, replacing data timestamp with the ${bizdate} value you used (for example, 20250223):

    • Success: The query returns a row count greater than zero.

    • No data returned: The ${bizdate} value you entered when running the workflow does not match the dt partition in the query. To find the exact timestamp used in the run, click Running History in the right-side navigation pane, then click View in the Actions column. Look for partition=[pt=xxx] in the run logs and use that value as dt in your query.

    SELECT count(*) FROM ads_user_info_1d_odps WHERE dt='data timestamp';

Step 5: Deploy the workflow

Auto triggered nodes run on a schedule only after you deploy them to the production environment.

Scheduling parameters are configured at the workflow level (set up in Synchronize data). You do not need to configure them separately for each node.
  1. In the Workspace Directories section, find your workflow and click the workflow name.

  2. In the top toolbar, click Deploy.

  3. On the DEPLOY tab, click Start Deployment to Production Environment and follow the on-screen instructions.

Step 6: Run nodes in the production environment

After deployment, scheduled instances run starting the following day. Use the data backfill feature to trigger a run immediately and confirm the nodes work correctly in production.

  1. After deployment, click Operation Center in the upper-right corner of the Data Studio page. Alternatively, click the icon icon in the upper-left corner and choose All Products > Data Development And Task Operation > Operation Center.

  2. In the left-side navigation pane, choose Auto Triggered Node O&M > Auto Triggered Nodes. On the Auto Triggered Nodes page, find the zero load node workshop_start_odps and click the node name.

  3. In the directed acyclic graph (DAG) of the node, right-click workshop_start_odps and choose Run > Current and Descendant Nodes Retroactively.

  4. In the Backfill Data panel, select the nodes to backfill, set the Data Timestamp parameter, and click Submit and Redirect.

  5. On the Data Backfill page, click Refresh to check whether all nodes completed successfully.

To avoid unexpected charges after completing this tutorial, configure the Effective Period parameter for all nodes in the workflow, or freeze the zero load node workshop_start_odps.

What's next

  • Visualize data on a dashboard: Display the processed user profile data in charts using DataAnalysis to extract key insights from the data.

  • Monitor data quality: Set up monitoring rules on the output tables to detect and intercept dirty data before it affects downstream processes.

  • Manage data: View the output tables in Data Map and explore data lineage relationships between tables.

  • Use an API to provide data services: Publish the final user profile data through standardized APIs in DataService Studio to share data with other systems.