This tutorial walks you through building a three-layer SQL processing pipeline in DataWorks to transform raw website logs and user records into a basic user profile dataset. By the end, you will have:
Set up three MaxCompute SQL nodes and configured their scheduling dependencies
Uploaded the
ip2region.jarresource and registered thegetregionuser-defined function (UDF)Written and tested SQL code for each data warehouse layer
Deployed the workflow to the production environment and verified it runs correctly
Prerequisites
Before you begin, ensure that you have:
Synchronized data to MaxCompute. For details, see Synchronize data.
How it works
The pipeline processes data through four layers:
ODS (Operational Data Store) — Source tables (
ods_raw_log_d_odpsfor website logs,ods_user_info_d_odpsfor user records) were populated in the data synchronization step.DWD (Data Warehouse Detail) layer — The
dwd_log_info_di_odpsnode splits each raw log entry into structured fields using thegetregionUDF and regular expressions.DWS (Data Warehouse Service) layer — The
dws_user_info_all_di_odpsnode joins the structured log data with the user records table.ADS (Application Data Service) layer — The
ads_user_info_1d_odpsnode aggregates the joined data by user ID to produce the final user profile.
The three MaxCompute SQL nodes and their roles are summarized below.
Node type | Node name | Purpose |
MaxCompute SQL |
| Splits raw log data from |
MaxCompute SQL |
| Joins |
MaxCompute SQL |
| Aggregates the joined data by user ID; writes the final user profile to |
Step 1: Build the data processing pipeline
Log on to the DataWorks console and go to the Data Studio page. In the left-side navigation pane, click the DATA STUDIO icon. In the Workspace Directories section, find your workflow and click the workflow name.
In the workflow canvas, create three MaxCompute SQL nodes with the names
dwd_log_info_di_odps,dws_user_info_all_di_odps, andads_user_info_1d_odps.Draw lines between the nodes to configure their scheduling dependencies, as shown in the following figure.
You can also use the automatic parsing feature to let DataWorks identify scheduling dependencies from node code. This tutorial uses the manual line-drawing approach. For details on automatic parsing, see Method 1: Configure scheduling dependencies based on the lineage in the code of a node.
Step 2: Upload a resource and register a UDF
The getregion UDF converts IP addresses to geographic regions. It is required by the dwd_log_info_di_odps node to process raw log data.
The
ip2region.jarresource is provided for this tutorial only. For production use cases that require IP-to-region mapping, use a professional IP address conversion service.Upload the resource file before registering the UDF that references it.
Upload the ip2region.jar resource
Download the ip2region.jar package to your local machine.
On the Data Studio page, click the
icon in the left-side navigation pane. In the RESOURCE MANAGEMENT pane, click the
icon and choose Create Resource > MaxCompute Jar. In the Create Resource or Function dialog box, set the Name parameter and click OK.The resource name does not need to match the JAR file name.
Set File Source to On-premises, click Upload next to File Content, and select the
ip2region.jarfile you downloaded.From the Data Source drop-down list, select the MaxCompute computing resource associated with your workspace. This is the resource you configured when preparing environments.
Click Save and then Deploy in the top toolbar to deploy the resource to the MaxCompute project in both development and production environments.
Register the getregion UDF
In the RESOURCE MANAGEMENT pane, click the
icon and choose Create Function > MaxCompute Function. In the Create Resource or Function dialog box, set the Name parameter to getregion.On the configuration tab, set the following parameters. Keep the defaults for all other parameters.
Parameter
Value
Function Type
OTHERData Source
The MaxCompute computing resource associated with your workspace
Class Name
org.alidata.odps.udf.Ip2RegionResource List
ip2region.jarDescription
Conversion of an IP address into a region
Command Syntax
getregion('ip')Parameter Description
IP address
Click Save and then Deploy to deploy the UDF to the MaxCompute project in both environments.
Step 3: Configure the MaxCompute SQL nodes
Each node runs SQL that creates an output table and inserts processed data into it. The ${bizdate} variable in each script is replaced by the data timestamp at runtime.
Configure the dwd_log_info_di_odps node
This node uses getregion to map each log entry's IP address to a region, and uses regular expressions to parse the HTTP request field into separate method, URL, and protocol fields. The output goes to dwd_log_info_di_odps.
In the Workspace Directories section, find your workflow and click the workflow name to open the canvas.
In the canvas, move the pointer over the
dwd_log_info_di_odpsnode and click Open Node.Copy the following SQL and paste it into the code editor.
-- Create the dwd_log_info_di_odps table. CREATE TABLE IF NOT EXISTS dwd_log_info_di_odps ( ip STRING COMMENT 'The IP address', uid STRING COMMENT 'The user ID', time STRING COMMENT 'The time in the yyyymmddhh:mi:ss format', status STRING COMMENT 'The status code that is returned by the server', bytes STRING COMMENT 'The number of bytes that are returned to the client', region STRING COMMENT 'The region, which is obtained based on the IP address', method STRING COMMENT 'The HTTP request type', url STRING COMMENT 'The URL', protocol STRING COMMENT 'The version number of HTTP', referer STRING COMMENT 'The source URL', device STRING COMMENT 'The terminal type', identity STRING COMMENT 'The access type, which can be crawler, feed, user, or unknown' ) PARTITIONED BY ( dt STRING ) LIFECYCLE 14; -- Split raw log entries from ods_raw_log_d_odps into structured fields. -- Each raw log entry is a single delimited string (##@@ separator). -- getregion(ip): calls the registered UDF to map IP address to geographic region. -- regexp_substr / regexp_extract: parses the HTTP request field into method, URL, and protocol. -- CASE WHEN on agent: classifies device type (android, iphone, etc.) and access identity -- (crawler, feed, user, or unknown) based on the user-agent string. -- ${bizdate}: scheduling variable replaced by the data timestamp at runtime (e.g., 20250223). INSERT OVERWRITE TABLE dwd_log_info_di_odps PARTITION (dt='${bizdate}') SELECT ip , uid , time , status , bytes , getregion(ip) AS region -- Map IP address to region using the registered UDF. , regexp_substr(request, '(^[^ ]+ )') AS method -- Extract HTTP method from request field. , regexp_extract(request, '^[^ ]+ (.*) [^ ]+$') AS url -- Extract URL from request field. , regexp_substr(request, '([^ ]+$)') AS protocol -- Extract HTTP version from request field. , regexp_extract(referer, '^[^/]+://([^/]+){1}') AS referer -- Normalize referer to domain only. , CASE WHEN TOLOWER(agent) RLIKE 'android' THEN 'android' WHEN TOLOWER(agent) RLIKE 'iphone' THEN 'iphone' WHEN TOLOWER(agent) RLIKE 'ipad' THEN 'ipad' WHEN TOLOWER(agent) RLIKE 'macintosh' THEN 'macintosh' WHEN TOLOWER(agent) RLIKE 'windows phone' THEN 'windows_phone' WHEN TOLOWER(agent) RLIKE 'windows' THEN 'windows_pc' ELSE 'unknown' END AS device , CASE WHEN TOLOWER(agent) RLIKE '(bot|spider|crawler|slurp)' THEN 'crawler' WHEN TOLOWER(agent) RLIKE 'feed' OR regexp_extract(request, '^[^ ]+ (.*) [^ ]+$') RLIKE 'feed' THEN 'feed' WHEN TOLOWER(agent) NOT RLIKE '(bot|spider|crawler|feed|slurp)' AND agent RLIKE '^[Mozilla|Opera]' AND regexp_extract(request, '^[^ ]+ (.*) [^ ]+$') NOT RLIKE 'feed' THEN 'user' ELSE 'unknown' END AS identity FROM ( SELECT SPLIT(col, '##@@')[0] AS ip , SPLIT(col, '##@@')[1] AS uid , SPLIT(col, '##@@')[2] AS time , SPLIT(col, '##@@')[3] AS request , SPLIT(col, '##@@')[4] AS status , SPLIT(col, '##@@')[5] AS bytes , SPLIT(col, '##@@')[6] AS referer , SPLIT(col, '##@@')[7] AS agent FROM ods_raw_log_d_odps WHERE dt ='${bizdate}' ) a;In the right-side navigation pane, click Debugging Configurations and set the following parameters. These are used when you run the workflow in Step 4.
Parameter
Value
Computing Resource
The MaxCompute computing resource associated with your workspace
Resource Group
The serverless resource group you purchased when preparing environments
Script Parameters
Leave blank. The
${bizdate}variable is replaced with a constant you specify when running the workflow.(Optional) Click Properties to review scheduling settings. In this tutorial, scheduling parameters are configured at the workflow level, so you do not need to configure them on individual nodes. For details, see Scheduling properties.
Click Save in the top toolbar.
Configure the dws_user_info_all_di_odps node
This node left-joins the processed log data (dwd_log_info_di_odps) with the user records table (ods_user_info_d_odps) on uid, merging behavioral data with demographic data in dws_user_info_all_di_odps.
In the canvas, move the pointer over the
dws_user_info_all_di_odpsnode and click Open Node.Copy the following SQL and paste it into the code editor.
-- Create the dws_user_info_all_di_odps table. CREATE TABLE IF NOT EXISTS dws_user_info_all_di_odps ( uid STRING COMMENT 'The user ID', gender STRING COMMENT 'The gender', age_range STRING COMMENT 'The age range', zodiac STRING COMMENT 'The zodiac sign', region STRING COMMENT 'The region, which is obtained based on the IP address', device STRING COMMENT 'The terminal type', identity STRING COMMENT 'The access type, which can be crawler, feed, user, or unknown', method STRING COMMENT 'The HTTP request type', url STRING COMMENT 'The URL', referer STRING COMMENT 'The source URL', time STRING COMMENT 'The time in the yyyymmddhh:mi:ss format' ) PARTITIONED BY ( dt STRING ) LIFECYCLE 14; -- Join processed log data (dwd_log_info_di_odps) with user records (ods_user_info_d_odps) on uid. -- LEFT OUTER JOIN: retains all log rows even if no matching user record exists. -- COALESCE(a.uid, b.uid): handles cases where uid appears in one table but not the other. -- ${bizdate}: scheduling variable replaced by the data timestamp at runtime. INSERT OVERWRITE TABLE dws_user_info_all_di_odps PARTITION (dt='${bizdate}') SELECT COALESCE(a.uid, b.uid) AS uid , b.gender , b.age_range , b.zodiac , a.region , a.device , a.identity , a.method , a.url , a.referer , a.time FROM ( SELECT * FROM dwd_log_info_di_odps WHERE dt = '${bizdate}' ) a LEFT OUTER JOIN ( SELECT * FROM ods_user_info_d_odps WHERE dt = '${bizdate}' ) b ON a.uid = b.uid;Configure the same Debugging Configurations parameters as in the previous node.
Click Save.
Configure the ads_user_info_1d_odps node
This node groups the joined data by uid and applies aggregation functions to produce one profile row per user, stored in ads_user_info_1d_odps.
In the canvas, move the pointer over the
ads_user_info_1d_odpsnode and click Open Node.Copy the following SQL and paste it into the code editor.
-- Create the ads_user_info_1d_odps table. CREATE TABLE IF NOT EXISTS ads_user_info_1d_odps ( uid STRING COMMENT 'The user ID', region STRING COMMENT 'The region, which is obtained based on the IP address', device STRING COMMENT 'The terminal type', pv BIGINT COMMENT 'The pv', gender STRING COMMENT 'The gender', age_range STRING COMMENT 'The age range', zodiac STRING COMMENT 'The zodiac sign' ) PARTITIONED BY ( dt STRING ) LIFECYCLE 14; -- Aggregate the wide table (dws_user_info_all_di_odps) by uid to produce one profile row per user. -- COUNT(0) AS pv: counts the total number of page views per user. -- MAX(): selects one value per user for region, device, gender, age_range, and zodiac. -- ${bizdate}: scheduling variable replaced by the data timestamp at runtime. INSERT OVERWRITE TABLE ads_user_info_1d_odps PARTITION (dt='${bizdate}') SELECT uid , MAX(region) , MAX(device) , COUNT(0) AS pv , MAX(gender) , MAX(age_range) , MAX(zodiac) FROM dws_user_info_all_di_odps WHERE dt = '${bizdate}' GROUP BY uid;Configure the same Debugging Configurations parameters as in the previous nodes.
Click Save.
Step 4: Run and verify
In the top toolbar of the workflow canvas, click Run. In the Enter runtime parameters dialog box, enter the data timestamp value to use for
${bizdate}— for example,20250223— and click OK.After the run completes, verify the result. In the left-side navigation pane, click the
icon. In the Personal Directory section, click the
icon to create a .sqlfile.In the lower part of the tab, confirm that the language mode is set to
MaxCompute SQL.
Run the following query against the final result table, replacing
data timestampwith the${bizdate}value you used (for example,20250223):Success: The query returns a row count greater than zero.
No data returned: The
${bizdate}value you entered when running the workflow does not match thedtpartition in the query. To find the exact timestamp used in the run, click Running History in the right-side navigation pane, then click View in the Actions column. Look forpartition=[pt=xxx]in the run logs and use that value asdtin your query.
SELECT count(*) FROM ads_user_info_1d_odps WHERE dt='data timestamp';
Step 5: Deploy the workflow
Auto triggered nodes run on a schedule only after you deploy them to the production environment.
Scheduling parameters are configured at the workflow level (set up in Synchronize data). You do not need to configure them separately for each node.
In the Workspace Directories section, find your workflow and click the workflow name.
In the top toolbar, click Deploy.
On the DEPLOY tab, click Start Deployment to Production Environment and follow the on-screen instructions.
Step 6: Run nodes in the production environment
After deployment, scheduled instances run starting the following day. Use the data backfill feature to trigger a run immediately and confirm the nodes work correctly in production.
After deployment, click Operation Center in the upper-right corner of the Data Studio page. Alternatively, click the
icon in the upper-left corner and choose All Products > Data Development And Task Operation > Operation Center.In the left-side navigation pane, choose Auto Triggered Node O&M > Auto Triggered Nodes. On the Auto Triggered Nodes page, find the zero load node
workshop_start_odpsand click the node name.In the directed acyclic graph (DAG) of the node, right-click
workshop_start_odpsand choose Run > Current and Descendant Nodes Retroactively.In the Backfill Data panel, select the nodes to backfill, set the Data Timestamp parameter, and click Submit and Redirect.
On the Data Backfill page, click Refresh to check whether all nodes completed successfully.
To avoid unexpected charges after completing this tutorial, configure the Effective Period parameter for all nodes in the workflow, or freeze the zero load node workshop_start_odps.What's next
Visualize data on a dashboard: Display the processed user profile data in charts using DataAnalysis to extract key insights from the data.
Monitor data quality: Set up monitoring rules on the output tables to detect and intercept dirty data before it affects downstream processes.
Manage data: View the output tables in Data Map and explore data lineage relationships between tables.
Use an API to provide data services: Publish the final user profile data through standardized APIs in DataService Studio to share data with other systems.