All Products
Search
Document Center

DataWorks:Process data

Last Updated:Jul 25, 2025

This topic describes how to use MaxCompute nodes in DataWorks to process data in the ods_user_info_d_odps table and the ods_raw_log_d_odps table that are synchronized to MaxCompute to obtain user profile data. The ods_user_info_d_odps table stores basic user information, and the ods_raw_log_d_odps table stores user website access logs. This topic helps you understand how to compute and analyze the synchronized data by using DataWorks and MaxCompute to complete simple data processing in data warehouses.

Prerequisites

The required data is synchronized. For more information, see Synchronize data.

Step 1: Build a data processing link

In the data synchronization phase, the required data is synchronized to MaxCompute. The next objective is to further process the data to generate the basic user profile data.

  1. Log on to the DataWorks console and go to the DATA STUDIO pane of the Data Studio page. In the Workspace Directories section of the DATA STUDIO pane, find the prepared workflow and click the workflow name to go to the configuration tab of the workflow.

  2. In this tutorial, you need to create three MaxCompute SQL nodes. The following table lists node names that are used in this tutorial and the functionalities of the nodes.

    Node type

    Node name

    Node functionality

    imageMaxCompute SQL

    dwd_log_info_di_odps

    This node is used to split data in the ods_raw_log_d_odps table and synchronize the data to multiple fields in the

    dwd_log_info_di_odps table based on a built-in function or user-defined function (UDF) named getregion.

    imageMaxCompute SQL

    dws_user_info_all_di_odps

    This node is used to aggregate data in the

    basic user information table ods_user_info_d_odps and the processed log data table dwd_log_info_di_odps

    and synchronize the aggregation result to the

    dws_user_info_all_di_odps table.

    imageMaxCompute SQL

    ads_user_info_1d_odps

    This node is used to further process data in the

    dws_user_info_all_di_odps table and synchronize the processed data to the

    ads_user_info_1d_odps table to generate a basic user profile.

  3. Draw lines to configure ancestor nodes for the MaxCompute SQL nodes, as shown in the following figure.

    image
    Note

    You can draw lines to configure scheduling dependencies for nodes in a workflow. You can also use the automatic parsing feature to enable the system to automatically identify scheduling dependencies between nodes. In this tutorial, scheduling dependencies between nodes are configured by drawing lines. For information about the automatic parsing feature, see Method 1: Configure scheduling dependencies based on the lineage in the code of a node.

Step 2: Upload a resource and register a UDF

To ensure that data can be processed as expected, you must register a MaxCompute UDF named getregion to split the log data structure that is synchronized to MaxCompute when you synchronize data into a table.

Important
  • In this tutorial, the required resources are provided for the function that is used to convert IP addresses into regions. You need to only download the resources to your on-premises machine, and then upload the resources to the desired DataWorks workspace before you register a function in DataWorks.

  • The IP address resources for this function are used only in this tutorial. If you need to implement the mappings between IP addresses and geographical locations in formal business scenarios, you must seek out professional IP address conversion services from specialized IP address websites.

Upload a resource (ip2region.jar)

  1. Download the ip2region.jar package.

    Note

    The ip2region.jar package is used only in this tutorial.

  2. Log on to the DataWorks console and go to the Data Studio page. In the left-side navigation pane of the Data Studio page, click the image icon. In the RESOURCE MANAGEMENT pane, click the image icon and choose Create Resource > MaxCompute Jar. In the Create Resource or Function dialog box, configure the Name parameter and click OK.

    Note

    The resource name can differ from the package that you want to upload.

  3. Set the File Source parameter to On-premises, click Upload next to the File Content parameter, and then select the ip2region.jar package downloaded to your on-premises machine.

  4. Select the MaxCompute computing resource that is associated with the workspace when you prepare environments from the Data Source drop-down list.

  5. In the top toolbar of the configuration tab, click Save and then click Deploy to deploy the resource to the MaxCompute project in both the development and production environments by following the on-screen instructions.

Register a UDF (getregion)

  1. Log on to the DataWorks console and go to the Data Studio page. In the left-side navigation pane of the Data Studio page, click the image icon. In the RESOURCE MANAGEMENT pane, click the image icon and choose Create Function > MaxCompute Function. In the Create Resource or Function dialog box, configure the Name parameter. In this tutorial, the Name parameter is set to getregion.

  2. On the configuration tab that appears, configure the parameters. The following table describes the parameters that you must configure in this tutorial. Retain the default values for other parameters.

    Parameter

    Description

    Function Type

    Select OTHER.

    Data Source

    Select the MaxCompute computing resource that is associated with the workspace when you prepare environments.

    Class Name

    Enter org.alidata.odps.udf.Ip2Region.

    Resource List

    Select ip2region.jar.

    Description

    Enter a description. In this tutorial, the following description is entered: Conversion of an IP address into a region.

    Command Syntax

    Enter getregion('ip').

    Parameter Description

    Enter a parameter description. In this example, the following parameter description is entered: IP address.

  3. In the top toolbar of the configuration tab, click Save and then click Deploy to deploy the function to the MaxCompute project in both the development and production environments by following the instructions that are displayed.

Step 3: Configure the MaxCompute SQL nodes

To perform data processing, you must schedule the related MaxCompute SQL node to implement each layer of processing logic. In this tutorial, complete sample code for data processing is provided. You must configure the code separately for the dwd_log_info_di_odps, dws_user_info_all_di_odps, and ads_user_info_1d_odps nodes.

Configure the dwd_log_info_di_odps node

In the sample code for this node, the registered function is used to process the SQL code for fields in the ancestor table ods_raw_log_d_odps and synchronize the data in the table to the dwd_log_info_di_odps table.

  1. In the left-side navigation pane of the Data Studio page, click the image icon. In the Workspace Directories section of the DATA STUDIO pane, find the created workflow and click the workflow name to go to the canvas of the workflow.

  2. In the canvas of the workflow, move the pointer over the dwd_log_info_di_odps node and click Open Node.

  3. Copy the following SQL statements and paste them in the code editor:

    Sample code for the dwd_log_info_di_odps node

    -- Create the dwd_log_info_di_odps table.
    CREATE TABLE IF NOT EXISTS dwd_log_info_di_odps (
     ip STRING COMMENT 'The IP address',
     uid STRING COMMENT 'The user ID',
     time STRING COMMENT 'The time in the yyyymmddhh:mi:ss format',
     status STRING COMMENT 'The status code that is returned by the server',
     bytes STRING COMMENT 'The number of bytes that are returned to the client',
     region STRING COMMENT 'The region, which is obtained based on the IP address',
     method STRING COMMENT 'The HTTP request type',
     url STRING COMMENT 'The URL',
     protocol STRING COMMENT 'The version number of HTTP',
     referer STRING COMMENT 'The source URL',
     device STRING COMMENT 'The terminal type',
     identity STRING COMMENT 'The access type, which can be crawler, feed, user, or unknown'
    )
    PARTITIONED BY (
     dt STRING
    )
    LIFECYCLE 14;
    
    -- Process data.
    -- Scenario: The following SQL statements use the getregion function to parse the IP address in the raw log data, and use regular expressions to split the raw data to analyze fields and write the fields to the dwd_log_info_di_odps table.
    --      This tutorial has prepared the function getregion for you to convert IP addresses into regions.
    -- Note:
    --     1.Before you can use a UDF on a DataWorks node, you must upload the resource required to register a function to DataWorks and use the resource to register a function in a visualized manner.
    --        In this tutorial, the resource ip2region.jar is used to register the function getregion.
    --     2.You can configure scheduling parameters for nodes in DataWorks to synchronize incremental data to the related partition in the desired table every day in the scheduling scenario.
    --        In actual development scenarios, you can define variables in the code of a node in the ${Variable name} format and assign scheduling parameters to the variables on the Properties tab of the configuration tab of the node. This way, the values of scheduling parameters can be dynamically replaced in the node code based on the configurations of the scheduling parameters.
    INSERT OVERWRITE TABLE dwd_log_info_di_odps PARTITION (dt='${bizdate}')
    SELECT ip 
      , uid
      , time
      , status
      , bytes 
      , getregion(ip) AS region -- Obtain the region based on the IP address by using the UDF.
      , regexp_substr(request, '(^[^ ]+ )') AS method -- Use a regular expression to split the request into three fields.
      , regexp_extract(request, '^[^ ]+ (.*) [^ ]+$') AS url
      , regexp_substr(request, '([^ ]+$)') AS protocol 
      , regexp_extract(referer, '^[^/]+://([^/]+){1}') AS referer -- Use a regular expression to cleanse the referer to obtain a more accurate URL.
      , CASE
        WHEN TOLOWER(agent) RLIKE 'android' THEN 'android' -- Use an agent to obtain the terminal information and access form.
        WHEN TOLOWER(agent) RLIKE 'iphone' THEN 'iphone'
        WHEN TOLOWER(agent) RLIKE 'ipad' THEN 'ipad'
        WHEN TOLOWER(agent) RLIKE 'macintosh' THEN 'macintosh'
        WHEN TOLOWER(agent) RLIKE 'windows phone' THEN 'windows_phone'
        WHEN TOLOWER(agent) RLIKE 'windows' THEN 'windows_pc'
        ELSE 'unknown'
      END AS device
      , CASE
        WHEN TOLOWER(agent) RLIKE '(bot|spider|crawler|slurp)' THEN 'crawler'
        WHEN TOLOWER(agent) RLIKE 'feed'
        OR regexp_extract(request, '^[^ ]+ (.*) [^ ]+$') RLIKE 'feed' THEN 'feed'
        WHEN TOLOWER(agent) NOT RLIKE '(bot|spider|crawler|feed|slurp)'
        AND agent RLIKE '^[Mozilla|Opera]'
        AND regexp_extract(request, '^[^ ]+ (.*) [^ ]+$') NOT RLIKE 'feed' THEN 'user'
        ELSE 'unknown'
      END AS identity
      FROM (
        SELECT SPLIT(col, '##@@')[0] AS ip
        , SPLIT(col, '##@@')[1] AS uid
        , SPLIT(col, '##@@')[2] AS time
        , SPLIT(col, '##@@')[3] AS request
        , SPLIT(col, '##@@')[4] AS status
        , SPLIT(col, '##@@')[5] AS bytes
        , SPLIT(col, '##@@')[6] AS referer
        , SPLIT(col, '##@@')[7] AS agent
      FROM ods_raw_log_d_odps  
      WHERE dt ='${bizdate}'
    ) a;
  4. Configure debugging parameters.

    In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow in Step 4.

    Parameter

    Description

    Computing Resource

    Select the MaxCompute computing resource that is associated with the workspace when you prepare environments.

    Resource Group

    Select the serverless resource group that you purchase when you prepare environments.

    Script Parameters

    You do not need to configure this parameter. In the sample code provided in this tutorial, the ${bizdate} variable is used to represent the data timestamp. When you debug the workflow by following Step 4, you can set the value of the variable to a constant, such as 20250223. When the workflow is run, the variables defined for the inner nodes of the workflow are replaced with the constant.

  5. (Optional) Configure scheduling properties.

    You can retain default values for parameters related to scheduling properties in this tutorial. You can click Properties in the right-side navigation pane of the configuration tab of the node to view the values of the parameters on the following subtabs. For information about other parameters on the Properties tab, see Scheduling properties.

    • Scheduling Parameters: In this tutorial, scheduling parameters are configured for the workflow. You do not need to configure scheduling parameters for the inner nodes of the workflow. The configured scheduling parameters can be directly used for code and tasks developed based on the inner nodes.

    • Scheduling Policies: You can configure the Time for Delayed Execution parameter to specify the duration by which the running of the node lags behind the running of the workflow. In this tutorial, you do not need to configure this parameter.

  6. In the top toolbar of the configuration tab, click Save to save the node.

Configure the dws_user_info_all_di_odps node

This node is used to aggregate the basic user information table ods_user_info_d_odps and the processed log data table dwd_log_info_di_odps and synchronize the aggregation result to the dws_user_info_all_di_odps table.

  1. In the canvas of the workflow, move the pointer over the dws_user_info_all_di_odps node and click Open Node.

  2. Copy the following SQL statements and paste them in the code editor:

    Sample code for the dws_user_info_all_di_odps node

    -- Create the dws_user_info_all_di_odps table.
    CREATE TABLE IF NOT EXISTS dws_user_info_all_di_odps (
     uid STRING COMMENT 'The user ID',
     gender STRING COMMENT 'The gender',
     age_range STRING COMMENT 'The age range',
     zodiac STRING COMMENT 'The zodiac sign',
     region STRING COMMENT 'The region, which is obtained based on the IP address',
     device STRING COMMENT 'The terminal type',
     identity STRING COMMENT 'The access type, which can be crawler, feed, user, or unknown',
     method STRING COMMENT 'The HTTP request type',
     url STRING COMMENT 'The URL',
     referer STRING COMMENT 'The source URL',
     time STRING COMMENT 'The time in the yyyymmddhh:mi:ss format'
    )
    PARTITIONED BY (
     dt STRING
    )
    LIFECYCLE 14;
    
    -- Process data.
    -- Scenario: The following SQL statements are used to aggregate the processed log data table dwd_log_info_di_odps and the basic user information table ods_user_info_d_odps and write the aggregation result to the dws_user_info_all_di_odps table.
    -- Note: You can configure scheduling parameters for nodes in DataWorks to synchronize incremental data to the related partition in the desired table every day in the scheduling scenario.
    --      In actual development scenarios, you can define variables in the code of a node in the ${Variable name} format and assign scheduling parameters to the variables on the Properties tab of the configuration tab of the node. This way, the values of scheduling parameters can be dynamically replaced in the node code based on the configurations of the scheduling parameters.
    INSERT OVERWRITE TABLE dws_user_info_all_di_odps  PARTITION (dt='${bizdate}')
    SELECT COALESCE(a.uid, b.uid) AS uid
      , b.gender
      , b.age_range
      , b.zodiac
      , a.region
      , a.device
      , a.identity
      , a.method
      , a.url
      , a.referer
      , a.time
    FROM (
      SELECT *
      FROM dwd_log_info_di_odps  
      WHERE dt = '${bizdate}'
    ) a
    LEFT OUTER JOIN (
      SELECT *
      FROM ods_user_info_d_odps 
      WHERE dt = '${bizdate}'
    ) b
    ON a.uid = b.uid;
  3. Configure debugging parameters.

    In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow in Step 4.

    Parameter

    Description

    Computing Resource

    Select the MaxCompute computing resource that is associated with the workspace when you prepare environments.

    Resource Group

    Select the serverless resource group that you purchase when you prepare environments.

    Script Parameters

    You do not need to configure this parameter. In the sample code provided in this tutorial, the ${bizdate} variable is used to represent the data timestamp. When you debug the workflow by following Step 4, you can set the value of the variable to a constant, such as 20250223. When the workflow is run, the variables defined for the inner nodes of the workflow are replaced with the constant.

  4. (Optional) Configure scheduling properties.

    You can retain default values for parameters related to scheduling properties in this tutorial. You can click Properties in the right-side navigation pane of the configuration tab of the node to view the values of the parameters on the following subtabs. For information about other parameters on the Properties tab, see Scheduling properties.

    • Scheduling Parameters: In this tutorial, scheduling parameters are configured for the workflow. You do not need to configure scheduling parameters for the inner nodes of the workflow. The configured scheduling parameters can be directly used for code and tasks developed based on the inner nodes.

    • Scheduling Policies: You can configure the Time for Delayed Execution parameter to specify the duration by which the running of the node lags behind the running of the workflow. In this tutorial, you do not need to configure this parameter.

  5. In the top toolbar of the configuration tab, click Save to save the node.

Configure the ads_user_info_1d_odps node

This node is used to further process data in the dws_user_info_all_di_odps table and synchronize the processed data to the ads_user_info_1d_odps table to generate a basic user profile.

  1. In the canvas of the workflow, move the pointer over the ads_user_info_1d_odps node and click Open Node.

  2. Copy the following SQL statements and paste them in the code editor:

    Sample code for theads_user_info_1d_odps node

    -- Create the ads_user_info_1d_odps table.
    CREATE TABLE IF NOT EXISTS ads_user_info_1d_odps (
     uid STRING COMMENT 'The user ID',
     region STRING COMMENT 'The region, which is obtained based on the IP address',
     device STRING COMMENT 'The terminal type',
     pv BIGINT COMMENT 'The pv',
     gender STRING COMMENT 'The gender',
     age_range STRING COMMENT 'The age range',
     zodiac STRING COMMENT 'The zodiac sign'
    )
    PARTITIONED BY (
     dt STRING
    )
    LIFECYCLE 14;    
    
    -- Process data.
    -- Scenario: The following SQL statements are used to further process data in the dws_user_info_all_di_odps wide table that stores the user website access logs to generate basic user profile data, and synchronize the data to the ads_user_info_1d_odps table.
    -- Note: You can configure scheduling parameters for nodes in DataWorks to synchronize incremental data to the related partition in the desired table every day in the scheduling scenario.
    --      In actual development scenarios, you can define variables in the code of a node in the ${Variable name} format and assign scheduling parameters to the variables on the Properties tab of the configuration tab of the node. This way, the values of scheduling parameters can be dynamically replaced in the node code based on the configurations of the scheduling parameters.
    INSERT OVERWRITE TABLE ads_user_info_1d_odps  PARTITION (dt='${bizdate}')
    SELECT uid
      , MAX(region)
      , MAX(device)
      , COUNT(0) AS pv
      , MAX(gender)
      , MAX(age_range)
      , MAX(zodiac)
    FROM dws_user_info_all_di_odps 
    WHERE dt = '${bizdate}'
    GROUP BY uid; 
  3. Configure debugging parameters.

    In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow in Step 4.

    Parameter

    Description

    Computing Resource

    Select the MaxCompute computing resource that is associated with the workspace when you prepare environments.

    Resource Group

    Select the serverless resource group that you purchase when you prepare environments.

    Script Parameters

    You do not need to configure this parameter. In the sample code provided in this tutorial, the ${bizdate} variable is used to represent the data timestamp. When you debug the workflow by following Step 4, you can set the value of the variable to a constant, such as 20250223. When the workflow is run, the variables defined for the inner nodes of the workflow are replaced with the constant.

  4. Configure debugging parameters.

    In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow in Step 4.

    Parameter

    Description

    Computing Resource

    Select the MaxCompute computing resource that is associated with the workspace when you prepare environments.

    Resource Group

    Select the serverless resource group that you purchase when you prepare environments.

    Script Parameters

    You do not need to configure this parameter. In the sample code provided in this tutorial, the ${bizdate} variable is used to represent the data timestamp. When you debug the workflow by following Step 4, you can set the value of the variable to a constant, such as 20250223. When the workflow is run, the variables defined for the inner nodes of the workflow are replaced with the constant.

  5. In the top toolbar of the configuration tab, click Save to save the node.

Step 4: Process data

  1. Process data.

    In the top toolbar of the configuration tab of the workflow, click Run. In the Enter runtime parameters dialog box, specify a value that is used for scheduling parameters defined for each node in this run, and click OK. In this tutorial, 20250223 is specified. You can specify a value based on your business requirements.

  2. Query the data processing result.

    1. In the left-side navigation pane of the Data Studio page, click the image icon. In the Personal Directory section of the DATA STUDIO pane, click the image icon to create a file whose name is suffixed with .sql. You can specify a name for the file based on your business requirements.

    2. In the lower part of the configuration tab of the file, check whether the language mode is MaxCompute SQL.image

    3. In the code editor, enter the following SQL statement to query the number of data records in the final result table ads_user_info_1d_odps and check whether the data processing result is generated.

      -- You must update the partition filter condition to the data timestamp of your current operation. In this tutorial, the value of the scheduling parameter bizdate configured in the preceding steps is 20250223.
      SELECT count(*) FROM ads_user_info_1d_odps WHERE dt='data timestamp';
      • If the result returned after you execute the preceding statement shows that data exists, data processing is complete.

      • If the results returned after you execute the preceding statements show that data does not exist in the destination, you must make sure that the values specified for the scheduling parameters defined for the inner nodes of the workflow in this run are the same as the value of the dt field in the preceding statements when you run the workflow. You can click Running History in the right-side navigation pane of the configuration tab of the workflow, and then click View in the Actions column of the running record generated for this run to view the data timestamp that is used when the workflow is run in the run logs of the workflow. The data timestamp is in the partition=[pt=xxx] format.

Step 5: Deploy the workflow

An auto triggered node can be automatically scheduled to run only after you deploy the node to the production environment. You can refer to the following steps to deploy the workflow to the production environment:

Note

In this tutorial, scheduling parameters are configured for the workflow when you configure scheduling properties for the workflow. You do not need to separately configure scheduling parameters for each node in the workflow.

  1. In the left-side navigation pane of the Data Studio page, click the image icon. In the Workspace Directories section of the DATA STUDIO pane, find the created workflow and click the workflow name to go to the configuration tab of the workflow.

  2. In the top toolbar of the configuration tab, click Deploy.

  3. On the DEPLOY tab, click Start Deployment to Production Environment to deploy the workflow by following the on-screen instructions.

Step 6: Run the nodes in the production environment

After you deploy the nodes on a day, the instances generated for the nodes can be scheduled to run on the next day. You can use the data backfill feature to backfill data for nodes in a workflow that is deployed, which allows you to check whether the nodes can be run in the production environment. For more information, see Backfill data and view data backfill instances (new version).

  1. After the nodes are deployed, click Operation Center in the upper-right corner of the Data Studio page.

    You can also click the 图标 icon in the upper-left corner of the Data Studio page and choose All Products > Data Development And Task Operation > Operation Center.

  2. In the left-side navigation pane of the Operation Center page, choose Auto Triggered Node O&M > Auto Triggered Nodes. On the Auto Triggered Nodes page, find the zero load node workshop_start_odps and click the node name.

  3. In the direct acyclic graph (DAG) of the node, right-click the workshop_start_odps node and choose Run > Current and Descendant Nodes Retroactively.

  4. In the Backfill Data panel, select the nodes for which you want to backfill data, configure the Data Timestamp parameter, and then click Submit and Redirect.

  5. In the upper part of the Data Backfill page, click Refresh to check whether all nodes are successfully run.

Note

To prevent excessive fees from being generated after the operations in the tutorial are complete, you can configure the Effective Period parameter for all nodes in the workflow or freeze the zero load node workshop_start_odps.

What to do next

  • Visualize data on a dashboard: After you complete user profile analysis, use DataAnalysis to display the processed data in charts. This helps you quickly extract key information to gain insights into the business trends behind the data.

  • Monitor data quality: Configure monitoring rules for tables that are generated after data processing to help identify and intercept dirty data in advance to prevent the impacts of dirty data from escalating.

  • Manage data: After user profile analysis is complete, data tables are generated in MaxCompute. You can view the data tables in Data Map, and determine the relationships between the data tables based on lineages.

  • Use an API to provide data services: After you obtain the final processed data, use standardized APIs in DataService Studio to share data and to provide data for other business modules that use APIs to receive data.