This topic describes how to use MaxCompute nodes in DataWorks to process data in the ods_user_info_d_odps table and the ods_raw_log_d_odps table that are synchronized to MaxCompute to obtain user profile data. The ods_user_info_d_odps table stores basic user information, and the ods_raw_log_d_odps table stores user website access logs. This topic helps you understand how to compute and analyze the synchronized data by using DataWorks and MaxCompute to complete simple data processing in data warehouses.
Prerequisites
The required data is synchronized. For more information, see Synchronize data.
Step 1: Build a data processing link
In the data synchronization phase, the required data is synchronized to MaxCompute. The next objective is to further process the data to generate the basic user profile data.
Log on to the DataWorks console and go to the DATA STUDIO pane of the Data Studio page. In the Workspace Directories section of the DATA STUDIO pane, find the prepared workflow and click the workflow name to go to the configuration tab of the workflow.
In this tutorial, you need to create three MaxCompute SQL nodes. The following table lists node names that are used in this tutorial and the functionalities of the nodes.
Node type
Node name
Node functionality
MaxCompute SQLdwd_log_info_di_odpsThis node is used to split data in the
ods_raw_log_d_odpstable and synchronize the data to multiple fields in thedwd_log_info_di_odpstable based on a built-in function or user-defined function (UDF) namedgetregion.
MaxCompute SQLdws_user_info_all_di_odpsThis node is used to aggregate data in the
basic user information table
ods_user_info_d_odpsand the processed log data tabledwd_log_info_di_odpsand synchronize the aggregation result to the
dws_user_info_all_di_odpstable.
MaxCompute SQLads_user_info_1d_odpsThis node is used to further process data in the
dws_user_info_all_di_odpstable and synchronize the processed data to theads_user_info_1d_odpstable to generate a basic user profile.Draw lines to configure ancestor nodes for the MaxCompute SQL nodes, as shown in the following figure.
NoteYou can draw lines to configure scheduling dependencies for nodes in a workflow. You can also use the automatic parsing feature to enable the system to automatically identify scheduling dependencies between nodes. In this tutorial, scheduling dependencies between nodes are configured by drawing lines. For information about the automatic parsing feature, see Method 1: Configure scheduling dependencies based on the lineage in the code of a node.
Step 2: Upload a resource and register a UDF
To ensure that data can be processed as expected, you must register a MaxCompute UDF named getregion to split the log data structure that is synchronized to MaxCompute when you synchronize data into a table.
In this tutorial, the required resources are provided for the function that is used to convert IP addresses into regions. You need to only download the resources to your on-premises machine, and then upload the resources to the desired DataWorks workspace before you register a function in DataWorks.
The IP address resources for this function are used only in this tutorial. If you need to implement the mappings between IP addresses and geographical locations in formal business scenarios, you must seek out professional IP address conversion services from specialized IP address websites.
Upload a resource (ip2region.jar)
Download the ip2region.jar package.
NoteThe
ip2region.jarpackage is used only in this tutorial.Log on to the DataWorks console and go to the Data Studio page. In the left-side navigation pane of the Data Studio page, click the
icon. In the RESOURCE MANAGEMENT pane, click the
icon and choose . In the Create Resource or Function dialog box, configure the Name parameter and click OK.NoteThe resource name can differ from the package that you want to upload.
Set the File Source parameter to On-premises, click Upload next to the File Content parameter, and then select the
ip2region.jarpackage downloaded to your on-premises machine.Select the MaxCompute computing resource that is associated with the workspace when you prepare environments from the Data Source drop-down list.
In the top toolbar of the configuration tab, click Save and then click Deploy to deploy the resource to the MaxCompute project in both the development and production environments by following the on-screen instructions.
Register a UDF (getregion)
Log on to the DataWorks console and go to the Data Studio page. In the left-side navigation pane of the Data Studio page, click the
icon. In the RESOURCE MANAGEMENT pane, click the
icon and choose . In the Create Resource or Function dialog box, configure the Name parameter. In this tutorial, the Name parameter is set to getregion.On the configuration tab that appears, configure the parameters. The following table describes the parameters that you must configure in this tutorial. Retain the default values for other parameters.
Parameter
Description
Function Type
Select
OTHER.Data Source
Select the MaxCompute computing resource that is associated with the workspace when you prepare environments.
Class Name
Enter
org.alidata.odps.udf.Ip2Region.Resource List
Select
ip2region.jar.Description
Enter a description. In this tutorial, the following description is entered: Conversion of an IP address into a region.
Command Syntax
Enter
getregion('ip').Parameter Description
Enter a parameter description. In this example, the following parameter description is entered: IP address.
In the top toolbar of the configuration tab, click Save and then click Deploy to deploy the function to the MaxCompute project in both the development and production environments by following the instructions that are displayed.
Step 3: Configure the MaxCompute SQL nodes
To perform data processing, you must schedule the related MaxCompute SQL node to implement each layer of processing logic. In this tutorial, complete sample code for data processing is provided. You must configure the code separately for the dwd_log_info_di_odps, dws_user_info_all_di_odps, and ads_user_info_1d_odps nodes.
Configure the dwd_log_info_di_odps node
In the sample code for this node, the registered function is used to process the SQL code for fields in the ancestor table ods_raw_log_d_odps and synchronize the data in the table to the dwd_log_info_di_odps table.
In the left-side navigation pane of the Data Studio page, click the
icon. In the Workspace Directories section of the DATA STUDIO pane, find the created workflow and click the workflow name to go to the canvas of the workflow.In the canvas of the workflow, move the pointer over the
dwd_log_info_di_odpsnode and click Open Node.Copy the following SQL statements and paste them in the code editor:
Configure debugging parameters.
In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow in Step 4.
Parameter
Description
Computing Resource
Select the MaxCompute computing resource that is associated with the workspace when you prepare environments.
Resource Group
Select the serverless resource group that you purchase when you prepare environments.
Script Parameters
You do not need to configure this parameter. In the sample code provided in this tutorial, the ${bizdate} variable is used to represent the data timestamp. When you debug the workflow by following Step 4, you can set the value of the variable to a constant, such as
20250223. When the workflow is run, the variables defined for the inner nodes of the workflow are replaced with the constant.(Optional) Configure scheduling properties.
You can retain default values for parameters related to scheduling properties in this tutorial. You can click Properties in the right-side navigation pane of the configuration tab of the node to view the values of the parameters on the following subtabs. For information about other parameters on the Properties tab, see Scheduling properties.
Scheduling Parameters: In this tutorial, scheduling parameters are configured for the workflow. You do not need to configure scheduling parameters for the inner nodes of the workflow. The configured scheduling parameters can be directly used for code and tasks developed based on the inner nodes.
Scheduling Policies: You can configure the Time for Delayed Execution parameter to specify the duration by which the running of the node lags behind the running of the workflow. In this tutorial, you do not need to configure this parameter.
In the top toolbar of the configuration tab, click Save to save the node.
Configure the dws_user_info_all_di_odps node
This node is used to aggregate the basic user information table ods_user_info_d_odps and the processed log data table dwd_log_info_di_odps and synchronize the aggregation result to the dws_user_info_all_di_odps table.
In the canvas of the workflow, move the pointer over the
dws_user_info_all_di_odpsnode and click Open Node.Copy the following SQL statements and paste them in the code editor:
Configure debugging parameters.
In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow in Step 4.
Parameter
Description
Computing Resource
Select the MaxCompute computing resource that is associated with the workspace when you prepare environments.
Resource Group
Select the serverless resource group that you purchase when you prepare environments.
Script Parameters
You do not need to configure this parameter. In the sample code provided in this tutorial, the ${bizdate} variable is used to represent the data timestamp. When you debug the workflow by following Step 4, you can set the value of the variable to a constant, such as
20250223. When the workflow is run, the variables defined for the inner nodes of the workflow are replaced with the constant.(Optional) Configure scheduling properties.
You can retain default values for parameters related to scheduling properties in this tutorial. You can click Properties in the right-side navigation pane of the configuration tab of the node to view the values of the parameters on the following subtabs. For information about other parameters on the Properties tab, see Scheduling properties.
Scheduling Parameters: In this tutorial, scheduling parameters are configured for the workflow. You do not need to configure scheduling parameters for the inner nodes of the workflow. The configured scheduling parameters can be directly used for code and tasks developed based on the inner nodes.
Scheduling Policies: You can configure the Time for Delayed Execution parameter to specify the duration by which the running of the node lags behind the running of the workflow. In this tutorial, you do not need to configure this parameter.
In the top toolbar of the configuration tab, click Save to save the node.
Configure the ads_user_info_1d_odps node
This node is used to further process data in the dws_user_info_all_di_odps table and synchronize the processed data to the ads_user_info_1d_odps table to generate a basic user profile.
In the canvas of the workflow, move the pointer over the
ads_user_info_1d_odpsnode and click Open Node.Copy the following SQL statements and paste them in the code editor:
Configure debugging parameters.
In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow in Step 4.
Parameter
Description
Computing Resource
Select the MaxCompute computing resource that is associated with the workspace when you prepare environments.
Resource Group
Select the serverless resource group that you purchase when you prepare environments.
Script Parameters
You do not need to configure this parameter. In the sample code provided in this tutorial, the ${bizdate} variable is used to represent the data timestamp. When you debug the workflow by following Step 4, you can set the value of the variable to a constant, such as
20250223. When the workflow is run, the variables defined for the inner nodes of the workflow are replaced with the constant.Configure debugging parameters.
In the right-side navigation pane of the configuration tab of the node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to test the workflow in Step 4.
Parameter
Description
Computing Resource
Select the MaxCompute computing resource that is associated with the workspace when you prepare environments.
Resource Group
Select the serverless resource group that you purchase when you prepare environments.
Script Parameters
You do not need to configure this parameter. In the sample code provided in this tutorial, the ${bizdate} variable is used to represent the data timestamp. When you debug the workflow by following Step 4, you can set the value of the variable to a constant, such as
20250223. When the workflow is run, the variables defined for the inner nodes of the workflow are replaced with the constant.In the top toolbar of the configuration tab, click Save to save the node.
Step 4: Process data
Process data.
In the top toolbar of the configuration tab of the workflow, click Run. In the Enter runtime parameters dialog box, specify a value that is used for scheduling parameters defined for each node in this run, and click OK. In this tutorial,
20250223is specified. You can specify a value based on your business requirements.Query the data processing result.
In the left-side navigation pane of the Data Studio page, click the
icon. In the Personal Directory section of the DATA STUDIO pane, click the
icon to create a file whose name is suffixed with .sql. You can specify a name for the file based on your business requirements.In the lower part of the configuration tab of the file, check whether the language mode is
MaxCompute SQL.
In the code editor, enter the following SQL statement to query the number of data records in the final result table
ads_user_info_1d_odpsand check whether the data processing result is generated.-- You must update the partition filter condition to the data timestamp of your current operation. In this tutorial, the value of the scheduling parameter bizdate configured in the preceding steps is 20250223. SELECT count(*) FROM ads_user_info_1d_odps WHERE dt='data timestamp';If the result returned after you execute the preceding statement shows that data exists, data processing is complete.
If the results returned after you execute the preceding statements show that data does not exist in the destination, you must make sure that the values specified for the scheduling parameters defined for the inner nodes of the workflow in this run are the same as the value of the
dtfield in the preceding statements when you run the workflow. You can click Running History in the right-side navigation pane of the configuration tab of the workflow, and then click View in the Actions column of the running record generated for this run to view the data timestamp that is used when the workflow is run in the run logs of the workflow. The data timestamp is in thepartition=[pt=xxx]format.
Step 5: Deploy the workflow
An auto triggered node can be automatically scheduled to run only after you deploy the node to the production environment. You can refer to the following steps to deploy the workflow to the production environment:
In this tutorial, scheduling parameters are configured for the workflow when you configure scheduling properties for the workflow. You do not need to separately configure scheduling parameters for each node in the workflow.
In the left-side navigation pane of the Data Studio page, click the
icon. In the Workspace Directories section of the DATA STUDIO pane, find the created workflow and click the workflow name to go to the configuration tab of the workflow.In the top toolbar of the configuration tab, click Deploy.
On the DEPLOY tab, click Start Deployment to Production Environment to deploy the workflow by following the on-screen instructions.
Step 6: Run the nodes in the production environment
After you deploy the nodes on a day, the instances generated for the nodes can be scheduled to run on the next day. You can use the data backfill feature to backfill data for nodes in a workflow that is deployed, which allows you to check whether the nodes can be run in the production environment. For more information, see Backfill data and view data backfill instances (new version).
After the nodes are deployed, click Operation Center in the upper-right corner of the Data Studio page.
You can also click the
icon in the upper-left corner of the Data Studio page and choose .In the left-side navigation pane of the Operation Center page, choose . On the Auto Triggered Nodes page, find the zero load node
workshop_start_odpsand click the node name.In the direct acyclic graph (DAG) of the node, right-click the
workshop_start_odpsnode and choose .In the Backfill Data panel, select the nodes for which you want to backfill data, configure the Data Timestamp parameter, and then click Submit and Redirect.
In the upper part of the Data Backfill page, click Refresh to check whether all nodes are successfully run.
To prevent excessive fees from being generated after the operations in the tutorial are complete, you can configure the Effective Period parameter for all nodes in the workflow or freeze the zero load node workshop_start_odps.
What to do next
Visualize data on a dashboard: After you complete user profile analysis, use DataAnalysis to display the processed data in charts. This helps you quickly extract key information to gain insights into the business trends behind the data.
Monitor data quality: Configure monitoring rules for tables that are generated after data processing to help identify and intercept dirty data in advance to prevent the impacts of dirty data from escalating.
Manage data: After user profile analysis is complete, data tables are generated in MaxCompute. You can view the data tables in Data Map, and determine the relationships between the data tables based on lineages.
Use an API to provide data services: After you obtain the final processed data, use standardized APIs in DataService Studio to share data and to provide data for other business modules that use APIs to receive data.