This topic describes how to use E-MapReduce (EMR) Hive nodes in DataWorks to process data in the ods_user_info_d_emr and the ods_raw_log_d_emr tables, which are stored in an Object Storage Service (OSS) bucket after synchronization, to obtain the desired user profile data. The ods_user_info_d_emr table stores basic user information, and the ods_raw_log_d_emr table stores website access logs.
Prerequisites
The required data is synchronized. For more information, see Synchronize data.
Step 1: Design a data processing link
Log on to the DataWorks console and go to the DATA STUDIO pane of the Data Studio page. In the Workspace Directories section of the DATA STUDIO pane, find the workflow workshop_emr
that you create when you synchronize data. On the configuration tab of the workflow, configure the workflow.
You must create three EMR Hive nodes named dwd_log_info_di_emr, dws_user_info_all_di_emr, and ads_user_info_1d_emr. Then, configure scheduling dependencies for these nodes, as shown in the following figure.
The following table lists node names that are used in this tutorial and the functionalities of the nodes.
Node type | Node name | Node functionality |
| dwd_log_info_di_emr | This node is used to cleanse raw OSS log data. This node splits data in the |
| dws_user_info_all_di_emr | This node is used to aggregate the cleansed OSS log data and basic user information data. This node aggregates data in the basic user information table |
| ads_user_info_1d_emr | This node is used to further process data in the |
Step 2: Register a UDF
To ensure that data can be processed as expected, you must register an EMR UDF named getregion
to split the log data structure that is synchronized to EMR when you synchronize data into a table.
Upload an EMR JAR resource (ip2region.jar)
Download the required JAR package.
In this example, you need to download the ip2region-emr.jar package.
Create an EMR JAR resource.
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
In the left-side navigation pane of the Data Studio page, click the
icon to go to the RESOURCE MANAGEMENT pane.
In the RESOURCE MANAGEMENT pane, click Create. In the Create Resource dialog box, select EMR Jar from the Type drop-down list and configure the Name parameter based on your business requirements. Then, click OK.
On the configuration tab of the EMR JAR resource, configure the following parameters.
Parameter
Description
File Source
Select On-premises.
File Content
Click Upload to upload the downloaded JAR package
ip2region-emr.jar
.Storage Path
Select OSS.
Select the OSS bucket that you specified for the EMR cluster created during environment preparation.
Data Source
Select the computing resource that you associate with the workspace when you synchronize data.
Resource Group
Select the serverless resource group that you create when you prepare environments.
In the top toolbar of the configuration tab, click Save. Then, click Deploy to deploy the EMR JAR resource to the development environment and production environment.
Register an EMR UDF (getregion)
Create an EMR UDF.
In the RESOURCE MANAGEMENT pane, find the created EMR JAR resource, right-click the resource name, and then choose
. In the Create Resource or Function dialog box, set the Name parameter togetregion
and click OK.Register the EMR UDF.
On the configuration tab of the EMR UDF, configure the following parameters.
Parameter
Description
Function Type
Select OTHER.
Data Source
Select the computing resource that you associate with the workspace when you synchronize data.
EMR Database
Select default.
Resource Group
Select the serverless resource group that you create when you prepare environments.
Owner
Select a user who has the required permissions as the owner.
Class Name
Enter
org.alidata.emr.udf.Ip2Region
.Resources
Select the name of the EMR JAR resource that you create.
Deploy the EMR UDF.
In the top toolbar of the configuration tab, click Save. Then, click Deploy to deploy the EMR UDF to the development environment and production environment.
Step 3: Configure the EMR nodes
To perform data processing, you must schedule the related EMR Hive node to implement each layer of processing logic. In this tutorial, complete sample code for data processing is provided. You must configure the code separately for the dwd_log_info_di_emr
, dws_user_info_all_di_emr
, and ads_user_info_1d_emr
nodes.
Step 4: Process data
Process data.
In the top toolbar of the configuration tab of the workflow, click Run. In the Enter runtime parameters dialog box, specify a value that is used for scheduling parameters defined for each node in this run, and click OK. In this tutorial,
20250223
is specified. You can specify a value based on your business requirements.Query the data processing result.
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
In the left-side navigation pane of the Data Studio page, click the
icon to go to the DATA STUDIO pane. The Workspace Directories section appears.
In the Workspace Directories section, find the
work
directory that you create, right-click the directory name, and then choose . In the Create Node dialog box, configure the Name parameter based on your business requirements and click OK.On the configuration tab of the EMR Hive node, enter the following SQL statement in the code editor, replace the data timestamp with the data timestamp of the ads_user_info_1d_emr node, and then execute the SQL statement to query the data synchronization result and view the number of data records that are imported into the ods_raw_log_d_emr and ods_user_info_d_emr tables.
NoteYou must change the value of the partition key column dt to the data timestamp of the ads_user_info_1d_emr node. For example, if a node is scheduled to run on
February 23, 2025
, the data timestamp of the node is20250222
, which is one day earlier than the scheduling time of the node.SELECT * FROM ads_user_info_1d_emr WHERE dt=The data timestamp;
If the result returned after you execute the preceding statement shows that data exists, the data processing is complete.
If the results returned after you execute the preceding statements show that data does not exist in the destination, you must make sure that the values specified for the scheduling parameters defined for the inner nodes of the workflow in this run are the same as the value of the
dt
field in the preceding statements when you run the workflow. You can click Running History in the right-side navigation pane of the configuration tab of the workflow, and then click View in the Actions column of the running record generated for this run to view the data timestamp that is used when the workflow is run in the run logs of the workflow. The data timestamp is in thepartition=[pt=xxx]
format.
Step 5: Deploy the workflow
An auto triggered node can be automatically scheduled to run only after you deploy the node to the production environment. You can refer to the following steps to deploy the workflow to the production environment:
In this tutorial, scheduling parameters are configured for the workflow when you configure scheduling properties for the workflow. You do not need to separately configure scheduling parameters for each node in the workflow.
In the left-side navigation pane of the Data Studio page, click the
icon. In the Workspace Directories section of the DATA STUDIO pane, find the created workflow and click the workflow name to go to the configuration tab of the workflow.
In the top toolbar of the configuration tab, click Deploy.
On the DEPLOY tab, click Start Deployment to Production Environment to deploy the workflow by following the on-screen instructions.
Step 6: Run the nodes in the production environment
After you deploy the nodes on a day, the instances generated for the nodes can be scheduled to run on the next day. You can use the data backfill feature to backfill data for nodes in a workflow that is deployed, which allows you to check whether the nodes can be run in the production environment. For more information, see Backfill data and view data backfill instances (new version).
After the nodes are deployed, click Operation Center in the upper-right corner of the Data Studio page.
You can also click the
icon in the upper-left corner of the Data Studio page and choose .
In the left-side navigation pane of the Operation Center page, choose
. On the Auto Triggered Nodes page, find the zero load nodeworkshop_start_emr
and click the node name.In the direct acyclic graph (DAG) of the node, right-click the
workshop_start_emr
node and choose .In the Backfill Data panel, select the nodes for which you want to backfill data, configure the Data Timestamp parameter, and then click Submit and Redirect.
In the upper part of the Data Backfill page, click Refresh to check whether all nodes are successfully run.
To prevent excessive fees from being generated after the tutorial is complete, you can configure the Effective Period parameter for all nodes in the workflow or freeze the zero load node workshop_start_emr
.
What to do next
Visualize data on a dashboard: After you complete user profile analysis, use DataAnalysis to display the processed data in charts. This helps you quickly extract key information to gain insights into the business trends behind the data.
Monitor data quality: Configure monitoring rules for tables that are generated after data processing to help identify and intercept dirty data in advance to prevent the impacts of dirty data from escalating.
Manage data: Data tables are generated in EMR Hive nodes after user profile analysis is complete. You can view the generated data tables in Data Map and view the relationships between the tables based on data lineages.
Use an API to provide data services: After you obtain the final processed data, use standardized APIs in DataService Studio to share data and to provide data for other business modules that use APIs to receive data.