This tutorial describes how to perform user profile analysis. In this tutorial, DataWorks is used to synchronize data, process data, and monitor data quality. To ensure that you can complete the tutorial as expected, you must create an E-MapReduce (EMR) cluster and a DataWorks workspace that are required for the tutorial, and configure the environment.
Business background
To develop effective business management strategies, you must obtain basic profile data of website users based on their activities on websites. The basic profile data includes the geographical and social attributes of the website users. You can analyze profile data by time and location, enabling refined operations on website traffic.
Usage notes
You must read Experiment introduction to have a deep understanding of the entire process of a user profile analysis experiment. This ensures that you can complete the tutorial as expected.
Precautions
Basic user information and website access logs of users that are required for tests in this tutorial are provided.
The data in this tutorial can be used only for experimental operations in DataWorks, and all the data is manual mock data.
This tutorial utilizes Data Development (Data Studio) (New Version) to perform data transformation.
Prepare an EMR environment
This tutorial requires an EMR cluster, which needs to be registered to DataWorks. This allows you to run data processing tasks based on the EMR cluster in the DataWorks console. The following table describes the key parameters for creating an EMR cluster. For information about how to create an EMR cluster, see Create a cluster.
Parameter | Description |
Region | China (Shanghai). |
Business Scenario | Data Lake. |
Product Version | Select the latest version. |
Optional Services (Select One At Least) | Select components based on your business requirements. This tutorial requires the Hive and OSS-HDFS components. |
Metadata | DLF Unified Metadata. |
Root Storage Directory of Cluster | Select a valid OSS-HDFS bucket. If no OSS-HDFS bucket is available in the drop-down list, click Create OSS-HDFS Bucket. |
Support of DataWorks for different configurations of an EMR cluster vary. Before you create an EMR cluster and develop EMR tasks in DataWorks based on the EMR cluster, we recommend that you read the Best practices for configuring EMR clusters used in DataWorks topic.
Prepare a DataWorks environment
Before you develop tasks in DataWorks, you must activate DataWorks. For more information, see Activate DataWorks.
Step 1: Create a workspace
If a workspace in which Participate in Public Preview of DataStudio of New Version is turned on exists in the China (Shanghai) region, skip this step and use the existing workspace.
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Workspace to go to the Workspaces page.
On the Workspaces page, click Create Workspace to create a workspace in standard mode. Turn on Participate in Public Preview of DataStudio of New Version when you create the workspace. For a workspace in standard mode, the development environment is isolated from the production environment.
NoteAs of February 18, 2025, the first time you activate DataWorks and create a workspace in the China (Shanghai) region by using your Alibaba Cloud account, the new-version Data Studio is activated by default.
For more information about how to create a workspace, see Create a workspace.
Step 2: Create a serverless resource group
Purchase a serverless resource group.
This tutorial requires a serverless resource group for data synchronization and scheduling. Therefore, you need to purchase and configure a serverless resource group.
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group to go to the Resource Groups page.
On the Resource Groups page, click Create Resource Group. On the buy page, set Region and Zone to China (Shanghai), specify the resource group name, configure other parameters as prompted, and then follow the on-screen instructions to pay for the resource group. For information about the billing details of serverless resource groups, see Billing of serverless resource groups.
NoteIn this example, a serverless resource group that is deployed in the China (Shanghai) region is used. Note that serverless resource groups do not support cross-region operations.
If no virtual private cloud (VPC) or vSwitch exists in the current region, click the link in the parameter description to go to the VPC console to create one. For more information about VPCs and vSwitches, see What is a VPC?
Associate the serverless resource group with the DataWorks workspace.
You can use the purchased serverless resource group in subsequent operations only after you associate the serverless resource group with a workspace.
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group. On the Resource Groups page, find the serverless resource group that you purchased, and click Associate Workspace in the Actions column. In the Associate Workspace panel, find the workspace with which you want to associate the serverless resource group and click Associate in the Actions column.
Enable the serverless resource group to access the Internet.
The test data used in this tutorial must be obtained over the Internet. By default, the serverless resource group cannot be used to access the Internet. You must configure an Internet NAT gateway for the VPC with which the serverless resource group is associated and configure an elastic IP address (EIP) for the VPC to establish a network connection between the VPC and the network environment of the test data. This way, you can use the serverless resource group to access the test data.
Go to the Internet NAT Gateway page in the VPC console. In the top navigation bar, select the China (Shanghai) region.
Click Create Internet NAT Gateway and configure the parameters. The following table describes the key parameters that are required in this tutorial. You can retain the default values for the parameters that are not described in the following table.
Parameter
Description
Region
Select China (Shanghai).
VPC
Select the VPC and vSwitch with which the resource group is associated.
To view the VPC and vSwitch with which the resource group is associated, perform the following operations: Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group. On the Resource Groups page, find the created resource group and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section of the VPC Binding tab on the page that appears, view the VPC and vSwitch with which the resource group is associated. For more information about VPCs and vSwitches, see What is a VPC?
Associate vSwitch
Access Mode
Select SNAT-enabled Mode.
EIP
Select Purchase EIP.
Service-linked Role
Click Create Service-linked Role to create a service-linked role if this is the first time you create a NAT gateway.
Click Buy Now. On the Confirm page, read the terms of service, select the check box for Terms of Service, and then click Activate Now.
For more information about how to create and use a serverless resource group, see Create and use a serverless resource group.
Step 3: Associate an EMR computing resource with the workspace
You can perform the following operations to associate an EMR computing resource with the workspace that you created to store data.
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Workspace. On the Workspaces page, find the desired workspace and click the name of the workspace to go to the Workspace Details page.
In the left-side navigation pane of the Workspace Details page, click Computing Resource.
On the Computing Resource page, click Associate Computing Resource. In the Associate Computing Resource panel, click EMR to go to the Associate EMR Computing Resource
panel.
In the Associate EMR Computing Resource panel, configure the parameters. The following table describes the parameters.
Parameter
Description
Alibaba Cloud Account to Which Cluster Belongs
Select Current Alibaba Cloud Account.
Cluster Type
Select Data Lake.
Cluster
Select the EMR cluster that you created in the Prepare a EMR environment section.
Default Access Identity
The identity that you want to use to access the EMR cluster in the current workspace.
Development environment: Select Cluster Account: hadoop.
Production Environment: Select Cluster Account: hadoop.
Pass Proxy User Information
Select Do Not Pass.
Computing Resource Instance Name
The name of the computing resource instance.
Click OK.
On the Computing Resource page, find the EMR computing resource that you created and click Initialize Resource Group in the upper-right corner to test the network connectivity between the EMR computing resource and the resource group.
For more information about how to associate a computing resource with a workspace, see Associate a computing resource with a workspace.
What to do next
You have prepared your environments and can proceed to the next tutorial. In the next tutorial, you will learn how to synchronize the basic user information and website access logs of users to OSS, and how to create a table in an EMR Hive node to query the synchronized data.