This tutorial describes how to perform user profile analysis. In this tutorial, DataWorks is used to synchronize data, process data, and monitor data quality. All data resources involved in this tutorial reside in the China (Shanghai) region. To ensure that you can complete the tutorial as expected, you must create a MaxCompute project and a DataWorks workspace, and configure data sources, computing resources, and storage resources.
Business background
To develop effective business management strategies, you must obtain basic profile data of website users based on their activities on websites. The basic profile data includes the geographical and social attributes of the website users. You can analyze profile data by time and location, enabling refined operations on website traffic.
Usage notes
You must read Experiment introduction to have a deep understanding of the entire process of a user profile analysis experiment. This ensures that you can complete the tutorial as expected.
Precautions
Basic user information and website access logs of users that are required for tests in this tutorial are provided.
The data in this tutorial can be used only for experimental operations in DataWorks, and all the data is manual mock data.
This tutorial utilizes Data Development (Data Studio) (New Version) to perform data transformation.
Prepare a MaxCompute environment
Step 1: Activate MaxCompute
In this tutorial, MaxCompute is used. Make sure that MaxCompute is activated. The following information describes the parameters that you must configure when you activate MaxCompute.
Region: China (Shanghai)
Specifications Type: Standard
Create MaxCompute projects
In a DataWorks workspace in standard mode, you must associate MaxCompute projects with the workspace in the development and production environments.
Log on to the MaxCompute console. In the left-side navigation pane, choose .
On the Projects page, click Create Project to create two MaxCompute projects as data sources in the development and production environments in DataWorks. The following table describes the key parameters that are required in this tutorial. You can retain the default values for the parameters that are not described in the following table.
Parameter
Description
Project Name (Globally Unique)
Specify a name based on your business requirements. The name must be globally unique.
Examples in this tutorial:
Production environment: workshop2024_01
Development environment: workshop2024_01_dev
Billing Method
Select Pay-as-you-go.
Default Quota
Select Default Pay-as-you-go Quota.
Data Type Edition
Select MaxCompute V2.0 Data Type Edition (Recommended).
Encrypt
Select No.
For more information about how to create a MaxCompute project, see Create a MaxCompute project.
Prepare a DataWorks environment
Before you develop tasks in DataWorks, you must activate DataWorks. For more information, see Activate DataWorks.
Step 1: Create a workspace
If a workspace in which Participate in Public Preview of DataStudio of New Version is turned on exists in the China (Shanghai) region, skip this step and use the existing workspace.
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Workspace to go to the Workspaces page.
On the Workspaces page, click Create Workspace to create a workspace in standard mode. Turn on Participate in Public Preview of DataStudio of New Version when you create the workspace. For a workspace in standard mode, the development environment is isolated from the production environment.
NoteAs of February 18, 2025, the first time you activate DataWorks and create a workspace in the China (Shanghai) region by using your Alibaba Cloud account, the new-version Data Studio is activated by default.
For more information about how to create a workspace, see Create a workspace.
Step 2: Create a serverless resource group
Purchase a serverless resource group.
This tutorial requires a serverless resource group for data synchronization and scheduling. Therefore, you need to purchase and configure a serverless resource group.
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group to go to the Resource Groups page.
On the Resource Groups page, click Create Resource Group. On the buy page, set Region and Zone to China (Shanghai), specify the resource group name, configure other parameters as prompted, and then follow the on-screen instructions to pay for the resource group. For information about the billing details of serverless resource groups, see Billing of serverless resource groups.
NoteIn this example, a serverless resource group that is deployed in the China (Shanghai) region is used. Note that serverless resource groups do not support cross-region operations.
If no virtual private cloud (VPC) or vSwitch exists in the current region, click the link in the parameter description to go to the VPC console to create one. For more information about VPCs and vSwitches, see What is a VPC?
Associate the serverless resource group with the DataWorks workspace.
You can use the purchased serverless resource group in subsequent operations only after you associate the serverless resource group with a workspace.
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group. On the Resource Groups page, find the serverless resource group that you purchased, and click Associate Workspace in the Actions column. In the Associate Workspace panel, find the workspace with which you want to associate the serverless resource group and click Associate in the Actions column.
Enable the serverless resource group to access the Internet.
The test data used in this tutorial must be obtained over the Internet. By default, the serverless resource group cannot be used to access the Internet. You must configure an Internet NAT gateway for the VPC with which the serverless resource group is associated and configure an elastic IP address (EIP) for the VPC to establish a network connection between the VPC and the network environment of the test data. This way, you can use the serverless resource group to access the test data.
Go to the Internet NAT Gateway page in the VPC console. In the top navigation bar, select the China (Shanghai) region.
Click Create Internet NAT Gateway and configure the parameters. The following table describes the key parameters that are required in this tutorial. You can retain the default values for the parameters that are not described in the following table.
Parameter
Description
Region
Select China (Shanghai).
VPC
Select the VPC and vSwitch with which the resource group is associated.
To view the VPC and vSwitch with which the resource group is associated, perform the following operations: Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group. On the Resource Groups page, find the created resource group and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section of the VPC Binding tab on the page that appears, view the VPC and vSwitch with which the resource group is associated. For more information about VPCs and vSwitches, see What is a VPC?
Associate vSwitch
Access Mode
Select SNAT-enabled Mode.
EIP
Select Purchase EIP.
Service-linked Role
Click Create Service-linked Role to create a service-linked role if this is the first time you create a NAT gateway.
Click Buy Now. On the Confirm page, read the terms of service, select the check box for Terms of Service, and then click Activate Now.
For more information about how to create and use a serverless resource group, see Create and use a serverless resource group.
Step 3: Associate a MaxCompute computing resource with the workspace
You must associate the MaxCompute projects with a DataWorks workspace as computing resources before you can process data of MaxCompute in Data Studio.
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Workspace. On the Workspaces page, find the desired workspace and click the name of the workspace to go to the Workspace Details page.
In the left-side navigation pane of the Workspace Details page, click Computing Resource.
On the Computing Resource page, click Associate Computing Resource. In the Associate Computing Resource panel, select computing resource types based on your business requirements and configure parameters.
In this tutorial, MaxCompute is used as the computing and storage resource. In this example, select MaxCompute as the computing resource type and configure the related parameters. The following table describes the key parameters that are required in this tutorial. You can retain the default values for the parameters that are not described in the following table.
Parameter
Description
MaxCompute Project
Select the MaxCompute project that you want to associate. In this tutorial, the MaxCompute projects that you created in the development and production environments in DataWorks in Step 2 are associated with the workspace.
Default Access Identity
The default access identity that is used to access the MaxCompute project in the current workspace.
Development environment: The value of this parameter is fixed as Executor.
You can select an identity from the Default Access Identity drop-down list for the production environment based on the current logon account. In this tutorial, Alibaba Cloud Account is selected.
NoteIf the current logon account is not an Alibaba Cloud account, you can refer to Add a MaxCompute data source to configure this parameter.
Computing Resource Instance Name
The identifier of the computing resource. You can specify an identifier. When a task is run, the system selects a computing resource for the task based on the name of the specified computing resource instance.
Connection Configuration
The resource group that is used to connect to the MaxCompute computing resource. The serverless resource group that is created and associated with the current workspace is selected. You must test the network connectivity between the resource group and the MaxCompute computing resource in the development and production environments separately.
Click OK.
For more information about how to associate a computing resource with a workspace, see Associate a computing resource with a workspace.
What to do next
You have prepared your environments and can proceed to the next tutorial. In the next tutorial, you will learn how to synchronize the basic user information and website access logs of users to Object Storage Service (OSS), and how to create a table in an ODPS SQL node to query the synchronized data.