This tutorial provides a user persona example in the China (Shanghai) region and demonstrates how to use DataWorks for data synchronization, data transformation, and quality monitoring. To complete this tutorial, you must prepare the required MaxCompute project and DataWorks workspace, and configure the necessary data sources, computing resources, and storage.
Business background
Analyzing user behavior on your website is crucial for creating effective business strategies. This analysis yields basic user profile data, including geographic and social attributes. You can then schedule regular persona analyses to perform fine-grained operations on your website traffic.
Prerequisites
Before you begin, read Introduction to the experiment to understand the complete workflow of the user persona analysis case study.
Notes
This case study provides the required user information and website access test data. You can use this data directly.
The data in this case study is mock data provided for hands-on practice with DataWorks applications only.
This tutorial uses DataStudio (old version) for data transformation.
Prepare the MaxCompute environment
1. Activate MaxCompute
This case study requires MaxCompute. Ensure that you have activated MaxCompute. Use the following parameters to activate the service in the China (Shanghai) region.
Region: China (Shanghai)
Specifications Type: Standard computing resources.
2. Create a MaxCompute project
In a standard DataWorks workspace, you must attach two MaxCompute projects. One project serves as the computing resource for the development environment. The other serves as the computing resource for the production environment.
Go to the MaxCompute console. In the navigation pane on the left, select .
Click Create Project to create two MaxCompute projects. The following table describes the key parameters for this tutorial. You can use the default values for other parameters. For more information, see Create a MaxCompute project.
Configuration Item
Configuration
Project Name
Custom. Must be globally unique.
Example for this tutorial:
Production environment:
workshop2024_01.Development environment:
workshop2024_01_dev.
Billing Method
For this tutorial, select Pay-As-You-Go.
Default Quota
For this tutorial, select Default Pay-as-you-go Quota from the drop-down list.
Data Type Edition
For this tutorial, select Data Type 2.0 (Recommended) from the drop-down list.
Storage Encryption
For this tutorial, select Not Encrypted.
For more information about how to create a MaxCompute project, see Create a MaxCompute project.
Prepare the DataWorks environment
Before you use DataWorks for development, ensure that you have activated the DataWorks service. For more information, see Purchasing guide.
1. Create a workspace
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the navigation pane on the left, click Workspace to go to the Workspaces page.
Click Create Workspace. Select Isolate Development and Production Environments. Do not select Use Data Studio (New Version).
Starting February 18, 2025, if you activate DataWorks and create a workspace in the China (Shanghai) region for the first time, the new version of DataStudio is enabled by default. The Use Data Studio (New Version) parameter is not displayed. If the new version of DataStudio is already enabled for you by default, see Experience the new version of DataStudio.
For more information about how to create a workspace, see Create a workspace.
2. Create a Serverless resource group
This tutorial requires you to synchronize data from OSS and MySQL to MaxCompute. The sync task runs on a DataWorks Serverless resource group. Therefore, you must purchase a Serverless resource group and complete the required preparations.
Purchase a Serverless resource group.
This tutorial requires a DataWorks Serverless resource group for data synchronization and scheduling. You must purchase a Serverless resource group and complete the necessary preparations.
Log on to the DataWorks - Resource Groups page. In the top navigation bar, switch the region to China (Shanghai). In the navigation pane on the left, click Resource Group to open the Resource Groups page.
Click Create Resource Group. On the resource group purchase page, set Region and Zone to China (Shanghai) and specify a Resource Group Name. Configure the other parameters and complete the payment as prompted. For more information about the billing of Serverless resource groups, see Billing of Serverless resource groups.
NoteIf no virtual private clouds (VPCs) or vSwitches are available in the current region, you can click the console link in the parameter description to create them. For more information about VPCs and vSwitches, see What is a VPC?.
Associate the resource group to the DataWorks workspace.
A newly purchased Serverless resource group must be attached to a workspace before it can be used.
Log on to the DataWorks - Resource Groups page. In the top navigation bar, switch the region to China (Shanghai). Find the Serverless resource group that you purchased. In the Actions column, click Associate Workspace. Then, click Associate next to the DataWorks workspace that you created.
Configure Internet access for the resource group.
The test data for this tutorial is retrieved from the Internet. By default, resource groups cannot access the Internet. You must configure an Internet NAT gateway for the VPC that is attached to the resource group and add elastic IP addresses (EIPs). This connects the VPC to the Internet and allows it to retrieve the data.
Log on to the VPC - Internet NAT Gateway console. In the top menu bar, select the China (Shanghai) region.
Click Create Internet NAT Gateway and configure the parameters. The following table describes the key parameters for this tutorial. For all other parameters, use the default values.
Parameter
Value
Region
China (Shanghai).
Network and Zone
Select the VPC and vSwitch that are attached to the resource group.
You can go to the DataWorks console, switch the region, and click Resource Group in the navigation pane on the left. Find the resource group that you created and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section, view the associated VPC and vSwitch. For more information about VPCs and vSwitches, see What is a VPC?.
Network Type
Internet NAT Gateway.
EIP
Purchase New EIP.
Create Service-linked Role
When you create a NAT Gateway for the first time, you must create a service-linked role. Click Create Service-linked Role.
Click Buy Now, accept the terms of service, and then click Activate Now to complete the purchase.
For more information about how to add and use a Serverless resource group, see Use a Serverless resource group.
3. Associate the MaxCompute project
You must attach the MaxCompute project that you created to the DataWorks workspace as a computing resource. This lets you process data in MaxCompute using the Data Development module.
Go to the DataWorks - Workspaces page. In the top navigation bar, switch the region to China (Shanghai). Find your workspace and click its name to go to the Workspace Details page.
In the navigation pane on the left, click Computing Resource. You are redirected to the page.
Click Create Computing Resource. Select a computing resource Type and configure the parameters to attach the resource.
This tutorial uses MaxCompute as the computing and storage resource. The following table describes the other key parameters. For all other parameters, use the default values.
Parameter
Description
Data Source Name
Custom name. It identifies the computing resource. At runtime, the computing resource instance name is used to select the computing resource for the task.
Alibaba Cloud Account
Select Current Alibaba Cloud Account.
Region
Select the same region as the current DataWorks workspace: China (Shanghai).
MaxCompute Project Name
Select the MaxCompute project to attach. For this tutorial, attach the corresponding MaxCompute projects created in Step 2 to the production and development environments.
Default Access Identity
Defines the identity used to access the MaxCompute project in the current workspace.
Development environment: Only the Executor identity is supported.
Production environment: Select an identity from the drop-down list based on the current logon account. For this tutorial, select Alibaba Cloud Account.
NoteIf you are logged on with a different identity, see New version of DataStudio: Attach a MaxCompute compute engine for configuration details.
Connection Configuration
The resource group used to connect to the MaxCompute computing resource. The Serverless resource group that you created and attached to the current workspace is displayed here. You must test the connectivity for both the development and production environments.
Click Create and Associate Computing Resource with DataStudio.
Follow the prompts on the page. After you refresh the computing resources page for Data Development, the attached MaxCompute computing resource is displayed.
NoteIf the MaxCompute computing resource status is not associated, click Associate.
Next steps
Now that you have prepared the environment, you can proceed to the next tutorial. In the next tutorial, you will learn how to synchronize basic user information and website access logs to MaxCompute. For more information, see Synchronize data.