This tutorial describes how to perform user profile analysis. In this tutorial, DataWorks is used to synchronize data, process data, and monitor data quality. All data resources involved in this tutorial reside in the China (Shanghai) region. To ensure that you can complete the tutorial as expected, you must first create an E-MapReduce (EMR) Serverless StarRocks instance and a DataWorks workspace and configure the required environments.
Business background
To develop effective business management strategies, you must obtain basic profile data of website users based on their activities on websites. The basic profile data includes the geographical and social attributes of the website users. You can analyze profile data by time and location, enabling refined operations on website traffic.
Usage notes
You must read Experiment introduction to have a deep understanding of the entire process of a user profile analysis experiment. This ensures that you can complete the tutorial as expected.
Precautions
Basic user information and website access logs of users that are required for tests in this tutorial are provided.
The data in this tutorial can be used only for experimental operations in DataWorks, and all the data is manual mock data.
This tutorial utilizes Data Development (Data Studio) (New Version) to perform data transformation.
Prepare an OSS environment
In this tutorial, a user-defined function (UDF) is used. The resource used to register the function is uploaded to Object Storage Service (OSS). Make sure that OSS is activated, an OSS bucket is created, and the Bucket ACL parameter of the OSS bucket is set to Public Read/Write.
Prepare an EMR Serverless StarRocks environment
In this tutorial, EMR Serverless StarRocks is used to process data. Make sure that you have an EMR Serverless StarRocks instance. If you do not have an EMR Serverless StarRocks instance, you can go to Create an instance.
Instance Type: Compute-storage Integration.
Region: China (Shanghai).
Instance Edition: Basic Edition.
ImportantBasic Edition is only for trial use and feature testing. The service level agreement (SLA) for this edition is not guaranteed. You can select Standard Edition for the Instance Edition parameter based on your business requirements.
Version: 3.1.
In this tutorial, data is processed in the database EMR Serverless StarRocks. After the EMR Serverless StarRocks instance is created, you must create a database. You can log on to the EMR Serverless StarRocks instance and execute the following SQL statement in the SQL Editor to create a database:
CREATE DATABASE Database name;Prepare a DataWorks environment
Before you develop tasks in DataWorks, you must activate DataWorks. For more information, see Activate DataWorks.
Step 1: Create a workspace
If a workspace in which Participate in Public Preview of DataStudio of New Version is turned on exists in the China (Shanghai) region, skip this step and use the existing workspace.
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Workspace to go to the Workspaces page.
On the Workspaces page, click Create Workspace to create a workspace in standard mode. Turn on Participate in Public Preview of DataStudio of New Version when you create the workspace. For a workspace in standard mode, the development environment is isolated from the production environment.
NoteAs of February 18, 2025, the first time you activate DataWorks and create a workspace in the China (Shanghai) region by using your Alibaba Cloud account, the new-version Data Studio is activated by default.
For more information about how to create a workspace, see Create a workspace.
Step 2: Create a serverless resource group
Purchase a serverless resource group.
This tutorial requires a serverless resource group for data synchronization and scheduling. Therefore, you need to purchase and configure a serverless resource group.
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group to go to the Resource Groups page.
On the Resource Groups page, click Create Resource Group. On the buy page, set Region and Zone to China (Shanghai), specify the resource group name, configure other parameters as prompted, and then follow the on-screen instructions to pay for the resource group. For information about the billing details of serverless resource groups, see Billing of serverless resource groups.
NoteIf no virtual private cloud (VPC) or vSwitch exists in the current region, click the link in the parameter description to go to the VPC console to create one. For more information about VPCs and vSwitches, see What is VPC?
Associate the serverless resource group with the DataWorks workspace.
You can use the purchased serverless resource group in subsequent operations only after you associate the serverless resource group with a workspace.
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group. On the Resource Groups page, find the serverless resource group that you purchased, and click Associate Workspace in the Actions column. In the Associate Workspace panel, find the workspace with which you want to associate the serverless resource group and click Associate in the Actions column.
Enable the serverless resource group to access the Internet.
The test data used in this tutorial must be obtained over the Internet. By default, the serverless resource group cannot be used to access the Internet. You must configure an Internet NAT gateway for the VPC with which the serverless resource group is associated and configure an elastic IP address (EIP) for the VPC to establish a network connection between the VPC and the network environment of the test data. This way, you can use the serverless resource group to access the test data.
Go to the Internet NAT Gateway page in the VPC console. In the top navigation bar, select the China (Shanghai) region.
Click Create Internet NAT Gateway and configure the parameters. The following table describes the key parameters that are required in this tutorial. You can retain the default values for the parameters that are not described in the following table.
Parameter
Description
Region
Select China (Shanghai).
VPC
Select the VPC and vSwitch with which the resource group is associated.
To view the VPC and vSwitch with which the resource group is associated, perform the following operations: Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group. On the Resource Groups page, find the created resource group and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section of the VPC Binding tab on the page that appears, view the VPC and vSwitch with which the resource group is associated. For more information about VPCs and vSwitches, see What is VPC?
Associate vSwitch
Access Mode
Select SNAT-enabled Mode.
EIP
Select Purchase EIP.
Service-linked Role
Click Create Service-linked Role to create a service-linked role if this is the first time you create a NAT gateway.
Click Buy Now. On the Confirm page, read the terms of service, select the check box for Terms of Service, and then click Activate Now.
For more information about how to create and use a serverless resource group, see Create and use a serverless resource group.
Step 3: Associate a StarRocks computing resource with the workspace
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Workspace. On the Workspaces page, find the desired workspace and click the name of the workspace to go to the Workspace Details page.
In the left-side navigation pane of the Workspace Details page, click Computing Resource.
On the Computing Resource page, click Associate Computing Resource. In the Associate Computing Resource panel, select computing resource types based on your business requirements and configure parameters.
Serverless StarRocks is used to provide computing and storage resources in this tutorial. In the "Select a computing resource type" step of the Associate Computing Resource panel, click Serverless StarRocks and configure parameters. The following table describes the key parameters that are required in this tutorial. You can retain the default values for the parameters that are not described in the following table.
Parameter
Description
Instance
Select a StarRocks instance that you want to associate with the current workspace. You can also click Create in the StarRocks Instance drop-down list and create a StarRocks instance on the EMR Serverless StarRocks page in the EMR console. Then, you can select the created StarRocks instance from the StarRocks Instance drop-down list.
NoteIf you turn on Isolate Development and Production Environments when you create the current workspace, you must separately select a StarRocks instance for the development and production environments.
For more information about how to create a StarRocks instance, see Create an instance.
Database Name
Select a database in the StarRocks instance. If no database is available, you must create a database in the StarRocks instance first.
Username
Password
The account and password that are specified when you create a StarRocks instance. The default account is
admin.Computing Resource Instance Name
The identifier of the computing resource. When a task is run, the system selects a computing resource for the task based on the name of the specified computing resource instance.
In this example, set this parameter to
doc_starrocks_storage_compute_tightly_01.Connection Configuration
Select a resource group to connect to the StarRocks instance. You can test the network connectivity between the resource group and the StarRocks instance.
NoteIf no resource group is available, you can create a resource group and associate the resource group with the current workspace. Then, go to the Workspace Details page to test the connectivity between the resource group and a computing resource that you want to associate with the workspace. For more information, see Create and use a serverless resource group.
Click OK.
For more information about how to associate a computing resource with a workspace, see Associate a computing resource.
What to do next
You have prepared your environments and can proceed to the next tutorial. In the next tutorial, you will learn how to synchronize the basic user information and website access logs of users to OSS, and how to create a table in a StarRocks node to query the synchronized data. For more information, see Synchronize data.