This tutorial uses a user profile analysis case to demonstrate how to use DataWorks for end-to-end operations, such as data synchronization, data transformation, and quality monitoring in the China (Shanghai) region. To complete this tutorial, you must prepare the required E-MapReduce (EMR) Serverless Spark space and a DataWorks workspace.
Business background
To create better business strategies, you need to obtain basic profile data about your website's user groups based on their website behavior. This data includes geographical and social attributes. This lets you perform profile analysis at scheduled times and implement fine-grained website traffic operations.
Before you begin
To follow this tutorial, read the introduction to understand the user profile analysis experiment.
Notes
This tutorial provides the required user information and website access test data.
The data in this tutorial is intended only for hands-on practice with DataWorks. All data is mock data.
This tutorial uses Data Studio (new version) for data transformation.
Prepare the OSS environment
You need to create an Object Storage Service (OSS) bucket. User information and website access logs will be synchronized to this bucket for data modeling and analysis.
Log on to the OSS console.
In the left navigation pane, click Bucket List. On the Bucket List page, click Create Bucket.
In the Create Bucket dialog box, configure the parameters and click Create. The parameters are as follows:
Parameter
Value
Bucket Name
In this example, the name is
dw-spark-demo.Region
Select China (Shanghai).
HDFS Service
Enable the HDFS service as prompted on the interface.
On the Bucket List page, click the name of the bucket to open the File Management page.
Prepare the EMR Serverless Spark workspace
This tutorial uses EMR Serverless Spark. Ensure that you have a Spark workspace. If not, create a workspace. Set the parameters as follows:
Parameter | Value |
Region | China (Shanghai) |
Billing Method | Pay-as-you-go. |
Workspace Name | Enter a custom name. |
Use DLF as the metadata service | Select the Data Lake Formation (DLF) data catalog that you want to bind. If you want to completely isolate metadata between different EMR clusters, select different catalogs. Important Support for DLF and DLF-Legacy (displayed as DLF 1.0 on the interface) versions is available. After selecting a version, create Paimon or Hive tables accordingly. |
Workspace Base Path | Select an OSS Bucket path to store the log files of jobs. |
Professional Edition: This workspace includes all the features of the Basic Edition, along with advanced features and performance improvements. It is ideal for large-scale extract, transform, and load (ETL) jobs.
Basic Edition: This workspace includes all basic features and a high-performance compute engine.
Prepare the DataWorks environment
Before using DataWorks for development, ensure that the DataWorks service is activated. For more information, see Purchase.
1. Create a workspace
If you already have a workspace (new version) in the China (Shanghai) region, you can skip this step and use the existing workspace.
Log on to the DataWorks console. In the top navigation bar, set the region to China (Shanghai). In the navigation pane on the left, click Workspace to go to the workspace list page.
Click Create Workspace to create a Use Data Studio (New Version) workspace, and enable Isolate Development and Production Environments.
NoteStarting February 18, 2025, the new Data Studio is enabled by default the first time an Alibaba Cloud account activates DataWorks and creates a workspace in the China (Shanghai) region.
For more information about how to create a workspace, see Create a workspace.
2. Create a serverless resource group
Purchase a Serverless resource group.
This tutorial requires a DataWorks Serverless resource group for data synchronization and scheduling. You must purchase a Serverless resource group and complete the initial setup first.
Log on to the DataWorks - Resource Group List page. In the top navigation bar, set the region to China (Shanghai). In the navigation pane on the left, click Resource Group to go to the resource group List page.
Click Create Resource Group. On the purchase page, set Region And Zone to China (Shanghai) and specify a Resource Group Name. Configure other parameters as prompted and complete the payment. For information about the billing of Serverless resource groups, see Billing of Serverless resource groups.
NoteIf no VPC or vSwitch is available in the current region, click the console link in the parameter description to create them. For more information about VPCs and vSwitches, see What is a virtual private cloud (VPC)?.
Bind the resource group to the DataWorks workspace.
A newly purchased Serverless resource group must be bound to a workspace before it can be used.
Log on to the DataWorks - Resource Group List page and set the region to China (Shanghai) in the top navigation bar. Find the serverless resource group that you purchased. In the Actions column, click Associate Workspace and then click Associate next to the DataWorks workspace that you created.
Configure public network access for the resource group.
The test data for this tutorial is retrieved from the internet. By default, resource groups do not have public network access. You must configure an Internet NAT gateway for the VPC that is bound to the resource group and add an EIP to retrieve data from the public network.
Log on to the VPC - Internet NAT Gateway console. In the top menu bar, set the region to China (Shanghai).
Click Create Internet NAT Gateway and configure the parameters. The following table lists the key parameters for this tutorial. Keep the default values for any parameters not mentioned.
Parameter
Value
Region
China (Shanghai).
Network And Zone
Select the VPC and vSwitch bound to the resource group.
You can go to the DataWorks console and switch to the China (Shanghai) region. In the navigation pane on the left, click Resource Group. Find the resource group that you created and click Network Settings in the Actions column. In the Data Scheduling & Data Integration area, view the associated VPC and VSwitch. For more information about VPCs and vSwitches, see What is a virtual private cloud (VPC)?.
Network Type
Internet NAT Gateway.
EIP
Create EIP.
Service-linked Role Creation
When you create a NAT Gateway for the first time, you must create a service-linked role. Click Create Service-linked Role.
Click Buy Now, select the terms of service, and then click Activate Now to complete the purchase.
For more information about how to add and use Serverless resource groups, see Use a Serverless resource group.
3. Bind EMR Serverless Spark as a computing resource
Go to the DataWorks - Workspace List page. In the top navigation bar, set the region to China (Shanghai). Find the workspace that you created and click its name to open the Workspace Details page.
In the navigation pane on the left, click Computing Resource.
Click Associate Computing Resource, select a computing resource type, and then configure the parameters.
This tutorial uses EMR Serverless Spark as the computing and storage resource. Set Computing Resource Type to EMR Serverless Spark and configure the key parameters described in the following table. Keep the default values for other parameters.
Parameter
Description
Spark Workspace
Select the Spark workspace to bind. From the drop-down list, select the EMR Serverless Spark workspace. You can also click Create in the drop-down list to go to the EMR Serverless Spark console to create a workspace. Then, return to the DataWorks workspace and select the new Spark workspace.
NoteIf you chose to isolate the development and production environments when you created the workspace, you must select Spark workspaces for both environments here.
For information about how to create a Spark workspace, see Create a Spark workspace.
Default Engine Version
When you create an EMR Spark task in Data Studio, the engine version and resource queue configured here are used by default.
Default Resource Queue
Default Access Identity
Defines the identity used to access the Spark workspace in the current DataWorks workspace.
Development environment: Only the Executor identity is supported.
Production environment: Alibaba Cloud account, RAM user, and Task Owner are supported.
Computing Resource Instance Name
Identifies the computing resource. When a task runs, the instance name is used to select the computing resource for the task.
Click Confirm to complete the Serverless Spark computing resource configuration.
For more information about how to bind a computing resource, see Bind a computing resource.
What to do next
Now that you have prepared the environment, you can proceed to the next tutorial. You will learn how to synchronize user information and website access logs to OSS, and then use a Spark SQL node to create tables and query the synchronized data. For more information, see Synchronize data.