This tutorial walks you through setting up the four services required before you start the user profile analysis experiment in DataWorks using E-MapReduce (EMR) Serverless Spark. All steps take place in the China (Shanghai) region.
By the end of this tutorial, you will have:
An Object Storage Service (OSS) bucket to store user information and website access logs
An EMR Serverless Spark workspace as the compute and storage layer
A DataWorks workspace with dev/prod environment isolation enabled
A serverless resource group with public network access, associated with the DataWorks workspace
EMR Serverless Spark configured as a computing resource in DataWorks
Prerequisites
Before you begin, ensure that you have:
Read the user profile analysis introduction to understand the experiment's scope and goals
Activated the DataWorks service. See Purchase
All data used in this tutorial is mock data provided for hands-on practice only.
Step 1: Create an OSS bucket
User information and website access logs are synchronized to an OSS bucket for data modeling and analysis.
Log on to the OSS console.
In the left navigation pane, click Buckets. On the Buckets page, click Create Bucket.
In the Create Bucket dialog box, configure the following parameters, then click Create. Keep the default values for all parameters not listed.
Parameter Value Bucket Name dw-spark-demo(or a custom name)Region China (Shanghai) OSS-HDFS Enable the HDFS service as prompted Click the bucket name to open its Object Management page.
Step 2: Create an EMR Serverless Spark workspace
If you do not already have a Spark workspace, create one using the parameters below. Keep the default values for all parameters not listed.
Two workspace editions are available. Professional Edition includes all the features of the Basic Edition, along with advanced features and performance improvements, making it suitable for large-scale extract, transform, and load (ETL) jobs. Basic Edition includes all basic features with a high-performance compute engine.
Parameter Value Region China (Shanghai) Payment Type Pay-as-you-go Workspace Name Enter a custom name DLF for Metadata Storage Select a Data Lake Formation (DLF) data catalog. Both DLF and DLF-Legacy (displayed as DLF 1.0 on the interface) are supported. After selecting a version, create Paimon or Hive tables accordingly. If you need complete metadata isolation between EMR clusters, select separate catalogs. Workspace Directory Select an OSS bucket path to store job log files
Step 3: Set up the DataWorks environment
3.1 Create a DataWorks workspace
If you already have a DataWorks workspace (new version) in the China (Shanghai) region, skip to Step 3.2.
Log on to the DataWorks console. In the top navigation bar, set the region to China (Shanghai). In the left navigation pane, click Workspace.
Click Create Workspace. Select Use Data Studio (New Version) and enable Isolate Development and Production Environments.
Starting February 18, 2025, new Data Studio is the default for any Alibaba Cloud account that activates DataWorks and creates its first workspace in the China (Shanghai) region.
For full workspace creation options, see Create a workspace.
3.2 Create a serverless resource group
A serverless resource group is required for data synchronization and scheduling in this tutorial.
Purchase the resource group
Log on to the DataWorks Resource Group List page. In the top navigation bar, set the region to China (Shanghai). In the left navigation pane, click Resource Group.
Click Create Resource Group. On the purchase page, set Region and Zone to China (Shanghai), specify a Resource Group Name, configure the remaining parameters as prompted, and complete the payment.
If no virtual private cloud (VPC) or vSwitch is available in the region, create them first. See What is a virtual private cloud (VPC)?. For billing details, see Billing of serverless resource groups.
Associate the resource group with your workspace
A newly purchased resource group must be associated with a workspace before it can be used.
On the DataWorks Resource Group List page, find the resource group you purchased. In the Actions column, click Associate Workspace, then click Associate next to your DataWorks workspace.
Configure public network access
The test data for this tutorial is retrieved from the internet. By default, resource groups have no public network access. Configure an Internet NAT Gateway for the VPC bound to the resource group to enable outbound internet connectivity.
Log on to the VPC Internet NAT Gateway console. In the top menu bar, set the region to China (Shanghai).
Click Create Internet NAT Gateway and configure the following parameters. Keep the default values for all parameters not listed.
Parameter Value Region China (Shanghai) Network And Zone Select the VPC and vSwitch bound to the resource group. To find these, go to the DataWorks console, switch to China (Shanghai), and navigate to Resource Group. Click Network Settings in the Actions column of your resource group, then check the VPC and vSwitch listed under Data Scheduling & Data Integration. Network Type Internet NAT Gateway EIP Create EIP Service-linked Role If this is your first NAT Gateway, click Create Service-linked Role Click Buy Now, accept the terms of service, and click Activate Now.
After the NAT Gateway instance is created, return to the console and create source NAT (SNAT) entries for it. The SNAT entry is ready when its status changes to Available. At that point, the VPC — and any resource groups bound to it — has outbound internet access.
Find the new NAT Gateway instance and click Manage in the Actions column. Switch to the SNAT tab.
In the SNAT Entry List section, click Create SNAT Entry and configure the following parameters.
Parameter Value SNAT Entry Select Specify VPC. This allows all resource groups in the VPC to access the internet through the configured EIP. Select EIP Select the EIP bound to the current NAT Gateway instance Click OK.
The resource group cannot access the internet until SNAT entries are configured.
For more information, see Use a serverless resource group.
3.3 Associate EMR Serverless Spark as a computing resource
Go to the DataWorks Workspace List page. In the top navigation bar, set the region to China (Shanghai). Find your workspace and click its name to open the Workspace Details page.
In the left navigation pane, click Computing Resources.
Click Associate Computing Resources. Set Computing Resource Type to EMR Serverless Spark and configure the following parameters. Keep the default values for all parameters not listed.
Parameter Description EMR Serverless Spark Workspace Select the Spark workspace you created in Step 2. If you enabled dev/prod environment isolation, select a Spark workspace for both the development and production environments. To create a new Spark workspace inline, click New in the dropdown. Default Engine Version The engine version used by default when you create an EMR Spark node in Data Studio Default Resource Queue The resource queue used by default when you create an EMR Spark node in Data Studio Default Access Identity The identity used to access the Spark workspace from DataWorks. In the development environment, only Executor is supported. In the production environment, Alibaba Cloud account, RAM user, and Node Owner are supported. Computing Resource Instance Name A name that identifies this computing resource. When a node runs, DataWorks uses this name to route the task to the correct resource. Click Confirm.
For full configuration options, see Associate a computing resource.
What's next
With the environment ready, proceed to Synchronize data to learn how to sync user information and website access logs to OSS, and use a Spark SQL node to create tables and query the synchronized data.