Build an EMR Serverless Spark Dev Environment - DataWorks

What you'll set up

By the end of this page, you will have:

Activated DataWorks and created a DataWorks workspace in Standard Mode.
Created an EMR Serverless Spark workspace (Professional Edition) as the compute resource.
Created an Object Storage Service (OSS) bucket to receive and store synchronized data.
Created a serverless resource group and configured internet access for it.
Registered the EMR Serverless Spark cluster in DataWorks.
Added three data sources: a platform-provided MySQL source, a platform-provided HttpFile source, and your own private OSS destination.

Prerequisites

Before you begin, make sure you have:

An Alibaba Cloud account with billing enabled
Permissions to create resources in the China (Shanghai) region
Access to the EMR, OSS, VPC, and DataWorks consoles

DataWorks product preparation

Activate DataWorks on the DataWorks purchase page if you haven't already. For details, see Purchase.

Prepare an EMR Serverless Spark workspace

This tutorial uses EMR Serverless Spark as the compute resource. If you don't have a Spark workspace, go to the E-MapReduce console, select Spark, and create a workspace with the following settings:

Parameter	Value
Region	China (Shanghai)
Payment type	Pay-as-you-go
Workspace name	Enter a custom name
DLF for metadata storage	Select a Data Lake Formation (DLF) data catalog. To completely isolate metadata between different EMR clusters, select different catalogs.
Workspace directory	Select an OSS bucket path to store job log files
Workspace type	Professional Edition

Select Professional Edition for this tutorial. Professional Edition includes all Basic Edition features plus advanced performance improvements suited for large-scale extract, transform, and load (ETL) jobs. Basic Edition provides powerful compute engines.

Prepare a private OSS bucket

Create an OSS bucket to serve as the destination for synchronized data. In the next tutorial, user information from the MySQL source and log data from the HttpFile source will be written to this bucket for data modeling and analysis.

Log on to the OSS console.
In the left navigation pane, click Buckets. On the Buckets page, click Create Bucket.

In the Create Bucket dialog box, configure the following parameters and click Create. For details on other parameters, see Create a bucket in the console.

Parameter	Value
Bucket name	Enter a custom name
Region	China (Shanghai)
OSS-HDFS	Enable the HDFS service as prompted

On the Buckets page, click the bucket name to go to its Object Management page.

Prepare the DataWorks environment

With DataWorks, your EMR Serverless Spark workspace, and your OSS bucket ready, complete the following steps to configure the DataWorks workspace for data synchronization and processing.

Create a DataWorks workspace

Log on to the DataWorks console.
In the left navigation pane, click Workspace Management to open the workspace list.
Click Create Workspace. In the panel that appears, create a workspace in Standard Mode and enable Isolate Development and Production Environments.

Create the workspace in China (Shanghai) to avoid network connectivity issues when connecting to data sources in that region. For a simpler setup, select No for Isolate Development and Production Environments.

Create a resource group

A resource group provides the compute resources for data synchronization and scheduling. Create a serverless resource group and configure internet access so it can reach external data sources.

Step 1: Purchase a serverless resource group

Log on to the DataWorks console. Switch to the China (Shanghai) region. In the left navigation pane, click Resource Group to open the resource group list.
Click Create Resource Group. On the purchase page, set Region and Zone to China (Shanghai) and specify a Resource Group Name. Complete the remaining configuration and payment as prompted. For billing details, see Serverless resource groups.

Serverless resource groups do not support cross-region operations. Use the same region as your data sources.

Step 2: Associate the resource group with your workspace

In the resource group list, find the resource group you purchased.
In the Actions column, click Associate Workspace and select the DataWorks workspace you created.

Step 3: Configure internet access

Log on to the VPC - Internet NAT gateway console. In the top menu bar, switch to the China (Shanghai) region.

Click Create NAT Gateway and configure the following parameters: Keep the default values for all other parameters.

Parameter	Value
Region	China (Shanghai)
Network and zone	Select the VPC and vSwitch of the resource group. To find them, go to the DataWorks console, click Resource Group List, find your resource group, and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section, note the Attached VPC and vSwitch. For details, see What is a VPC?
Network type	Internet NAT gateway
EIP	Purchase New EIP
Service-linked role	If this is your first NAT gateway, click Create Service-linked Role

Click Buy Now. Accept the Terms of Service and click Confirm Order to complete the purchase.

Register an EMR Serverless Spark cluster

Register the Spark workspace in DataWorks so it can be used as the compute engine for data processing tasks.

Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left navigation pane, choose More > Management Center. Select the target workspace from the drop-down list and click Go to Management Center.
In the left navigation pane, click Clusters. On the Cluster Management page, click Register Cluster. In the dialog box, select E-MapReduce.

Configure the following parameters:

This tutorial uses the configuration described above. If your setup differs, see Data Studio (legacy version): Associate an EMR computing resource.

Parameter	Value
Display name of cluster	Enter a custom name
Clusters	Select the current Alibaba Cloud account
Cluster type	EMR Serverless Spark
Workspace created in EMR Serverless Spark	Select the Spark workspace you created in Prepare an EMR Serverless Spark workspace
Default engine version	Used by default when creating EMR Spark nodes in DataStudio. To use different versions per node, set them in the node's Advanced Settings.
Default resource queue	Used by default when creating EMR Spark nodes in DataStudio. To use different queues per node, set them in the node's Advanced Settings.
Default SQL compute	Used by default when creating EMR Spark SQL nodes in DataStudio. To use different compute settings per node, set them in the node's Advanced Settings.
Default access identity	Development environment: Executor. Production environment: Alibaba Cloud Account, RAM User, or Node Owner.

Create data sources

This tutorial uses three data sources:

MySQL — provided by the platform; serves as the source for batch synchronization, supplying user information
HttpFile — provided by the platform; serves as the source for batch synchronization, supplying log data
Private OSS — your own OSS bucket; serves as the destination that receives user information and log data

The MySQL and HttpFile data sources and their test data are provided by the DataWorks documentation team. All data is mock data and is read-only within the Data Integration module. The private OSS data source is the bucket you created in Prepare a private OSS bucket.

Add the MySQL data source

Log on to the DataWorks console. In the top navigation bar, select the target region. In the left navigation pane, choose More > Management Center. Select the target workspace and click Go to Management Center.
In the left navigation pane of the SettingCenter page, click Data Sources. In the upper-left corner, click Add Data Source.
In the Add Data Source dialog box, select MySQL.

On the Add MySQL Data Source page, configure the following parameters:

Parameter	Value
Data source name	`user_behavior_analysis_mysql`
Data source description	Provided for DataWorks use cases; read-only source for batch synchronization
Configuration mode	Connection String Mode
Host IP address	`rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com`
Port number	`3306`
Database name	`workshop`
Username	`workshop`
Password	`workshop#2017`
Authentication method	No Authentication

Find the resource group you created and click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns. A Connected status confirms successful connectivity.
Click Complete Creation.

Add the HttpFile data source

The HttpFile data source is an OSS bucket provided by the platform that supplies log data for the tutorial.

On the Data Sources page, click Add Data Source and select HttpFile.

On the Add HttpFile Data Source page, configure the following parameters:

Parameter	Value
Data source name	`user_behavior_analysis_httpfile`
Data source description	Provided for DataWorks use cases; read-only source for batch synchronization
URL	`https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com` (for both development and production environments)

Find the resource group you created and click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns.

Important
At least one resource group must show Connected status. Without a connectable resource group, you cannot configure data synchronization tasks using the codeless UI.
Click Complete Creation.

Add the private OSS data source

Add your OSS bucket as a private data source to serve as the destination for synchronized data.

On the Management Center page, choose Data Source > Data Source List and click Add Data Source.
In the Add Data Source dialog box, search for and select OSS.

In the Add OSS Data Source dialog box, configure the following parameters: For RAM role authorization mode, DataWorks assumes a role to access the data source using Security Token Service (STS), which provides higher security. For setup details, see Configure a data source in RAM role authorization mode. For Access Key mode, enter the AccessKey ID and AccessKey Secret of your Alibaba Cloud account. Go to the Security Information Management page to copy your AccessKey ID.

Important

The AccessKey Secret is only displayed when you first create it. Store it securely. If your AccessKey is lost or compromised, delete it and create a new one.

Parameter	Value
Data source name	`test_g`
Description	A brief description of the data source
Endpoint	`http://oss-cn-shanghai-internal.aliyuncs.com`
Bucket	The name of the OSS bucket you created (for example, `dw-emr-demo`)
Access mode	RAM role authorization mode or Access Key mode (select one)

Click Connected state in the Test Connectivity column for your resource group and wait for the status to show Connectable.

Important
At least one resource group must be Connectable. Without this, you cannot use the codeless UI to create sync tasks for this data source.
Click Complete.

What's next

With the environment set up, proceed to the next tutorial to synchronize user information and website access logs to OSS, then use Spark SQL to create an external table that accesses data stored in your private OSS bucket. For details, see Synchronize data.