Prepare environments - E-MapReduce - Alibaba Cloud Documentation Center

Before you use EMR Serverless Spark with DataWorks to synchronize, process, and monitor data quality, set up the required resources. This tutorial uses a user profile example in the China (Shanghai) region. Complete the following tasks to prepare the environment:

Create an EMR Serverless Spark workspace
Create an Object Storage Service (OSS) bucket
Set up the DataWorks environment

Prerequisites

Before you begin, make sure that you have activated DataWorks. If not yet activated, go to the DataWorks activation page. For details, see Purchase guide.

Create an EMR Serverless Spark workspace

This tutorial uses EMR Serverless Spark as the compute engine. If you already have a Spark workspace, skip this step. Otherwise, go to the E-MapReduce console, select Spark, and create a workspace with the following settings:

Parameter	Value
Region	China (Shanghai)
Billing Method	Pay-as-you-go
Workspace Name	Enter a custom name
DLF as a Metadata Service	Select a Data Lake Formation (DLF) data catalog. To fully isolate metadata between EMR clusters, select different catalogs.
Workspace Base Path	Select an OSS bucket path to store job log files
Workspace Type	Professional Edition

Professional Edition includes all Basic Edition features plus advanced capabilities and performance improvements. It is suited for large-scale extract, transform, and load (ETL) jobs. Basic Edition includes all basic features and provides powerful compute engines.

Create an OSS bucket

User information and website access logs are synchronized to an OSS bucket for data modeling and analysis. Create a dedicated bucket for this purpose.

Log on to the OSS console.
In the left-side navigation pane, click Buckets. On the Buckets page, click Create Bucket.
In the Create Bucket dialog box, configure the following parameters, then click Create. For more information about the parameters, see Create a bucket in the console.
Parameter Value
Bucket Name Enter a custom name
Region China (Shanghai)
HDFS Service Enable the HDFS service as prompted on the UI
On the Buckets page, click the bucket name to go to the Files page.

Parameter	Value
Bucket Name	Enter a custom name
Region	China (Shanghai)
HDFS Service	Enable the HDFS service as prompted on the UI

Set up the DataWorks environment

After the EMR Serverless Spark workspace and OSS bucket are ready, create a DataWorks workspace, register the Spark cluster, and configure data sources.

In DataWorks, an EMR Serverless Spark workspace is registered as a cluster.

Create a DataWorks workspace

Log on to the DataWorks console.
In the left-side navigation pane, click Workspace Management to open the workspace list page.
Click Create Workspace. In the panel that appears, create a workspace in Standard Mode and enable Isolates The Development And Production Environments.

The data resources in this tutorial are in the China (Shanghai) region. Create the workspace in China (Shanghai) to avoid network connectivity issues when adding data sources from other regions. For a simpler setup, select No for the Isolate Development And Production Environments parameter.

Create a resource group

A resource group provides the compute resources for data synchronization and scheduling. Make sure the network connection between the resource group and the EMR Serverless Spark workspace is stable.

Purchase a serverless resource group

Log on to the DataWorks console. Switch to the target region. In the left-side navigation pane, click Resource Group to open the resource group list page.
Click Create Resource Group. On the purchase page, set Region And Zone to China (Shanghai) and specify a Resource Group Name. Configure the remaining parameters and complete the payment as prompted. For billing details, see Billing of serverless resource groups.

This tutorial uses a serverless resource group in the China (Shanghai) region. Serverless resource groups do not support cross-region operations.

Bind and configure the resource group

Log on to the DataWorks console. Switch to the target region. In the left-side navigation pane, click Resource Group to open the resource group list page.
Find the serverless resource group. In the Actions column, click Bind Workspace. Bind the resource group to the DataWorks workspace created earlier.
Configure internet access for the resource group:
1. Log on to the VPC - Internet NAT Gateway console. In the top menu bar, switch to the China (Shanghai) region.
2. Click Create NAT Gateway and configure the following parameters: | Parameter | Value | |-----------|-------| | Region | China (Shanghai) | | Network And Zone | Select the VPC and vSwitch to which the resource group is attached. To find these values, go to the DataWorks console, switch to the region, and click Resource Group List in the left-side navigation pane. Find the resource group, click Network Settings in the Actions column, and check Attached VPC and VSwitch in the Data Scheduling & Data Integration section. For more information, see What is a VPC? | | Network Type | Internet NAT gateway | | Elastic IP Address | Purchase New EIP | | Service-linked Role Creation | If you are creating a NAT gateway for the first time, click Create Service-linked Role | > Note: Keep the default values for parameters not listed in this table.
3. Click Buy Now. Select the Terms of Service and click Confirm Order to complete the purchase.

Register the EMR Serverless Spark cluster

Data storage and processing for the user persona analysis run on the EMR Serverless Spark cluster. Register the cluster in DataWorks before use.

Go to the SettingCenter page. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. Select the workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane, click Cluster Management. On the Cluster Management page, click Register Cluster. In the dialog box, select E-MapReduce.

Configure the EMR Serverless Spark cluster with the following parameters:

Parameter	Description
Display Name	Enter a custom name
Alibaba Cloud Account for the Cluster	Select the current Alibaba Cloud account
Cluster Type	EMR Serverless Spark
E-MapReduce Workspace	Select the workspace prepared in the Create an EMR Serverless Spark workspace section
Default Engine Version	The default engine version for EMR Spark nodes in DataStudio. To set different versions for individual nodes, update the Advanced Settings in the Spark node editing window.
Default Resource Queue	The default resource queue for EMR Spark nodes in DataStudio. To set different queues for individual nodes, update the Advanced Settings in the Spark node editing window.
Default SQL Compute	The default SQL Compute for EMR Spark SQL nodes in DataStudio. To set different SQL Computes for individual nodes, update the Advanced Settings in the Spark node editing window.
Default Access Identity	Development environment: Executor. Production environment: Alibaba Cloud Account, RAM User, or Node Owner.

This tutorial uses the configuration described above. For different scenarios, see DataStudio (old version): Bind an EMR compute engine.

Create data sources

This tutorial uses a MySQL database for user information and an OSS bucket for user log data. Both are provided by the platform. Create data sources in DataWorks to enable data synchronization.

The platform provides the test data and data sources for this tutorial. Add them to your workspace to access the test data.
All data is mock data for hands-on practice in DataWorks only. It is read-only within the Data Integration module.
The OSS bucket created in the Create an OSS bucket step receives user information from the MySQL data source and log data from the HttpFile data source.

Create a MySQL data source

The MySQL database is provided by the platform. It serves as the source for a data integration task and provides user information.

In the left-side navigation pane of the SettingCenter page, click Data Sources. In the upper-left corner of the Data Sources page, click Add Data Source.
In the Add Data Source dialog box, select MySQL.

On the Add MySQL Data Source page, configure the following parameters. Use the sample values for both the development and production environments.

Parameter	Value
Data Source Name	user_behavior_analysis_mysql
Data Source Description	Data source provided for DataWorks use cases. Used as the source of a batch synchronization task to access test data. Read-only in data synchronization scenarios.
Configuration Mode	Connection String Mode
Connection Address	Host IP Address: `rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com`, Port Number: `3306`
Database Name	`workshop`
Username	`workshop`
Password	`workshop#2017`
Authentication Method	No Authentication

Find the resource group. Click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Verify that Connected appears.
Click Complete Creation.

Create an HttpFile data source

The HttpFile data source is an OSS bucket provided by the platform. It serves as the source for a data integration task and provides log data.

Go to the Data Sources page.
1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. Select the workspace from the drop-down list and click Go to Management Center.
2. In the left-side navigation pane of the SettingCenter page, click Data Sources.
In the upper-left corner of the Data Sources page, click Add Data Source. In the Add Data Source dialog box, click HttpFile.

On the Add HttpFile Data Source page, configure the following parameters. Use the sample values for both the development and production environments.

Parameter	Value
Data Source Name	user_behavior_analysis_httpfile
Data Source Description	Data source provided for DataWorks use cases. Used as the source of a batch synchronization task to access test data. Read-only in data synchronization scenarios.
URL	`https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com`

Find the resource group. Click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Verify that Connected appears.

Important

At least one resource group must be connectable. Otherwise, the codeless UI for configuring data synchronization tasks is not available.

Click Complete Creation.

Add a private OSS data source

Create a private OSS data source from the OSS bucket prepared earlier. This data source serves as the destination for data integration, receiving user information and log data.

The private OSS data source uses the OSS bucket you created. It stores user information imported from the MySQL data source and log data imported from the HttpFile data source. Both the MySQL and HttpFile data sources are provided by the DataWorks documentation.

On the Management Center page, choose Data Source > Data Source List, then click Add Data Source.
In the Add Data Source dialog box, search for and select OSS.
In the Add OSS Data Source dialog box, configure the following parameters: RAM Role Authorization Mode DataWorks assumes roles to access the data source through Security Token Service (STS). This provides higher security. For more information, see Configure a data source in RAM role authorization mode. AccessKey Mode > Important: The AccessKey secret is displayed only at creation time and cannot be retrieved later. Keep it confidential. If the AccessKey is leaked or lost, delete it and create a new AccessKey. > Note: Select either RAM Role Authorization Mode or AccessKey Mode.
Parameter Value
AccessKey ID The AccessKey ID of the current account. Copy it from the Security Information Management page.
AccessKey secret The AccessKey secret of the current account.
Click Test Connectivity in the Connection Status column for the resource group. Wait until the status shows Connectable.

Parameter	Value
AccessKey ID	The AccessKey ID of the current account. Copy it from the Security Information Management page.
AccessKey secret	The AccessKey secret of the current account.

Important

At least one resource group must be in the Connectable state. Otherwise, the codeless UI for creating sync tasks is not available.

Click Complete.

Next steps

The environment is ready. Proceed to the data synchronization tutorial to sync basic user information and website access logs to OSS. Then create a foreign table with Spark SQL to access the data in the private OSS bucket. For details, see Synchronize data.