Prepare the environment for an EMR Serverless Spark workspace in DataWorks - DataWorks

This tutorial uses a user persona example to demonstrate how to use DataWorks to synchronize, process, and monitor the quality of data in the China (Shanghai) region. To complete this tutorial, you must prepare the required EMR Serverless Spark and DataWorks workspaces and complete the environment configuration.

DataWorks product preparation

Ensure that you have activated DataWorks. If you have not activated DataWorks, you can activate it on the DataWorks page. For more information, see Purchase guide.

Prepare an EMR Serverless Spark workspace

This tutorial uses EMR Serverless Spark as the computing resource. Ensure that you have a Spark workspace. If you do not have a Spark workspace, go to the E-MapReduce console, select Spark, and create a workspace.

Region: China (Shanghai).
Billing Method: Pay-as-you-go.
Workspace Name: Enter a custom name.
DLF as a Metadata Service: Select a DLF data catalog. To completely isolate metadata between different EMR clusters, select different catalogs.
Workspace Base Path: Select an OSS bucket path to store job log files.
Workspace Type: Select Professional Edition for this tutorial.
Note
- Professional Edition: This workspace includes all features of the Basic Edition, in addition to advanced features and performance improvements. It is suitable for large-scale extract, transform, and load (ETL) jobs.
- Basic Edition: This workspace includes all basic features and provides powerful compute engines.

Prepare a private OSS environment

For this tutorial, you need to create an OSS bucket. User information and website access logs will be synchronized to this bucket for data modeling and analysis.

Log on to the OSS console.
In the navigation pane on the left, click Buckets. On the Buckets page, click Create Bucket.
In the Create Bucket dialog box, configure the parameters and click Create.
- Bucket Name: Enter a custom name.
- Region: Select China (Shanghai).
- HDFS Service: Enable the HDFS service as prompted on the UI.
  For more information about the parameters, see Create a bucket in the console.
On the Buckets page, click the name of the Bucket to go to the Files page of the bucket.

Prepare the DataWorks environment

After you prepare DataWorks, an EMR Serverless Spark workspace, and an OSS bucket, you must create a DataWorks workspace, register a Spark cluster, and create data sources. These steps prepare the environment for data synchronization and processing.

Create a DataWorks workspace

Log on to the DataWorks console.
In the navigation pane on the left, click Workspace Management to go to the workspace list page.
Click Create Workspace. In the panel that appears, create a workspace in Standard Mode and enable Isolates The Development And Production Environments.

Note

The data resources in this tutorial are in the China (Shanghai) region. We recommend that you create the workspace in China (Shanghai) to avoid network connectivity issues when you add data sources from other regions. For a simpler setup, you can select No for the Isolate Development And Production Environments parameter.

Create a resource group

Before you use DataWorks, you must create a resource group to provide resources for data synchronization and scheduling. Ensure that the network connection between the resource group and the Serverless Spark workspace is stable.

Purchase a serverless resource group.
1. Log on to the DataWorks console. Switch to the target region. In the navigation pane on the left, click Resource Group to go to the resource group list page.
2. Click Create Resource Group. On the resource group purchase page, set Region And Zone to China (Shanghai) and specify a Resource Group Name. Configure the other parameters and complete the payment as prompted. For more information about the billing of serverless resource groups, see Billing of serverless resource groups.
  Note
  This tutorial uses a serverless resource group in the China (Shanghai) region as an example. Serverless resource groups do not support cross-region operations.

Configure the serverless resource group.

Log on to the DataWorks console. Switch to the target region. In the navigation pane on the left, click Resource Group to go to the resource group list page.
Find the serverless resource group that you purchased. In the Actions column, click Bind Workspace. Bind the resource group to the DataWorks workspace that you created.

Configure Internet access for the resource group.

Log on to the VPC - Internet NAT Gateway console. In the top menu bar, switch to the China (Shanghai) region.

Click Create NAT Gateway. Configure the parameters.

Parameter	Value
Region	China (Shanghai).
Network And Zone	Select the VPC and vSwitch to which the resource group is attached. Go to the DataWorks console. Switch to the region. In the navigation pane on the left, click Resource Group List. Find the resource group that you created. In the Actions column, click Network Settings. In the Data Scheduling & Data Integration section, view the Attached VPC and VSwitch. For more information about VPCs and vSwitches, see What is a VPC?
Network Type	Internet NAT gateway.
Elastic IP Address	Purchase New EIP.
Service-linked Role Creation	If you are creating a NAT Gateway for the first time, you must create a service-linked role. Click Create Service-linked Role.

Note

Keep the default values for the parameters not mentioned in the table.

Click Buy Now. Select the Terms of Service and click Confirm Order to complete the purchase.

Register an EMR Serverless Spark cluster

Data storage and data processing for the user persona analysis are performed in an EMR Serverless Spark cluster. You must register the Spark cluster before you can use it.

Go to the SettingCenter page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

In the left navigation pane, click Cluster Management. On the Cluster Management page, click Register Cluster. In the dialog box that opens, select E-MapReduce to configure the EMR Serverless Spark cluster.
Register the E-MapReduce cluster.
- Display Name: Enter a custom name.
- Alibaba Cloud Account for the Cluster: Select the current Alibaba Cloud account.
- Cluster Type: EMR Serverless Spark.
- E-MapReduce Workspace: Select the workspace that you prepared in the Prepare an EMR Serverless Spark workspace section.
- Default Engine Version: This engine version is used by default when you create an EMR Spark node in DataStudio. To set different engine versions for different nodes, you can define them in the Advanced Settings of the Spark node editing window.
- Default Resource Queue: This resource queue is used by default when you create an EMR Spark node in DataStudio. To set different resource queues for different nodes, you can define them in the Advanced Settings of the Spark node editing window.
- Default SQL Compute: This SQL Compute is used by default when you create an EMR Spark SQL node in DataStudio. To set different SQL Computes for different nodes, you can define them in the Advanced Settings of the Spark node editing window.
- Default Access Identity: The default value for the development environment is Executor. For the production environment, you can select Alibaba Cloud Account, RAM User, or Node Owner.
  Note
  This tutorial uses the configuration described above. If your scenario is different, see DataStudio (old version): Bind an EMR compute engine.

Create data sources

This tutorial provides a MySQL database that stores user information and an OSS bucket that stores user log data. You must create data sources for them in DataWorks to use them for data synchronization.

Note

The platform provides the test data and data sources required for this tutorial. Add the data sources to your workspace to access the test data.
The data provided in this tutorial is for hands-on practice in DataWorks only. All data is mock data and can be read only from the Data Integration module.
The OSS Bucket that you created in the Prepare a private OSS environment step is used to receive user information from the MySQL data source and log data from the HttpFile data source.

Create a MySQL data source

In this tutorial, the database for the MySQL data source is provided by the platform. It serves as the data source for a data integration task and provides user information.

In the left-side navigation pane of the SettingCenter page, click Data Sources. In the upper-left corner of the Data Sources page, click Add Data Source.
In the Add Data Source dialog box, select MySQL.

On the Add MySQL Data Source page, configure the parameters. The following table describes the parameters. In this example, the sample values are used in the development and production environments.

Parameter	Description
Data Source Name	The name of the data source. In this example, user_behavior_analysis_mysql is used.
Data Source Description	The description of the data source. The data source is exclusively provided for the use cases of DataWorks and is used as the source of a batch synchronization task to access the provided test data. The data source is only for data reading in data synchronization scenarios.
Configuration Mode	Select Connection String Mode.
Connection Address	Host IP Address: Enter `rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com`. Port Number: Enter `3306`.
Database Name	The name of the database. In this example, `workshop` is used.
Username	The username. In this example, workshop is used.
Password	The password. In this example, workshop#2017 is used.
Authentication Method	Select No Authentication.

Find a desired resource group and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. If the network connectivity test is successful, Connected appears in the corresponding column.
Click Complete Creation.

Create an HttpFile data source

In this tutorial, the HttpFile data source is an OSS bucket provided by the platform. It serves as the source for a data integration task and provides log data.

Go to the Data Sources page.
1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
2. In the left-side navigation pane of the SettingCenter page, click Data Sources.
In the upper-left corner of the Data Sources page, click Add Data Source. In the Add Data Source dialog box, click HttpFile.

On the Add HttpFile Data Source page, configure the parameters. In this tutorial, the sample values are used in the development and production environments.

Parameter	Description
Data Source Name	The name of the data source. In this example, user_behavior_analysis_httpfile is used.
Data Source Description	The description of the data source. The data source is exclusively provided for the use cases of DataWorks and is used as the source of a batch synchronization task to access the provided test data. The data source is only for data reading in data synchronization scenarios.
URL	Enter `https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com` in the URL field for the development and production environments.

Find a desired resource group and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. If the network connectivity test is successful, Connected appears in the corresponding column.
Important
Make sure that at least one resource group is connectable. Otherwise, you cannot use the codeless user interface (UI) to configure a data synchronization task for the data source.
Click Complete Creation.

Add a private OSS data source

For this tutorial, you must prepare your own OSS bucket and create a private OSS data source. This data source serves as the destination for data integration to receive user information and log data.

Note

The private OSS data source is an OSS data source created from your own OSS bucket. It is used to store the user information imported from the MySQL data source and the log data imported from the HttpFile data source. Both the MySQL and HttpFile data sources are provided by the DataWorks documentation.

On the Management Center page, choose Data Source > Data Source List, and then click Add Data Source.
In the Add Data Source dialog box, search for and select OSS.

In the Add OSS Data Source dialog box, configure the parameters.

Parameter	Description
Data Source Name	The name of the data source. In this example, test_g is used.
Data Source Description	A brief description of the data source.
Endpoint	Enter `http://oss-cn-shanghai-internal.aliyuncs.com`.
Bucket	The name of the OSS bucket that you created when you prepared the environment. For example, dw-emr-demo.
Access Mode	RAM Role Authorization Mode	DataWorks can assume roles to access the data source using Security Token Service (STS). This provides higher security. For more information, see Configure a data source in RAM role authorization mode.
	AccessKey Mode	AccessKey ID	The AccessKey ID of the current account. You can go to the Security Information Management page to copy the AccessKey ID.
		AccessKey Secret	Enter the AccessKey secret of the current account. Important The AccessKey secret is displayed only when you create it. You cannot view it later. Keep it confidential. If the AccessKey is leaked or lost, delete it and create a new AccessKey.

Note

Select either RAM Role Authorization Mode or AccessKey Mode.

Click Test Connectivity in the Connection Status column for the specified resource group. Wait until the test is complete and the status is Connectable.
Important
Ensure that at least one resource group is in the Connectable state. Otherwise, you cannot use the codeless UI to create a sync task for this data source.
Click Complete.

More operations

Now that you have prepared the environment, you can proceed to the next tutorial. In the next tutorial, you will learn how to synchronize basic user information and user website access logs to OSS. Then, you will create a foreign table using Spark SQL to access the data stored in the private OSS bucket. For more information, see Synchronize data.