Before you use EMR Serverless Spark with DataWorks to synchronize, process, and monitor data quality, set up the required resources. This tutorial uses a user profile example in the China (Shanghai) region. Complete the following tasks to prepare the environment:
Prerequisites
Before you begin, make sure that you have activated DataWorks. If not yet activated, go to the DataWorks activation page. For details, see Purchase guide.
Create an EMR Serverless Spark workspace
This tutorial uses EMR Serverless Spark as the compute engine. If you already have a Spark workspace, skip this step. Otherwise, go to the E-MapReduce console, select Spark, and create a workspace with the following settings:
| Parameter | Value |
|---|---|
| Region | China (Shanghai) |
| Billing Method | Pay-as-you-go |
| Workspace Name | Enter a custom name |
| DLF as a Metadata Service | Select a Data Lake Formation (DLF) data catalog. To fully isolate metadata between EMR clusters, select different catalogs. |
| Workspace Base Path | Select an OSS bucket path to store job log files |
| Workspace Type | Professional Edition |
Professional Edition includes all Basic Edition features plus advanced capabilities and performance improvements. It is suited for large-scale extract, transform, and load (ETL) jobs. Basic Edition includes all basic features and provides powerful compute engines.
Create an OSS bucket
User information and website access logs are synchronized to an OSS bucket for data modeling and analysis. Create a dedicated bucket for this purpose.
Log on to the OSS console.
In the left-side navigation pane, click Buckets. On the Buckets page, click Create Bucket.
In the Create Bucket dialog box, configure the following parameters, then click Create. For more information about the parameters, see Create a bucket in the console.
Parameter Value Bucket Name Enter a custom name Region China (Shanghai) HDFS Service Enable the HDFS service as prompted on the UI On the Buckets page, click the bucket name to go to the Files page.
Set up the DataWorks environment
After the EMR Serverless Spark workspace and OSS bucket are ready, create a DataWorks workspace, register the Spark cluster, and configure data sources.
In DataWorks, an EMR Serverless Spark workspace is registered as a cluster.
Create a DataWorks workspace
Log on to the DataWorks console.
In the left-side navigation pane, click Workspace Management to open the workspace list page.
Click Create Workspace. In the panel that appears, create a workspace in Standard Mode and enable Isolates The Development And Production Environments.
The data resources in this tutorial are in the China (Shanghai) region. Create the workspace in China (Shanghai) to avoid network connectivity issues when adding data sources from other regions. For a simpler setup, select No for the Isolate Development And Production Environments parameter.
Create a resource group
A resource group provides the compute resources for data synchronization and scheduling. Make sure the network connection between the resource group and the EMR Serverless Spark workspace is stable.
Purchase a serverless resource group
Log on to the DataWorks console. Switch to the target region. In the left-side navigation pane, click Resource Group to open the resource group list page.
Click Create Resource Group. On the purchase page, set Region And Zone to China (Shanghai) and specify a Resource Group Name. Configure the remaining parameters and complete the payment as prompted. For billing details, see Billing of serverless resource groups.
This tutorial uses a serverless resource group in the China (Shanghai) region. Serverless resource groups do not support cross-region operations.
Bind and configure the resource group
Log on to the DataWorks console. Switch to the target region. In the left-side navigation pane, click Resource Group to open the resource group list page.
Find the serverless resource group. In the Actions column, click Bind Workspace. Bind the resource group to the DataWorks workspace created earlier.
Configure internet access for the resource group:
Log on to the VPC - Internet NAT Gateway console. In the top menu bar, switch to the China (Shanghai) region.
Click Create NAT Gateway and configure the following parameters: | Parameter | Value | |-----------|-------| | Region | China (Shanghai) | | Network And Zone | Select the VPC and vSwitch to which the resource group is attached. To find these values, go to the DataWorks console, switch to the region, and click Resource Group List in the left-side navigation pane. Find the resource group, click Network Settings in the Actions column, and check Attached VPC and VSwitch in the Data Scheduling & Data Integration section. For more information, see What is a VPC? | | Network Type | Internet NAT gateway | | Elastic IP Address | Purchase New EIP | | Service-linked Role Creation | If you are creating a NAT gateway for the first time, click Create Service-linked Role | > Note: Keep the default values for parameters not listed in this table.
Click Buy Now. Select the Terms of Service and click Confirm Order to complete the purchase.
Register the EMR Serverless Spark cluster
Data storage and processing for the user persona analysis run on the EMR Serverless Spark cluster. Register the cluster in DataWorks before use.
Go to the SettingCenter page. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. Select the workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane, click Cluster Management. On the Cluster Management page, click Register Cluster. In the dialog box, select E-MapReduce.
Configure the EMR Serverless Spark cluster with the following parameters:
Parameter Description Display Name Enter a custom name Alibaba Cloud Account for the Cluster Select the current Alibaba Cloud account Cluster Type EMR Serverless Spark E-MapReduce Workspace Select the workspace prepared in the Create an EMR Serverless Spark workspace section Default Engine Version The default engine version for EMR Spark nodes in DataStudio. To set different versions for individual nodes, update the Advanced Settings in the Spark node editing window. Default Resource Queue The default resource queue for EMR Spark nodes in DataStudio. To set different queues for individual nodes, update the Advanced Settings in the Spark node editing window. Default SQL Compute The default SQL Compute for EMR Spark SQL nodes in DataStudio. To set different SQL Computes for individual nodes, update the Advanced Settings in the Spark node editing window. Default Access Identity Development environment: Executor. Production environment: Alibaba Cloud Account, RAM User, or Node Owner.
This tutorial uses the configuration described above. For different scenarios, see DataStudio (old version): Bind an EMR compute engine.
Create data sources
This tutorial uses a MySQL database for user information and an OSS bucket for user log data. Both are provided by the platform. Create data sources in DataWorks to enable data synchronization.
The platform provides the test data and data sources for this tutorial. Add them to your workspace to access the test data.
All data is mock data for hands-on practice in DataWorks only. It is read-only within the Data Integration module.
The OSS bucket created in the Create an OSS bucket step receives user information from the MySQL data source and log data from the HttpFile data source.
Create a MySQL data source
The MySQL database is provided by the platform. It serves as the source for a data integration task and provides user information.
In the left-side navigation pane of the SettingCenter page, click Data Sources. In the upper-left corner of the Data Sources page, click Add Data Source.
In the Add Data Source dialog box, select MySQL.
On the Add MySQL Data Source page, configure the following parameters. Use the sample values for both the development and production environments.
Parameter Value Data Source Name user_behavior_analysis_mysql Data Source Description Data source provided for DataWorks use cases. Used as the source of a batch synchronization task to access test data. Read-only in data synchronization scenarios. Configuration Mode Connection String Mode Connection Address Host IP Address: rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com, Port Number:3306Database Name workshopUsername workshopPassword workshop#2017Authentication Method No Authentication Find the resource group. Click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Verify that Connected appears.
Click Complete Creation.
Create an HttpFile data source
The HttpFile data source is an OSS bucket provided by the platform. It serves as the source for a data integration task and provides log data.
Go to the Data Sources page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. Select the workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Data Sources.
In the upper-left corner of the Data Sources page, click Add Data Source. In the Add Data Source dialog box, click HttpFile.
On the Add HttpFile Data Source page, configure the following parameters. Use the sample values for both the development and production environments.
Parameter Value Data Source Name user_behavior_analysis_httpfile Data Source Description Data source provided for DataWorks use cases. Used as the source of a batch synchronization task to access test data. Read-only in data synchronization scenarios. URL https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.comFind the resource group. Click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Verify that Connected appears.
At least one resource group must be connectable. Otherwise, the codeless UI for configuring data synchronization tasks is not available.
Click Complete Creation.
Add a private OSS data source
Create a private OSS data source from the OSS bucket prepared earlier. This data source serves as the destination for data integration, receiving user information and log data.
The private OSS data source uses the OSS bucket you created. It stores user information imported from the MySQL data source and log data imported from the HttpFile data source. Both the MySQL and HttpFile data sources are provided by the DataWorks documentation.
On the Management Center page, choose Data Source > Data Source List, then click Add Data Source.
In the Add Data Source dialog box, search for and select OSS.
In the Add OSS Data Source dialog box, configure the following parameters: RAM Role Authorization Mode DataWorks assumes roles to access the data source through Security Token Service (STS). This provides higher security. For more information, see Configure a data source in RAM role authorization mode. AccessKey Mode > Important: The AccessKey secret is displayed only at creation time and cannot be retrieved later. Keep it confidential. If the AccessKey is leaked or lost, delete it and create a new AccessKey. > Note: Select either RAM Role Authorization Mode or AccessKey Mode.
Parameter Value AccessKey ID The AccessKey ID of the current account. Copy it from the Security Information Management page. AccessKey secret The AccessKey secret of the current account. Click Test Connectivity in the Connection Status column for the resource group. Wait until the status shows Connectable.
At least one resource group must be in the Connectable state. Otherwise, the codeless UI for creating sync tasks is not available.
Click Complete.
Next steps
The environment is ready. Proceed to the data synchronization tutorial to sync basic user information and website access logs to OSS. Then create a foreign table with Spark SQL to access the data in the private OSS bucket. For details, see Synchronize data.