This tutorial uses a user persona example to demonstrate how to use DataWorks to synchronize, process, and monitor the quality of data in the China (Shanghai) region. To complete this tutorial, you must prepare the required EMR Serverless Spark and DataWorks workspaces and complete the environment configuration.
DataWorks product preparation
Ensure that you have activated DataWorks. If you have not activated DataWorks, you can activate it on the DataWorks page. For more information, see Purchase guide.
Prepare an EMR Serverless Spark workspace
This tutorial uses EMR Serverless Spark as the computing resource. Ensure that you have a Spark workspace. If you do not have a Spark workspace, go to the E-MapReduce console, select Spark, and create a workspace.
Region: China (Shanghai).
Billing Method: Pay-as-you-go.
Workspace Name: Enter a custom name.
DLF as a Metadata Service: Select a DLF data catalog. To completely isolate metadata between different EMR clusters, select different catalogs.
Workspace Base Path: Select an OSS bucket path to store job log files.
Workspace Type: Select Professional Edition for this tutorial.
NoteProfessional Edition: This workspace includes all features of the Basic Edition, in addition to advanced features and performance improvements. It is suitable for large-scale extract, transform, and load (ETL) jobs.
Basic Edition: This workspace includes all basic features and provides powerful compute engines.
Prepare a private OSS environment
For this tutorial, you need to create an OSS bucket. User information and website access logs will be synchronized to this bucket for data modeling and analysis.
Log on to the OSS console.
In the navigation pane on the left, click Buckets. On the Buckets page, click Create Bucket.
In the Create Bucket dialog box, configure the parameters and click Create.
Bucket Name: Enter a custom name.
Region: Select China (Shanghai).
HDFS Service: Enable the HDFS service as prompted on the UI.
For more information about the parameters, see Create a bucket in the console.
On the Buckets page, click the name of the Bucket to go to the Files page of the bucket.
Prepare the DataWorks environment
After you prepare DataWorks, an EMR Serverless Spark workspace, and an OSS bucket, you must create a DataWorks workspace, register a Spark cluster, and create data sources. These steps prepare the environment for data synchronization and processing.
Create a DataWorks workspace
Log on to the DataWorks console.
In the navigation pane on the left, click Workspace Management to go to the workspace list page.
Click Create Workspace. In the panel that appears, create a workspace in Standard Mode and enable Isolates The Development And Production Environments.
The data resources in this tutorial are in the China (Shanghai) region. We recommend that you create the workspace in China (Shanghai) to avoid network connectivity issues when you add data sources from other regions. For a simpler setup, you can select No for the Isolate Development And Production Environments parameter.
Create a resource group
Before you use DataWorks, you must create a resource group to provide resources for data synchronization and scheduling. Ensure that the network connection between the resource group and the Serverless Spark workspace is stable.
Purchase a serverless resource group.
Log on to the DataWorks console. Switch to the target region. In the navigation pane on the left, click Resource Group to go to the resource group list page.
Click Create Resource Group. On the resource group purchase page, set Region And Zone to China (Shanghai) and specify a Resource Group Name. Configure the other parameters and complete the payment as prompted. For more information about the billing of serverless resource groups, see Billing of serverless resource groups.
Note
This tutorial uses a serverless resource group in the China (Shanghai) region as an example. Serverless resource groups do not support cross-region operations.
Configure the serverless resource group.
Log on to the DataWorks console. Switch to the target region. In the navigation pane on the left, click Resource Group to go to the resource group list page.
Find the serverless resource group that you purchased. In the Actions column, click Bind Workspace. Bind the resource group to the DataWorks workspace that you created.
Configure Internet access for the resource group.
Log on to the VPC - Internet NAT Gateway console. In the top menu bar, switch to the China (Shanghai) region.
Click Create NAT Gateway. Configure the parameters.
Parameter
Value
Region
China (Shanghai).
Network And Zone
Select the VPC and vSwitch to which the resource group is attached.
Go to the DataWorks console. Switch to the region. In the navigation pane on the left, click Resource Group List. Find the resource group that you created. In the Actions column, click Network Settings. In the Data Scheduling & Data Integration section, view the Attached VPC and VSwitch. For more information about VPCs and vSwitches, see What is a VPC?
Network Type
Internet NAT gateway.
Elastic IP Address
Purchase New EIP.
Service-linked Role Creation
If you are creating a NAT Gateway for the first time, you must create a service-linked role. Click Create Service-linked Role.
Note
Keep the default values for the parameters not mentioned in the table.
Click Buy Now. Select the Terms of Service and click Confirm Order to complete the purchase.
Register an EMR Serverless Spark cluster
Data storage and data processing for the user persona analysis are performed in an EMR Serverless Spark cluster. You must register the Spark cluster before you can use it.
Go to the SettingCenter page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left navigation pane, click Cluster Management. On the Cluster Management page, click Register Cluster. In the dialog box that opens, select E-MapReduce to configure the EMR Serverless Spark cluster.
Register the E-MapReduce cluster.
Display Name: Enter a custom name.
Alibaba Cloud Account for the Cluster: Select the current Alibaba Cloud account.
Cluster Type: EMR Serverless Spark.
E-MapReduce Workspace: Select the workspace that you prepared in the Prepare an EMR Serverless Spark workspace section.
Default Engine Version: This engine version is used by default when you create an EMR Spark node in DataStudio. To set different engine versions for different nodes, you can define them in the Advanced Settings of the Spark node editing window.
Default Resource Queue: This resource queue is used by default when you create an EMR Spark node in DataStudio. To set different resource queues for different nodes, you can define them in the Advanced Settings of the Spark node editing window.
Default SQL Compute: This SQL Compute is used by default when you create an EMR Spark SQL node in DataStudio. To set different SQL Computes for different nodes, you can define them in the Advanced Settings of the Spark node editing window.
Default Access Identity: The default value for the development environment is Executor. For the production environment, you can select Alibaba Cloud Account, RAM User, or Node Owner.
NoteThis tutorial uses the configuration described above. If your scenario is different, see DataStudio (old version): Bind an EMR compute engine.
Create data sources
This tutorial provides a MySQL database that stores user information and an OSS bucket that stores user log data. You must create data sources for them in DataWorks to use them for data synchronization.
The platform provides the test data and data sources required for this tutorial. Add the data sources to your workspace to access the test data.
The data provided in this tutorial is for hands-on practice in DataWorks only. All data is mock data and can be read only from the Data Integration module.
The OSS Bucket that you created in the Prepare a private OSS environment step is used to receive user information from the MySQL data source and log data from the HttpFile data source.
Create a MySQL data source
In this tutorial, the database for the MySQL data source is provided by the platform. It serves as the data source for a data integration task and provides user information.
In the left-side navigation pane of the SettingCenter page, click Data Sources. In the upper-left corner of the Data Sources page, click Add Data Source.
In the Add Data Source dialog box, select MySQL.
On the Add MySQL Data Source page, configure the parameters. The following table describes the parameters. In this example, the sample values are used in the development and production environments.
Parameter
Description
Data Source Name
The name of the data source. In this example, user_behavior_analysis_mysql is used.
Data Source Description
The description of the data source. The data source is exclusively provided for the use cases of DataWorks and is used as the source of a batch synchronization task to access the provided test data. The data source is only for data reading in data synchronization scenarios.
Configuration Mode
Select Connection String Mode.
Connection Address
Host IP Address: Enter
rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com.Port Number: Enter
3306.
Database Name
The name of the database. In this example,
workshopis used.Username
The username. In this example, workshop is used.
Password
The password. In this example, workshop#2017 is used.
Authentication Method
Select No Authentication.
Find a desired resource group and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. If the network connectivity test is successful, Connected appears in the corresponding column.
Click Complete Creation.
Create an HttpFile data source
In this tutorial, the HttpFile data source is an OSS bucket provided by the platform. It serves as the source for a data integration task and provides log data.
Go to the Data Sources page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Data Sources.
In the upper-left corner of the Data Sources page, click Add Data Source. In the Add Data Source dialog box, click HttpFile.
On the Add HttpFile Data Source page, configure the parameters. In this tutorial, the sample values are used in the development and production environments.
Parameter
Description
Data Source Name
The name of the data source. In this example, user_behavior_analysis_httpfile is used.
Data Source Description
The description of the data source. The data source is exclusively provided for the use cases of DataWorks and is used as the source of a batch synchronization task to access the provided test data. The data source is only for data reading in data synchronization scenarios.
URL
Enter
https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.comin the URL field for the development and production environments.Find a desired resource group and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. If the network connectivity test is successful, Connected appears in the corresponding column.
ImportantMake sure that at least one resource group is connectable. Otherwise, you cannot use the codeless user interface (UI) to configure a data synchronization task for the data source.
Click Complete Creation.
Add a private OSS data source
For this tutorial, you must prepare your own OSS bucket and create a private OSS data source. This data source serves as the destination for data integration to receive user information and log data.
The private OSS data source is an OSS data source created from your own OSS bucket. It is used to store the user information imported from the MySQL data source and the log data imported from the HttpFile data source. Both the MySQL and HttpFile data sources are provided by the DataWorks documentation.
On the Management Center page, choose , and then click Add Data Source.
In the Add Data Source dialog box, search for and select OSS.
In the Add OSS Data Source dialog box, configure the parameters.
Parameter
Description
Data Source Name
The name of the data source. In this example, test_g is used.
Data Source Description
A brief description of the data source.
Endpoint
Enter
http://oss-cn-shanghai-internal.aliyuncs.com.Bucket
The name of the OSS bucket that you created when you prepared the environment. For example, dw-emr-demo.
Access Mode
RAM Role Authorization Mode
DataWorks can assume roles to access the data source using Security Token Service (STS). This provides higher security. For more information, see Configure a data source in RAM role authorization mode.
AccessKey Mode
AccessKey ID
The AccessKey ID of the current account. You can go to the Security Information Management page to copy the AccessKey ID.
AccessKey Secret
Enter the AccessKey secret of the current account.
ImportantThe AccessKey secret is displayed only when you create it. You cannot view it later. Keep it confidential. If the AccessKey is leaked or lost, delete it and create a new AccessKey.
NoteSelect either RAM Role Authorization Mode or AccessKey Mode.
Click Test Connectivity in the Connection Status column for the specified resource group. Wait until the test is complete and the status is Connectable.
ImportantEnsure that at least one resource group is in the Connectable state. Otherwise, you cannot use the codeless UI to create a sync task for this data source.
Click Complete.
More operations
Now that you have prepared the environment, you can proceed to the next tutorial. In the next tutorial, you will learn how to synchronize basic user information and user website access logs to OSS. Then, you will create a foreign table using Spark SQL to access the data stored in the private OSS bucket. For more information, see Synchronize data.