This tutorial walks you through a user profile analysis use case that demonstrates how to synchronize, process, and monitor data quality using DataWorks in the China (Shanghai) region. Before starting, complete the environment setup steps below to create and connect the required services.
What you'll set up
By the end of this page, you will have:
-
Activated DataWorks and created a DataWorks workspace in Standard Mode.
-
Created an EMR Serverless Spark workspace (Professional Edition) as the compute resource.
-
Created an Object Storage Service (OSS) bucket to receive and store synchronized data.
-
Created a serverless resource group and configured internet access for it.
-
Registered the EMR Serverless Spark cluster in DataWorks.
-
Added three data sources: a platform-provided MySQL source, a platform-provided HttpFile source, and your own private OSS destination.
Prerequisites
Before you begin, make sure you have:
-
An Alibaba Cloud account with billing enabled
-
Permissions to create resources in the China (Shanghai) region
-
Access to the EMR, OSS, VPC, and DataWorks consoles
DataWorks product preparation
Activate DataWorks on the DataWorks purchase page if you haven't already. For details, see Purchase.
Prepare an EMR Serverless Spark workspace
This tutorial uses EMR Serverless Spark as the compute resource. If you don't have a Spark workspace, go to the E-MapReduce console, select Spark, and create a workspace with the following settings:
| Parameter | Value |
|---|---|
| Region | China (Shanghai) |
| Payment type | Pay-as-you-go |
| Workspace name | Enter a custom name |
| DLF for metadata storage | Select a Data Lake Formation (DLF) data catalog. To completely isolate metadata between different EMR clusters, select different catalogs. |
| Workspace directory | Select an OSS bucket path to store job log files |
| Workspace type | Professional Edition |
Select Professional Edition for this tutorial. Professional Edition includes all Basic Edition features plus advanced performance improvements suited for large-scale extract, transform, and load (ETL) jobs. Basic Edition provides powerful compute engines.
Prepare a private OSS bucket
Create an OSS bucket to serve as the destination for synchronized data. In the next tutorial, user information from the MySQL source and log data from the HttpFile source will be written to this bucket for data modeling and analysis.
-
Log on to the OSS console.
-
In the left navigation pane, click Buckets. On the Buckets page, click Create Bucket.
-
In the Create Bucket dialog box, configure the following parameters and click Create. For details on other parameters, see Create a bucket in the console.
Parameter Value Bucket name Enter a custom name Region China (Shanghai) OSS-HDFS Enable the HDFS service as prompted -
On the Buckets page, click the bucket name to go to its Object Management page.
Prepare the DataWorks environment
With DataWorks, your EMR Serverless Spark workspace, and your OSS bucket ready, complete the following steps to configure the DataWorks workspace for data synchronization and processing.
Create a DataWorks workspace
-
Log on to the DataWorks console.
-
In the left navigation pane, click Workspace Management to open the workspace list.
-
Click Create Workspace. In the panel that appears, create a workspace in Standard Mode and enable Isolate Development and Production Environments.
Create the workspace in China (Shanghai) to avoid network connectivity issues when connecting to data sources in that region. For a simpler setup, select No for Isolate Development and Production Environments.
Create a resource group
A resource group provides the compute resources for data synchronization and scheduling. Create a serverless resource group and configure internet access so it can reach external data sources.
Step 1: Purchase a serverless resource group
-
Log on to the DataWorks console. Switch to the China (Shanghai) region. In the left navigation pane, click Resource Group to open the resource group list.
-
Click Create Resource Group. On the purchase page, set Region and Zone to China (Shanghai) and specify a Resource Group Name. Complete the remaining configuration and payment as prompted. For billing details, see Serverless resource groups.
Serverless resource groups do not support cross-region operations. Use the same region as your data sources.
Step 2: Associate the resource group with your workspace
-
In the resource group list, find the resource group you purchased.
-
In the Actions column, click Associate Workspace and select the DataWorks workspace you created.
Step 3: Configure internet access
-
Log on to the VPC - Internet NAT gateway console. In the top menu bar, switch to the China (Shanghai) region.
-
Click Create NAT Gateway and configure the following parameters: Keep the default values for all other parameters.
Parameter Value Region China (Shanghai) Network and zone Select the VPC and vSwitch of the resource group. To find them, go to the DataWorks console, click Resource Group List, find your resource group, and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section, note the Attached VPC and vSwitch. For details, see What is a VPC? Network type Internet NAT gateway EIP Purchase New EIP Service-linked role If this is your first NAT gateway, click Create Service-linked Role -
Click Buy Now. Accept the Terms of Service and click Confirm Order to complete the purchase.
Register an EMR Serverless Spark cluster
Register the Spark workspace in DataWorks so it can be used as the compute engine for data processing tasks.
-
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left navigation pane, choose More > Management Center. Select the target workspace from the drop-down list and click Go to Management Center.
-
In the left navigation pane, click Clusters. On the Cluster Management page, click Register Cluster. In the dialog box, select E-MapReduce.
-
Configure the following parameters:
This tutorial uses the configuration described above. If your setup differs, see Data Studio (legacy version): Associate an EMR computing resource.
Parameter Value Display name of cluster Enter a custom name Clusters Select the current Alibaba Cloud account Cluster type EMR Serverless Spark Workspace created in EMR Serverless Spark Select the Spark workspace you created in Prepare an EMR Serverless Spark workspace Default engine version Used by default when creating EMR Spark nodes in DataStudio. To use different versions per node, set them in the node's Advanced Settings. Default resource queue Used by default when creating EMR Spark nodes in DataStudio. To use different queues per node, set them in the node's Advanced Settings. Default SQL compute Used by default when creating EMR Spark SQL nodes in DataStudio. To use different compute settings per node, set them in the node's Advanced Settings. Default access identity Development environment: Executor. Production environment: Alibaba Cloud Account, RAM User, or Node Owner.
Create data sources
This tutorial uses three data sources:
-
MySQL — provided by the platform; serves as the source for batch synchronization, supplying user information
-
HttpFile — provided by the platform; serves as the source for batch synchronization, supplying log data
-
Private OSS — your own OSS bucket; serves as the destination that receives user information and log data
The MySQL and HttpFile data sources and their test data are provided by the DataWorks documentation team. All data is mock data and is read-only within the Data Integration module. The private OSS data source is the bucket you created in Prepare a private OSS bucket.
Add the MySQL data source
-
Log on to the DataWorks console. In the top navigation bar, select the target region. In the left navigation pane, choose More > Management Center. Select the target workspace and click Go to Management Center.
-
In the left navigation pane of the SettingCenter page, click Data Sources. In the upper-left corner, click Add Data Source.
-
In the Add Data Source dialog box, select MySQL.
-
On the Add MySQL Data Source page, configure the following parameters:
Parameter Value Data source name user_behavior_analysis_mysqlData source description Provided for DataWorks use cases; read-only source for batch synchronization Configuration mode Connection String Mode Host IP address rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.comPort number 3306Database name workshopUsername workshopPassword workshop#2017Authentication method No Authentication -
Find the resource group you created and click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns. A Connected status confirms successful connectivity.
-
Click Complete Creation.
Add the HttpFile data source
The HttpFile data source is an OSS bucket provided by the platform that supplies log data for the tutorial.
-
On the Data Sources page, click Add Data Source and select HttpFile.
-
On the Add HttpFile Data Source page, configure the following parameters:
Parameter Value Data source name user_behavior_analysis_httpfileData source description Provided for DataWorks use cases; read-only source for batch synchronization URL https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com(for both development and production environments) -
Find the resource group you created and click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns.
ImportantAt least one resource group must show Connected status. Without a connectable resource group, you cannot configure data synchronization tasks using the codeless UI.
-
Click Complete Creation.
Add the private OSS data source
Add your OSS bucket as a private data source to serve as the destination for synchronized data.
-
On the Management Center page, choose Data Source > Data Source List and click Add Data Source.
-
In the Add Data Source dialog box, search for and select OSS.
-
In the Add OSS Data Source dialog box, configure the following parameters: For RAM role authorization mode, DataWorks assumes a role to access the data source using Security Token Service (STS), which provides higher security. For setup details, see Configure a data source in RAM role authorization mode. For Access Key mode, enter the AccessKey ID and AccessKey Secret of your Alibaba Cloud account. Go to the Security Information Management page to copy your AccessKey ID.
ImportantThe AccessKey Secret is only displayed when you first create it. Store it securely. If your AccessKey is lost or compromised, delete it and create a new one.
Parameter Value Data source name test_gDescription A brief description of the data source Endpoint http://oss-cn-shanghai-internal.aliyuncs.comBucket The name of the OSS bucket you created (for example, dw-emr-demo)Access mode RAM role authorization mode or Access Key mode (select one) -
Click Connected state in the Test Connectivity column for your resource group and wait for the status to show Connectable.
ImportantAt least one resource group must be Connectable. Without this, you cannot use the codeless UI to create sync tasks for this data source.
-
Click Complete.
What's next
With the environment set up, proceed to the next tutorial to synchronize user information and website access logs to OSS, then use Spark SQL to create an external table that accesses data stored in your private OSS bucket. For details, see Synchronize data.