This tutorial walks through data synchronization, data processing, and quality monitoring using DataWorks in the China (Shanghai) region, based on a user profile analysis scenario. Complete the steps in this document to set up the required EMR Serverless StarRocks cluster and DataWorks workspace before starting data development.
By the end of this document, you will have:
An activated OSS bucket for storing User-Defined Function (UDF) resources
An EMR Serverless StarRocks instance with a database created
A DataWorks workspace with isolated development and production environments
A Serverless resource group with internet access configured
StarRocks associated as a compute resource in your DataWorks workspace
Prerequisites
Before you begin, read the introduction to understand the user profile analysis experiment.
The tutorial uses mock data intended for hands-on practice only. All user information and website access test data is provided. Data transformation is performed using Data Studio (new version).
Prepare the Object Storage Service (OSS) environment
This tutorial uses a User-Defined Function (UDF). The resources required for the registered function are uploaded to OSS.
Activate OSS and create an OSS bucket. Make sure the resources used by the function have public read/write permissions.
Prepare the EMR Serverless StarRocks environment
This tutorial requires an EMR Serverless StarRocks instance. If you don't have one, create one with the following settings:
| Parameter | Value |
|---|---|
| Instance type | Storage-computing integrated |
| Region | China (Shanghai) |
| Instance edition | Basic Edition |
| Version | 3.1 |
Basic Edition is for trial use and functional testing only and does not guarantee an SLA. Use Standard Edition for production workloads.
After the instance is created, log on to the instance Manager and run the following SQL statement to create a database:
CREATE DATABASE database_name;Prepare the DataWorks environment
Activate DataWorks before proceeding.
1. Create a workspace
If you already have a new-version workspace in the China (Shanghai) region, skip this step.
Log on to the DataWorks console. In the top navigation bar, set the region to China (Shanghai). In the left navigation pane, click Workspace to open the workspace list.
Click Create Workspace. Select Use Data Studio (New Version) and enable Isolate Development and Production Environments.
Starting February 18, 2025, new Data Studio is enabled by default when an Alibaba Cloud account activates DataWorks and creates a workspace in the China (Shanghai) region.
For detailed instructions, see Create a workspace.
2. Create a Serverless resource group
A Serverless resource group is required for data synchronization and scheduling. Complete all three sub-steps: purchase the resource group, associate it with your workspace, and configure public network access.
Purchase a Serverless resource group
Log on to the DataWorks - Resource Group List page. In the top navigation bar, set the region to China (Shanghai). In the left navigation pane, click Resource Group.
Click Create Resource Group. On the purchase page, set Region and Zone to China (Shanghai), specify a Resource Group Name, configure other parameters as prompted, and complete the payment. For billing details, see Billing of Serverless resource groups.
If no Virtual Private Cloud (VPC) or vSwitch is available in the current region, click the console link in the parameter description to create them. For more information, see What is a virtual private cloud (VPC)?.
Associate the resource group with your workspace
A newly purchased Serverless resource group must be associated with a workspace before it can be used.
Log on to the DataWorks - Resource Group List page and set the region to China (Shanghai). Find the resource group you purchased. In the Actions column, click Associate Workspace, then click Associate next to your DataWorks workspace.
Configure public network access
The test data in this tutorial is retrieved from the internet. By default, resource groups do not have public network access. Configure an Internet NAT Gateway with an Elastic IP Address (EIP) to enable outbound internet connectivity for the resource group.
Log on to the VPC - Internet NAT Gateway console. In the top navigation bar, set the region to China (Shanghai).
Click Create Internet NAT Gateway and configure the parameters. Keep the default values for parameters not listed below.
Parameter Value Region China (Shanghai) Network and zone The VPC and vSwitch bound to your resource group. To find these, go to the DataWorks console, switch to China (Shanghai), click Resource Group in the left navigation pane, find your resource group, and click Network Settings in the Actions column. Under Data Scheduling & Data Integration, view the associated VPC and vSwitch. Network type Internet NAT Gateway EIP Create EIP Service-linked role On first-time NAT Gateway creation, click Create Service-linked Role. Click Buy Now, accept the terms of service, and click Activate Now.
After the NAT Gateway instance is created, configure SNAT entries to enable internet access for the resource group's VPC:
Find the new NAT Gateway instance, click Manage in the Actions column, and switch to the SNAT tab.
In the SNAT Entry List section, click Create SNAT Entry and configure the following:
Parameter Value SNAT entry Select Specify VPC to cover all resource groups in the VPC Select EIP Select the EIP bound to the current NAT Gateway instance Click OK.
When the SNAT entry status changes to Available, the VPC has internet access.
For more information, see Use a Serverless resource group.
3. Associate the StarRocks compute resource
Go to the DataWorks - Workspace List page. In the top navigation bar, set the region to China (Shanghai). Find the workspace you created and click its name to open the Workspace Details page.
In the left navigation pane, click Computing Resources.
Click Associate Computing Resources, set the Compute Resource Type to Serverless StarRocks, and configure the following parameters. Keep the default values for parameters not listed below.
Parameter Description StarRocks instance Select the StarRocks instance. To create a new one, click Create in the drop-down list, create the instance in the EMR StarRocks console, then return here to select it. If you enabled environment isolation when creating the workspace, select StarRocks instances for both the production and development environments. Database name Select the database you created earlier. Username and Password The credentials configured when creating the StarRocks instance. The default username is admin.Computing resource instance name Set to doc_starrocks_storage_compute_tightly_01. Tasks use this name to identify the compute resource.Connection configuration Select the resource group for connecting to the StarRocks instance. Test connectivity here before saving. Click Confirm.
For more information, see Manage computing resources.
What's next
You have completed the environment setup:
An activated OSS bucket for UDF resources
An EMR Serverless StarRocks instance and database
A DataWorks workspace with isolated development and production environments
A Serverless resource group with internet access configured
StarRocks associated as a compute resource in your DataWorks workspace
Next, synchronize user profile data and website access logs to OSS, create a table, and query the data using a StarRocks node. See Data synchronization and processing.