All Products
Search
Document Center

DataWorks:Prepare the environment

Last Updated:Mar 26, 2026

This tutorial walks you through setting up the four services required before you start the user profile analysis experiment in DataWorks using E-MapReduce (EMR) Serverless Spark. All steps take place in the China (Shanghai) region.

By the end of this tutorial, you will have:

  • An Object Storage Service (OSS) bucket to store user information and website access logs

  • An EMR Serverless Spark workspace as the compute and storage layer

  • A DataWorks workspace with dev/prod environment isolation enabled

  • A serverless resource group with public network access, associated with the DataWorks workspace

  • EMR Serverless Spark configured as a computing resource in DataWorks

Prerequisites

Before you begin, ensure that you have:

All data used in this tutorial is mock data provided for hands-on practice only.

Step 1: Create an OSS bucket

User information and website access logs are synchronized to an OSS bucket for data modeling and analysis.

  1. Log on to the OSS console.

  2. In the left navigation pane, click Buckets. On the Buckets page, click Create Bucket.

  3. In the Create Bucket dialog box, configure the following parameters, then click Create. Keep the default values for all parameters not listed.

    ParameterValue
    Bucket Namedw-spark-demo (or a custom name)
    RegionChina (Shanghai)
    OSS-HDFSEnable the HDFS service as prompted
  4. Click the bucket name to open its Object Management page.

Step 2: Create an EMR Serverless Spark workspace

  1. If you do not already have a Spark workspace, create one using the parameters below. Keep the default values for all parameters not listed.

    Two workspace editions are available. Professional Edition includes all the features of the Basic Edition, along with advanced features and performance improvements, making it suitable for large-scale extract, transform, and load (ETL) jobs. Basic Edition includes all basic features with a high-performance compute engine.
    ParameterValue
    RegionChina (Shanghai)
    Payment TypePay-as-you-go
    Workspace NameEnter a custom name
    DLF for Metadata StorageSelect a Data Lake Formation (DLF) data catalog. Both DLF and DLF-Legacy (displayed as DLF 1.0 on the interface) are supported. After selecting a version, create Paimon or Hive tables accordingly. If you need complete metadata isolation between EMR clusters, select separate catalogs.
    Workspace DirectorySelect an OSS bucket path to store job log files

Step 3: Set up the DataWorks environment

3.1 Create a DataWorks workspace

If you already have a DataWorks workspace (new version) in the China (Shanghai) region, skip to Step 3.2.

  1. Log on to the DataWorks console. In the top navigation bar, set the region to China (Shanghai). In the left navigation pane, click Workspace.

  2. Click Create Workspace. Select Use Data Studio (New Version) and enable Isolate Development and Production Environments.

    Starting February 18, 2025, new Data Studio is the default for any Alibaba Cloud account that activates DataWorks and creates its first workspace in the China (Shanghai) region.

For full workspace creation options, see Create a workspace.

3.2 Create a serverless resource group

A serverless resource group is required for data synchronization and scheduling in this tutorial.

Purchase the resource group

  1. Log on to the DataWorks Resource Group List page. In the top navigation bar, set the region to China (Shanghai). In the left navigation pane, click Resource Group.

  2. Click Create Resource Group. On the purchase page, set Region and Zone to China (Shanghai), specify a Resource Group Name, configure the remaining parameters as prompted, and complete the payment.

    If no virtual private cloud (VPC) or vSwitch is available in the region, create them first. See What is a virtual private cloud (VPC)?. For billing details, see Billing of serverless resource groups.

Associate the resource group with your workspace

A newly purchased resource group must be associated with a workspace before it can be used.

On the DataWorks Resource Group List page, find the resource group you purchased. In the Actions column, click Associate Workspace, then click Associate next to your DataWorks workspace.

Configure public network access

The test data for this tutorial is retrieved from the internet. By default, resource groups have no public network access. Configure an Internet NAT Gateway for the VPC bound to the resource group to enable outbound internet connectivity.

  1. Log on to the VPC Internet NAT Gateway console. In the top menu bar, set the region to China (Shanghai).

  2. Click Create Internet NAT Gateway and configure the following parameters. Keep the default values for all parameters not listed.

    ParameterValue
    RegionChina (Shanghai)
    Network And ZoneSelect the VPC and vSwitch bound to the resource group. To find these, go to the DataWorks console, switch to China (Shanghai), and navigate to Resource Group. Click Network Settings in the Actions column of your resource group, then check the VPC and vSwitch listed under Data Scheduling & Data Integration.
    Network TypeInternet NAT Gateway
    EIPCreate EIP
    Service-linked RoleIf this is your first NAT Gateway, click Create Service-linked Role
  3. Click Buy Now, accept the terms of service, and click Activate Now.

  4. After the NAT Gateway instance is created, return to the console and create source NAT (SNAT) entries for it. The SNAT entry is ready when its status changes to Available. At that point, the VPC — and any resource groups bound to it — has outbound internet access.

    1. Find the new NAT Gateway instance and click Manage in the Actions column. Switch to the SNAT tab.

    2. In the SNAT Entry List section, click Create SNAT Entry and configure the following parameters.

      ParameterValue
      SNAT EntrySelect Specify VPC. This allows all resource groups in the VPC to access the internet through the configured EIP.
      Select EIPSelect the EIP bound to the current NAT Gateway instance
    3. Click OK.

    The resource group cannot access the internet until SNAT entries are configured.

For more information, see Use a serverless resource group.

3.3 Associate EMR Serverless Spark as a computing resource

  1. Go to the DataWorks Workspace List page. In the top navigation bar, set the region to China (Shanghai). Find your workspace and click its name to open the Workspace Details page.

  2. In the left navigation pane, click Computing Resources.

  3. Click Associate Computing Resources. Set Computing Resource Type to EMR Serverless Spark and configure the following parameters. Keep the default values for all parameters not listed.

    ParameterDescription
    EMR Serverless Spark WorkspaceSelect the Spark workspace you created in Step 2. If you enabled dev/prod environment isolation, select a Spark workspace for both the development and production environments. To create a new Spark workspace inline, click New in the dropdown.
    Default Engine VersionThe engine version used by default when you create an EMR Spark node in Data Studio
    Default Resource QueueThe resource queue used by default when you create an EMR Spark node in Data Studio
    Default Access IdentityThe identity used to access the Spark workspace from DataWorks. In the development environment, only Executor is supported. In the production environment, Alibaba Cloud account, RAM user, and Node Owner are supported.
    Computing Resource Instance NameA name that identifies this computing resource. When a node runs, DataWorks uses this name to route the task to the correct resource.
  4. Click Confirm.

For full configuration options, see Associate a computing resource.

What's next

With the environment ready, proceed to Synchronize data to learn how to sync user information and website access logs to OSS, and use a Spark SQL node to create tables and query the synchronized data.