All Products
Search
Document Center

DataWorks:Prepare the environment

Last Updated:Mar 26, 2026

This tutorial walks you through a user profile analysis use case that demonstrates how to synchronize, process, and monitor data quality using DataWorks in the China (Shanghai) region. Before starting, complete the environment setup steps below to create and connect the required services.

What you'll set up

By the end of this page, you will have:

  1. Activated DataWorks and created a DataWorks workspace in Standard Mode.

  2. Created an EMR Serverless Spark workspace (Professional Edition) as the compute resource.

  3. Created an Object Storage Service (OSS) bucket to receive and store synchronized data.

  4. Created a serverless resource group and configured internet access for it.

  5. Registered the EMR Serverless Spark cluster in DataWorks.

  6. Added three data sources: a platform-provided MySQL source, a platform-provided HttpFile source, and your own private OSS destination.

Prerequisites

Before you begin, make sure you have:

  • An Alibaba Cloud account with billing enabled

  • Permissions to create resources in the China (Shanghai) region

  • Access to the EMR, OSS, VPC, and DataWorks consoles

DataWorks product preparation

Activate DataWorks on the DataWorks purchase page if you haven't already. For details, see Purchase.

Prepare an EMR Serverless Spark workspace

This tutorial uses EMR Serverless Spark as the compute resource. If you don't have a Spark workspace, go to the E-MapReduce console, select Spark, and create a workspace with the following settings:

Parameter Value
Region China (Shanghai)
Payment type Pay-as-you-go
Workspace name Enter a custom name
DLF for metadata storage Select a Data Lake Formation (DLF) data catalog. To completely isolate metadata between different EMR clusters, select different catalogs.
Workspace directory Select an OSS bucket path to store job log files
Workspace type Professional Edition
Select Professional Edition for this tutorial. Professional Edition includes all Basic Edition features plus advanced performance improvements suited for large-scale extract, transform, and load (ETL) jobs. Basic Edition provides powerful compute engines.

Prepare a private OSS bucket

Create an OSS bucket to serve as the destination for synchronized data. In the next tutorial, user information from the MySQL source and log data from the HttpFile source will be written to this bucket for data modeling and analysis.

  1. Log on to the OSS console.

  2. In the left navigation pane, click Buckets. On the Buckets page, click Create Bucket.

  3. In the Create Bucket dialog box, configure the following parameters and click Create. For details on other parameters, see Create a bucket in the console.

    Parameter Value
    Bucket name Enter a custom name
    Region China (Shanghai)
    OSS-HDFS Enable the HDFS service as prompted
  4. On the Buckets page, click the bucket name to go to its Object Management page.

Prepare the DataWorks environment

With DataWorks, your EMR Serverless Spark workspace, and your OSS bucket ready, complete the following steps to configure the DataWorks workspace for data synchronization and processing.

Create a DataWorks workspace

  1. Log on to the DataWorks console.

  2. In the left navigation pane, click Workspace Management to open the workspace list.

  3. Click Create Workspace. In the panel that appears, create a workspace in Standard Mode and enable Isolate Development and Production Environments.

Create the workspace in China (Shanghai) to avoid network connectivity issues when connecting to data sources in that region. For a simpler setup, select No for Isolate Development and Production Environments.

Create a resource group

A resource group provides the compute resources for data synchronization and scheduling. Create a serverless resource group and configure internet access so it can reach external data sources.

Step 1: Purchase a serverless resource group

  1. Log on to the DataWorks console. Switch to the China (Shanghai) region. In the left navigation pane, click Resource Group to open the resource group list.

  2. Click Create Resource Group. On the purchase page, set Region and Zone to China (Shanghai) and specify a Resource Group Name. Complete the remaining configuration and payment as prompted. For billing details, see Serverless resource groups.

    Serverless resource groups do not support cross-region operations. Use the same region as your data sources.

Step 2: Associate the resource group with your workspace

  1. In the resource group list, find the resource group you purchased.

  2. In the Actions column, click Associate Workspace and select the DataWorks workspace you created.

Step 3: Configure internet access

  1. Log on to the VPC - Internet NAT gateway console. In the top menu bar, switch to the China (Shanghai) region.

  2. Click Create NAT Gateway and configure the following parameters: Keep the default values for all other parameters.

    Parameter Value
    Region China (Shanghai)
    Network and zone Select the VPC and vSwitch of the resource group. To find them, go to the DataWorks console, click Resource Group List, find your resource group, and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section, note the Attached VPC and vSwitch. For details, see What is a VPC?
    Network type Internet NAT gateway
    EIP Purchase New EIP
    Service-linked role If this is your first NAT gateway, click Create Service-linked Role
  3. Click Buy Now. Accept the Terms of Service and click Confirm Order to complete the purchase.

Register an EMR Serverless Spark cluster

Register the Spark workspace in DataWorks so it can be used as the compute engine for data processing tasks.

  1. Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left navigation pane, choose More > Management Center. Select the target workspace from the drop-down list and click Go to Management Center.

  2. In the left navigation pane, click Clusters. On the Cluster Management page, click Register Cluster. In the dialog box, select E-MapReduce.

  3. Configure the following parameters:

    This tutorial uses the configuration described above. If your setup differs, see Data Studio (legacy version): Associate an EMR computing resource.
    Parameter Value
    Display name of cluster Enter a custom name
    Clusters Select the current Alibaba Cloud account
    Cluster type EMR Serverless Spark
    Workspace created in EMR Serverless Spark Select the Spark workspace you created in Prepare an EMR Serverless Spark workspace
    Default engine version Used by default when creating EMR Spark nodes in DataStudio. To use different versions per node, set them in the node's Advanced Settings.
    Default resource queue Used by default when creating EMR Spark nodes in DataStudio. To use different queues per node, set them in the node's Advanced Settings.
    Default SQL compute Used by default when creating EMR Spark SQL nodes in DataStudio. To use different compute settings per node, set them in the node's Advanced Settings.
    Default access identity Development environment: Executor. Production environment: Alibaba Cloud Account, RAM User, or Node Owner.

Create data sources

This tutorial uses three data sources:

  • MySQL — provided by the platform; serves as the source for batch synchronization, supplying user information

  • HttpFile — provided by the platform; serves as the source for batch synchronization, supplying log data

  • Private OSS — your own OSS bucket; serves as the destination that receives user information and log data

The MySQL and HttpFile data sources and their test data are provided by the DataWorks documentation team. All data is mock data and is read-only within the Data Integration module. The private OSS data source is the bucket you created in Prepare a private OSS bucket.

Add the MySQL data source

  1. Log on to the DataWorks console. In the top navigation bar, select the target region. In the left navigation pane, choose More > Management Center. Select the target workspace and click Go to Management Center.

  2. In the left navigation pane of the SettingCenter page, click Data Sources. In the upper-left corner, click Add Data Source.

  3. In the Add Data Source dialog box, select MySQL.

  4. On the Add MySQL Data Source page, configure the following parameters:

    Parameter Value
    Data source name user_behavior_analysis_mysql
    Data source description Provided for DataWorks use cases; read-only source for batch synchronization
    Configuration mode Connection String Mode
    Host IP address rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com
    Port number 3306
    Database name workshop
    Username workshop
    Password workshop#2017
    Authentication method No Authentication
  5. Find the resource group you created and click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns. A Connected status confirms successful connectivity.

  6. Click Complete Creation.

Add the HttpFile data source

The HttpFile data source is an OSS bucket provided by the platform that supplies log data for the tutorial.

  1. On the Data Sources page, click Add Data Source and select HttpFile.

  2. On the Add HttpFile Data Source page, configure the following parameters:

    Parameter Value
    Data source name user_behavior_analysis_httpfile
    Data source description Provided for DataWorks use cases; read-only source for batch synchronization
    URL https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com (for both development and production environments)
  3. Find the resource group you created and click Test Network Connectivity in both the Connection Status (Development Environment) and Connection Status (Production Environment) columns.

    Important

    At least one resource group must show Connected status. Without a connectable resource group, you cannot configure data synchronization tasks using the codeless UI.

  4. Click Complete Creation.

Add the private OSS data source

Add your OSS bucket as a private data source to serve as the destination for synchronized data.

  1. On the Management Center page, choose Data Source > Data Source List and click Add Data Source.

  2. In the Add Data Source dialog box, search for and select OSS.

  3. In the Add OSS Data Source dialog box, configure the following parameters: For RAM role authorization mode, DataWorks assumes a role to access the data source using Security Token Service (STS), which provides higher security. For setup details, see Configure a data source in RAM role authorization mode. For Access Key mode, enter the AccessKey ID and AccessKey Secret of your Alibaba Cloud account. Go to the Security Information Management page to copy your AccessKey ID.

    Important

    The AccessKey Secret is only displayed when you first create it. Store it securely. If your AccessKey is lost or compromised, delete it and create a new one.

    Parameter Value
    Data source name test_g
    Description A brief description of the data source
    Endpoint http://oss-cn-shanghai-internal.aliyuncs.com
    Bucket The name of the OSS bucket you created (for example, dw-emr-demo)
    Access mode RAM role authorization mode or Access Key mode (select one)
  4. Click Connected state in the Test Connectivity column for your resource group and wait for the status to show Connectable.

    Important

    At least one resource group must be Connectable. Without this, you cannot use the codeless UI to create sync tasks for this data source.

  5. Click Complete.

What's next

With the environment set up, proceed to the next tutorial to synchronize user information and website access logs to OSS, then use Spark SQL to create an external table that accesses data stored in your private OSS bucket. For details, see Synchronize data.