All Products
Search
Document Center

DataWorks:Prepare environments

Last Updated:Mar 25, 2025

This tutorial describes how to perform user profile analysis. In this tutorial, DataWorks is used to synchronize data, process data, and monitor data quality. To ensure that you can complete the tutorial as expected, you must first create an E-MapReduce (EMR) cluster and a DataWorks workspace and configure the required environments.

Business background

To develop effective business management strategies, you must obtain basic profile data of website users based on their activities on websites. The basic profile data includes the geographical and social attributes of the website users. You can analyze profile data based on time and location and perform refined operations on website traffic by using basic user profile data.

Usage notes

You must read Experiment introduction to have a deep understanding of the entire process of a user profile analysis experiment. This ensures that you can complete the tutorial as expected.

Precautions

  • Basic user information and website access logs of users that are required for tests in this experiment are provided.

  • The data in this experiment can be used only for experimental operations in DataWorks, and all the data is manual mock data.

  • In this experiment, DataStudio is used.

Prepare an EMR environment

Create an EMR cluster

This tutorial requires an EMR cluster, which needs to be registered to DataWorks. This allows you to run data processing tasks based on the EMR cluster in the DataWorks console. When you create an EMR cluster, take note of the following items in the Software Configuration step.

Parameter

Description

Region

Select China (Shanghai).

Business Scenario

Select Data Lake.

Product Version

Select the latest version.

Optional Services (Select One At Least)

Select components based on your business requirements. This tutorial requires the Hive and OSS-HDFS components.

Metadata

Select DLF Unified Metadata.

Root Storage Directory of Cluster

Select an OSS-HDFS bucket. If no option is available in the drop-down list, click Create OSS-HDFS Bucket.

For more information about how to create an EMR cluster, see Step 1: Create a cluster.

Note

The support of DataWorks for different configurations of an EMR cluster varies. Before you create an EMR cluster and develop EMR tasks in DataWorks based on the EMR cluster, we recommend that you read the Best practices for configuring EMR clusters used in DataWorks topic.

Prepare a DataWorks environment

Before you develop tasks in DataWorks, you must activate DataWorks. For more information, see Prepare an environment.

Step 1: Create a workspace

If a workspace exists in the China (Shanghai) region, skip this step and use the existing workspace.

  1. Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region.

  2. In the left-side navigation pane, click Workspace. On the Workspaces page, click Create Workspace to create a workspace in standard mode. For more information, see Create a workspace. For a workspace in standard mode, the development environment is isolated from the production environment.

Step 2: Create a serverless resource group

This tutorial requires a serverless resource group for data synchronization and scheduling. Therefore, you need to purchase and configure a serverless resource group.

  1. Purchase a serverless resource group.

    1. Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group to go to the Resource Groups page.

    2. On the Resource Groups page, click Create Resource Group. On the buy page, set Region and Zone to China (Shanghai), specify the resource group name, configure other parameters as prompted, and then follow on-screen instructions to pay for the resource group. For information about the billing details of serverless resource groups, see Billing of serverless resource groups.

      Note
      • In this example, a serverless resource group that is deployed in the China (Shanghai) region is used. Note that serverless resource groups do not support cross-region operations.

      • If no virtual private cloud (VPC) or vSwitch exists in the current region, click the link in the parameter description to go to the VPC console to create one. For more information about VPCs and vSwitches, see What is a VPC?

  2. Associate the serverless resource group with the DataWorks workspace.

    You can use the serverless resource group that you purchased in subsequent operations only after you associate the serverless resource group with a workspace.

    Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group. On the Resource Groups page, find the serverless resource group that you purchased, and click Associate Workspace in the Actions column. In the Associate Workspace panel, find the workspace with which you want to associate and click Associate in the Actions column.

  3. Enable the serverless resource group to access the Internet.

    The test data used in this tutorial must be obtained over the Internet. By default, the serverless resource group cannot be used to access the Internet. You must configure an Internet NAT gateway for the VPC with which the serverless resource group is associated and configure an EIP for the VPC to establish a network connection between the VPC and the network environment of the test data. This way, you can use the serverless resource group to access the test data.

    1. Go to the Internet NAT Gateway page in the VPC console. In the top navigation bar, select the China (Shanghai) region.

    2. Click Create Internet NAT Gateway and configure the parameters. The following table describes the key parameters that are required in this tutorial. You can retain the default values for the parameters that are not described in the following table.

      Parameter

      Description

      Region

      Select China (Shanghai).

      VPC

      Select the VPC and vSwitch with which the resource group is associated.

      To view the VPC and vSwitch with which the resource group is associated, perform the following operations: Log on to the DataWorks console. In the top navigation bar, select the region in which you activate DataWorks. In the left-side navigation pane, click Resource Group. On the Resource Groups page, find the created resource group and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section of the VPC Binding tab on the page that appears, view the VPC and vSwitch with which the resource group is associated. For more information about VPCs and vSwitches, see What is a VPC?

      Associate vSwitch

      Access Mode

      Select SNAT-enabled Mode.

      EIP

      Select Purchase EIP.

      Service-linked Role

      Click Create Service-linked Role to create a service-linked role if this is the first time you create a NAT gateway.

    3. Click Buy Now. On the Confirm page, read the terms of service, select the check box for Terms of Service, and then click Activate Now.

For more information about how to create and use a serverless resource group, see Create and use a serverless resource group.

Step 3: Register the EMR cluster to DataWorks and initialize the resource group

You can use the EMR cluster in DataWorks only if you register the cluster to DataWorks.

  1. Go to the Register EMR Cluster page.

    1. Go to the SettingCenter page.

      Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

    2. In the left-side navigation pane of the SettingCenter page, click Cluster Management. On the Cluster Management page, click Register Cluster. In the Select Cluster Type dialog box, click E-MapReduce. The Register EMR Cluster page appears.

  2. Register the EMR cluster to DataWorks.

    On the Register EMR Cluster page, configure cluster information. The following table describes the key parameters.

    Parameter

    Description

    Alibaba Cloud Account to Which Cluster Belongs

    Set it to Current Alibaba Cloud Account.

    Cluster Type

    Select Data Lake.

    Default Access Identity

    Set it to Cluster Account: hadoop.

    Pass Proxy User Information

    Set it to Pass.

  3. Initialize the resource group.

    1. Go to the Cluster Management page in SettingCenter. Find the EMR cluster that is registered to DataWorks and click Initialize Resource Group in the section that displays the information of the EMR cluster.

    2. In the Initialize Resource Group dialog box, find the desired resource group and click Initialize.

    3. After the initialization is complete, click OK.

    Important

    You must make sure that the initialization of the resource group is successful. Otherwise, tasks that use the resource group may fail. If the initialization of the resource group fails, you can view the failure cause and perform a network connectivity diagnosis as prompted.

For more information about how to register an EMR cluster, see Register an EMR cluster to DataWorks.

What to do next

You have prepared your environments and can proceed to the next tutorial. In the next tutorial, you will learn how to synchronize the basic user information and website access logs of users to OSS, and how to create a table in an EMR Hive node to query the synchronized data. For more information, see Synchronize data.