To perform this tutorial, you must first create an E-MapReduce (EMR) cluster and a DataWorks workspace and configure the development environment. This topic describes how to make the preparations.

Prerequisites

  • An Alibaba Cloud account is created.
  • Real-name verification is complete.
  • The following services are activated:
    • EMR: an open source big data platform. For more information about the billing details of the EMR service.
    • DataWorks: a big data development and governance platform. For more information about the billing details of the DataWorks service, see Purchase guide.
    • Object Storage Service (OSS): an object storage service. For more information about how to activate OSS, see Activate OSS.

Procedure

Before you perform this tutorial, you must perform the following operations to prepare the environment:
  1. Create and configure an EMR cluster

    An EMR data lake cluster is a big data computing cluster that allows you to analyze data in a flexible, reliable, and efficient manner. You must create and configure an EMR data lake cluster that is used to run EMR nodes in DataWorks. For more information, see Create an EMR cluster.

  2. Create a DataWorks workspace

    You must create a DataWorks workspace. Workspaces are basic units for managing permissions in DataWorks. You can create workspaces based on the organizational structure of your company. For more information, see Create a DataWorks workspace.

  3. Configure the development environment that is required to develop EMR nodes in DataWorks
    Before you can develop EMR nodes in DataWorks to run EMR jobs, you must purchase a resource group in DataWorks, add members to a DataWorks workspace, and associate an EMR compute engine instance with the workspace. This way, the EMR jobs can run as expected. For more information, see Configure the development environment that is required to develop EMR nodes in DataWorks.
    Note After a workspace is created, you must associate an EMR compute engine instance with the workspace before you can run EMR nodes.
  4. Create an OSS bucket

    You must create an OSS bucket to store EMR metadata and the JAR resources that are required to run EMR nodes. For more information about how to create an OSS bucket, see Create a bucket.

Create an EMR cluster

Note The optimal configurations of an EMR component for you to run EMR nodes in DataWorks vary based on the type of the component you use. Before you create an EMR cluster to run EMR nodes in DataWorks, we recommend that you read the Best practices for configuring EMR clusters used in DataWorks topic.
Create an EMR cluster.
  1. Log on to the new EMR console.
  2. In the top navigation bar, select the China (Shanghai) region. On the Overview page, click Cluster Wizard.
    Note
    • The source data is stored in the China (Shanghai) region. Therefore, we recommend that you create an EMR cluster in the same region as the source data.
    • You can select Quick Purchase or Cluster Wizard to create an EMR cluster. In this topic, Cluster Wizard is selected.
  3. Create an EMR data lake cluster.
    1. On the Cluster Wizard page, set the Cluster Type parameter to DataLake and use the default values for other parameters in the Software Settings step.
    2. Click Next: Hardware Settings.
  4. Configure hardware settings for the cluster.
    1. In the Hardware Settings step, set the Billing Method parameter to Pay-As-You-Go and configure the parameters in the Network Settings and Instance sections.
    2. Click Next: Basic Settings.
  5. Configure basic settings for the cluster.
    1. In the Basic Settings step, configure the Cluster Name parameter and select a key pair from the Key Pair drop-down list.
    2. Click Next: Confirm.
    By default, Assign Public IP Address is turned off. If you do not turn on this switch, you cannot access the cluster over the Internet after the cluster is created. In this tutorial, you are not required to assign a public IP address. Therefore, click Next in the Assign Public IP Address message. To access the cluster over the Internet, log on to the Elastic Compute Service (ECS) console and assign an elastic IP address (EIP) to the ECS instance that corresponds to the cluster.
  6. In the Confirm step, verify your configuration, read the terms of service and select E-MapReduce Service Terms, and then click Create.

Create a DataWorks workspace

Note All data resources in this tutorial reside in the China (Shanghai) region. We recommend that you create a workspace in the China (Shanghai) region. If you create a workspace in a different region and you add a data source in the workspace, the data source may fail the network connectivity test.
  1. Log on to the DataWorks console.
  2. In the left-side navigation pane, click Workspaces.
  3. In the top navigation bar, select the region where you want to create a workspace and click Create Workspace.
  4. Configure the information about the workspace.
    1. Configure the basic information.
      Section Parameter Description
      Basic Information Workspace Name The name of the workspace.
      Display Name The display name of the workspace in the DataWorks console.
      Mode The mode of the workspace. Valid values: Basic Mode (Production Environment Only) and Standard Mode (Development and Production Environments).
      • Basic Mode (Production Environment Only): A workspace in basic mode is associated with only one compute engine project. Workspaces in basic mode do not isolate the development environment from the production environment. In these workspaces, you can perform only basic data development and cannot strictly control the data development process and the permissions on tables.
      • Standard Mode (Development and Production Environments): A workspace in standard mode is associated with two compute engine projects. One project serves as the development environment, and the other project serves as the production environment. Workspaces in standard mode allow you to develop code in a standard way and strictly control the permissions on tables. These workspaces impose limits on table operations in the production environment for data security.

      For more information, see Basic mode and standard mode.

      Description The description of the workspace.
      Advanced Settings Download SELECT Query Result Specifies whether the query results that are returned by SELECT statements in DataStudio can be downloaded. If you turn off this switch, the query results cannot be downloaded.
      Note You can change the setting of this parameter for the workspace in the Workspace Settings panel after the workspace is created. For more information, see Configure security settings.
    2. Click Next in the next steps until the Engine Details step appears.
    3. In the Engine Details step, click Create Workspace.

Configure the development environment that is required to develop EMR nodes in DataWorks

The following table describes the settings that you must configure for the development environment before you develop EMR nodes in DataWorks to run EMR jobs.
Product Configuration description References
EMR To prevent an error from being reported during the execution of an EMR node in DataWorks, you must make sure that the key configurations of the related EMR data lake cluster meet the requirements. The key configurations are used to authenticate the identity of the account that you use to run the EMR node in DataWorks in the EMR data lake cluster. Configure an EMR data lake cluster
DataWorks
  • Purchase a resource group: You must purchase an exclusive resource group for scheduling and connect the resource group to the virtual private cloud (VPC) where the EMR data lake cluster resides.
  • Add members to a workspace and assign roles to the workspace members: You must add users that develop data to a workspace as workspace members before the users can develop EMR nodes on the DataStudio page in DataWorks.
  • Associate an EMR cluster as an EMR compute engine instance with a DataWorks workspace: Associate an EMR data lake cluster as an EMR compute engine instance with a DataWorks workspace.
Configure DataWorks

Create a bucket

  1. Log on to the OSS console.
  2. In the left-side navigation pane, click Buckets.
  3. On the Buckets page, click Create Bucket.
  4. In the Create Bucket panel, configure the parameters and click OK.
    Note Select China (Shanghai) from the Region drop-down list. For more information about the parameters, see Create buckets.
  5. Click the name of the created bucket in the Bucket Name column to go to the Files page.
  6. In the Create Folder panel, configure the Folder Name parameter and click OK.
    Note Create three folders to store external data sources of OSS, ApsaraDB RDS, and JAR resources.