All Products
Search
Document Center

E-MapReduce:Getting started with DataWorks on EMR

Last Updated:Dec 07, 2023

In DataWorks, you can create nodes such as Hive, Spark SQL, Presto, and MapReduce nodes based on an E-MapReduce (EMR) compute engine. You can also configure a workflow, schedule nodes in the workflow on a regular basis, manage metadata, and configure monitoring rules to monitor data quality. This way, you can develop and govern data lakes in a centralized manner. This topic describes how to use an EMR cluster in DataWorks.

Procedure

  1. Step 1: Create a cluster

    Create a DataLake cluster in the EMR console. For more information, see Create a cluster.

  2. Step 2: Create a workspace

    Create a workspace in the DataWorks console. For more information, see Create a workspace.

  3. Step 3: Associate the EMR cluster with the workspace

    Associate the EMR cluster with the workspace in the DataWorks console. For more information, see Register an EMR cluster to DataWorks.

  4. Step 4: Develop and govern data

    After you associate the EMR cluster with the workspace, you can create and develop an EMR node. This way, you can manage EMR metadata, perform O&M and monitoring operations on the node, and monitor the data quality of the node in DataWorks. This ensures that EMR data can be generated as expected. For more information, see Usage notes for development of EMR nodes in DataWorks.

Step 1: Create a cluster

  1. Go to the cluster creation page.

    1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

    2. In the top navigation bar, select the region where you want to create a cluster and select a resource group based on your business requirements.

      • You cannot change the region of a cluster after the cluster is created.

      • By default, all resource groups in your account are displayed.

    3. On the EMR on ECS page, click Create Cluster.

  2. On the page that appears, configure the parameters. The following table describes the parameters.

    Step

    Parameter

    Example

    Description

    Software Configuration

    Region

    China (Hangzhou)

    The geographic location where the ECS instances of the cluster reside.

    Important

    You cannot change the region after the cluster is created. Select a region based on your business requirements.

    Business Scenario

    New Data Lake

    The business scenario of the cluster. Default value: New Data Lake.

    Product Version

    EMR-5.14.0

    The version of EMR. Select the latest version.

    High Service Availability

    Off

    Specifies whether to enable high availability for the EMR cluster. If you turn on the High Service Availability switch, EMR distributes master nodes across different underlying hardware devices to reduce the risk of failures. By default, the High Service Availability switch is turned off.

    Optional Services (Select One At Least)

    Hadoop-Common, OSS-HDFS, YARN, Hive, Spark3, Tez, Knox, and OpenLDAP

    The services that can be selected for the cluster. You can select services based on your business requirements. The processes that are related to the selected services are automatically started.

    Note

    In addition to the default services of the cluster, you must also select the Knox and OpenLDAP services.

    Collect Service Operational Logs

    On

    Specifies whether to enable log collection for all services. By default, this switch is turned on. If you turn on this switch, the service operational logs of your cluster are collected. The logs are used only for cluster diagnostics.

    Important

    If you turn off this switch, the EMR cluster health check and service-related technical support are limited. For more information about how to disable log collection and the impacts imposed by disabling of log collection, see How do I stop collection of service operational logs?

    Metadata

    DLF Unified Metadata

    If you select DLF Unified Metadata for Metadata, metadata is stored in Data Lake Formation (DLF).

    The system selects the default DLF catalog for you to store metadata in DLF. If you want different clusters to be associated with different DLF catalogs, you can click Create Catalog to create DLF catalogs based on your business requirements.

    Note

    To configure this parameter, make sure that Alibaba Cloud DLF is activated.

    Root Storage Directory of Cluster

    1366993922******

    The root storage directory of cluster data. Select a bucket for which the OSS-HDFS service is enabled.

    Note
    • Before you use the OSS-HDFS service, make sure that the OSS-HDFS service is available in the region in which you want to create a cluster. If the OSS-HDFS service is unavailable in the region, you can change the region or use HDFS instead of OSS-HDFS. For more information about the regions in which OSS-HDFS is available, see Enable OSS-HDFS and grant access permissions.

    • You can select the OSS-HDFS service when you create a DataLake cluster in the new data lake scenario, a Dataflow cluster, a DataServing cluster, or a custom cluster of EMR V5.12.1, EMR V3.46.1, or a minor version later than EMR V5.12.1 or EMR V3.46.1. If you select HDFS instead of OSS-HDFS, you do not need to configure this parameter.

    Hardware Configuration

    Billing Method

    Pay-as-you-go

    The billing method of the cluster. If you want to perform a test, we recommend that you use the pay-as-you-go billing method. After the test is complete, you can release the cluster and create a subscription cluster in the production environment.

    Zone

    Zone I

    The zone where the cluster resides. You cannot change the zone after the cluster is created. Select a zone based on your business requirements.

    VPC

    vpc_Hangzhou/vpc-bp1f4epmkvncimpgs****

    The virtual private cloud (VPC) in which the cluster is deployed. Select a VPC in the current region. If no VPC is available, click create a VPC to create a VPC. After the VPC is created, click the Refresh icon and select the created VPC.

    vSwitch

    vsw_i/vsw-bp1e2f5fhaplp0g6p****

    The vSwitch of the cluster. Select a vSwitch in the specified zone. If no vSwitch is available in the zone, create a vSwitch.

    Default Security Group

    sg_seurity/sg-bp1ddw7sm2risw****

    Important

    Do not use an advanced security group that is created in the ECS console.

    The security group to which you want to add the cluster. If you have created security groups in EMR, you can select a security group based on your business requirements. You can also create a security group.

    Node Group

    Turn on the Assign Public Network IP switch for the master node and use default settings of other parameters

    The instances in the cluster. Configure the master node, core nodes, and task nodes based on your business requirements. For more information, see Select configurations.

    Basic Configuration

    Cluster Name

    Emr-DataLake

    The name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_).

    Identity Credentials

    Password

    The identity credentials that you want to use to remotely access the master node of the cluster.

    Password and Confirm Password

    Custom password

    The password that you want to use to access the cluster. Record this password for subsequent operations.

  3. Click Next: Confirm. In the Confirm step, read the terms of service, select the check box, and then click Confirm.

    The cluster is successfully created if the cluster is in the Running state. For more information about cluster parameters, see Create a cluster.

Step 2: Create a workspace

  1. Log on to the DataWorks console.

  2. On the Overview page, click Create Workspace.

  3. In the Create Workspace panel, configure the parameters. The following table describes the parameters.

    Parameter

    Example

    Description

    Workspace Name

    emr_dataworks

    The name of the workspace. The name must be 3 to 23 characters in length and can contain only letters, underscores (_), and digits. The name must start with a letter.

    Isolate Development and Production Environments

    No

    Specifies the mode of the workspace.

    • If you want to isolate production and development environments, select Yes. In this case, the workspace that you create is in standard mode.

    • If you do not want to isolate production and development environments, select No. In this case, the workspace that you create is in basic mode.

  4. Click Commit.

Step 3: Associate the EMR cluster with the workspace

For information about the development of EMR nodes in DataWorks, see Usage notes for development of EMR nodes in DataWorks.

  1. After you create a workspace, click Associate Now to the right of E-MapReduce in the Recommended Big Data Compute Engines section in the Create Workspace panel.

  2. On the Associate EMR Compute Engine page, click Associate and Continue.

  3. On the Open Source Clusters page, click Registering a cluster.

  4. In the Select Cluster Type dialog box, click E-MapReduce. On the Register E-MapReduce cluster page, configure parameters and click Complete Registration. The following table describes the parameters.

    Parameter

    Example

    Description

    Cluster Display Name

    dataworks_test

    The display name of the EMR cluster in DataWorks. The value must be unique.

    Cloud Account To Which The Cluster Belongs

    Current Alibaba Cloud primary account

    Specifies the type of the Alibaba Cloud account to which the EMR cluster that you want to register in the current workspace belongs.

    Cluster Type

    Data Lake

    The type of the EMR cluster that you want to register.

    Cluster

    Emr-DataLake

    The EMR cluster that you want to associate with the current workspace.

    Default Access Identity

    Cluster account: hadoop

    The identity that you want to use to access the EMR cluster in the current workspace.

  5. In the E-MapReduce section, click Initialize Resource Group.

    You can initialize the exclusive resource group for scheduling that you want to use to ensure that a network connection is established between the resource group and the EMR cluster.

    Note

    DataWorks allows you to use only exclusive resource groups for scheduling to run EMR tasks. Therefore, you can select only an exclusive resource group for scheduling when you initialize a resource group.

Step 4: Develop and govern data

Item

Description

References

Data development

You can create and develop an EMR node in DataStudio based on your business requirements.

Metadata management

You can create a collector to collect EMR metadata and manage the EMR metadata in Data Map. On the DataMap page, you can view the metadata, output information, and lineages of EMR tables.

Data Map

Data quality monitoring

Data Quality allows you to monitor the quality of data in tables that are generated by scheduling nodes. You can configure monitoring rules for tables to monitor the quality of data in the tables.

Note

To configure monitoring rules for tables that are generated by an EMR node in an EMR DataLake cluster or a custom cluster, you must select the dqc_emr_plugin_datalake plug-in.

Data Quality overview

Node O&M and monitoring

The intelligent monitoring feature allows you to monitor the status of scheduling nodes. You can configure alert rules to monitor the status of EMR nodes.