In DataWorks, you can create and develop E-MapReduce (EMR) nodes, such as EMR Hive nodes, EMR Spark nodes, EMR Spark SQL nodes, EMR Presto nodes, EMR Impala nodes, and EMR MapReduce nodes, based on an EMR cluster. You can also configure a workflow, schedule nodes in the workflow, manage metadata, and configure monitoring rules to monitor the data quality. This way, you can develop and manage data lakes in a centralized manner.

Procedure

  1. Step 1: Create a DataLake cluster or a custom cluster
  2. Step 2: Create a DataWorks workspace
  3. Step 3: Associate the EMR cluster with the workspace
  4. Step 4: Develop and govern data

Step 1: Create a DataLake cluster or a custom cluster

Create a DataLake cluster (recommended) or a custom cluster in the EMR console. For more information, see Create a cluster. In this example, a DataLake cluster is used.

  1. Go to the cluster creation page.
    1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
    2. In the top navigation bar, select the region where you want to create a cluster and select a resource group based on your business requirements.
      • Region: You cannot change the region of a cluster after the cluster is created.
      • Resource group: By default, all resource groups in your account are displayed.
    3. On the EMR on ECS page, click Create Cluster.
  2. On the page that appears, configure the parameters. The following table describes the parameters.
    StepParameterExampleDescription
    Software ConfigurationRegionChina (Hangzhou)The geographic location where the Elastic Compute Service (ECS) instances of the cluster reside.
    Important You cannot change the region after the cluster is created. Select a region based on your business requirements.
    Business ScenarioNew Data LakeThe business scenario of the cluster. Default value: New Data Lake.
    Product VersionEMR-5.8.0The version of EMR. Select the latest version.
    High Service AvailabilityOff Specifies whether to enable high availability for the EMR cluster. If you turn on the High Service Availability switch, EMR distributes master nodes across different underlying hardware devices to reduce the risk of failures. By default, the High Service Availability switch is turned off.
    Optional Services (Select One At Least)HDFS, YARN, Hive, Spark3, and Tez The services that can be selected for the cluster. You can select services based on your business requirements. The processes that are related to the selected services are automatically started.
    Hardware ConfigurationBilling MethodPay-as-you-goThe billing method of the cluster. If you want to perform a test, we recommend that you use the pay-as-you-go billing method. After the test is complete, you can release the cluster and create a subscription cluster in the production environment.
    Zone Zone IThe zone where the cluster resides. You cannot change the zone after the cluster is created. Select a zone based on your business requirements.
    VPCvpc_Hangzhou/vpc-bp1f4epmkvncimpgs****The virtual private cloud (VPC) in which the cluster is deployed. Select a VPC in the current region. If no VPC is available, click create a VPC to create a VPC. After the VPC is created, click the Refresh icon and select the created VPC.
    vSwitchvsw_i/vsw-bp1e2f5fhaplp0g6p****The vSwitch of the cluster. Select a vSwitch in the specified zone. If no vSwitch is available in the zone, create a vSwitch.
    Default Security Groupsg_seurity/sg-bp1ddw7sm2risw****
    Important Do not use an advanced security group that is created in the ECS console.
    The security group to which you want to add the cluster. If you have created security groups in EMR, you can select a security group based on your business requirements. You can also create a security group.
    Node GroupDefault configurations The instances in the cluster. Configure the master node, core nodes, and task nodes based on your business requirements.
    Basic ConfigurationCluster NameEmr-DataLakeThe name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_).
    Identity CredentialsPassword The identity credentials that you want to use to remotely access the master node of the cluster.
    Password and Confirm PasswordCustom password The password that you want to use to access the cluster. Record this password for subsequent operations.
  3. Click Next: Confirm. In the Confirm step, read the terms of service, select the check box, and then click Confirm.
    The cluster is successfully created if the cluster is in the Running state.

Step 2: Create a DataWorks workspace

Create a workspace in the DataWorks console. For more information, see Create a workspace.

  1. Log on to the DataWorks console.
  2. On the Overview page, click Create Workspace.
  3. In the Create Workspace panel, configure the parameters.
    ParameterExampleDescription
    Workspace Nameemr_dataworksThe name of the workspace. The name must be 3 to 23 characters in length and can contain only letters, underscores (_), and digits. The name must start with a letter.
    Isolate Development and Production EnvironmentsNoSpecifies the mode of the workspace.
    • If you want to isolate production and development environments, select Yes. In this case, the workspace is in standard mode.
    • If you do not want to isolate production and development environments, select No. In this case, the workspace is in basic mode.
  4. Click Commit.

Step 3: Associate the EMR cluster with the workspace

Associate the EMR cluster with the workspace in the DataWorks console. For more information, see Configure DataWorks.

  1. On the Workspaces page, find the created workspace and click DataStudio in the Actions column.
  2. In the upper-right corner of the DataStudio page, click the Workspace Manage icon.
  3. In the Compute Engine Information section of the Workspace Management page, click E-MapReduce.
  4. Click Add Instance.
  5. In the Add EMR Cluster dialog box, configure the parameters and click Confirm.
    ParameterExampleDescription
    Instance Display NameThe display name of the EMR compute engine instance The display name of the EMR compute engine instance.
    Access ModeShortcut modeThe mode in which you want to associate the EMR cluster with the workspace. If you want to perform a test, we recommend that you select Shortcut mode for Access Mode. This way, you can quickly associate the EMR cluster with the workspace. You can change the access mode based on your business requirements after the EMR cluster is associated with the workspace.
    ClusterIDEmr-DataLakeThe ID of the EMR cluster that you want to associate with the workspace. Only DataLake clusters and custom clusters in the current region are displayed.
    YARNresourcequeuedefaultThe YARN queue that is used by default when you commit nodes in DataWorks by using the EMR cluster. Default value: default.
    OverrideDataStudioYARNresourcequeueNot selectedThe queue rule that is used to run nodes.
    Initialize Resource GroupConfigured based on your business requirements
    1. Select an exclusive resource group for scheduling that connects to the current workspace.

      If no exclusive resource groups for scheduling are available, create an exclusive resource group for scheduling and configure network connectivity for the resource group. For more information, see Create and use an exclusive resource group for scheduling.

    2. Click Initialize to initialize the resource group and test the network connectivity between the exclusive resource group for scheduling and the EMR compute engine.
      You can also select multiple resource groups to initialize the resource groups at a time.
      Note If the configurations of the EMR cluster that you want to use as an EMR compute engine instance are modified or the versions of the components in the EMR cluster change, you must initialize the resource group again in this dialog box.

Step 4: Develop and govern data

After you associate the EMR cluster with the workspace, you can create and develop an EMR node. This way, you can manage EMR metadata, perform O&M and monitoring operations on the node, and monitor the data quality of the node in DataWorks. This ensures that EMR data can be generated as expected.

ItemDescriptionReferences
Data developmentYou can create and develop an EMR node in DataStudio based on your business requirements.
Note You can create the following types of EMR nodes in DataWorks: EMR Hive, EMR MR, EMR Spark SQL, EMR Spark, EMR Shell, EMR Presto, EMR Impala, and EMR Spark Streaming.
Metadata managementYou can create a collector to collect EMR metadata and manage the EMR metadata in DataMap. On the DataMap page, you can view metadata of EMR tables, generated information, and lineages.
Data quality monitoringData Quality allows you to monitor the quality of data in tables that are generated by scheduling nodes. You can configure monitoring rules for tables to monitor the quality of data in the tables.
Note To configure monitoring rules for tables that are generated by an EMR node in an EMR DataLake cluster or a custom cluster, you must select the dqc_emr_plugin_datalake plug-in.
Overview
Node O&M and monitoringThe intelligent monitoring feature allows you to monitor the status of scheduling nodes. You can configure alert rules to monitor the status of EMR nodes.

For more information about DataWorks on EMR, see Development process of an EMR node in DataWorks.