In DataWorks, you can create and develop E-MapReduce (EMR) nodes, such as EMR Hive nodes, EMR Spark nodes, EMR Spark SQL nodes, EMR Presto nodes, EMR Impala nodes, and EMR MapReduce nodes, based on an EMR cluster. You can also configure a workflow, schedule nodes in the workflow, manage metadata, and configure monitoring rules to monitor the data quality. This way, you can develop and manage data lakes in a centralized manner.
Procedure
Step 1: Create a DataLake cluster or a custom cluster
Create a DataLake cluster (recommended) or a custom cluster in the EMR console. For more information, see Create a cluster. In this example, a DataLake cluster is used.
- Go to the cluster creation page.
- On the page that appears, configure the parameters. The following table describes the parameters.
Step Parameter Example Description Software Configuration Region China (Hangzhou) The geographic location where the Elastic Compute Service (ECS) instances of the cluster reside. Important You cannot change the region after the cluster is created. Select a region based on your business requirements.Business Scenario New Data Lake The business scenario of the cluster. Default value: New Data Lake. Product Version EMR-5.8.0 The version of EMR. Select the latest version. High Service Availability Off Specifies whether to enable high availability for the EMR cluster. If you turn on the High Service Availability switch, EMR distributes master nodes across different underlying hardware devices to reduce the risk of failures. By default, the High Service Availability switch is turned off. Optional Services (Select One At Least) HDFS, YARN, Hive, Spark3, and Tez The services that can be selected for the cluster. You can select services based on your business requirements. The processes that are related to the selected services are automatically started. Hardware Configuration Billing Method Pay-as-you-go The billing method of the cluster. If you want to perform a test, we recommend that you use the pay-as-you-go billing method. After the test is complete, you can release the cluster and create a subscription cluster in the production environment. Zone Zone I The zone where the cluster resides. You cannot change the zone after the cluster is created. Select a zone based on your business requirements. VPC vpc_Hangzhou/vpc-bp1f4epmkvncimpgs**** The virtual private cloud (VPC) in which the cluster is deployed. Select a VPC in the current region. If no VPC is available, click create a VPC to create a VPC. After the VPC is created, click the Refresh icon and select the created VPC. vSwitch vsw_i/vsw-bp1e2f5fhaplp0g6p**** The vSwitch of the cluster. Select a vSwitch in the specified zone. If no vSwitch is available in the zone, create a vSwitch. Default Security Group sg_seurity/sg-bp1ddw7sm2risw**** The security group to which you want to add the cluster. If you have created security groups in EMR, you can select a security group based on your business requirements. You can also create a security group.Important Do not use an advanced security group that is created in the ECS console.Node Group Default configurations The instances in the cluster. Configure the master node, core nodes, and task nodes based on your business requirements. Basic Configuration Cluster Name Emr-DataLake The name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_). Identity Credentials Password The identity credentials that you want to use to remotely access the master node of the cluster. Password and Confirm Password Custom password The password that you want to use to access the cluster. Record this password for subsequent operations. - Click Next: Confirm. In the Confirm step, read the terms of service, select the check box, and then click Confirm. The cluster is successfully created if the cluster is in the Running state.
Step 2: Create a DataWorks workspace
Create a workspace in the DataWorks console. For more information, see Create a workspace.
- Log on to the DataWorks console.
- On the Overview page, click Create Workspace.
- In the Create Workspace panel, configure the parameters.
Parameter Example Description Workspace Name emr_dataworks The name of the workspace. The name must be 3 to 23 characters in length and can contain only letters, underscores (_), and digits. The name must start with a letter. Isolate Development and Production Environments No Specifies the mode of the workspace. - If you want to isolate production and development environments, select Yes. In this case, the workspace is in standard mode.
- If you do not want to isolate production and development environments, select No. In this case, the workspace is in basic mode.
- Click Commit.
Step 3: Associate the EMR cluster with the workspace
Associate the EMR cluster with the workspace in the DataWorks console. For more information, see Configure DataWorks.
- On the Workspaces page, find the created workspace and click DataStudio in the Actions column.
- In the upper-right corner of the DataStudio page, click the
icon.
- In the Compute Engine Information section of the Workspace Management page, click E-MapReduce.
- Click Add Instance.
- In the Add EMR Cluster dialog box, configure the parameters and click Confirm.
Parameter Example Description Instance Display Name The display name of the EMR compute engine instance The display name of the EMR compute engine instance. Access Mode Shortcut mode The mode in which you want to associate the EMR cluster with the workspace. If you want to perform a test, we recommend that you select Shortcut mode for Access Mode. This way, you can quickly associate the EMR cluster with the workspace. You can change the access mode based on your business requirements after the EMR cluster is associated with the workspace. ClusterID Emr-DataLake The ID of the EMR cluster that you want to associate with the workspace. Only DataLake clusters and custom clusters in the current region are displayed. YARNresourcequeue default The YARN queue that is used by default when you commit nodes in DataWorks by using the EMR cluster. Default value: default. OverrideDataStudioYARNresourcequeue Not selected The queue rule that is used to run nodes. Initialize Resource Group Configured based on your business requirements - Select an exclusive resource group for scheduling that connects to the current workspace.
If no exclusive resource groups for scheduling are available, create an exclusive resource group for scheduling and configure network connectivity for the resource group. For more information, see Create and use an exclusive resource group for scheduling.
- Click Initialize to initialize the resource group and test the network connectivity between the exclusive resource group for scheduling and the EMR compute engine. You can also select multiple resource groups to initialize the resource groups at a time.Note If the configurations of the EMR cluster that you want to use as an EMR compute engine instance are modified or the versions of the components in the EMR cluster change, you must initialize the resource group again in this dialog box.
- Select an exclusive resource group for scheduling that connects to the current workspace.
Step 4: Develop and govern data
After you associate the EMR cluster with the workspace, you can create and develop an EMR node. This way, you can manage EMR metadata, perform O&M and monitoring operations on the node, and monitor the data quality of the node in DataWorks. This ensures that EMR data can be generated as expected.
Item | Description | References |
---|---|---|
Data development | You can create and develop an EMR node in DataStudio based on your business requirements. Note You can create the following types of EMR nodes in DataWorks: EMR Hive, EMR MR, EMR Spark SQL, EMR Spark, EMR Shell, EMR Presto, EMR Impala, and EMR Spark Streaming. | |
Metadata management | You can create a collector to collect EMR metadata and manage the EMR metadata in DataMap. On the DataMap page, you can view metadata of EMR tables, generated information, and lineages. | |
Data quality monitoring | Data Quality allows you to monitor the quality of data in tables that are generated by scheduling nodes. You can configure monitoring rules for tables to monitor the quality of data in the tables. Note To configure monitoring rules for tables that are generated by an EMR node in an EMR DataLake cluster or a custom cluster, you must select the dqc_emr_plugin_datalake plug-in. | Overview |
Node O&M and monitoring | The intelligent monitoring feature allows you to monitor the status of scheduling nodes. You can configure alert rules to monitor the status of EMR nodes. |
For more information about DataWorks on EMR, see Development process of an EMR node in DataWorks.