Getting started with DataWorks on EMR - E-MapReduce - Alibaba Cloud Documentation Center

In DataWorks, you can create nodes such as Hive, Spark SQL, Presto, and MapReduce nodes based on an E-MapReduce (EMR) compute engine. You can also configure a workflow, schedule nodes in the workflow on a regular basis, manage metadata, and configure monitoring rules to monitor data quality. This way, you can develop and govern data lakes in a centralized manner. This topic describes how to use an EMR cluster in DataWorks.

Procedure

Step 1: Create a cluster
Create a DataLake cluster in the EMR console. For more information, see Create a cluster.
Step 2: Create a workspace
Create a workspace in the DataWorks console. For more information, see Create a workspace.
Step 3: Associate the EMR cluster with the workspace
Associate the EMR cluster with the workspace in the DataWorks console. For more information, see Register an EMR cluster to DataWorks.
Step 4: Develop and govern data
After you associate the EMR cluster with the workspace, you can create and develop an EMR node. This way, you can manage EMR metadata, perform O&M and monitoring operations on the node, and monitor the data quality of the node in DataWorks. This ensures that EMR data can be generated as expected. For more information, see Usage notes for development of EMR nodes in DataWorks.

Step 1: Create a cluster

Go to the cluster creation page.
1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
2. In the top navigation bar, select the region where you want to create a cluster and select a resource group based on your business requirements.
  - You cannot change the region of a cluster after the cluster is created.
  - By default, all resource groups in your account are displayed.
3. On the EMR on ECS page, click Create Cluster.

On the page that appears, configure the parameters. The following table describes the parameters.

Step	Parameter	Example	Description
Software Configuration	Region	China (Hangzhou)	The geographic location where the ECS instances of the cluster reside. Important You cannot change the region after the cluster is created. Select a region based on your business requirements.
	Business Scenario	New Data Lake	The business scenario of the cluster. Default value: New Data Lake.
	Product Version	EMR-5.14.0	The version of EMR. Select the latest version.
	High Service Availability	Off	Specifies whether to enable high availability for the EMR cluster. If you turn on the High Service Availability switch, EMR distributes master nodes across different underlying hardware devices to reduce the risk of failures. By default, the High Service Availability switch is turned off.
	Optional Services (Select One At Least)	Hadoop-Common, OSS-HDFS, YARN, Hive, Spark3, Tez, Knox, and OpenLDAP	The services that can be selected for the cluster. You can select services based on your business requirements. The processes that are related to the selected services are automatically started. Note In addition to the default services of the cluster, you must also select the Knox and OpenLDAP services.
	Collect Service Operational Logs	On	Specifies whether to enable log collection for all services. By default, this switch is turned on. If you turn on this switch, the service operational logs of your cluster are collected. The logs are used only for cluster diagnostics. Important If you turn off this switch, the EMR cluster health check and service-related technical support are limited. For more information about how to disable log collection and the impacts imposed by disabling of log collection, see How do I stop collection of service operational logs?
	Metadata	DLF Unified Metadata	If you select DLF Unified Metadata for Metadata, metadata is stored in Data Lake Formation (DLF). The system selects the default DLF catalog for you to store metadata in DLF. If you want different clusters to be associated with different DLF catalogs, you can click Create Catalog to create DLF catalogs based on your business requirements. Note To configure this parameter, make sure that Alibaba Cloud DLF is activated.
	Root Storage Directory of Cluster	1366993922******	The root storage directory of cluster data. Select a bucket for which the OSS-HDFS service is enabled. Note Before you use the OSS-HDFS service, make sure that the OSS-HDFS service is available in the region in which you want to create a cluster. If the OSS-HDFS service is unavailable in the region, you can change the region or use HDFS instead of OSS-HDFS. For more information about the regions in which OSS-HDFS is available, see Enable OSS-HDFS and grant access permissions. You can select the OSS-HDFS service when you create a DataLake cluster in the new data lake scenario, a Dataflow cluster, a DataServing cluster, or a custom cluster of EMR V5.12.1, EMR V3.46.1, or a minor version later than EMR V5.12.1 or EMR V3.46.1. If you select HDFS instead of OSS-HDFS, you do not need to configure this parameter.
Hardware Configuration	Billing Method	Pay-as-you-go	The billing method of the cluster. If you want to perform a test, we recommend that you use the pay-as-you-go billing method. After the test is complete, you can release the cluster and create a subscription cluster in the production environment.
	Zone	Zone I	The zone where the cluster resides. You cannot change the zone after the cluster is created. Select a zone based on your business requirements.
	VPC	vpc_Hangzhou/vpc-bp1f4epmkvncimpgs****	The virtual private cloud (VPC) in which the cluster is deployed. Select a VPC in the current region. If no VPC is available, click create a VPC to create a VPC. After the VPC is created, click the Refresh icon and select the created VPC.
	vSwitch	vsw_i/vsw-bp1e2f5fhaplp0g6p****	The vSwitch of the cluster. Select a vSwitch in the specified zone. If no vSwitch is available in the zone, create a vSwitch.
	Default Security Group	sg_seurity/sg-bp1ddw7sm2risw****	Important Do not use an advanced security group that is created in the ECS console. The security group to which you want to add the cluster. If you have created security groups in EMR, you can select a security group based on your business requirements. You can also create a security group.
	Node Group	Turn on the Assign Public Network IP switch for the master node and use default settings of other parameters	The instances in the cluster. Configure the master node, core nodes, and task nodes based on your business requirements. For more information, see Select configurations.
Basic Configuration	Cluster Name	Emr-DataLake	The name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_).
	Identity Credentials	Password	The identity credentials that you want to use to remotely access the master node of the cluster.
	Password and Confirm Password	Custom password	The password that you want to use to access the cluster. Record this password for subsequent operations.

Click Next: Confirm. In the Confirm step, read the terms of service, select the check box, and then click Confirm.
The cluster is successfully created if the cluster is in the Running state. For more information about cluster parameters, see Create a cluster.

Step 2: Create a workspace

Log on to the DataWorks console.
On the Overview page, click Create Workspace.

In the Create Workspace panel, configure the parameters. The following table describes the parameters.

Parameter

Example

Description

Workspace Name

emr_dataworks

The name of the workspace. The name must be 3 to 23 characters in length and can contain only letters, underscores (_), and digits. The name must start with a letter.

Isolate Development and Production Environments

Specifies the mode of the workspace.

If you want to isolate production and development environments, select Yes. In this case, the workspace that you create is in standard mode.
If you do not want to isolate production and development environments, select No. In this case, the workspace that you create is in basic mode.

Click Commit.

Step 3: Associate the EMR cluster with the workspace

For information about the development of EMR nodes in DataWorks, see Usage notes for development of EMR nodes in DataWorks.

After you create a workspace, click Associate Now to the right of E-MapReduce in the Recommended Big Data Compute Engines section in the Create Workspace panel.
On the Associate EMR Compute Engine page, click Associate and Continue.
On the Open Source Clusters page, click Registering a cluster.

In the Select Cluster Type dialog box, click E-MapReduce. On the Register E-MapReduce cluster page, configure parameters and click Complete Registration. The following table describes the parameters.

Parameter	Example	Description
Cluster Display Name	dataworks_test	The display name of the EMR cluster in DataWorks. The value must be unique.
Cloud Account To Which The Cluster Belongs	Current Alibaba Cloud primary account	Specifies the type of the Alibaba Cloud account to which the EMR cluster that you want to register in the current workspace belongs.
Cluster Type	Data Lake	The type of the EMR cluster that you want to register.
Cluster	Emr-DataLake	The EMR cluster that you want to associate with the current workspace.
Default Access Identity	Cluster account: hadoop	The identity that you want to use to access the EMR cluster in the current workspace.

In the E-MapReduce section, click Initialize Resource Group.
You can initialize the exclusive resource group for scheduling that you want to use to ensure that a network connection is established between the resource group and the EMR cluster.
Note
DataWorks allows you to use only exclusive resource groups for scheduling to run EMR tasks. Therefore, you can select only an exclusive resource group for scheduling when you initialize a resource group.

Step 4: Develop and govern data

Item	Description	References
Data development	You can create and develop an EMR node in DataStudio based on your business requirements.	DataStudio overview Develop EMR tasks
Metadata management	You can create a collector to collect EMR metadata and manage the EMR metadata in Data Map. On the DataMap page, you can view the metadata, output information, and lineages of EMR tables.	Data Map
Data quality monitoring	Data Quality allows you to monitor the quality of data in tables that are generated by scheduling nodes. You can configure monitoring rules for tables to monitor the quality of data in the tables. Note To configure monitoring rules for tables that are generated by an EMR node in an EMR DataLake cluster or a custom cluster, you must select the dqc_emr_plugin_datalake plug-in.	Data Quality overview
Node O&M and monitoring	The intelligent monitoring feature allows you to monitor the status of scheduling nodes. You can configure alert rules to monitor the status of EMR nodes.	Overview Create a custom alert rule