DataWorks allows you to create E-MapReduce (EMR) nodes such as Hive nodes, MR nodes, Presto nodes, and Spark SQL nodes based on an EMR compute engine instance. You can use the different types of EMR nodes for different features. For example, you can configure an EMR workflow, schedule nodes in a workflow, or manage metadata in a workflow. These features help you generate data in an efficient manner by using EMR. This topic describes the precautions and development process of an EMR node in DataWorks based on an EMR DataLake cluster. We recommend that you read this topic before you develop an EMR node to run EMR jobs.
Background information
You can create an EMR DataLake cluster only in the new EMR console. A DataLake cluster is a big data computing cluster that allows you to analyze data in a flexible, reliable, and efficient manner. You can use a DataLake cluster to build a scalable data pipeline. For more information about an EMR DataLake cluster, see Configure a DataLake cluster.Limits
The following table describes the limits of EMR DataLake clusters, DataWorks features, and EMR nodes that are used to run EMR jobs or perform other operations in DataWorks.Item | Description |
---|---|
DataLake Cluster and custom cluster | The EMR DataLake cluster must be of V3.41.0 or a later minor version, or V5.7.0 or a later minor version. If the EMR DataLake cluster is of a minor version earlier than V3.41.0 or V5.7.0, specific DataWorks features cannot be used. |
DataWorks features |
|
EMR nodes in DataWorks | EMR nodes can be run only on an exclusive resource group for scheduling. |
Procedure
- Make preparationsBefore you develop an EMR node in DataWorks, you must complete the required preparations on the EMR and DataWorks sides.
Item Description References EMR To prevent an error from being reported during the execution of an EMR node in DataWorks, you must make sure that the key configurations of the related EMR DataLake cluster meet the requirements. For example, in the EMR DataLake cluster, you must configure the settings of Lightweight Directory Access Protocol (LDAP), a Ranger whitelist, and a security policy to authenticate the identity of the account that you use to run the EMR node in DataWorks. - Create a DataLake cluster.
- For more information about how to configure a DataLake cluster, see Configure an EMR data lake cluster.
- Before you purchase an EMR cluster and configure environment settings for the EMR cluster, we recommend that you read the Best practices for configuring EMR clusters used in DataWorks topic.
DataWorks - You must associate an EMR cluster as an EMR compute engine instance with a DataWorks workspace in the DataWorks console. The EMR cluster is used to run EMR nodes in DataWorks. The workspace to which the EMR cluster is associated must be in basic mode. Note You can associate an EMR cluster with a DataWorks workspace in shortcut or security mode. The shortcut mode supports rapid data processing, and the security mode ensures higher security based on data permission management. You can select a mode based on the condition of the EMR cluster and your business requirements.
- To ensure that an EMR node can normally run, you must purchase and configure an exclusive resource group for scheduling, add members to a workspace, assign roles to the members, and prepare the resources and permissions that are required to run the EMR node.
- Develop an EMR node
- Create a workflow.
In DataStudio, data development is performed by using components such as a node in a workflow. Before you can create a node, you must create a workflow. For more information about how to create a workflow, see Create a workflow.
- Create an EMR node. DataWorks uses nodes to develop data. DataWorks encapsulates the different types of jobs in an EMR cluster into different types of EMR nodes in DataWorks. You can select a type of EMR node and develop the node based on your business requirements in DataWorks. For more information about how to create an EMR node in DataWorks, see Create an EMR node.Note You can create the following types of EMR nodes in DataWorks: EMR Hive, EMR MR, EMR Spark SQL, EMR Spark, EMR Shell, EMR Presto, EMR Impala, and EMR Spark Streaming.
- Develop and debug the code for an EMR node.
To ensure that the EMR node that you develop can be run in an efficient manner and fully utilize computing resources, we recommend that you debug the code for the node before you commit and deploy the node. For more information, see Debug and view a node.
- Commit and deploy the EMR node. After the debugging is complete, you can commit and deploy the EMR node to the production environment for scheduling. For more information, see Deploy nodes.Note The EMR node can be scheduled to run after the node is deployed to the production environment.
- Create a workflow.
Additional information
Operation | Description | References |
---|---|---|
Manage EMR metadata | After you associate an EMR cluster with a DataWorks workspace, Data Map automatically collects the metadata of the EMR cluster. On the DataMap page, you can view information, including the details of EMR tables and the source data for the EMR tables. Note
| Overview |
Monitor data quality of EMR nodes in DataWorks | Data Quality allows you to monitor the quality of data in tables that are generated by scheduling nodes. You can configure monitoring rules for tables to monitor the data quality of data in the tables. Note To configure monitoring rules for table data generated by an EMR node that is run based on an EMR DataLake cluster, you must select the dqc_emr_plugin_datalake plug-in. | Overview |
Perform O&M and monitoring operations on EMR nodes | The intelligent monitoring feature allows you to monitor the status of scheduling nodes. You can configure alert rules to monitor the status of EMR nodes. | Overview |
Permissions
- EMR cluster authentication
If you use a non-system account to run EMR nodes in DataWorks, you must enable the authentication service in the EMR cluster and add the non-system account to the authentication service.
- Permissions to associate an EMR cluster with a DataWorks workspace
Only an account that is attached the
AliyunEMRFullAccess
policy can be used to associate an EMR cluster as an EMR compute engine instance with a DataWorks workspace in the DataWorks console. The EMR cluster is used to run EMR nodes in DataWorks. - Data permission management
- DataWorks
In DataWorks, you are granted permissions on the EMR compute engine based on the mapping between DataWorks workspace members and EMR cluster accounts. A member in a DataWorks workspace is authenticated to perform operations in an EMR cluster. Different data operation permissions are granted to different node owners, Alibaba Cloud accounts, and RAM users to isolate permissions when they run EMR nodes in DataWorks.
- EMR
The permissions that a workspace member has when the workspace member runs EMR nodes are determined by the mapped EMR cluster account. You can use the related components in the EMR cluster to manage the permissions of the EMR cluster account. For example, you can use Ranger to manage the permissions of an EMR cluster account that maps to an Alibaba Cloud account.
If DLF is specified as the metadata storage service for an EMR cluster, and the data permission management feature of DLF is enabled by using the DLF-Auth component, you can request permissions in Security Center of DataWorks. For more information, see Manage permissions on DLF.
- DataWorks
- Permission management for DataWorks service modules
If you want to run EMR nodes in DataWorks, you must be granted the permissions on DataWorks service modules such as DataStudio, Data Map, Data Quality, and intelligent monitoring. After you obtain the required permissions on the service modules, you can develop EMR nodes, perform O&M operations on the nodes, and monitor data quality of the nodes.