Use EMR in DataWorks - DataWorks - Alibaba Cloud Documentation Center

DataWorks allows you to create nodes such as Hive, MR, Presto, and Spark SQL nodes based on an E-MapReduce (EMR) compute engine. In the DataWorks console, you can configure EMR nodes, enable periodic scheduling of tasks on the nodes, and manage the metadata of the nodes to ensure that data is generated and managed in an efficient and stable manner. This topic describes the usage notes for the development of EMR tasks in DataWorks. The usage notes cover the basic development process, fee description, environment preparation, and permission management.

Background information

EMR is a big data processing solution provided by Alibaba Cloud.

EMR is developed based on open source Apache Hadoop and Apache Spark. EMR allows you to use peripheral systems in the Hadoop and Spark ecosystems to analyze and process data with ease. Alibaba Cloud provides EMR on ECS, EMR on ACK, and EMR Serverless StarRocks to meet the business requirements of different users. For more information, see the topics in the Product Overview directory.

Supported EMR cluster types

You must register an EMR cluster to DataWorks before you can use the cluster in the DataWorks console to run tasks. Before you can perform operations related to EMR in the DataWorks console, you must create required EMR clusters. You can register the following types of EMR clusters to DataWorks:

DataLake cluster (new data lake): created on the EMR on ECS page
Custom cluster: created on the EMR on ECS page
Hadoop cluster (old data lake): created on the EMR on ECS page
Important
- You can use EMR Hadoop clusters of the following versions in DataWorks:
  EMR V3.38.2, EMR V3.38.3, EMR V4.9.0, EMR V5.6.0, EMR V3.26.3, EMR V3.27.2, EMR V3.29.0, EMR V3.32.0, EMR V3.35.0, EMR V4.3.0, EMR V4.4.1, EMR V4.5.0, EMR V4.5.1, EMR V4.6.0, EMR V4.8.0, EMR V5.2.1, and EMR V5.4.3.
- We recommend that you do not use Hadoop clusters. We recommend that you migrate data from Hadoop clusters to DataLake clusters at the earliest opportunity. For more information, see Migrate data from a Hadoop cluster to a DataLake cluster.
Spark cluster: created on the EMR on ACK page
EMR Serverless StarRocks instance

Note

If your cluster cannot be registered to DataWorks, submit a ticket to contact technical support.

Limits

Task type: You cannot run EMR Flink tasks in the DataWorks console.

Task running: You can use a serverless resource group (recommended) or an old-version exclusive resource group for scheduling to run an EMR task.
Task governance:
- Only SQL tasks in EMR Hive, EMR Spark, and EMR Spark SQL nodes can be used to generate data lineages. If your EMR cluster is of V3.43.1, V5.9.1, or a minor version later than V3.43.1 or V5.9.1, you can view the table-level lineages and field-level lineages of the preceding nodes that are created based on the cluster.
  Note
  For Spark-based EMR nodes, if the EMR cluster is of V5.8.0, V3.42.0, or a minor version later than V5.8.0 or V3.42.0, the Spark-based EMR nodes can be used to view table-level and field-level lineages. If the EMR cluster is of a minor version earlier than V5.8.0 or V3.42.0, only the Spark-based EMR nodes that use Spark 2.x can be used to view table-level lineages.
- If you want to manage metadata for a DataLake or custom cluster in DataWorks, you must configure EMR-HOOK in the cluster first. If you do not configure EMR-HOOK in the desired cluster, metadata cannot be displayed in real time, audit logs cannot be generated, and data lineages cannot be displayed in DataWorks. In addition, EMR governance tasks cannot be run. EMR-HOOK can be configured for EMR Hive and EMR Spark SQL services. For more information, see Use the Hive extension feature to record data lineage and historical access information and Use the Spark SQL extension feature to record data lineage and historical access information.
Supported regions: EMR Serverless Spark is available only in the China (Zhangjiakou) region.
For an EMR cluster for which Kerberos authentication is enabled, you must add inbound rules of UDP ports to the security group of the EMR cluster for the CIDR block of the vSwitch with which a resource group is associated.
Note
To add an inbound rule, perform the following operations: Log on to the EMR console. Go to the Basic Information tab of your EMR cluster. In the Security section of the Basic Information tab, click the icon to the right of the Cluster Security Group parameter. On the Security Group Details tab of the Security Groups page, click the Inbound tab in the Access Rule section. On the Inbound tab, click Add Rule. Set the Protocol Type parameter to Custom UDP, the Port Range parameter to the configuration specified in the /etc/krb5.conf file of your EMR cluster, and the Authorization Object parameter to the CIDR block of the vSwitch with which a resource group is associated.

Prerequisites

DataWorks is activated and a workspace is created. For more information, see Activate DataWorks and Manage workspaces.
An EMR cluster is created. For more information, see Create a cluster.
Note
You can use different EMR services to run EMR tasks in DataWorks. The optimal configurations of the EMR services vary. When you create an EMR cluster, you can refer to the Appendix: Suggestions for EMR cluster configuration section in this topic to select EMR services based on your business requirements.
A DataWorks serverless resource group is purchased.
By default, DataWorks resource groups are not connected to the networks of other cloud services after the resource groups are purchased. An EMR cluster must be connected to a specific resource group before you can use the EMR cluster.
Note
- DataWorks releases serverless resource groups that are used for general purposes, and we recommend that you purchase this type of resource group. Serverless resource groups are suitable for scenarios in which different task types are used, such as data synchronization and task scheduling. For information about how to purchase a serverless resource group, see Create and use a serverless resource group. New users can purchase only serverless resource groups.
- If you have purchased an old-version exclusive resource group, you can also use the resource group to run EMR tasks. An old-version exclusive resource group that you can select varies based on the type of the task that you want to run. For example, to run a data synchronization task, you must use an exclusive resource group for Data Integration. To run a data scheduling task, you must use an exclusive resource group for scheduling. For more information, see the topics in the Use old-version resource groups directory.

Usage notes

The following table describes the usage notes for the development of EMR tasks in DataWorks.

No.	Description
Billing	When you develop EMR tasks in DataWorks, you are charged for not only DataWorks resources but also the resources of other Alibaba Cloud services.
Environment preparation	Before you develop EMR tasks in DataWorks, you must purchase DataWorks of a specific edition and a resource group based on your business requirements, register an EMR cluster, and prepare the development environment.
Permission management	DataWorks provides a comprehensive permission management system for you to manage product-level permissions and module-level permissions. You can grant different permissions to different users based on your business requirements to implement fine-grained permission management.
Data integration	DataWorks Data Integration allows you to read data from and write data to EMR Hive. DataWorks provides a variety of data synchronization scenarios, such as batch synchronization and full and incremental synchronization.
Data modeling and development	DataWorks provides the Data Modeling service that is used to structure and manage large volumes of unordered and complex data. DataWorks also provides the DataStudio service for development of tasks that are scheduled to run. After the tasks are developed, you can go to Operation Center to monitor and perform O&M operations on the tasks.
Data governance	DataWorks allows you to manage EMR metadata and govern EMR data.
Data analysis and services	DataWorks DataAnalysis provides the EMR data analysis and service sharing capabilities.
Open Platform	DataWorks provides openness capabilities that allow your application systems to quickly integrate with DataWorks. You can use DataWorks to manage data-related processes, govern data, perform O&M operations on data, and quickly respond to changes to the business status in the application systems.

Billing

1. Fees for DataWorks resources

This section describes the fees that are included in your DataWorks bill. For information about the billable items of DataWorks, see Billing overview.

Fee	Description
Fees for the DataWorks edition that you use	You must activate DataWorks before you can develop tasks in DataWorks. If you activate DataWorks Standard Edition, DataWorks Professional Edition, or DataWorks Enterprise Edition, you are charged the fees for the edition when you purchase the edition.
Fees for the scheduling resources that you use to schedule tasks	After tasks are developed, scheduling resources are required to schedule the tasks. You can purchase a serverless resource group or an old-version exclusive resource group for scheduling, and pay for the resource group. We recommend that you purchase a serverless resource group. Note A purchased serverless resource group can be used for task scheduling and data synchronization.
Fees for the resources that you use to synchronize data	A data synchronization task consumes scheduling resources and synchronization resources. You can purchase a serverless resource group or an old-version exclusive resource group for Data Integration, and pay for the resource group. We recommend that you purchase a serverless resource group.

2. Fees for the resources of other Alibaba Cloud services

This section describes the fees that are not included in your DataWorks bill.

Important

You are charged for the resources of other Alibaba Cloud services based on the billing logic of the Alibaba Cloud services. For more information, see the billing documentation of the Alibaba Cloud services. For information about the billing details of an EMR compute engine, see the topics in the Billing directory.

Fee	Description
Database fees	When you run data synchronization tasks to read data from and write data to databases, database fees may be generated.
Computing and storage fees	When you run tasks of a specific type of compute engine, computing and storage fees of this type of compute engine may be generated.
Network service fees	When you establish network connections between DataWorks and other related services, network service fees may be generated. For example, if you use services, such as Express Connect, Elastic IP Address (EIP), and Internet Shared Bandwidth, to establish network connections between DataWorks and other related services, you may be charged network service fees.

Environment preparation

1. Resource preparation

Item	Description	References
Select a DataWorks edition	DataWorks Basic Edition allows you to perform the following basic operations during the development of EMR data: migrate data to the cloud, develop data, schedule EMR tasks, and govern data. If you want to use more advanced data governance and data security solutions, you can purchase DataWorks of an advanced edition, such as DataWorks Standard Edition, DataWorks Professional Edition, or DataWorks Enterprise Edition.	Differences among DataWorks editions
Select a resource group	You can use only serverless resource groups or old-version exclusive resource groups to run EMR tasks. We recommend that you use serverless resource groups.	Create and use a serverless resource group Use old-version resource groups

2. Development environment preparation

You must register an EMR cluster with a DataWorks workspace before you can develop EMR tasks in DataStudio. You must add users to the workspace as members. This facilitates collaborative data development.

Item	Description	References
Prepare a data synchronization environment	Before you run a data synchronization task based on an EMR service, you must add the EMR service to DataWorks as a data source.	Supported data source types and synchronization operations
Prepare an environment for data development and analysis	Before you enable DataWorks to periodically schedule EMR tasks, you must add an EMR cluster to DataWorks as a data source. Then, you can use the data source to perform operations, such as data development, data analysis, and periodic task scheduling.	Register an EMR cluster to DataWorks
Prepare a collaborative development environment	To ensure that RAM users can collaborate with each other to develop data in a workspace, you must perform the following operations: Add the RAM users to the current workspace as members and assign the Development role to the RAM users in the workspace. Add workspace members to the desired EMR cluster.	Manage permissions on workspace-level services Manage OpenLDAP users

Permission management

DataWorks provides a comprehensive permission management system for you to manage product-level permissions and module-level permissions. You can grant different permissions to different users based on your business requirements. Details of permission management:

1. Management of data access permissions

You can configure mappings between RAM users that are added to a DataWorks workspace as members to develop EMR tasks and EMR cluster accounts to allow the RAM users to have the permissions of the mapped EMR cluster accounts. For more information, see Configure mappings between tenant member accounts and EMR cluster accounts.

DataWorks allows you to manage permissions on Data Lake Formation (DLF) in a visualized manner. For example, you can request permissions, process permission requests, and audit permissions. This helps you manage permissions on fully managed data lakes in a centralized manner. If DLF is specified as the metadata storage service for an EMR data source that is added to your workspace, you can apply for and manage permissions in DataWorks Security Center. For more information, see Manage permissions on DLF.

2. Management of permissions on services and features

Before you develop data in DataWorks as a RAM user, you must assign a workspace-level role to the RAM user to grant the RAM user specific permissions. For more information, see Best practices for managing permissions of RAM users.

You can refer to Manage permissions on global-level services to manage permissions on DataWorks service modules, such as prohibiting users from accessing Data Map, and to manage permissions of performing operations in the DataWorks console, such as allowing users to delete a workspace.
You can refer to Manage permissions on workspace-level services to manage permissions on DataWorks workspace-level service modules, such as allowing users to access DataStudio to perform development operations, and to manage permissions on DataWorks global-level service modules, such as prohibiting users from accessing Data Security Guard.

Getting started

DataWorks provides multiple services. You can develop tasks that are scheduled to run in DataStudio. After the tasks are developed, you can go to Operation Center in the production environment to monitor and perform O&M operations on the tasks. DataWorks also provides process control for task development and deployment to standardize data development operations and ensure security of data development.

1. Data integration

DataWorks Data Integration allows you to read data from and write data to EMR Hive. You must add the Hive service to DataWorks as a data source before you can synchronize data from another type of data source to a Hive data source or synchronize data from a Hive data source to another type of data source. In addition, DataWorks provides a variety of data synchronization scenarios, such as batch synchronization, full synchronization, and incremental synchronization. You can select a scenario based on your business requirements. For more information, see Data Integration.

2. Data modeling and development

Module	Description	References
Data Modeling	Data Modeling is the first step for end-to-end data governance. Data Modeling uses the modeling methodology of the Alibaba data mid-end, interprets the business data of an enterprise from a business perspective by using the data warehouse planning, data standard, dimensional modeling, and data metric modules, and allows personnel inside the enterprise to quickly understand and share the idea of measuring and interpreting business data in compliance with data warehousing specifications.	Data Modeling overview
DataStudio	DataWorks encapsulates the capabilities of an EMR compute engine. This way, you can use the EMR compute engine to run EMR data synchronization and development tasks. Data synchronization: DataStudio supports only specific batch and real-time synchronization scenarios. For more information about data synchronization scenarios, see Data Integration overview. Data development: You can develop and allow the system to periodically schedule different types of tasks in DataWorks without the need to use complex command lines.	Create an EMR Hive node Create an EMR MR node Create an EMR Spark SQL node Create an EMR Spark node Create an EMR Shell node Create an EMR Presto node Create an EMR Spark Streaming node Create an EMR Kyuubi node Create an EMR Trino node Create an EMR table Create and use an EMR resource Create an EMR function
	You can use general nodes and nodes of a specific type of compute engine in DataWorks to process complex logic. DataWorks supports the following types of general nodes: Zero load nodes that are used to manage workflows HTTP Trigger nodes that are used in the scenarios in which external scheduling systems are used to trigger scheduling of nodes in DataWorks, OSS object inspection nodes, and FTP Check nodes Assignment nodes that are used to pass input parameters and output parameters for nodes, and parameter nodes Do-while nodes that are used to execute node code in loops, for-each nodes that are used to traverse the outputs of assignment nodes in loops and judge the outputs, and branch nodes Other nodes, such as common Shell nodes and MySQL database nodes	Create and use a zero load node Create an HTTP Trigger node OSS object inspection node Create an FTP Check node Configure an assignment node Create a parameter node Logic of do-while nodes Logic of for-each nodes Configure a branch node
	After tasks on nodes are developed, you can perform the following operations based on your business requirements: Configure scheduling properties for nodes If you want to enable DataWorks to periodically run your tasks on nodes, you must configure scheduling properties for the nodes, such as scheduling dependencies and scheduling parameters. Debug nodes To ensure that tasks on nodes in the production environment are run in an efficient manner and prevent a waste of computing resources, we recommend that you debug and run the tasks before you deploy the tasks. Deploy nodes The tasks on nodes can be scheduled to run only after they are deployed to the production environment. Therefore, after the tasks are developed, you must deploy the tasks to the production environment. After the tasks are deployed, you can view and manage the tasks on the Auto Triggered Nodes page in Operation Center. Manage nodes You can perform various operations on the tasks on nodes, such as deploying and undeploying the tasks, and modifying scheduling properties for multiple tasks at the same time. Perform process management DataWorks provides process control for task development and deployment to ensure the accuracy and security of the operations that are performed on tasks. For example, DataWorks provides the code review, forceful smoke testing, and code review logic customization features.	Overview Debugging procedure Deploy nodes Perform batch operations Process management
Operation Center	Operation Center is an end-to-end big data O&M and monitoring platform. Operation Center allows you to view the status of tasks and perform O&M operations on tasks on which exceptions occur. For example, you can perform intelligent diagnostics and rerun tasks in Operation Center. Operation Center provides the intelligent baseline feature that you can use to resolve issues such as uncontrollable output time of important tasks and difficulties in monitoring of massive tasks. This feature helps you ensure the timeliness of task output.	Perform basic O&M operations on auto triggered nodes
Data Quality	Data Quality ensures data availability for the end-to-end data R&D process and provides reliable data for your business in an efficient manner. Data Quality can help you identify data quality issues at the earliest opportunity and prevent data quality issues from escalating by virtue of effective monitoring rule-based quality checks and the combination of monitoring rules and task scheduling processes.	Data Quality overview

3. Data governance

After you register an EMR cluster to DataWorks, DataWorks automatically collects metadata from your EMR compute engine. You can refer to Data Map overview to view metadata. In addition, you can refer to Data Governance Center overview to view the issues that are detected by DataWorks and perform related data governance operations.

Module	Description	References
Data Map	Data Map is an enterprise-grade data management platform that provides management, sorting, quick search, and in-depth understanding capabilities for data objects based on the underlying unified metadata services.	Data Map overview
Security Center Data Security Guard Approval Center	Security Center is an end-to-end data security governance platform that covers classification of data assets, sensitive data identification, management on data-related authorization, masking of sensitive data, audit of access to sensitive data, and risk identification and response. Security Center helps you determine data security governance issues.	Security Center overview Data Security Guard overview Approval Center overview
Data Governance Center	Data Governance Center automatically identifies items to be governed for multiple governance fields based on rules that come from experience in data-related fields, and provides governance and optimization solutions covering pre-event issue prevention and post-event issue resolution. Data Governance Center can help you actively and systematically complete data governance.	Data Governance Center overview

4. Data analysis and services

DataAnalysis and DataService Studio are designed to provide data processing and analysis capabilities for enterprises and help enterprises use the APIs that are managed in a unified manner to access and share data.

Module	Description	References
DataAnalysis	The DataAnalysis module of DataWorks helps you perform SQL-based analysis online, gain an insight into business requirements, and edit and share data, and allows you to save query results as chart cards and quickly generate visualized data reports based on the chart cards for daily reporting.	For more information, see DataAnalysis overview.
DataService Studio	DataService Studio is designed to provide comprehensive data service and sharing capabilities for enterprises and helps enterprises manage API services for internal and external systems in a centralized manner.	For more information, see DataService Studio overview.

5. Open Platform

DataWorks provides openness capabilities that allow your application systems to quickly integrate with DataWorks. You can use DataWorks to manage data-related processes, govern data, perform O&M operations on data, and quickly respond to changes to the business status in the application systems.

Item	Description	References
OpenAPI	The OpenAPI module allows you to call DataWorks API operations so that you can integrate your applications with DataWorks. This can help facilitate big data processing, decrease manual operations and O&M operations, minimize data risks, and reduce costs for enterprises.	OpenAPI
OpenEvent	The OpenEvent module allows you to subscribe to DataWorks change events related to your applications so that you can detect and respond to the changes at the earliest opportunity.	OpenEvent overview
Extensions	You can use the OpenEvent module to subscribe to event messages that are generated in your DataWorks workspace. You can use the Extensions module to register your local program as an extension to manage extension point events and processes.	Extensions overview

Appendix: Suggestions for EMR cluster configuration

You can use different EMR services to run EMR tasks in DataWorks. The optimal configurations of the EMR services vary. When you create an EMR cluster, you can select EMR services based on your business requirements.

Kyuubi
When you configure Kyuubi in the production environment, we recommend that you set the kyuubi_java_opts parameter to 10g or a larger value, and set the kyuubi_beeline_opts parameter to 2g or a larger value.
Spark
- The default memory size of Spark is small. You can add a command that is used to configure the memory size in the spark-submit CLI to modify the default memory size.
- You can modify the following parameters that are configured for Spark based on the scale of the EMR cluster that you use: spark.driver.memory, spark.driver.memoryOverhead, and spark.executor.memory.
Important
Only EMR Hive nodes, EMR Spark nodes, and EMR Spark SQL nodes in DataWorks can be used to generate lineages. EMR Hive nodes can be used to generate table-level and column-level lineages. Spark-based EMR nodes can be used to generate only table-level lineages.
For more information about how to configure Spark, see Spark memory management.
HDFS
You can modify the following parameters that are configured for HDFS based on the scale of the EMR cluster that you use: hadoop_namenode_heapsize, hadoop_datanode_heapsize, hadoop_secondary_namenode_heapsize, and hadoop_namenode_opts.