Usage notes for development of EMR nodes in DataWorks - DataWorks

DataWorks allows you to create nodes such as Hive, MR, Presto, and Spark SQL nodes based on an E-MapReduce (EMR) compute engine. In the DataWorks console, you can configure EMR nodes, enable periodic scheduling of the nodes, and manage the metadata of the nodes to ensure that data is generated and managed in an efficient and stable manner. This topic describes the usage notes for the development of EMR nodes in DataWorks.

Background information

EMR is a big data processing solution provided by Alibaba Cloud.

EMR is developed based on open source Apache Hadoop and Apache Spark. EMR allows you to use peripheral systems in the Hadoop and Spark ecosystems to analyze and process data with ease. Alibaba Cloud provides EMR on ECS, EMR on ACK, and EMR Serverless StarRocks to meet the business requirements of different users. For more information, see Product Overview.

Supported EMR cluster types

You can register the following types of EMR clusters to DataWorks. Before you can perform operations related to EMR in the DataWorks console, you must create one of the following types of EMR clusters.

Product series	EMR cluster type and description	Usage notes for using the EMR cluster in DataWorks
EMR on ECS	EMR DataLake cluster: If you want to register an EMR DataLake cluster to DataWorks, you must make sure that the cluster is of V3.41.0, V5.7.0, or a minor version later than V3.41.0 or V5.7.0.	You must register the cluster to DataWorks before you can use the cluster in the DataWorks console.
EMR on ECS	Custom EMR cluster: If you want to register a custom EMR cluster to DataWorks, you must make sure that the cluster is of V3.41.0, V5.7.0, or a minor version later than V3.41.0 or V5.7.0.	You must register the cluster to DataWorks before you can use the cluster in the DataWorks console.
EMR on ECS	EMR Hadoop cluster: If you want to register an EMR Hadoop cluster that belongs to the current logon account to DataWorks, you must make sure that the cluster is of V3.38.2 or V3.38.3. If you want to register an EMR Hadoop cluster that belongs to an Alibaba Cloud account other than the current logon account to DataWorks, you must make sure that the cluster is of V3.38.2 or V3.38.3 and Data Lake Formation (DLF) is not used to manage the metadata of the cluster.	You must register the cluster to DataWorks before you can use the cluster in the DataWorks console. For more information, see Scenario: Register a cross-account EMR cluster.
EMR on ACK	A cluster that you create in EMR on ACK supports only Spark and Spark SQL nodes.	You must register the cluster to DataWorks before you can use the cluster in the DataWorks console.

Limits

Task type: You cannot run EMR Flink tasks in the DataWorks console.

Task running: You can use only an exclusive resource group for scheduling to run an EMR task.
Data lineages: Only SQL tasks in EMR Hive, EMR Spark, and EMR Spark SQL nodes can be used to generate data lineages. If your EMR cluster is of V3.43.1, V5.9.1, or a minor version later than V3.43.1 or V5.9.1, you can view the table-level lineages and field-level lineages of the preceding nodes that are created based on the cluster.

Note

For Spark-based EMR nodes, if the EMR cluster is of V5.8.0, V3.42.0, or a minor version later than V5.8.0 or V3.42.0, the Spark-based EMR nodes can be used to view table-level and field-level lineages. If the EMR cluster is of a minor version earlier than V5.8.0 or V3.42.0, only the Spark-based EMR nodes that use Spark 2.x can be used to view table-level lineages.

Prerequisites

DataWorks is activated. For more information, see Activate DataWorks.
An EMR cluster is created. For more information, see Create a cluster.
Note
You can use different EMR services to run EMR nodes in DataWorks. The optimal configurations of the EMR services vary. When you create an EMR cluster, you can refer to Appendix: Suggestions for EMR cluster configuration to select EMR services based on your business requirements.
A DataWorks workspace is created. For more information, see Create and manage workspaces.

Usage notes

The following table describes the usage notes for the development of EMR nodes in DataWorks.

No.	Description	References
1	When you develop EMR nodes in DataWorks, you are charged for not only DataWorks resources but also the resources of other Alibaba Cloud services.	Billing
2	Before you develop EMR nodes in DataWorks, you must purchase DataWorks of a specific edition and a resource group based on your business requirements, register an EMR cluster, and prepare the development environment.	Environment preparation
3	DataWorks provides a comprehensive permission management system for you to manage product-level permissions and module-level permissions. You can grant different permissions to different users based on your business requirements to implement fine-grained permission management.	Permission management
4	DataWorks provides the Data Modeling service that is used to structure and manage large volumes of unordered and complex data. DataWorks also provides the DataStudio service for development of nodes that are scheduled to run. After the nodes are developed, you can go to Operation Center to monitor and perform O&M operations on the nodes.	Data modeling and development
5	DataWorks DataAnalysis provides the EMR data analysis and service sharing capabilities.	Data analysis
6	DataWorks allows you to manage EMR metadata and govern EMR data.	Data governance
7	DataWorks provides the DataService Studio service to help you manage API services for internal and external systems in a centralized manner.	Data service
8	DataWorks provides openness capabilities that allow your application systems to quickly integrate with DataWorks. You can use DataWorks to manage data-related processes, govern data, perform O&M operations on data, and quickly respond to changes to the business status in the application systems.	Open Platform

Billing

1. Fees for DataWorks resources

This section describes the fees that are included in your DataWorks bill. For information about the billable items of DataWorks, see Billing overview.

Fee	Description
Fees for the DataWorks edition that you use	You must activate DataWorks before you can develop nodes in DataWorks. If you activate an advanced edition such as DataWorks Enterprise Edition, you are charged the related fees when you purchase the edition.
Fees for the scheduling resources that you use to schedule nodes	After nodes are developed, scheduling resources are required to schedule the nodes. You must purchase subscription exclusive resource groups for scheduling based on your business requirements and pay for the resource groups.
Fees for the resources that you use to synchronize data	A data synchronization node consumes scheduling resources and synchronization resources. You must purchase subscription exclusive resource groups for Data Integration based on your business requirements and pay for the resource groups.

2. Fees for the resources of other Alibaba Cloud services

This section describes the fees that are not included in your DataWorks bill.

Important

You are charged for the resources of other Alibaba Cloud services based on the billing logic of the Alibaba Cloud services. For more information, see the billing documentation of the Alibaba Cloud services. For information about the billing details of an EMR compute engine, see Billing overview.

Fee	Description
Database fees	When you run data synchronization nodes to read data from and write data to databases, database fees may be generated.
Computing and storage fees	When you run nodes of a specific compute engine type, computing and storage fees of this type of compute engine may be generated.
Network service fees	When you establish network connections between DataWorks and other related services, network service fees may be generated. For example, if you use services, such as Express Connect, Elastic IP Address (EIP), and Internet Shared Bandwidth, to establish network connections between DataWorks and other related services, you may be charged network service fees.

Environment preparation

1. Resource preparation

Item	Description	References
Select a DataWorks edition	DataWorks Basic Edition allows you to perform the following basic operations during the development of EMR data: migrate data to the cloud, develop data, schedule EMR nodes, and govern data. If you want to use more advanced data governance and data security solutions, you can purchase DataWorks of an advanced edition.	Differences among DataWorks editions
Select a resource group	You can use only exclusive resource groups for scheduling to run EMR nodes.	Create and use an exclusive resource group for scheduling

2. Development environment preparation

You must register an EMR cluster with a DataWorks workspace before you can develop EMR nodes in DataStudio. You must add users to the workspace as members. This facilitates collaborative data development.

Item

Description

References

Register an EMR cluster

Before you enable DataWorks to periodically schedule EMR nodes, you must add an EMR cluster to DataWorks as a data source.

Prepare a collaborative development environment

To ensure that RAM users can collaborate with each other to develop data in a workspace, you must perform the following operations:

Add the RAM users to the current workspace as members and assign the Development role to the RAM users in the workspace.
Add workspace members to the desired EMR cluster.

Permission management

DataWorks provides a comprehensive permission management system for you to manage product-level permissions and module-level permissions. You can grant different permissions to different users based on your business requirements. Details of permission management:

1. Management of data access permissions

You can configure mappings between RAM users that are added to a DataWorks workspace as members to develop EMR nodes and EMR cluster accounts to allow the RAM users to have the permissions of the mapped EMR cluster accounts. For more information, see Configure mappings between tenant member accounts and EMR cluster accounts.

DataWorks allows you to manage permissions on Data Lake Formation (DLF) in a visualized manner. For example, you can request permissions, process permission requests, and audit permissions. This helps you manage permissions on fully managed data lakes in a centralized manner. If DLF is specified as the metadata storage service for an EMR compute engine that you associate with your workspace, you can apply for and manage permissions in DataWorks Security Center. For more information, see Manage permissions on DLF.

2. Management of permissions on services and features

Before you develop data in DataWorks as a RAM user, you must assign a workspace-level role to the RAM user to grant the RAM user specific permissions. For more information, see Best practices for managing permissions of RAM users.

You can refer to Manage permissions on global-level services to manage permissions on DataWorks service modules, such as prohibiting users from accessing Data Map, and to manage permissions of performing operations in the DataWorks console, such as allowing users to delete a workspace.
You can refer to Manage permissions on workspace-level services to manage permissions on DataWorks workspace-level service modules, such as allowing users to access DataStudio to perform development operations, and to manage permissions on DataWorks global-level service modules, such as prohibiting users from accessing Data Security Guard.

Getting started

1. Data modeling and development

Module	Description	References
Data Modeling	Data Modeling is the first step for end-to-end data governance. Data Modeling uses the modeling methodology of the Alibaba data mid-end, interprets the business data of an enterprise from a business perspective by using the data warehouse planning, data standard, dimensional modeling, and data metric modules, and allows personnel inside the enterprise to quickly understand and share the idea of measuring and interpreting business data in compliance with data warehousing specifications.	Data Modeling overview
DataStudio	DataStudio is an end-to-end big data development platform that you can use to develop data processing tasks of an EMR compute engine online. DataStudio provides powerful node scheduling capabilities and can support centralized orchestration and scheduling for tens of millions of instances that are generated for nodes. DataStudio also provides a control process for node deployment, which can ensure the stability of node output.	Development of EMR nodes: Create an EMR Presto node Create an EMR Hive node Create an EMR MR node Create an EMR Spark SQL node Create an EMR Spark node Create an EMR Shell node Create an EMR Spark Streaming node Create an EMR table Create and use an EMR JAR resource Create an EMR function Development of other types of nodes: Create and use a zero load node Create an HTTP Trigger node OSS object inspection node Create an FTP Check node Configure an assignment node Create a parameter node Logic of do-while nodes Logic of for-each nodes Configure a branch node Scheduling property configuration of nodes: Overview Node debugging: Debugging procedure Node deployment: Deploy nodes Node management: Perform operations on multiple DataWorks objects at a time Process control: Process management
Operation Center	Operation Center is an end-to-end big data O&M and monitoring platform. Operation Center allows you to view the status of nodes and perform O&M operations on nodes on which exceptions occur. For example, you can perform intelligent diagnostics and rerun nodes in Operation Center. Operation Center provides the intelligent baseline feature that you can use to resolve issues such as uncontrollable output time of important nodes and difficulties in monitoring of massive nodes. This feature helps you ensure the timeliness of node output.	Perform basic O&M operations on auto triggered nodes
Data Quality	Data Quality ensures data availability for the end-to-end data R&D process and provides reliable data for your business in an efficient manner. Data Quality can help you identify data quality issues at the earliest opportunity and prevent data quality issues from escalating by virtue of effective monitoring rule-based quality checks and the combination of monitoring rules and node scheduling processes.	Data Quality overview

2. Data analysis

The DataAnalysis service module of DataWorks helps you perform SQL-based analysis online, gain an insight into business requirements, and edit and share data, and allows you to save query results as chart cards and quickly generate visualized data reports based on the chart cards for daily reporting. For more information, see DataAnalysis overview.

3. Data governance

After you register an EMR cluster to DataWorks, DataWorks automatically collects metadata from your EMR compute engine. You can refer to Data Map overview to view metadata. In addition, you can refer to Data Governance Center overview to view the issues that are detected by DataWorks and perform related data governance operations.

Module	Description	References
Data Map	Data Map is an enterprise-grade data management platform that provides management, sorting, quick search, and in-depth understanding capabilities for data objects based on the underlying unified metadata services.	Data Map overview
Security Center Data Security Guard Approval Center	Security Center is an end-to-end data security governance platform that covers classification of data assets, sensitive data identification, management on data-related authorization, masking of sensitive data, audit of access to sensitive data, and risk identification and response. Security Center helps you determine data security governance issues.	Security Center overview Data Security Guard overview Approval Center overview
Data Governance Center	Data Governance Center automatically identifies items to be governed for multiple governance fields based on rules that come from experience in data-related fields, and provides governance and optimization solutions covering pre-event issue prevention and post-event issue resolution. Data Governance Center can help you actively and systematically complete data governance.	Data Governance Center overview

4. Data service

DataService Studio is designed to provide comprehensive data service and sharing capabilities for enterprises and helps enterprises manage API services for internal and external systems in a centralized manner. For more information, see DataService Studio overview.

5. Open Platform

DataWorks provides openness capabilities that allow your application systems to quickly integrate with DataWorks. You can use DataWorks to manage data-related processes, govern data, perform O&M operations on data, and quickly respond to changes to the business status in the application systems.

Item	Description	References
OpenAPI	The OpenAPI module allows you to call DataWorks API operations so that you can integrate your applications with DataWorks. This can help facilitate big data processing, decrease manual operations and O&M operations, minimize data risks, and reduce costs for enterprises.	OpenAPI
OpenEvent	The OpenEvent module allows you to subscribe to DataWorks change events related to your applications so that you can detect and respond to the changes at the earliest opportunity.	OpenEvent overview
Extensions	You can use the OpenEvent module to subscribe to event messages that are generated in your DataWorks workspace. You can use the Extensions module to register your local program as an extension to manage extension point events and processes.	Extensions overview

Appendix: Suggestions for EMR cluster configuration

You can use different EMR services to run EMR nodes in DataWorks. The optimal configurations of the EMR services vary. When you create an EMR cluster, you can select EMR services based on your business requirements.

Kyuubi
When you configure Kyuubi in the production environment, we recommend that you set the kyuubi_java_opts parameter to 10g or a larger value, and set the kyuubit_beeline_opts parameter to 2g or a larger value.
Spark
- The default memory size of Spark is small. You can add a command that is used to configure the memory size in the spark-submit CLI to modify the default memory size.
- You can modify the following parameters that are configured for Spark based on the scale of the EMR cluster that you use: spark.driver.memory, spark.driver.memoryOverhead, and spark.executor.memory.
Important
Only EMR Hive nodes, EMR Spark nodes, and EMR Spark SQL nodes in DataWorks can be used to generate lineages. EMR Hive nodes can be used to generate table-level and column-level lineages. Spark-based EMR nodes can be used to generate only table-level lineages.
For more information about how to configure Spark, see Spark memory management.
HDFS
You can modify the following parameters that are configured for HDFS based on the scale of the EMR cluster that you use: hadoop_namenode_heapsize, hadoop_datanode_heapsize, hadoop_secondary_namenode_heapsize, and hadoop_namenode_opts.