Best practices for configuring EMR clusters used in DataWorks - DataWorks

DataWorks allows you to associate an E-MapReduce (EMR) data lake cluster as an EMR compute engine instance with a DataWorks workspace and create EMR nodes such as EMR Hive, EMR MR, EMR Presto, and EMR Spark SQL nodes in DataWorks based on the EMR data lake cluster. You can use the different types of EMR nodes for different features. For example, you can configure an EMR workflow, schedule nodes in a workflow, or manage metadata in a workflow. These features help you generate data in an efficient manner by using EMR. This topic describes the optimal configurations of an EMR data lake cluster that is used when you run EMR nodes in DataWorks.

Background information

You can select different EMR components when you run EMR nodes in DataWorks. The optimal configurations of an EMR component for you to run EMR nodes in DataWorks vary based on the type of the component you use. You can select EMR components based on your business requirements. For more information, see Configure EMR components.
When you run EMR nodes in DataWorks, you can select a metadata storage method based on the mode in which your workspace runs. For more information, see Select a metadata storage method.

For more information about the precautions and development process of an EMR node in DataWorks based on an EMR data lake cluster, see Development process of an EMR node in DataWorks.

Configure EMR components

Kyuubi
When you configure the Kyuubi component in the production environment, we recommend that you set the memory size of kyuubi_java_opts to 10g or a larger value, and set the memory size of kyuubit_beeline_opts to 2g or a larger value.
Spark
- The default memory size of the Spark component is small. You can add a command to the spark-submit script to change the default memory size based on your business requirements.
- You can configure the following configuration items of the Spark component based on the number of Elastic Compute Service (ECS) instances in your EMR cluster: spark.driver.memory, spark.driver.memoryOverhead, and spark.executor.memory.
Important
- Only EMR Hive nodes, EMR Spark nodes, and EMR Spark SQL nodes in DataWorks can be used to generate lineages. EMR Hive nodes can be used to generate table-level and column-level lineages. Spark-based EMR nodes can be used to generate only table-level lineages.
- For Spark-based EMR nodes, only the nodes that use Spark 2.x can be used to generate lineages.
For more information about how to configure the Spark component, see Spark Memory Management.
HDFS
You can configure the following configuration items of the HDFS component based on the number of ECS instances in your EMR cluster: hadoop_namenode_heapsize, hadoop_datanode_heapsize, hadoop_secondary_namenode_heapsize, and hadoop_namenode_opts.

Select a metadata storage method

To implement the isolation mechanism between the development and production environments of a DataWorks workspace in standard mode, you must configure one EMR cluster in the development environment and another EMR cluster in the production environment on the compute engine configuration page. To meet the data isolation requirements, the metadata of the two EMR clusters must be stored by using one of the following methods:

Method 1: Use two different catalogs of Data Lake Formation (DLF) to store the metadata. This method is recommended for data lake scenarios.
Method 2: Use two different ApsaraDB RDS databases to store the metadata.