All Products
Search
Document Center

DataWorks:Best practices for configuring EMR clusters used in DataWorks

Last Updated:Mar 27, 2026

When you register an E-MapReduce (EMR) DataLake cluster to a DataWorks workspace and run EMR nodes (EMR Hive, EMR MR, EMR Presto, EMR Spark SQL), the default component settings are often insufficient for production workloads. This topic covers recommended memory configurations for Kyuubi, Spark, and Hadoop Distributed File System (HDFS), and explains how to isolate metadata between the development and production environments.

Configure EMR components

Kyuubi

In the production environment, set the following JVM memory parameters:

Parameter Recommended value Description
kyuubi_java_opts 10g or larger Heap size for the Kyuubi server JVM. A larger heap reduces garbage collection (GC) pressure under concurrent query load.
kyuubit_beeline_opts 2g or larger Heap size for the Beeline client. Increase this value if your queries return large result sets.

Spark

Spark's default memory allocation is conservative. Tune the following parameters based on your cluster size and workload:

Parameter Scope Description
spark.driver.memory Driver Heap size for the Spark driver. Increase if the driver handles large broadcast variables or collects significant data.
spark.driver.memoryOverhead Driver Off-heap memory reserved for the driver JVM. Adjust based on your cluster scale and workload.
spark.executor.memory Executor Heap size for each executor. This is the primary parameter for tuning executor performance.

Pass these parameters via spark-submit to apply them at the job level without affecting other workloads on the cluster:

spark-submit \
  --conf spark.driver.memory=4g \
  --conf spark.driver.memoryOverhead=512m \
  --conf spark.executor.memory=8g \
  ...

For the full list of Spark memory configuration options, see Spark memory management.

Data lineage support

Important

Not all EMR node types in DataWorks generate lineage data. Review the following constraints before designing pipelines that depend on lineage.

Node type Table-level lineage Column-level lineage
EMR Hive Supported Supported
EMR Spark Supported (Spark 2.x only) Not supported
EMR Spark SQL Supported (Spark 2.x only) Not supported
EMR MR Not supported Not supported
EMR Presto Not supported Not supported

HDFS

HDFS daemon memory is controlled through the following parameters. Adjust these based on your cluster size — larger clusters with more data nodes and higher namespace load require more heap:

Parameter Component Description
hadoop_namenode_heapsize NameNode Heap size for the NameNode JVM. Increase for clusters with a large number of files and blocks.
hadoop_datanode_heapsize DataNode Heap size for each DataNode JVM.
hadoop_secondary_namenode_heapsize Secondary NameNode Heap size for the Secondary NameNode, which handles periodic checkpointing.
hadoop_namenode_opts NameNode Additional JVM options for the NameNode, such as GC tuning flags.

Isolate metadata between environments

When a DataWorks workspace runs in standard mode, you must register two separate EMR clusters — one for the development environment and one for the production environment. Register both on the Data Sources page in SettingCenter.

To meet the data isolation requirement, back each cluster's metadata with a separate ApsaraDB RDS database. Using a single database for both environments allows development changes to affect production metadata, which defeats the purpose of environment isolation.

For setup steps and additional usage notes, see Usage notes for development of EMR tasks in DataWorks.