When you register an E-MapReduce (EMR) DataLake cluster to a DataWorks workspace and run EMR nodes (EMR Hive, EMR MR, EMR Presto, EMR Spark SQL), the default component settings are often insufficient for production workloads. This topic covers recommended memory configurations for Kyuubi, Spark, and Hadoop Distributed File System (HDFS), and explains how to isolate metadata between the development and production environments.
Configure EMR components
Kyuubi
In the production environment, set the following JVM memory parameters:
| Parameter | Recommended value | Description |
|---|---|---|
kyuubi_java_opts |
10g or larger | Heap size for the Kyuubi server JVM. A larger heap reduces garbage collection (GC) pressure under concurrent query load. |
kyuubit_beeline_opts |
2g or larger | Heap size for the Beeline client. Increase this value if your queries return large result sets. |
Spark
Spark's default memory allocation is conservative. Tune the following parameters based on your cluster size and workload:
| Parameter | Scope | Description |
|---|---|---|
spark.driver.memory |
Driver | Heap size for the Spark driver. Increase if the driver handles large broadcast variables or collects significant data. |
spark.driver.memoryOverhead |
Driver | Off-heap memory reserved for the driver JVM. Adjust based on your cluster scale and workload. |
spark.executor.memory |
Executor | Heap size for each executor. This is the primary parameter for tuning executor performance. |
Pass these parameters via spark-submit to apply them at the job level without affecting other workloads on the cluster:
spark-submit \
--conf spark.driver.memory=4g \
--conf spark.driver.memoryOverhead=512m \
--conf spark.executor.memory=8g \
...
For the full list of Spark memory configuration options, see Spark memory management.
Data lineage support
Not all EMR node types in DataWorks generate lineage data. Review the following constraints before designing pipelines that depend on lineage.
| Node type | Table-level lineage | Column-level lineage |
|---|---|---|
| EMR Hive | Supported | Supported |
| EMR Spark | Supported (Spark 2.x only) | Not supported |
| EMR Spark SQL | Supported (Spark 2.x only) | Not supported |
| EMR MR | Not supported | Not supported |
| EMR Presto | Not supported | Not supported |
HDFS
HDFS daemon memory is controlled through the following parameters. Adjust these based on your cluster size — larger clusters with more data nodes and higher namespace load require more heap:
| Parameter | Component | Description |
|---|---|---|
hadoop_namenode_heapsize |
NameNode | Heap size for the NameNode JVM. Increase for clusters with a large number of files and blocks. |
hadoop_datanode_heapsize |
DataNode | Heap size for each DataNode JVM. |
hadoop_secondary_namenode_heapsize |
Secondary NameNode | Heap size for the Secondary NameNode, which handles periodic checkpointing. |
hadoop_namenode_opts |
NameNode | Additional JVM options for the NameNode, such as GC tuning flags. |
Isolate metadata between environments
When a DataWorks workspace runs in standard mode, you must register two separate EMR clusters — one for the development environment and one for the production environment. Register both on the Data Sources page in SettingCenter.
To meet the data isolation requirement, back each cluster's metadata with a separate ApsaraDB RDS database. Using a single database for both environments allows development changes to affect production metadata, which defeats the purpose of environment isolation.
For setup steps and additional usage notes, see Usage notes for development of EMR tasks in DataWorks.