All Products
Search
Document Center

DataWorks:Best practices for configuring EMR clusters used in DataWorks

Last Updated:Apr 18, 2025

DataWorks allows you to register an E-MapReduce (EMR) DataLake cluster to a DataWorks workspace and create EMR nodes such as EMR Hive, EMR MR, EMR Presto, and EMR Spark SQL nodes in DataWorks based on the EMR DataLake cluster. You can configure an EMR workflow, schedule nodes in a workflow on a regular basis, or manage metadata in a workflow. These features help generate data in an efficient manner. This topic describes the optimal configurations of an EMR DataLake cluster that is used when you run EMR nodes in DataWorks.

Background information

  • You can select different EMR components when you run EMR nodes in DataWorks. The components have different optimal configurations for you to run EMR nodes in DataWorks. You can select EMR components based on your business requirements. For more information, see the Configure EMR components section in this topic.

  • When you run EMR nodes in DataWorks, you can select a metadata storage method based on the mode in which your workspace runs. For more information, see the Select a metadata storage method section in this topic.

For more information about the precautions and development process of an EMR node in DataWorks based on an EMR DataLake cluster, see Usage notes for development of EMR tasks in DataWorks.

Configure EMR components

  • Kyuubi

    When you configure Kyuubi in the production environment, we recommend that you set the memory size of kyuubi_java_opts to 10g or a larger value, and set the memory size of kyuubit_beeline_opts to 2g or a larger value.

  • Spark

    • The default memory size of Spark is small. You can run a spark-submit command to change the default memory size based on your business requirements.

    • You can configure the following configuration items of Spark based on the scale of your EMR cluster: spark.driver.memory, spark.driver.memoryOverhead, and spark.executor.memory.

    Important
    • Only EMR Hive nodes, EMR Spark nodes, and EMR Spark SQL nodes in DataWorks can be used to generate lineages. EMR Hive nodes can be used to generate table-level and column-level lineages. Spark-based EMR nodes can be used to generate only table-level lineages.

    • For Spark-based EMR nodes, only the nodes that use Spark 2.x can be used to generate lineages.

    For more information about how to configure Spark, see Spark Memory Management.

  • HDFS

    You can configure the following configuration items of HDFS based on the scale of your EMR cluster: hadoop_namenode_heapsize, hadoop_datanode_heapsize, hadoop_secondary_namenode_heapsize, and hadoop_namenode_opts.

Select a metadata storage method

To implement the isolation mechanism between the development and production environments of a DataWorks workspace in standard mode, you must register one EMR cluster in the development environment and another EMR cluster in the production environment on the Data Sources page in SettingCenter. To meet the data isolation requirements, the metadata of the two EMR clusters must be stored by using two different ApsaraDB RDS databases to store the metadata.