To prevent an error from being reported when you run an E-MapReduce (EMR) node in DataWorks, make sure that the key configurations of the related EMR data lake cluster meet the requirements. For example, you must configure the settings of Lightweight Directory Access Protocol (LDAP), a Ranger whitelist, and a security policy to authenticate the identity of the account that you use to run the EMR node in DataWorks in the EMR data lake cluster. This topic describes how to configure the key items for an EMR data lake cluster.

Limits

The EMR data lake cluster must be of V3.41.0 or a later minor version, or V5.7.0 or a later minor version. If the EMR data lake cluster is of a minor version earlier than 3.41.0 or 5.7.0, specific DataWorks features cannot be used.

Configure an EMR data lake cluster

  1. Optional:Enable LDAP.
    If you want to associate the EMR data lake cluster as an EMR compute engine instance with a DataWorks workspace in security mode and enable user authentication, you must enable LDAP for the EMR data lake cluster.
  2. Add the required properties of DataWorks to the Hive property whitelist on the Ranger service page of the EMR data lake cluster.
    If you integrate Hive with Ranger in EMR, you must add the required properties of DataWorks to the Hive property whitelist on the Ranger service page of the EMR data lake cluster and restart Hive before you develop EMR Hive nodes in DataWorks. Otherwise, the error message Cannot modify spark.yarn.queue at runtime or Cannot modify SKYNET_BIZDATE at runtime is returned when you run EMR Hive nodes.
    1. Add the required properties of DataWorks to the Hive property whitelist on the Ranger service page of the EMR data lake cluster.
      Add a custom parameter that consists of a key and a value. The following sample code provides an example on a custom parameter that is configured for the Hive component in an EMR data lake cluster:
      hive.security.authorization.sqlstd.confwhitelist.append=tez.*|spark.*|mapred.*|mapreduce.*|ALISA.*|SKYNET.*
      Note In the preceding code, ALISA.* and SKYNET.* are supported only for DataWorks.
    2. Restart the Hive service.

      After the whitelist is configured, you must restart the Hive service to make the configurations take effect.

  3. Change the default priority based on which an EMR node is run in the yarn-site.xml file.
    If you want to change the priority of an EMR node that is run in DataWorks, you must add the configuration item yarn.cluster.max-application-priority to the yarn-site.xml file for the EMR cluster in the EMR console and specify a higher priority instead of the default value 0. Otherwise, the priority that you specified for the EMR node in DataWorks does not take effect.
    Note After the change, you must restart the YARN service to make the configuration take effect.