Configure Custom Spark Job Files in E-MapReduce - E-MapReduce

Custom configuration files let you control the runtime environment for EMR Serverless Spark jobs and sessions. Use them when a framework requires an XML file at a specific path at runtime, or when you are migrating non-SQL jobs from EMR on ECS and the job code does not initialize the file system through SparkContext#hadoopConfiguration.

For key-value Spark properties that do not require a dedicated file, add them directly in the Spark Configuration field instead.

Prerequisites

Before you begin, make sure you have:

A workspace. See Workspace management

Create a custom configuration file

Log on to the E-MapReduce console.
In the left-side navigation pane, choose EMR Serverless > Spark.
On the Spark page, click the target workspace name.
On the EMR Serverless Spark page, click Configuration Management in the left-side navigation pane.
On the Configuration page, click the Custom Configuration Files tab, and then click Create Custom Configuration File.

Configure the parameters and click Create.

Note

The system predefines a set of key configuration files whose names and content are managed internally. You cannot rename or overwrite them. The locked file names are: spark-defaults.conf, kyuubi-defaults.conf, executorPodTemplate.yaml, spark-pod-template.yaml, driver_log4j.xml, executor_log4j.xml, session_log4j.xml, spark.properties, and syncer_log4j.xml.

Parameter	Description
Path	The storage path for the file.
File name	The file name and extension. Select `.txt`, `.xml`, or `.json` based on the file type.
File content	The configuration content. Make sure the content complies with the format requirements of the selected file type.
Description	A description of the file's purpose, to help with ongoing management.

After the file is created, click Edit or Delete in the Actions column to modify or remove it.

Examples

The following examples show two common scenarios for custom configuration files. Both create an XML file at /etc/spark/conf, which Serverless Spark picks up at job startup.

Example 1: Enable Ranger authentication for Spark Thrift Server

This example configures Ranger authentication for a Spark Thrift Server session.

Step 1: Create the Ranger security configuration file

Create a configuration file named ranger-spark-security.xml and save it to /etc/spark/conf. Use the following content:

<configuration>
  <property>
    <name>ranger.plugin.spark.policy.cache.dir</name>
    <value>/opt/emr-hive/policycache</value>
  </property>
  <property>
    <name>ranger.plugin.spark.ambari.cluster.name</name>
    <value>serverless-spark</value>
  </property>
  <property>
    <name>ranger.plugin.spark.service.name</name>
    <value>emr-hive</value>
  </property>
  <property>
    <name>ranger.plugin.spark.policy.rest.url</name>
    <value>http://<ranger_admin_ip>:<ranger_admin_port></value>
  </property>
  <property>
    <name>ranger.plugin.spark.policy.source.impl</name>
    <value>org.apache.ranger.admin.client.RangerAdminRESTClient</value>
  </property>
  <property>
    <name>ranger.plugin.spark.super.users</name>
    <value>root</value>
  </property>
</configuration>

Replace the placeholders with values from your environment:

Placeholder	Description
`<ranger_admin_ip>`	Internal IP address of Ranger Admin. If Ranger is deployed in an EMR on ECS cluster, use the internal IP address of the master node.
`<ranger_admin_port>`	Port number of Ranger Admin. For EMR on ECS deployments, use `6080`.

Step 2: Configure the Spark Thrift Server session

Stop the Spark Thrift Server session before making changes. Select the connection name from the Network Connection drop-down list, and add the following entries in Spark Configuration:

spark.emr.serverless.user.defined.jars     /opt/ranger/ranger-spark.jar
spark.sql.extensions                       org.apache.kyuubi.plugin.spark.authz.ranger.RangerSparkExtension

Restart the Spark Thrift Server for the changes to take effect.

Step 3: Verify the configuration

Use Spark Beeline to connect and run a query. For connection instructions, see Connect to Spark Thrift Server.

If a user accesses a resource without sufficient privilege, Ranger returns a permission error similar to the following:

Error: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.kyuubi.plugin.spark.authz.AccessControlException: Permission denied: user [test] does not have [update] privilege on [database=default/table=students/column=name]
	at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:46)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:262)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:166)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79)
	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:41)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:166)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:161)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:175)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Example 2: Access OSS or OSS-HDFS after migrating from EMR on ECS

Problem: When migrating non-SQL jobs from EMR on ECS to Serverless Spark, jobs that access OSS or OSS-HDFS fail with UnsupportedFileSystemException. Serverless Spark does not inject core-site.xml by default, so the OSS and OSS-HDFS file system implementations are not registered—unless the job code initializes the file system through SparkContext#hadoopConfiguration.

Solution: Create a configuration file named core-site.xml and save it to /etc/spark/conf:

<?xml version="1.0" ?>
<configuration>
    <property>
        <name>fs.AbstractFileSystem.oss.impl</name>
        <value>com.aliyun.jindodata.oss.OSS</value>
    </property>
    <property>
        <name>fs.oss.endpoint</name>
        <value>oss-cn-<region>-internal.aliyuncs.com</value>
    </property>
    <property>
        <name>fs.oss.impl</name>
        <value>com.aliyun.jindodata.oss.JindoOssFileSystem</value>
    </property>
    <property>
        <name>fs.oss.credentials.provider</name>
        <value>com.aliyun.jindodata.oss.auth.SimpleCredentialsProvider</value>
    </property>
    <property>
        <name>fs.oss.accessKeyId</name>
        <value>The AccessKey ID used to access OSS or OSS-HDFS.</value>
    </property>
    <property>
        <name>fs.oss.accessKeySecret</name>
        <value>The AccessKey secret used to access OSS or OSS-HDFS.</value>
    </property>
</configuration>

Replace <region> with your OSS region, for example, hangzhou.