Spark job failures on E-MapReduce (EMR) typically fall into five categories: memory exhaustion, file format incompatibilities, shuffle errors, external data source connection failures, and job hangs. Use the error message or symptom to locate the relevant section, then apply the fix.
Memory-related issues
Memory errors are the most common cause of Spark job failures on EMR. Before adjusting parameters, identify which component (driver or executor) is out of memory and whether the cause is heap, off-heap, or OSS I/O buffering.
Is the container being killed by YARN for exceeding memory limits?
Cause: The memory allocated at job submission is too low. During JVM startup, Spark may consume significantly more memory than the allocated amount—especially off-heap memory—causing YARN NodeManager to terminate the container.
Fix: In the EMR console, go to the Spark service page and open the Configure tab. Increase spark.driver.memoryOverhead or spark.executor.memoryOverhead.
Is the container killed by YARN specifically when reading from or writing to OSS?
Cause: OSS I/O operations require additional memory for read-ahead and write buffers. If executor memory is already near its limit, these buffers push the container over the YARN memory ceiling.
Fix: First, try increasing Spark executor memory. If that is not possible, reduce OSS buffer sizes by modifying the following parameters on the core-site.xml tab of the Hadoop-Common service page in the EMR console:
| Parameter | Value |
|---|---|
fs.oss.read.readahead.buffer.count | 0 |
fs.oss.read.buffer.size | 16384 |
fs.oss.write.buffer.size | 16384 |
fs.oss.memory.buffer.size.max.mb | 512 |
Is the error "Java heap space" appearing?
Cause: The job processes more data than fits in the JVM heap, resulting in an out-of-memory (OOM) error.
Fix: In the EMR console, go to the Spark service page and open the Configure tab. Increase spark.executor.memory or spark.driver.memory based on the volume of data your job processes.
Does an OOM error occur when reading Snappy-compressed files?
In the EMR console, go to the Spark service page and open the spark-defaults.conf tab. Add the following parameter:
spark.hadoop.io.compression.codec.snappy.native = trueIs the Spark driver running out of memory?
The driver OOM can have multiple causes. Try the following fixes in order:
Increase driver memory: In the EMR console, go to the Spark service page and open the Configure tab. Increase
spark.driver.memory.Avoid large collect operations: Check whether your code uses
collectto pull large datasets to the driver. If so, switch toforeachPartitionto process data in executors instead, and remove thecollectcalls.Disable broadcast joins: Set
spark.sql.autoBroadcastJoinThresholdto-1.
Is the Spark executor running out of memory?
Try the following fixes:
Increase executor memory: In the EMR console, go to the Spark service page and open the Configure tab. Increase
spark.executor.memory.Reduce executor cores: Decrease
spark.executor.coresso each executor handles fewer concurrent tasks, reducing per-task memory pressure.Increase parallelism: Increase
spark.default.parallelismandspark.sql.shuffle.partitionsto spread data across more partitions.
File format issues
Does Hive or Impala fail to read a Parquet table that was imported by Spark?
Error message:
Failed with exception java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file xxxCause: Hive and Spark use different Parquet conventions for the DECIMAL data type. Parquet files written by Spark are not compatible with the Hive or Impala reader by default.
Fix:
In the EMR console, go to the Spark service page and open the spark-defaults.conf tab. Add the following parameter:
spark.sql.parquet.writeLegacyFormat = trueRe-import the Parquet data using Spark.
Run your Hive or Impala job to read the data.
Shuffle-related issues
Does the error "java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE" occur?
Cause: The number of shuffle partitions is too small. Individual shuffle blocks grow beyond Integer.MAX_VALUE in size, which Spark cannot handle.
Fix: Use either of the following approaches:
Increase
spark.default.parallelismandspark.sql.shuffle.partitionsto create more, smaller partitions.Call
repartitionbefore the shuffle operation to pre-split the data.
External data source issues
Does the error "java.sql.SQLException: No suitable driver found for jdbc:mysql:xxx" occur?
The mysql-connector-java version in use is outdated. Update it to a version later than 5.1.48.
Does the error "Invalid authorization specification, message from server: ip not in whitelist" occur when connecting to ApsaraDB RDS?
Add the internal IP addresses of all EMR cluster nodes to the whitelist of ApsaraDB RDS.
Job hang issues
Are no jobs appearing on the Spark UI, or do all jobs finish but the task keeps running?
Open the Spark web UI and locate the Spark executor process.
Analyze the thread dump of the Spark driver.
If you see a large number of ORC-related threads, set the following parameter and restart the task:
--conf spark.hadoop.hive.exec.orc.split.strategy=BIFor Spark 2.X only: if
spark.sql.adaptive.enabledis set totrue, change it tofalse.
Is the Spark task not ending and the Spark web UI inaccessible?
Cause: The Spark driver has insufficient memory. Full garbage collection (GC) is blocking all progress.
Fix: Increase spark.driver.memory.
Code and runtime errors
Does the error "NoSuchDatabaseException: Database 'xxx' not found" occur when reading Hive data?
Check both of the following:
Missing `.enableHiveSupport()`: Verify that
.enableHiveSupport()is called when you initialize the SparkSession. If not, add it.Direct `new SparkContext()` call: If your code calls
new SparkContext(), remove it and obtain the SparkContext from the SparkSession instead.
Does the error "java.lang.ClassNotFoundException" occur in the Spark job?
Identify the specific JAR package from the class name in the error, then use one of the following methods to make it available:
Method 1: Submit the JAR using the
--jarsflag when you submit the Spark job.Method 2: Specify the JAR path in
spark.driver.extraclasspathandspark.executor.extraclasspath. The JAR must be present on every node in the EMR cluster.
Parameter reference
The following table lists all parameters mentioned in this topic with their configured values.
| Parameter | Location | Value |
|---|---|---|
spark.driver.memoryOverhead | Spark service page > Configure tab | Increase |
spark.executor.memoryOverhead | Spark service page > Configure tab | Increase |
spark.driver.memory | Spark service page > Configure tab | Increase |
spark.executor.memory | Spark service page > Configure tab | Increase |
spark.executor.cores | Spark service page > Configure tab | Decrease |
spark.default.parallelism | Spark service page > Configure tab | Increase |
spark.sql.shuffle.partitions | Spark service page > Configure tab | Increase |
spark.sql.autoBroadcastJoinThreshold | Spark service page > Configure tab | -1 |
spark.hadoop.io.compression.codec.snappy.native | Spark service page > spark-defaults.conf tab | true |
spark.sql.parquet.writeLegacyFormat | Spark service page > spark-defaults.conf tab | true |
spark.hadoop.hive.exec.orc.split.strategy | Job submission flag --conf | BI |
spark.sql.adaptive.enabled (Spark 2.X) | Spark service page > Configure tab | false |
fs.oss.read.readahead.buffer.count | Hadoop-Common service page > core-site.xml tab | 0 |
fs.oss.read.buffer.size | Hadoop-Common service page > core-site.xml tab | 16384 |
fs.oss.write.buffer.size | Hadoop-Common service page > core-site.xml tab | 16384 |
fs.oss.memory.buffer.size.max.mb | Hadoop-Common service page > core-site.xml tab | 512 |