All Products
Search
Document Center

E-MapReduce:Troubleshoot issues related to Spark jobs

Last Updated:Mar 26, 2026

Spark job failures on E-MapReduce (EMR) typically fall into five categories: memory exhaustion, file format incompatibilities, shuffle errors, external data source connection failures, and job hangs. Use the error message or symptom to locate the relevant section, then apply the fix.

Memory-related issues

Memory errors are the most common cause of Spark job failures on EMR. Before adjusting parameters, identify which component (driver or executor) is out of memory and whether the cause is heap, off-heap, or OSS I/O buffering.

Is the container being killed by YARN for exceeding memory limits?

Cause: The memory allocated at job submission is too low. During JVM startup, Spark may consume significantly more memory than the allocated amount—especially off-heap memory—causing YARN NodeManager to terminate the container.

Fix: In the EMR console, go to the Spark service page and open the Configure tab. Increase spark.driver.memoryOverhead or spark.executor.memoryOverhead.

Is the container killed by YARN specifically when reading from or writing to OSS?

Cause: OSS I/O operations require additional memory for read-ahead and write buffers. If executor memory is already near its limit, these buffers push the container over the YARN memory ceiling.

Fix: First, try increasing Spark executor memory. If that is not possible, reduce OSS buffer sizes by modifying the following parameters on the core-site.xml tab of the Hadoop-Common service page in the EMR console:

ParameterValue
fs.oss.read.readahead.buffer.count0
fs.oss.read.buffer.size16384
fs.oss.write.buffer.size16384
fs.oss.memory.buffer.size.max.mb512

Is the error "Java heap space" appearing?

Cause: The job processes more data than fits in the JVM heap, resulting in an out-of-memory (OOM) error.

Fix: In the EMR console, go to the Spark service page and open the Configure tab. Increase spark.executor.memory or spark.driver.memory based on the volume of data your job processes.

Does an OOM error occur when reading Snappy-compressed files?

In the EMR console, go to the Spark service page and open the spark-defaults.conf tab. Add the following parameter:

spark.hadoop.io.compression.codec.snappy.native = true

Is the Spark driver running out of memory?

The driver OOM can have multiple causes. Try the following fixes in order:

  1. Increase driver memory: In the EMR console, go to the Spark service page and open the Configure tab. Increase spark.driver.memory.

  2. Avoid large collect operations: Check whether your code uses collect to pull large datasets to the driver. If so, switch to foreachPartition to process data in executors instead, and remove the collect calls.

  3. Disable broadcast joins: Set spark.sql.autoBroadcastJoinThreshold to -1.

Is the Spark executor running out of memory?

Try the following fixes:

  • Increase executor memory: In the EMR console, go to the Spark service page and open the Configure tab. Increase spark.executor.memory.

  • Reduce executor cores: Decrease spark.executor.cores so each executor handles fewer concurrent tasks, reducing per-task memory pressure.

  • Increase parallelism: Increase spark.default.parallelism and spark.sql.shuffle.partitions to spread data across more partitions.

File format issues

Does Hive or Impala fail to read a Parquet table that was imported by Spark?

Error message:

Failed with exception java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file xxx

Cause: Hive and Spark use different Parquet conventions for the DECIMAL data type. Parquet files written by Spark are not compatible with the Hive or Impala reader by default.

Fix:

  1. In the EMR console, go to the Spark service page and open the spark-defaults.conf tab. Add the following parameter:

    spark.sql.parquet.writeLegacyFormat = true
  2. Re-import the Parquet data using Spark.

  3. Run your Hive or Impala job to read the data.

Shuffle-related issues

Does the error "java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE" occur?

Cause: The number of shuffle partitions is too small. Individual shuffle blocks grow beyond Integer.MAX_VALUE in size, which Spark cannot handle.

Fix: Use either of the following approaches:

  • Increase spark.default.parallelism and spark.sql.shuffle.partitions to create more, smaller partitions.

  • Call repartition before the shuffle operation to pre-split the data.

External data source issues

Does the error "java.sql.SQLException: No suitable driver found for jdbc:mysql:xxx" occur?

The mysql-connector-java version in use is outdated. Update it to a version later than 5.1.48.

Does the error "Invalid authorization specification, message from server: ip not in whitelist" occur when connecting to ApsaraDB RDS?

Add the internal IP addresses of all EMR cluster nodes to the whitelist of ApsaraDB RDS.

Job hang issues

Are no jobs appearing on the Spark UI, or do all jobs finish but the task keeps running?

  1. Open the Spark web UI and locate the Spark executor process.

  2. Analyze the thread dump of the Spark driver.

  3. If you see a large number of ORC-related threads, set the following parameter and restart the task:

    --conf spark.hadoop.hive.exec.orc.split.strategy=BI
  4. For Spark 2.X only: if spark.sql.adaptive.enabled is set to true, change it to false.

Is the Spark task not ending and the Spark web UI inaccessible?

Cause: The Spark driver has insufficient memory. Full garbage collection (GC) is blocking all progress.

Fix: Increase spark.driver.memory.

Code and runtime errors

Does the error "NoSuchDatabaseException: Database 'xxx' not found" occur when reading Hive data?

Check both of the following:

  • Missing `.enableHiveSupport()`: Verify that .enableHiveSupport() is called when you initialize the SparkSession. If not, add it.

  • Direct `new SparkContext()` call: If your code calls new SparkContext(), remove it and obtain the SparkContext from the SparkSession instead.

Does the error "java.lang.ClassNotFoundException" occur in the Spark job?

Identify the specific JAR package from the class name in the error, then use one of the following methods to make it available:

  • Method 1: Submit the JAR using the --jars flag when you submit the Spark job.

  • Method 2: Specify the JAR path in spark.driver.extraclasspath and spark.executor.extraclasspath. The JAR must be present on every node in the EMR cluster.

Parameter reference

The following table lists all parameters mentioned in this topic with their configured values.

ParameterLocationValue
spark.driver.memoryOverheadSpark service page > Configure tabIncrease
spark.executor.memoryOverheadSpark service page > Configure tabIncrease
spark.driver.memorySpark service page > Configure tabIncrease
spark.executor.memorySpark service page > Configure tabIncrease
spark.executor.coresSpark service page > Configure tabDecrease
spark.default.parallelismSpark service page > Configure tabIncrease
spark.sql.shuffle.partitionsSpark service page > Configure tabIncrease
spark.sql.autoBroadcastJoinThresholdSpark service page > Configure tab-1
spark.hadoop.io.compression.codec.snappy.nativeSpark service page > spark-defaults.conf tabtrue
spark.sql.parquet.writeLegacyFormatSpark service page > spark-defaults.conf tabtrue
spark.hadoop.hive.exec.orc.split.strategyJob submission flag --confBI
spark.sql.adaptive.enabled (Spark 2.X)Spark service page > Configure tabfalse
fs.oss.read.readahead.buffer.countHadoop-Common service page > core-site.xml tab0
fs.oss.read.buffer.sizeHadoop-Common service page > core-site.xml tab16384
fs.oss.write.buffer.sizeHadoop-Common service page > core-site.xml tab16384
fs.oss.memory.buffer.size.max.mbHadoop-Common service page > core-site.xml tab512