Troubleshoot issues related to Spark jobs - E-MapReduce - Alibaba Cloud Documentation Center

This topic describes how to troubleshoot issues related to Spark jobs.

Memory-related issues

The error "Container killed by YARN for exceeding memory limits" occurs.

Cause: The amount of memory that is requested when you submit an application is low. The Java Virtual Machine (JVM) occupies more memory than the allocated amount during startup. As a result, the job is abnormally terminated by YARN NodeManager. In particular, Spark jobs may consume a large amount of off-heap memory and are more likely to be abnormally terminated.
Solution: On the Configure tab of the Spark service page in the E-MapReduce (EMR) console, increase the value of the spark.driver.memoryOverhead or spark.executor.memoryOverhead parameter.

The error "Container killed by YARN for exceeding memory limits" occurs when a Spark job is run to read data from or write data to Object Storage Service (OSS) objects.

Cause: The amount of memory that is required when you read data from or write data to OSS exceeds the upper limit.
Solution: Increase the memory of the Spark executor. If the memory of the Spark executor cannot be increased, you can modify the following parameters on the core-site.xml tab of the Hadoop-Common service page in the EMR console:
- fs.oss.read.readahead.buffer.count: 0
- fs.oss.read.buffer.size: 16384
- fs.oss.write.buffer.size: 16384
- fs.oss.memory.buffer.size.max.mb: 512

The error "Error: Java heap space" occurs.

Cause: The Spark job has large amounts of data to process, but the JVM has insufficient memory. As a result, an out-of-memory (OOM) error is returned.
Solution: On the Configure tab of the Spark service page in the EMR console, increase the value of the spark.executor.memory or spark.driver.memory parameter based on your business requirements.

An OOM error occurs when a Spark job is run to read Snappy files.

On the spark-defaults.conf tab of the Spark service page in the EMR console, add the spark.hadoop.io.compression.codec.snappy.native parameter and set the value to true.

An OOM error occurs due to insufficient memory of the Spark driver.

You can use one of the following solutions to troubleshoot the issue:

On the Configure tab of the Spark service page in the EMR console, increase the value of the spark.driver.memory parameter.
Check whether operations such as collect operations are performed to pull data to the Spark driver. If the amount of data to be collected is large, we recommend that you use foreachPartitions to perform operations in the executor and remove code that is related to collect operations.
Set the spark.sql.autoBroadcastJoinThreshold parameter to -1.

An OOM error occurs due to insufficient memory of the Spark executor.

You can use one of the following solutions to troubleshoot the issue:

On the Configure tab of the Spark service page in the EMR console, increase the value of the spark.executor.memory parameter.
Decrease the value of the spark.executor.cores parameter.
Increase the values of the spark.default.parallelism and spark.sql.shuffle.partitions parameters.

Issues related to file formats

An error occurs when a Hive or Impala job is run to read a Parquet table that is imported by using Spark.

Error message: Failed with exception java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file xxx.
Cause: The DECIMAL data type has different representations in the different Parquet conventions that are used in Hive and Spark. Parquet data that is imported by using Spark cannot be properly read by using Hive or Impala.
Solution: Add the spark.sql.parquet.writeLegacyFormat parameter and set the value to true on the spark-defaults.conf tab of the Spark service page in the EMR console. Use Spark to import the Parquet data again. Then, you can run a Hive or Impala job to read the Parquet data.

Shuffle-related issues

The error "java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE" occurs.

Cause: During data shuffling, the number of partitions is excessively small. As a result, the block size exceeds the value of the Integer.MAX_VALUE parameter.
Solution: Increase the number of partitions by setting the spark.default.parallelism and spark.sql.shuffle.partitions parameters to larger values. You can also perform the repartition operation before you perform data shuffling.

Issues related to external data sources

The error "java.sql.SQLException: No suitable driver found for jdbc:mysql:xxx" occurs.

An early version of mysql-connector-java is used. Update it to a version later than 5.1.48.

The error "Invalid authorization specification, message from server: ip not in whitelist" occurs when Spark is connected to ApsaraDB RDS.

Check the whitelist settings of ApsaraDB RDS and add the internal IP addresses of all EMR cluster nodes into the whitelist of ApsaraDB RDS.

Other issues

No jobs are allocated on the Spark UI, or all jobs displayed on the Spark UI finish running, but your Spark task does not end for a long period of time.

Go to the web UI of Spark. Find the Spark executor process and analyze the thread dump of the driver. If a large number of ORC-related threads exist, set the --conf spark.hadoop.hive.exec.orc.split.strategy parameter to BI and restart the Spark task. For Spark 2.X, you must check whether the value of the spark.sql.adaptive.enabled parameter is true. If the value is true, change the value to false.

A Spark task does not end for a long period of time, and the web UI of Spark cannot be accessed.

Cause: The memory of the Spark driver is insufficient, and full garbage collection (GC) occurs.
Solution: Increase the value of the spark.driver.memory parameter.

The error "NoSuchDatabaseException: Database 'xxx' not found" occurs when Spark runs code to read Hive data.

Check whether the .enableHiveSupport() method is called when you initialize a Spark session. If the .enableHiveSupport() method is not called, manually call the method.
Check whether the new SparkContext() method is called in the code. If the new SparkContext() method is called, remove the code related to the method and obtain the Spark context from the Spark session.

The error "java.lang.ClassNotFoundException" occurs in the Spark job.

Find the specific JAR package based on the class information and process the JAR package by using one of the following methods:

Method 1: When you submit a Spark job, use the --jars command to submit the JAR package.
Method 2: Specify the path of the JAR package in the spark.driver.extraclasspath and spark.executor.extraclasspath parameters. This method requires that the JAR package exists on each node of the EMR cluster.