This topic describes how to troubleshoot and resolve Hive job failures.
Failure troubleshooting
If you encounter job failures or performance issues on the client, follow these steps:
-
Check Hive client logs.
-
For jobs submitted through the Hive CLI, client logs are located at /tmp/hive/$USER/hive.log or /tmp/$USER/hive.log on the cluster or Gateway node.
-
For jobs submitted through Hive Beeline or JDBC, logs are in the HiveServer service logs (typically in /var/log/emr/hive or /mnt/disk1/log/hive).
-
-
Check YARN Application logs for the Hive job using the yarn command.
yarn logs -applicationId application_xxx_xxx -appOwner userName
Memory-related errors
Out-of-memory (OOM) errors due to insufficient container memory
Error logs: java.lang.OutOfMemoryError: GC overhead limit exceeded or java.lang.OutOfMemoryError: Java heap space.
Solution: Increase container memory. For Hive on MapReduce (MR) jobs, also increase the JVM heap size.
-
Hive on MR: On the YARN service configuration page, click the mapred-site.xml tab and increase mapper and reducer memory.
mapreduce.map.memory.mb=4096 mapreduce.reduce.memory.mb=4096Also update the JVM parameters
-Xmxin mapreduce.map.java.opts and mapreduce.reduce.java.opts to 80% of mapreduce.map.memory.mb and mapreduce.reduce.memory.mb.mapreduce.map.java.opts=-Xmx3276m (keep other parameters unchanged) mapreduce.reduce.java.opts=-Xmx3276m (keep other parameters unchanged) -
Hive on Tez
-
If the Tez container runs out of memory, on the Hive service configuration page, click the hive-site.xml tab and increase Tez container memory.
hive.tez.container.size=4096 -
If the Tez AM runs out of memory, on the Tez service configuration page, click the tez-site.xml tab and increase Tez AM memory.
tez.am.resource.memory.mb=4096
-
-
Hive on Spark: Increase Spark Executor memory in
spark-defaults.conf.spark.executor.memory=4g
Container killed by YARN due to excessive memory usage
Error log: Container killed by YARN for exceeding memory limits.
Root cause: The Hive task uses more memory (including JVM heap, off-heap memory, and child processes) than requested from YARN. For example, in Hive on MR, if the Map Task JVM heap size (mapreduce.map.java.opts=-Xmx4g) exceeds the YARN memory allocation (mapreduce.map.memory.mb=3072, or 3 GB), YARN NodeManager kills the container.
Solution:
-
For Hive on MR jobs, increase mapreduce.map.memory.mb and mapreduce.reduce.memory.mb, ensuring they are at least 1.25 times the
-Xmxvalues in mapreduce.map.java.opts and mapreduce.reduce.java.opts. -
For Hive on Spark jobs, you can increase the value of the spark.executor.memoryOverhead parameter and ensure it is at least 25% of the value of the spark.executor.memory parameter.
OOM caused by SortBuffer set too large
-
Error log:
Error running child: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:986) -
Root cause: Sort Buffer Size exceeds the Hive Task Container Size. For example, container memory is set to 1300 MB, but SortBuffer is set to 1024 MB.
-
Solution: Increase container memory or reduce SortBuffer size.
tez.runtime.io.sort.mb (Hive on Tez) mapreduce.task.io.sort.mb (Hive on MR)
OOM caused by certain GroupBy statements
-
Error log:
22/11/28 08:24:43 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 0) java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:611) at org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:813) at org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:719) at org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:787) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:547) -
Root cause: The GroupBy HashTable consumes too much memory, causing OOM.
-
Solution:
-
Reduce split size to 128 MB, 64 MB, or smaller to increase job concurrency:
mapreduce.input.fileinputformat.split.maxsize=134217728ormapreduce.input.fileinputformat.split.maxsize=67108864. -
Increase mapper and reducer concurrency.
-
Increase container memory. For details, see Out-of-memory (OOM) errors due to insufficient container memory.
-
OOM when reading Snappy files
-
Root cause: Standard Snappy files written by services such as LogService use a different format than Hadoop ecosystem Snappy files. EMR defaults to the Hadoop-modified Snappy format and throws an OutOfMemoryError when processing standard Snappy files.
-
Solution: Configure the following parameter for Hive jobs.
set io.compression.codec.snappy.native=true;
Metadata-related errors
Timeout when dropping large partitioned tables
-
Error log:
FAILED: Execution ERROR, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timeout -
Root cause: The table has too many partitions. Dropping it takes a long time, causing a Hive Metastore client network timeout.
-
Solution:
-
On the EMR console Hive service configuration page, click the hive-site.xml tab and increase the metastore client socket timeout.
hive.metastore.client.socket.timeout=1200s -
Delete partitions in batches, for example, by running conditional drop commands multiple times.
alter table [TableName] DROP IF EXISTS PARTITION (ds<='20220720')
-
insert overwrite with dynamic partitions causes job failure
-
Error message: When using
insert overwriteoperations on dynamic partitions or running similar jobs that includeinsert overwriteoperations, the errorException when loading xxx in tableoccurs, and the following error message appears in the HiveServer logs.Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Directory oss://xxxx could not be cleaned up.; -
Root cause: Metadata and data are inconsistent. Metadata contains a partition record, but the data storage system lacks the corresponding path, causing a "path not found" error during cleanup.
-
Solution: Fix the metadata issue before re-running the job.
Hive throws java.lang.IllegalArgumentException: java.net.UnknownHostException: emr-header-1.xxx when reading or dropping tables
-
Root cause: When an EMR cluster uses DLF unified metadata or a unified meta database (legacy feature), the initial path of the created database is the HDFS path of the current EMR cluster (for example,
hdfs://master-1-1.xxx:9000/user/hive/warehouse/test.dborhdfs://emr-header-1.cluster-xxx:9000/user/hive/warehouse/test.db). Hive table paths inherit the database path and also use the HDFS path of the current cluster (for example,hdfs://master-1-1.xxx:9000/user/hive/warehouse/test.db/test_tbl). If you use Hive in a new EMR cluster to read data from or write data to a Hive table or database that is created by an old EMR cluster, the new cluster may fail to connect to the old cluster. In addition, if the old cluster is released, the error "java.net.UnknownHostException" is returned. -
Solution:
-
Method 1: If the Hive table data is temporary or test data, change the Hive table location to an OSS path and run drop table or drop database.
-- Hive SQL alter table test_tbl set location 'oss://bucket/not/exists' drop table test_tbl; alter table test_pt_tbl partition (pt=xxx) set location 'oss://bucket/not/exists'; alter table test_pt_tbl drop partition pt=xxx); alter database test_db set location 'oss://bucket/not/exists' drop datatabase test_db -
Method 2: If the Hive table data is valid but inaccessible from the new cluster, transfer the HDFS data from the old EMR cluster to OSS and create a new table.
hadoop fs -cp hdfs://emr-header-1.xxx/old/path oss://bucket/new/path hive -e "create table new_tbl like old_tbl location 'oss://bucket/new/path'"
-
Hive UDFs and third-party packages
Conflicts caused by placing third-party packages in the Hive lib directory
-
Root cause: Placing third-party packages or replacing Hive JARs in the Hive lib directory ($HIVE_HOME/lib) often causes conflicts. Avoid this practice.
-
Solution: Remove third-party packages from $HIVE_HOME/lib and restore the original Hive JARs.
Hive cannot use the reflect function
-
Root cause: The reflect function may be unavailable when Ranger authentication is enabled.
-
Solution: Remove reflect from the blacklist by configuring
hive-site.xml.hive.server2.builtin.udf.blacklist=empty_blacklist
Custom UDFs slow down job execution
-
Root cause: Hive jobs run slowly without clear error logs, possibly due to poor performance in custom UDFs.
-
Solution: Perform a thread dump on the Hive task to identify performance hotspots and optimize the custom UDF accordingly.
grouping() function fails
-
Symptom: Using the
grouping()function produces this error:grouping() requires at least 2 argument, got 1This error indicates a parsing error in the
grouping()function call. -
Root cause: This is a known bug in open-source Hive. Hive’s parser is case-sensitive for the
grouping()function. Using lowercasegrouping()causes Hive to misidentify the function, leading to incorrect argument parsing. -
Solution: Change the
grouping()function in your SQL to uppercaseGROUPING().
Engine compatibility issues
Inconsistent results due to Hive and Spark timezone differences
-
Symptom: Hive’s from_unix_time uses UTC, while Spark uses the local timezone. Inconsistent timezones lead to different results.
-
Solution: Set Spark’s timezone to UTC by adding this code in Spark SQL:
set spark.sql.session.timeZone=UTC;Or add this setting to the Spark configuration file:
spark.sql.session.timeZone=UTC
Known bugs in older Hive versions
Hive on Spark with dynamic partitioning runs slowly (known bug)
-
Root cause: A bug in open-source Hive causes Beeline to enable spark.dynamicAllocation.enabled, which forces Hive to calculate shuffle partitions as 1.
-
Solution: Disable dynamic resource allocation for Hive on Spark jobs or use Hive on Tez instead.
spark.dynamicAllocation.enabled=false
Tez fails when hive.optimize.dynamic.partition.hashjoin is enabled (known bug)
-
Error log:
Vertex failed, vertexName=Reducer 2, vertexId=vertex_1536275581088_0001_5_02, diagnostics=[Task failed, taskId=task_1536275581088_0001_5_02_000009, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1536275581088_0001_5_02_000009_0:java.lang.RuntimeException: java.lang.RuntimeException: cannot find field _col1 from [0:key, 1:value] at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) ]]] -
Root cause: A bug in open-source Hive.
-
Solution: As a workaround, disable the setting.
hive.optimize.dynamic.partition.hashjoin=false
MapJoinOperator throws NullPointerException (known bug)
-
Error log.
2022-05-06 18:37:26,664 ERROR [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:313) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.cleanUpInputFileChangedOp(MapJoinOperator.java:345) at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1124) at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1128) at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1128) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:540) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:148) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169) -
Root cause: Enabling hive.auto.convert.join.noconditionaltask triggers this error.
-
Solution: Disable the related setting.
hive.auto.convert.join.noconditionaltask=false
Hive on Tez throws IllegalStateException (known bug)
-
Error log:
java.lang.RuntimeException: java.lang.IllegalStateException: Was expecting dummy store operator but found: FS[17] at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) -
Root cause: A bug in open-source Hive that occurs when Hive AM reuse is enabled. EMR Hive does not yet fix this issue.
-
Solution: Disable Tez ApplicationMaster reuse for individual jobs.
set tez.am.container.reuse.enabled=false;
Other errors
select count(1) returns 0
-
Root cause:
select count(1)uses Hive table statistics, but the statistics are inaccurate. -
Solution: Disable statistics usage.
hive.compute.query.using.stats=falseOr recompute table statistics using the analyze command.
analyze table <table_name> compute statistics;
Hive job submission fails on self-managed ECS instances
Submitting Hive jobs from self-managed ECS instances (outside EMR) causes unpredictable errors. Use an EMR Gateway cluster or deploy a Gateway environment using EMR-CLI. For more information, see Deploy a Gateway environment using EMR-CLI.
Job failures due to data skew
-
Abnormal behavior:
-
Shuffle data fills up disk space.
-
Certain tasks take much longer to run.
-
Certain tasks or containers experience OOM.
-
-
Solution:
-
Enable Hive skew join optimization.
set hive.optimize.skewjoin=true; -
Increase mapper and reducer concurrency.
-
Increase container memory. For details, see Out-of-memory (OOM) errors due to insufficient container memory.
-
How to handle “Too many counters: 121 max=120”?
-
Description: When running Hive SQL jobs with Tez or MR engines, you encounter this error.
-
Analysis: The job exceeds the default counter limit.
-
Solution: On the EMR console YARN service Configure tab, search for the mapreduce.job.counters.max parameter and increase its value. After updating, resubmit the Hive job. If you submit jobs via Beeline or JDBC, restart the HiveServer service.