All Products
Search
Document Center

E-MapReduce:Troubleshoot and resolve Hive job failures

Last Updated:Jun 20, 2026

This topic describes how to troubleshoot and resolve Hive job failures.

Failure troubleshooting

If you encounter job failures or performance issues on the client, follow these steps:

  1. Check Hive client logs.

    • For jobs submitted through the Hive CLI, client logs are located at /tmp/hive/$USER/hive.log or /tmp/$USER/hive.log on the cluster or Gateway node.

    • For jobs submitted through Hive Beeline or JDBC, logs are in the HiveServer service logs (typically in /var/log/emr/hive or /mnt/disk1/log/hive).

  2. Check YARN Application logs for the Hive job using the yarn command.

    yarn logs -applicationId application_xxx_xxx -appOwner userName

Memory-related errors

Out-of-memory (OOM) errors due to insufficient container memory

Error logs: java.lang.OutOfMemoryError: GC overhead limit exceeded or java.lang.OutOfMemoryError: Java heap space.

Solution: Increase container memory. For Hive on MapReduce (MR) jobs, also increase the JVM heap size.

  • Hive on MR: On the YARN service configuration page, click the mapred-site.xml tab and increase mapper and reducer memory.

    mapreduce.map.memory.mb=4096
    mapreduce.reduce.memory.mb=4096

    Also update the JVM parameters -Xmx in mapreduce.map.java.opts and mapreduce.reduce.java.opts to 80% of mapreduce.map.memory.mb and mapreduce.reduce.memory.mb.

    mapreduce.map.java.opts=-Xmx3276m (keep other parameters unchanged)
    mapreduce.reduce.java.opts=-Xmx3276m (keep other parameters unchanged)
  • Hive on Tez

    • If the Tez container runs out of memory, on the Hive service configuration page, click the hive-site.xml tab and increase Tez container memory.

      hive.tez.container.size=4096
    • If the Tez AM runs out of memory, on the Tez service configuration page, click the tez-site.xml tab and increase Tez AM memory.

      tez.am.resource.memory.mb=4096
  • Hive on Spark: Increase Spark Executor memory in spark-defaults.conf.

    spark.executor.memory=4g

Container killed by YARN due to excessive memory usage

Error log: Container killed by YARN for exceeding memory limits.

Root cause: The Hive task uses more memory (including JVM heap, off-heap memory, and child processes) than requested from YARN. For example, in Hive on MR, if the Map Task JVM heap size (mapreduce.map.java.opts=-Xmx4g) exceeds the YARN memory allocation (mapreduce.map.memory.mb=3072, or 3 GB), YARN NodeManager kills the container.

Solution:

  1. For Hive on MR jobs, increase mapreduce.map.memory.mb and mapreduce.reduce.memory.mb, ensuring they are at least 1.25 times the -Xmx values in mapreduce.map.java.opts and mapreduce.reduce.java.opts.

  2. For Hive on Spark jobs, you can increase the value of the spark.executor.memoryOverhead parameter and ensure it is at least 25% of the value of the spark.executor.memory parameter.

OOM caused by SortBuffer set too large

  • Error log:

    Error running child: java.lang.OutOfMemoryError: Java heap space
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:986)
  • Root cause: Sort Buffer Size exceeds the Hive Task Container Size. For example, container memory is set to 1300 MB, but SortBuffer is set to 1024 MB.

  • Solution: Increase container memory or reduce SortBuffer size.

    tez.runtime.io.sort.mb (Hive on Tez)
    mapreduce.task.io.sort.mb (Hive on MR)

OOM caused by certain GroupBy statements

  • Error log:

    22/11/28 08:24:43 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 0)
    java.lang.OutOfMemoryError: GC overhead limit exceeded
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:611)
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:813)
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:719)
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:787)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
        at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
        at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
        at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148)
        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:547)
  • Root cause: The GroupBy HashTable consumes too much memory, causing OOM.

  • Solution:

    1. Reduce split size to 128 MB, 64 MB, or smaller to increase job concurrency: mapreduce.input.fileinputformat.split.maxsize=134217728 or mapreduce.input.fileinputformat.split.maxsize=67108864.

    2. Increase mapper and reducer concurrency.

    3. Increase container memory. For details, see Out-of-memory (OOM) errors due to insufficient container memory.

OOM when reading Snappy files

  • Root cause: Standard Snappy files written by services such as LogService use a different format than Hadoop ecosystem Snappy files. EMR defaults to the Hadoop-modified Snappy format and throws an OutOfMemoryError when processing standard Snappy files.

  • Solution: Configure the following parameter for Hive jobs.

    set io.compression.codec.snappy.native=true;

Metadata-related errors

Timeout when dropping large partitioned tables

  • Error log:

    FAILED: Execution ERROR, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timeout
  • Root cause: The table has too many partitions. Dropping it takes a long time, causing a Hive Metastore client network timeout.

  • Solution:

    1. On the EMR console Hive service configuration page, click the hive-site.xml tab and increase the metastore client socket timeout.

      hive.metastore.client.socket.timeout=1200s
    2. Delete partitions in batches, for example, by running conditional drop commands multiple times.

      alter table [TableName] DROP IF EXISTS PARTITION (ds<='20220720')

insert overwrite with dynamic partitions causes job failure

  • Error message: When using insert overwrite operations on dynamic partitions or running similar jobs that include insert overwrite operations, the error Exception when loading xxx in table occurs, and the following error message appears in the HiveServer logs.

    Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Directory oss://xxxx could not be cleaned up.;
  • Root cause: Metadata and data are inconsistent. Metadata contains a partition record, but the data storage system lacks the corresponding path, causing a "path not found" error during cleanup.

  • Solution: Fix the metadata issue before re-running the job.

Hive throws java.lang.IllegalArgumentException: java.net.UnknownHostException: emr-header-1.xxx when reading or dropping tables

  • Root cause: When an EMR cluster uses DLF unified metadata or a unified meta database (legacy feature), the initial path of the created database is the HDFS path of the current EMR cluster (for example, hdfs://master-1-1.xxx:9000/user/hive/warehouse/test.db or hdfs://emr-header-1.cluster-xxx:9000/user/hive/warehouse/test.db). Hive table paths inherit the database path and also use the HDFS path of the current cluster (for example, hdfs://master-1-1.xxx:9000/user/hive/warehouse/test.db/test_tbl). If you use Hive in a new EMR cluster to read data from or write data to a Hive table or database that is created by an old EMR cluster, the new cluster may fail to connect to the old cluster. In addition, if the old cluster is released, the error "java.net.UnknownHostException" is returned.

  • Solution:

    • Method 1: If the Hive table data is temporary or test data, change the Hive table location to an OSS path and run drop table or drop database.

      -- Hive SQL
      alter table test_tbl set location 'oss://bucket/not/exists'
      drop table test_tbl;
      alter table test_pt_tbl partition (pt=xxx) set location 'oss://bucket/not/exists';
      alter table test_pt_tbl drop partition pt=xxx);
      alter database test_db set location 'oss://bucket/not/exists'
      drop datatabase test_db
    • Method 2: If the Hive table data is valid but inaccessible from the new cluster, transfer the HDFS data from the old EMR cluster to OSS and create a new table.

      hadoop fs -cp hdfs://emr-header-1.xxx/old/path oss://bucket/new/path
      hive -e "create table new_tbl like old_tbl location 'oss://bucket/new/path'"

Hive UDFs and third-party packages

Conflicts caused by placing third-party packages in the Hive lib directory

  • Root cause: Placing third-party packages or replacing Hive JARs in the Hive lib directory ($HIVE_HOME/lib) often causes conflicts. Avoid this practice.

  • Solution: Remove third-party packages from $HIVE_HOME/lib and restore the original Hive JARs.

Hive cannot use the reflect function

  • Root cause: The reflect function may be unavailable when Ranger authentication is enabled.

  • Solution: Remove reflect from the blacklist by configuring hive-site.xml.

    hive.server2.builtin.udf.blacklist=empty_blacklist

Custom UDFs slow down job execution

  • Root cause: Hive jobs run slowly without clear error logs, possibly due to poor performance in custom UDFs.

  • Solution: Perform a thread dump on the Hive task to identify performance hotspots and optimize the custom UDF accordingly.

grouping() function fails

  • Symptom: Using the grouping() function produces this error:

    grouping() requires at least 2 argument, got 1

    This error indicates a parsing error in the grouping() function call.

  • Root cause: This is a known bug in open-source Hive. Hive’s parser is case-sensitive for the grouping() function. Using lowercase grouping() causes Hive to misidentify the function, leading to incorrect argument parsing.

  • Solution: Change the grouping() function in your SQL to uppercase GROUPING().

Engine compatibility issues

Inconsistent results due to Hive and Spark timezone differences

  • Symptom: Hive’s from_unix_time uses UTC, while Spark uses the local timezone. Inconsistent timezones lead to different results.

  • Solution: Set Spark’s timezone to UTC by adding this code in Spark SQL:

    set spark.sql.session.timeZone=UTC;

    Or add this setting to the Spark configuration file:

    spark.sql.session.timeZone=UTC

Known bugs in older Hive versions

Hive on Spark with dynamic partitioning runs slowly (known bug)

  • Root cause: A bug in open-source Hive causes Beeline to enable spark.dynamicAllocation.enabled, which forces Hive to calculate shuffle partitions as 1.

  • Solution: Disable dynamic resource allocation for Hive on Spark jobs or use Hive on Tez instead.

    spark.dynamicAllocation.enabled=false

Tez fails when hive.optimize.dynamic.partition.hashjoin is enabled (known bug)

  • Error log:

    Vertex failed, vertexName=Reducer 2, vertexId=vertex_1536275581088_0001_5_02, diagnostics=[Task failed, taskId=task_1536275581088_0001_5_02_000009, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1536275581088_0001_5_02_000009_0:java.lang.RuntimeException: java.lang.RuntimeException: cannot find field _col1 from [0:key, 1:value]
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
    ]]]
  • Root cause: A bug in open-source Hive.

  • Solution: As a workaround, disable the setting.

    hive.optimize.dynamic.partition.hashjoin=false

MapJoinOperator throws NullPointerException (known bug)

  • Error log.

    2022-05-06 18:37:26,664 ERROR [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: java.lang.NullPointerException
            at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:313)
            at org.apache.hadoop.hive.ql.exec.MapJoinOperator.cleanUpInputFileChangedOp(MapJoinOperator.java:345)
            at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1124)
            at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1128)
            at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1128)
            at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:540)
            at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:148)
            at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
            at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
            at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:422)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911)
            at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
  • Root cause: Enabling hive.auto.convert.join.noconditionaltask triggers this error.

  • Solution: Disable the related setting.

    hive.auto.convert.join.noconditionaltask=false

Hive on Tez throws IllegalStateException (known bug)

  • Error log:

    java.lang.RuntimeException: java.lang.IllegalStateException: Was expecting dummy store operator but found: FS[17]
            at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
            at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
            at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
            at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
            at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:422)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
            at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
            at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
            at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
  • Root cause: A bug in open-source Hive that occurs when Hive AM reuse is enabled. EMR Hive does not yet fix this issue.

  • Solution: Disable Tez ApplicationMaster reuse for individual jobs.

    set tez.am.container.reuse.enabled=false;

Other errors

select count(1) returns 0

  • Root cause: select count(1) uses Hive table statistics, but the statistics are inaccurate.

  • Solution: Disable statistics usage.

    hive.compute.query.using.stats=false

    Or recompute table statistics using the analyze command.

    analyze table <table_name> compute statistics;

Hive job submission fails on self-managed ECS instances

Submitting Hive jobs from self-managed ECS instances (outside EMR) causes unpredictable errors. Use an EMR Gateway cluster or deploy a Gateway environment using EMR-CLI. For more information, see Deploy a Gateway environment using EMR-CLI.

Job failures due to data skew

  • Abnormal behavior:

    • Shuffle data fills up disk space.

    • Certain tasks take much longer to run.

    • Certain tasks or containers experience OOM.

  • Solution:

    1. Enable Hive skew join optimization.

      set hive.optimize.skewjoin=true;
    2. Increase mapper and reducer concurrency.

    3. Increase container memory. For details, see Out-of-memory (OOM) errors due to insufficient container memory.

How to handle “Too many counters: 121 max=120”?

  • Description: When running Hive SQL jobs with Tez or MR engines, you encounter this error.

  • Analysis: The job exceeds the default counter limit.

  • Solution: On the EMR console YARN service Configure tab, search for the mapreduce.job.counters.max parameter and increase its value. After updating, resubmit the Hive job. If you submit jobs via Beeline or JDBC, restart the HiveServer service.