Hive job execution and troubleshooting FAQ - E-MapReduce

Note For step-by-step troubleshooting guides, see Troubleshoot issues related to Hive jobs and Troubleshoot issues related to the Hive service.

Why are my Hive jobs stuck in the waiting state?
Why does the map stage read small files?
Why do some reduce tasks take much longer than others?
How do I estimate the maximum number of concurrent Hive jobs?
Why does my Hive external table return no data after creation?

Why are my Hive jobs stuck in the waiting state?

Symptom: One or more Hive jobs remain in the waiting state and do not progress.

Cause: Either queue resources are fully occupied, or a single job is consuming a disproportionate share of resources.

Fix: Use YARN UI to identify which condition applies, then take the corresponding action.

In the E-MapReduce (EMR) console, go to the Access Links and Ports tab and click the link in the Access URL column for YARN UI.
Click the application ID.
Click the link next to Tracking URL to see which jobs are waiting.
In the left-side navigation pane, click Scheduler to inspect queue resource usage.

Based on what you see:

Observation	Action
Queue resources are fully occupied	Move the waiting jobs to an idle queue
A single job is consuming a disproportionate amount of time	Optimize that job's code

Why does the map stage read small files?

Symptom: Individual map tasks are reading only a few bytes of data.

Cause: The input contains a large number of small files. Each small file becomes a separate map task, regardless of its size, which wastes resources and slows down the job.

Fix: Confirm the issue via YARN UI, then merge the input files.

In the EMR console, go to the Access Links and Ports tab and click the link in the Access URL column for YARN UI.
Click the application ID to open the job detail page, then go to the Map tasks page. Check the data size read per map task. The following example shows a task reading only two bytes — a clear sign of small files. For more detail, check the log of each map task.

If most map tasks read very small amounts of data, merge the input files before running the job.

Why do some reduce tasks take much longer than others?

Symptom: One or more reduce tasks run significantly longer than the others.

Cause: Data skew — one reduce task receives far more records or shuffle bytes than the rest, causing it to take disproportionately long.

Fix: Use YARN UI to identify the slow tasks and confirm data skew.

In the EMR console, go to the Access Links and Ports tab and click the link in the Access URL column for YARN UI.
Click the application ID.
On the Reduce tasks page, sort tasks by completion time in descending order to find the slowest ones.
Click the name of a slow reduce task.
In the left-side navigation pane of the task details page, click Counters. Check the values of Reduce input records and Reduce shuffle bytes. If these values are significantly higher than in other tasks, data skew is occurring.

How do I estimate the maximum number of concurrent Hive jobs?

Maximum concurrency depends on the HiveServer2 heap size and the number of master nodes. Use this formula:

max_num = master_num × max(5, hive_server2_heapsize/512)

Parameter	Description
`master_num`	Number of master nodes in the cluster
`hive_server2_heapsize`	HiveServer2 heap size in MB, configured in `hive-env.sh`. Default: 512 MB

Example: A cluster with 3 master nodes and a 4 GB (4,096 MB) HiveServer2 heap size supports up to 24 concurrent jobs: 3 × max(5, 4096/512) = 3 × 8 = 24.

If your workload regularly hits the concurrency limit, increase the HiveServer2 heap size in hive-env.sh and recalculate.

Why does my Hive external table return no data after creation?

Symptom: Running SELECT * FROM storage_log; returns no rows after creating the following table:

CREATE EXTERNAL TABLE storage_log(content STRING) PARTITIONED BY (ds STRING)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS TEXTFILE
    LOCATION 'oss://log-12453****/your-logs/airtake/pro/storage';

Cause: Hive does not automatically associate a partitioned external table with existing data directories. After creating a partitioned external table, you must register the partitions manually before any data is visible.

Fix: Add the partitions manually, then query the table.

ALTER TABLE storage_log ADD PARTITION(ds=123);

Query the table:

SELECT * FROM storage_log;

Expected output:

OK
abcd    123
efgh    123