This document covers the health check items and key monitoring metrics for HiveServer and HiveMetaStore in E-MapReduce (EMR) clusters.
Severity levels
| Level | Meaning | Action required |
|---|---|---|
| P0 (Critical) | Hive service is unavailable | Troubleshoot immediately |
| P1 (High) | Hive service is available but performance is degraded or workload is elevated | Troubleshoot immediately |
HiveServer check items
inspection_hive_server_availability
Checks whether HiveServer is available by executing a set of validation statements.
| Error message | Cause | Action |
|---|---|---|
hive server availability permission check is failed |
The user lacks permissions to run HiveServer check statements (for example, permissions were accidentally revoked) | Restore the required permissions |
Hive server availability is failed |
HiveServer is abnormal | Check the HiveServer process and logs |
inspection_hive_server_ha
Checks the high availability (HA) status of all HiveServer components.
| Result message | Status | Severity | Action |
|---|---|---|---|
Hive server HA status is OK |
All HiveServer components are normal | — | No action needed |
One or more Hive server failed |
One or more HiveServer components are abnormal | P1 | Check the HiveServer process and logs |
All Hive server are failed |
All HiveServer components are abnormal | P0 | Check the HiveServer process and logs |
inspection_hive_server_port
Checks whether port 10000 of HiveServer exists on the host. If port 10000 does not exist, the HiveServer process is abnormal. Check the HiveServer process and logs.
inspection_hive_server_gc
Checks the memory.heap.usage metric of the Java virtual machine (JVM) for HiveServer.
| JVM heap memory usage | Severity | Action |
|---|---|---|
| ≥ 95% | P0 | Increase HiveServer memory immediately. Otherwise, HiveServer may restart and running jobs may fail. |
| ≥ 90% | P1 | Increase HiveServer memory immediately. Otherwise, HiveServer may restart and running jobs may fail. |
| < 90% | Normal | Adjust memory based on business requirements if needed. |
inspection_hive_server_restart
Monitors HiveServer restart frequency within any five-minute window.
| Restart behavior | Severity | Action |
|---|---|---|
| Repeated restarts within five minutes | P0 | Check the HiveServer process and logs immediately |
| One restart within five minutes | P1 | Check the HiveServer process and logs immediately |
| No restarts | Normal | No action needed |
HiveMetaStore check items
inspection_hive_metastore_ha
Checks the high availability (HA) status of all HiveMetaStore components.
| Result message | Status | Severity | Action |
|---|---|---|---|
Hive metastore HA status is OK |
All HiveMetaStore components are normal | — | No action needed |
One or more metastore failed |
One or more HiveMetaStore components are abnormal | P1 | Check the HiveMetaStore process and logs immediately |
All Hive metastore are failed |
All HiveMetaStore components are abnormal | P0 | Check the HiveMetaStore process and logs immediately |
inspection_hive_metastore_port
Checks whether port 9083 of HiveMetaStore exists on the host. If port 9083 does not exist, the HiveMetaStore process is abnormal. Check the HiveMetaStore process and logs immediately.
inspection_hive_metastore_gc
Checks the memory.heap.usage metric of the JVM for HiveMetaStore.
| JVM heap memory usage | Severity | Action |
|---|---|---|
| ≥ 95% | P0 | Increase HiveMetaStore memory immediately |
| ≥ 90% | P1 | Increase HiveMetaStore memory immediately |
| < 90% | Normal | Adjust memory based on business requirements if needed. |
inspection_hive_metastore_restart
Monitors HiveMetaStore restart frequency within any five-minute window.
| Restart behavior | Severity | Action |
|---|---|---|
| Repeated restarts within five minutes | P0 | Check the HiveMetaStore process and logs immediately |
| One restart within five minutes | P1 | Check the HiveMetaStore process and logs immediately |
| No restarts | Normal | No action needed |
Key metrics of HiveServer2
View these metrics on the Monitoring tab of your cluster in the EMR console.
| Category | Metrics | What to look for |
|---|---|---|
| Session | OpenSessions, ActiveSessions | A spike in open or active sessions at the time of an error can indicate memory pressure. Adjust memory based on your business requirements. |
| JVM | JVM MemHeapMax, garbage collection (GC) metrics | Review JVM metrics in the time window when errors occurred to determine whether to increase memory. |
| Task | PENDING tasks, ActiveRunTasksCalls, TasksCount metrics | A large number of pending tasks may indicate insufficient HiveServer memory, YARN resource queue contention, or large jobs consuming most available resources. For example, the pending tasks can be the tasks whose progress is suspended. |
Key metrics of HiveMetaStore
View these metrics on the Monitoring tab of your cluster in the EMR console.
| Category | Metrics | What to look for |
|---|---|---|
| JVM | JVM MemHeapMax, GC metrics | Review JVM metrics in the time window when errors occurred to determine whether to increase memory. |
| Metadata operations | GetTable-related metrics, CreateTable Time | A steady increase in these metrics or an exception indicates a memory bottleneck in HiveMetaStore or a performance issue in the backend database. Increase HiveMetaStore memory based on the current memory size, or upgrade the backend database specifications based on the time required to run a query on the database. |