This topic describes the check items and key metrics of Hive.
Severity levels
- P0: Critical. If a P0-level error occurs, the Hive service is unavailable. You must immediately troubleshoot the issue.
- P1: High. If a P1-level error occurs, the Hive service is available, but the performance may be low or the workload may be high. You must immediately troubleshoot the issue.
HiveServer-related check items
Availability: inspection_hive_server_availability
- The check fails and the error
hive server availability permission check is failed
is reported.This indicates that the user does not have permissions to execute statements that are used to check HiveServer. For example, the permissions that are granted to the user are accidentally revoked.
- The check fails and the error
Hive server availability is failed
is reported.This indicates that HiveServer is abnormal. In this case, you must check the HiveServer process and the logs of the process to troubleshoot the issue.
High availability: inspection_hive_server_ha
- If the message
Hive server HA status is OK
is returned, all HiveServer components are in a normal state. - If the message
One or more Hive server failed
is returned, one or more HiveServer components are abnormal. This is a P1-level error. In this case, you must check the HiveServer process and the logs of the process to troubleshoot the issue. - If the message
All Hive server are failed
is returned, all HiveServer components are abnormal. This is a P0-level error. In this case, you must check the HiveServer process and the logs of the process to troubleshoot the issue.
Port existence: inspection_hive_server_port
You can use this check item to check whether port 10000 of the HiveServer component exists on the machine. If port 10000 does not exist, the HiveServer process is in an abnormal state. In this case, you must check the HiveServer process and the logs of the process to troubleshoot the issue.
Garbage collection: inspection_hive_server_gc
- If the usage of JVM heap memory is greater than or equal to 95%, a P0-level error occurs. In this case, you must immediately increase the memory size of the HiveServer component. Otherwise, the HiveServer component may be restarted, and the jobs may fail.
- If the usage of JVM heap memory is greater than or equal to 90%, a P1-level error occurs. In this case, you must immediately increase the memory size of the HiveServer component. Otherwise, the HiveServer component may be restarted, and the jobs may fail.
- If the usage of JVM heap memory is lower than 90%, you can determine whether to adjust the memory size of the HiveServer component based on your business requirements.
Number of restarts: inspection_hive_server_restart
- If the HiveServer component is repeatedly restarted within five minutes, a P0-level error occurs. In this case, you must immediately check the HiveServer process and the logs of the process to troubleshoot the issue.
- If the HiveServer component is restarted once within five minutes, a P1-level error occurs. In this case, you must immediately check the HiveServer process and the logs of the process to troubleshoot the issue.
- In other scenarios, the HiveServer component remains in a normal state.
HiveMetaStore-related check items
High availability: inspection_hive_metastore_ha
- If the message
Hive metastore HA status is OK
is returned, all HiveMetaStore components are in a normal state. - If the message
One or more metastore failed
is returned, one or more HiveMetaStore components are abnormal. This is a P1-level error. In this case, you must immediately check the HiveMetaStore process and the logs of the process to troubleshoot the issue. - If the message
All Hive metastore are failed
is returned, all HiveMetaStore components are abnormal. This is a P0-level error. In this case, you must immediately check the HiveMetaStore process and the logs of the process to troubleshoot the issue.
Port existence: inspection_hive_metastore_port
You can use this check item to check whether port 9083 of the HiveMetaStore component exists on the machine. If port 9083 does not exist, the HiveMetaStore process is in an abnormal state. In this case, you must immediately check the HiveMetaStore process and the logs of the process to troubleshoot the issue.
Garbage collection: inspection_hive_metastore_gc
- If the usage of JVM heap memory is greater than or equal to 95%, a P0-level error occurs. In this case, you must immediately increase the memory size of the HiveMetaStore component.
- If the usage of JVM heap memory is greater than or equal to 90%, a P1-level error occurs. In this case, you must immediately increase the memory size of the HiveMetaStore component.
- If the usage of JVM heap memory is lower than 90%, you can determine whether to adjust the memory size of the HiveMetaStore component based on your business requirements.
Number of restarts: inspection_hive_metastore_restart
- If the HiveMetaStore component is repeatedly restarted within five minutes, a P0-level error occurs. In this case, you must immediately check the HiveMetaStore process and the logs of the process to troubleshoot the issue.
- If the HiveMetaStore component is restarted once within five minutes, a P1-level error occurs. In this case, you must immediately check the HiveMetaStore process and the logs of the process to troubleshoot the issue.
- In other scenarios, the HiveMetaStore component remains in a normal state.
Key metrics of HiveServer
You can view the key metrics of the HiveServer2 component on the Monitoring tab of your cluster in the E-MapReduce (EMR) console.
- Session-related metrics
OpenSessions and ActiveSessions: You can view the number of opened sessions or active sessions. This way, you can check whether a large number of tasks were running when errors occurred and adjust the memory size based on your business requirements.
- JVM-related metrics
JVM MemHeapMax and GC-related metrics: You can view the JVM metrics in the periods of time when errors occur to determine whether to adjust the memory size.
- Task-related metrics
PENDING tasks, ActiveRunTasksCalls, and metrics related to TasksCount: If a large number of pending tasks exist, you must check whether the memory of the HiveServer process and the scheduling resources of the YARN resource queue are sufficient, or whether large jobs occupy a large amount of resources. For example, the pending tasks can be the tasks whose progress is suspended.
Key metrics of HiveMetaStore
You can view the key metrics of the HiveMetaStore component on the Monitoring tab of your cluster in the EMR console.
- JVM-related metrics
JVM MemHeapMax and GC-related metrics: You can view the JVM metrics in the periods of time when errors occur to determine whether to adjust the memory size.
- Metrics related to metadata operations
If GetTable-related metrics and the CreateTable Time metric tend to increase or a related exception occurs, you must check whether a bottleneck occurs for the memory of the HiveMetaStore component or the performance of the backend database. The metrics are used to measure the time period that is required to perform metadata-related operations. You can adjust the memory size of the HiveMetaStore component based on the current memory size or upgrade the specifications of the backend database based on the time period that is required to run a query on the database.