Hive is a data warehouse framework based on Hadoop that supports extract, transform, and load (ETL) operations and metadata management in big data scenarios.
Hive components
Name | Description |
HiveServer2 | A HiveQL query server that receives SQL requests from a Java Database Connectivity (JDBC) client over the Thrift or HTTP protocol. It supports concurrent access from multiple clients and identity authentication. |
Hive MetaStore | The metadata management component. It stores metadata, such as databases and tables, for other engines. For example, both Spark and Presto use this component for metadata management. |
Hive Client | The Hive client. It submits SQL jobs and converts them into MapReduce, Tez, or Spark jobs based on the configured execution engine. This component is installed on all nodes of an EMR cluster. |
Feature enhancements
For more information about the compatibility between EMR, Hadoop, and Hive versions, see Release versions. The following tables describe the features that are enhanced for Hive in different EMR versions.
EMR 5.x series
EMR version | Component version | Feature enhancement |
EMR-5.20.0 | Hive 3.1.3 | Optimized the performance of adding fields to partitioned tables. |
EMR-5.17.4 | Hive 3.1.3 | Supports the deployment of Master-Extend node groups. |
EMR-5.12.1 | Hive 3.1.3 | By default, OSS-HDFS is used to store data in Hive warehouse files. |
EMR-5.9.0 | Hive 3.1.3 | Kerberos authentication is supported. |
EMR-5.8.0 | Hive 3.1.2 | LDAP authentication can be enabled with one click. |
EMR-5.6.0 | Hive 3.1.2 | The following issue is fixed: After speculative execution is enabled for Hive on Tez, both the original task and the speculative task are committed. |
EMR-5.5.0 | Hive 3.1.2 |
|
EMR-5.4.0 | Hive 3.1.2 | In JindoFS in block storage mode, the metadata of multiple Hive tables can be optimized at the same time. By default, this feature is disabled. |
EMR-5.3.0 | Hive 3.1.2 | In JindoFS in block storage mode, the metadata of multiple Hive tables can be optimized at the same time. |
EMR-5.2.1 | Hive 3.1.2 |
|
EMR 3.x series
EMR version | Component version | Feature enhancement |
MR-3.51.4 | Hive 2.3.9 | Supports the deployment of Master-Extend node groups. |
EMR-3.46.1 | Hive 2.3.9 | By default, OSS-HDFS is used to store data in Hive warehouse files. |
EMR-3.40.0 | Hive 2.3.8 |
|
EMR-3.39.1 | Hive 2.3.8 | Hive is adapted to JindoSDK. |
EMR-3.36.1 | Hive 2.3.8 |
|
EMR-3.35.0 | Hive 2.3.7 | Fixed community-reported issues related to fetch tasks. |
EMR-3.34.0 | Hive 2.3.7 |
|
EMR-3.33.0 | Hive 2.3.7 |
|
EMR-3.32.0 | Hive 2.3.5 |
|
EMR-3.30.0 | Hive 2.3.5 |
|
EMR-3.29.0 | Hive 2.3.5 |
|
EMR-3.28.0 | Hive 2.3.5 | Supports Delta Lake 0.6.0. |
EMR-3.27.2 | Hive 2.3.5 |
|
EMR-3.26.3 | Hive 2.3.5 | HCatalog tables support the direct committer. |
EMR-3.25.0 | Hive 2.3.5 | Fixed an issue where MapReduce jobs failed in automatic LOCAL mode. |
EMR-3.24.0 | Hive 2.3.5 |
|
EMR-3.23.0 | Hive 2.3.5 |
|
Versions earlier than EMR-3.23.0 | Hive 2.x | The external unified database is saved to the Hive metastore. All clusters that use the external Hive metastore share the same metadata. |
EMR 4.x series
EMR version | Component version | Feature enhancement |
EMR-4.10.0 | Hive 3.1.2 |
|
EMR-4.8.0 | Hive 3.1.2 |
|
EMR-4.6.0 | Hive 3.1.2 |
|
EMR-4.5.0 | Hive 3.1.2 |
|
EMR-4.4.1 | Hive 3.1.2 | Optimized the default parameter configurations. |
EMR-4.4.0 | Hive 3.1.2 |
|
EMR-4.3.0 | Hive 3.1.1 | Supports custom deployments. |
Hive syntax
To ensure a consistent user experience, EMR retains the syntax of open source components as much as possible. EMR Hive is fully compatible with the syntax of Apache Hive.
For more information about Apache Hive, visit the Apache Hive official website.
References
For more information about connecting to Hive with a Hive client, see Hive connection methods.
For more information about identity authentication for the Hive service, see Use Kerberos authentication and Use LDAP authentication.
For information about accessing data lake data using Hive, see Use Hive to access Delta Lake and Hudi data.
For more information about common optimization methods for Hive jobs, see Hive job optimization.
For information about how to troubleshoot common issues with Hive jobs, see Troubleshoot exceptions for Hive jobs.