You can use a Hive connector to query and analyze data in a Hive data warehouse. This topic describes how to use and configure Hive connectors.
Background information
A Hive data warehouse system consists of the following parts:
Data files in various formats. In most cases, the data files are stored in Hadoop Distributed File System (HDFS) or an object storage system, such as Alibaba Cloud Object Storage Service (OSS).
Metadata about how the data files are mapped to schemas and tables. The metadata is stored in a database such as a MySQL database. You can access the metadata by using a Hive metastore.
A query language called HiveQL. This query language is executed on a distributed computing framework, such as MapReduce or Tez.
Prerequisites
An E-MapReduce (EMR) cluster of V3.45.0, V5.11.0, or a minor version later than V3.45.0 or V5.11.0 with the Presto service deployed is created. For more information, see Create a cluster.
Usage notes
Hive connectors support various distributed storage systems, such as HDFS, Alibaba Cloud OSS, and compatible systems of Amazon S3. You can use a Hive connector to query data from the distributed storage systems.
Before you access Hive Metastore, make sure that the coordinator node and all worker nodes can connect to Hive Metastore that you configured and to the distributed storage system whose data you want to query. By default, you can use the Thrift protocol to access Hive Metastore over port 9083.
You can use a Hive connector to access Data Lake Formation (DLF). To use a Hive connector to access DLF, set the Metadata parameter to DLF Unified Metadata when you create a cluster.
Example to query data
Open the Presto CLI. For more information, see Access Presto by running commands.
Execute the following statement to create a test table:
create table hive.default.doc(id int);Execute the following statement to write data to the test table:
insert into hive.default.doc values(1),(2);Execute the following statement to query data in the test table:
select * from hive.default.doc;The following information is returned:
id ---- 1 2 (2 rows)
Default configurations of a Hive connector
Parameter | Description |
hive.metastore.uri | The Uniform Resource Identifier (URI) that is used to access Hive Metastore by using the Thrift protocol. The default value of this parameter is in the format of You can specify any valid value when you use a Hive connector to access DLF. |
hive.config.resources | HDFS configuration files. Separate the names of configuration files with commas (,). Example: core-site.xml,hdfs-site.xml. You must make sure that the configuration files exist on all the hosts where Presto is running. In an EMR cluster, the core-site.xml configuration file contains the configurations required to access OSS. We recommend that you do not modify the configuration file. |
hive.recursive-directories | Specifies whether data can be read from the subdirectories for a table or a partition. Default value: true. This property is similar to the hive.mapred.supports.subdirectories property of Hive. |
hive.non-managed-table-writes-enabled | Specifies whether to enable data writes to unmanaged (external) Hive tables. Default value: true. |
hive.copy-on-first-write-configuration-enabled | Specifies whether to decrease the number of hdfsConfiguration copies. Default value: false. We recommend that you retain the default value false. If you set this parameter to true, access to OSS in password-free mode, LDAP authentication, and Kerberos authentication become invalid. |
hive.hdfs.impersonation.enabled | Specifies whether to enable user impersonation. Default value: false. If you use a user agent for HDFS, you must enable user impersonation. In other cases, you do not need to enable user impersonation. |
Configuration of multiple connectors
If you have multiple Hive clusters or you want to access DLF and Hive Metastore at the same time, you can use one of the following methods to configure multiple connectors:
Create an appropriate number of catalog files in the
etc/catalogdirectory. Make sure that the file name extension is .properties.For example, if you create a property file named sales.properties, Presto uses the connector that is configured in the property file to create a catalog named sales. Presto can identify a connector for which the connector.name parameter is set to hive-hadoop2 as a Hive connector.
Use placeholder connectors provided by EMR.
By default, EMR provides five placeholder connectors named connector1, connector2, connector3, connector4, and connector5. You can select a placeholder connector and modify the configurations of the placeholder connector to use it as a Hive connector. However, when you use the placeholder connector to query data, the catalog that is created based on the placeholder connector can be named only connector1, connector2, connector3, connector4, or connector5.