E-MapReduce (EMR) Serverless Spark allows you to connect to an external Hive Metastore. This way, you can access the data that is stored in the Hive Metastore with ease. This topic describes how to configure settings in EMR Serverless Spark to connect to an external Hive Metastore to implement efficient management and utilization of data resources in a work environment.
Prerequisites
A workspace and an SQL session are created. For more information, see Create a workspace and Manage SQL sessions.
Limits
To use the Hive Metastore service, you must restart an existing session in your workspace.
After you specify a Hive Metastore as a default catalog, your workflow tasks automatically depend on the Hive Metastore.
Procedure
Step 1: Prepare the Hive Metastore service
In this example, the Hive Metastore that is deployed in EMR on ECS is used as an external service. If a Hive Metastore has been deployed in your virtual private cloud (VPC), skip this step.
On the EMR on ECS page, create a DataLake cluster that contains the Hive service and for which the Metadata parameter is set to Built-in MySQL. For more information, see Create a cluster.
Log on to the master node of the DataLake cluster in SSH mode. For more information, see Log on to a cluster.
Run the following command to open the Hive CLI:
hive
Run the following commands to create a table named dw_users that points to Object Storage Service (OSS) and write data to the table:
CREATE TABLE `dw_users`( `name` string) LOCATION 'oss://<yourBucket>/path/to/file'; INSERT INTO dw_users select 'Bob';
Step 2: Create a network connection
Go to the Network Connections page.
Log on to the EMR console.
In the left-side navigation pane, choose
.On the Spark page, click the name of the desired workspace.
In the left-side navigation pane of the EMR Serverless Spark page, click Network Connections.
On the Network Connections page, click Create Network Connection.
In the Create Network Connection dialog box, configure parameters and click OK. The following table describes the parameters.
Parameter
Description
Name
The name of the network connection.
VPC
The VPC in which your EMR cluster resides.
vSwitch
The vSwitch that is deployed in the VPC in which the EMR cluster is deployed.
If Succeeded is displayed in the Status column of the connection, the network connection is created.
Step 3: Enable the port of the Hive Metastore
Obtain the CIDR block of the vSwitch that you specified when you created the network connection.
You can log on to the VPC console and obtain the CIDR block of a desired vSwitch on the vSwitch page.
Configure security group rules.
Log on to the EMR console.
On the EMR on ECS page, find the desired cluster and click the name of the cluster in the Cluster ID/Name column.
In the Security section of the Basic Information tab, click the link to the right of Cluster Security Group.
On the Security Group Details tab, click Add Rule, configure the Port Range and Authorization Object parameters, and then click Save.
Parameter
Description
Port Range
The port number. Enter 9083.
Authorization Object
The CIDR block of the vSwitch obtained in the previous step.
ImportantTo prevent attacks from external users, we recommend that you do not set the Authorization Object parameter to 0.0.0.0/0.
Step 4: Connect to the Hive Metastore
In the left-side navigation pane of the EMR Serverless Spark page, click Catalogs.
On the Catalogs page, click Add Catalog.
In the Add Catalog dialog box, click External Hive Metastore, configure the following parameters, and then click OK. The following table describes the parameters.
Parameter
Description
Network Connections
The network connection that you created in Step 2.
Metastore Service Address
The Uniform Resource Identifier (URI) of the Hive metastore. Configure this parameter in the
thrift://<IP address of a Hive Metastore>:9083
format.<IP address of a Hive Metastore>
specifies the internal IP address of the Hive Metastore. In this example, the internal IP address of the master node of the EMR cluster is used. You can view the internal IP address on the Nodes tab of the EMR cluster.
Step 5: Use Hive Metastore
On the Catalogs page, find the hive_metastore catalog and click Set as default in the Actions column to configure the hive_metastore catalog as the default data catalog in the current workspace.
Restart the SQL session.
If an SQL session is running in your workspace, stop the SQL session, and then restart the SQL session to ensure that the setting of the hive_metastore catalog takes effect.
Query data from tables in the hive_metastore catalog in an SQL job.
Create an SQL job. For more information, see Develop an SQL.
Run the command that is shown in the following figure to query data from the dw_users table in the hive_metastore catalog.
SELECT * FROM dw_users;