EMR Serverless Spark connects to an external Hive Metastore over the Thrift protocol (port 9083). Once connected, EMR Serverless Spark uses the Metastore as its data catalog, giving your Spark jobs access to all tables and schemas already registered there.
Prerequisites
Before you begin, make sure you have:
A workspace in EMR Serverless Spark. See Create a workspace.
An SQL session in that workspace. See Manage SQL sessions.
A running Hive Metastore service reachable within your virtual private cloud (VPC). If you don't have one, complete Step 1 first.
Limitations
Restart any existing SQL sessions after connecting to Hive Metastore. Sessions created before the connection is set up cannot use it.
After you set Hive Metastore as the default data catalog, all flow tasks in the workspace use it by default.
Step 1: Set up a Hive Metastore service (optional)
Skip this step if a Hive Metastore service already exists in your VPC. This step uses an EMR on ECS cluster as an example.
Create a DataLake cluster that includes the Hive service on EMR on ECS, with Metadata set to Built-in MySQL. See Create a cluster.
Log on to the master node of the cluster using Secure Shell (SSH). See Log on to a cluster.
Open the Hive CLI:
hiveCreate a table named
dw_usersbacked by Object Storage Service (OSS) and insert a row:CREATE TABLE `dw_users`( `name` string) LOCATION 'oss://<yourBucket>/path/to/file'; INSERT INTO dw_users select 'Bob';
Step 2: Add a network connection
EMR Serverless Spark uses a network connection to reach your VPC, where the Hive Metastore runs.
In the EMR console, go to EMR Serverless > Spark and click your workspace.
In the left navigation pane, click Network Connectivity.
Click Add Network Connection, configure the following parameters, then click OK. The connection is ready when Status shows Successful.
Parameter Description Connection Name A name for the connection VPC The VPC where your Hive Metastore runs vSwitch Select the same vSwitch that is in the same VPC as the EMR cluster 
Step 3: Open the Hive Metastore port
Allow EMR Serverless Spark to reach the Hive Metastore Thrift service on port 9083.
Get the CIDR block of the vSwitch you selected in Step 2. Log in to the VPC console and navigate to the VSwitches page to find the CIDR block.

Add an inbound security group rule on the EMR on ECS cluster.
In the EMR on ECS console, click the target cluster ID.
On the Basic Information page, click the link next to Cluster Security Group.
Click Add Rule, set the following fields, then click OK.
Parameter
Value
Port
9083
Source
The CIDR block of the vSwitch from step 1
ImportantDo not set Source to
0.0.0.0/0. Restricting the source to the vSwitch CIDR block prevents external access to the Metastore port.
Step 4: Connect to Hive Metastore
On the EMR Serverless Spark page, click Data Catalog in the left navigation pane.
Click Add Data Catalog.
Select External Hive Metastore, configure the following parameters, then click OK.
Parameter Description Network Connectivity The network connection you added in Step 2 Metastore Endpoint The Thrift URI of the Hive Metastore, in the format thrift://<IP>:9083. Use the internal IP address of the master node, which you can find on the Node Management page of the EMR on ECS cluster. For high availability (HA) deployments, enter multiple endpoints separated by commas:thrift://<IP1>:9083,thrift://<IP2>:9083Kerberos Keytab File Path The path to the Kerberos keytab file Kerberos Principal The principal name in the keytab file. Run klist -kt <keytab file>to look up the name
Step 5: Query data from Hive Metastore
On the Data Catalog page, find
hive_metastoreand click Set As Default in the Actions column.
Restart any existing SQL sessions. Stop each session and start it again so the new data catalog takes effect.
Run a query to verify the connection. Create a new SparkSQL job (see Develop a SparkSQL job) and run:
SELECT * FROM dw_users;A result set confirms that EMR Serverless Spark can read from your Hive Metastore.
