All Products
Search
Document Center

E-MapReduce:Connect to an external Hive Metastore service

Last Updated:Mar 25, 2026

EMR Serverless Spark connects to an external Hive Metastore over the Thrift protocol (port 9083). Once connected, EMR Serverless Spark uses the Metastore as its data catalog, giving your Spark jobs access to all tables and schemas already registered there.

Prerequisites

Before you begin, make sure you have:

  • A workspace in EMR Serverless Spark. See Create a workspace.

  • An SQL session in that workspace. See Manage SQL sessions.

  • A running Hive Metastore service reachable within your virtual private cloud (VPC). If you don't have one, complete Step 1 first.

Limitations

  • Restart any existing SQL sessions after connecting to Hive Metastore. Sessions created before the connection is set up cannot use it.

  • After you set Hive Metastore as the default data catalog, all flow tasks in the workspace use it by default.

Step 1: Set up a Hive Metastore service (optional)

Note

Skip this step if a Hive Metastore service already exists in your VPC. This step uses an EMR on ECS cluster as an example.

  1. Create a DataLake cluster that includes the Hive service on EMR on ECS, with Metadata set to Built-in MySQL. See Create a cluster.

  2. Log on to the master node of the cluster using Secure Shell (SSH). See Log on to a cluster.

  3. Open the Hive CLI:

    hive
  4. Create a table named dw_users backed by Object Storage Service (OSS) and insert a row:

    CREATE TABLE `dw_users`(
      `name` string)
    LOCATION
      'oss://<yourBucket>/path/to/file';
    
    INSERT INTO dw_users select 'Bob';

Step 2: Add a network connection

EMR Serverless Spark uses a network connection to reach your VPC, where the Hive Metastore runs.

  1. In the EMR console, go to EMR Serverless > Spark and click your workspace.

  2. In the left navigation pane, click Network Connectivity.

  3. Click Add Network Connection, configure the following parameters, then click OK. The connection is ready when Status shows Successful.

    ParameterDescription
    Connection NameA name for the connection
    VPCThe VPC where your Hive Metastore runs
    vSwitchSelect the same vSwitch that is in the same VPC as the EMR cluster

    image

Step 3: Open the Hive Metastore port

Allow EMR Serverless Spark to reach the Hive Metastore Thrift service on port 9083.

  1. Get the CIDR block of the vSwitch you selected in Step 2. Log in to the VPC console and navigate to the VSwitches page to find the CIDR block.

    image

  2. Add an inbound security group rule on the EMR on ECS cluster.

    1. In the EMR on ECS console, click the target cluster ID.

    2. On the Basic Information page, click the link next to Cluster Security Group.

    3. Click Add Rule, set the following fields, then click OK.

      Parameter

      Value

      Port

      9083

      Source

      The CIDR block of the vSwitch from step 1

      Important

      Do not set Source to 0.0.0.0/0. Restricting the source to the vSwitch CIDR block prevents external access to the Metastore port.

Step 4: Connect to Hive Metastore

  1. On the EMR Serverless Spark page, click Data Catalog in the left navigation pane.

  2. Click Add Data Catalog.

  3. Select External Hive Metastore, configure the following parameters, then click OK.

    ParameterDescription
    Network ConnectivityThe network connection you added in Step 2
    Metastore EndpointThe Thrift URI of the Hive Metastore, in the format thrift://<IP>:9083. Use the internal IP address of the master node, which you can find on the Node Management page of the EMR on ECS cluster. For high availability (HA) deployments, enter multiple endpoints separated by commas: thrift://<IP1>:9083,thrift://<IP2>:9083
    Kerberos Keytab File PathThe path to the Kerberos keytab file
    Kerberos PrincipalThe principal name in the keytab file. Run klist -kt <keytab file> to look up the name

    image

Step 5: Query data from Hive Metastore

  1. On the Data Catalog page, find hive_metastore and click Set As Default in the Actions column.

    image

  2. Restart any existing SQL sessions. Stop each session and start it again so the new data catalog takes effect.

  3. Run a query to verify the connection. Create a new SparkSQL job (see Develop a SparkSQL job) and run:

    SELECT * FROM dw_users;

    A result set confirms that EMR Serverless Spark can read from your Hive Metastore.

    image

FAQ

How do I access HDFS data?

The approach depends on whether the Hadoop Distributed File System (HDFS) cluster has high availability (HA) enabled.

  • Without HA — The default domain master-1-1.<cluster-id>.<region>.emr.aliyuncs.com is accessible directly. For other domain names, add mappings. See Manage domain names.

  • With HA — Configure domain name mappings first, then create an hdfs-site.xml file in Manage Custom Configuration Files and save it to /etc/spark/conf. This lets both Java Runtime and Fusion Runtime access the data. Use the hdfs-site.xml from your EMR on ECS cluster as the source of truth; the following snippet shows the required properties:

    <?xml version="1.0"?>
    <configuration>
      <property>
        <name>dfs.nameservices</name>
        <value>hdfs-cluster</value>
      </property>
      <property>
        <name>dfs.ha.namenodes.hdfs-cluster</name>
        <value>nn1,nn2,nn3</value>
      </property>
      <property>
        <name>dfs.namenode.rpc-address.hdfs-cluster.nn1</name>
        <value>master-1-1.<cluster-id>.<region-id>.emr.aliyuncs.com:<port></value>
      </property>
      <property>
        <name>dfs.namenode.rpc-address.hdfs-cluster.nn2</name>
        <value>master-1-2.<cluster-id>.<region-id>.emr.aliyuncs.com:<port></value>
      </property>
      <property>
        <name>dfs.namenode.rpc-address.hdfs-cluster.nn3</name>
        <value>master-1-3.<cluster-id>.<region-id>.emr.aliyuncs.com:<port></value>
      </property>
      <property>
        <name>dfs.client.failover.proxy.provider.hdfs-cluster</name>
        <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
      </property>
    </configuration>
  • With Kerberos-enabled HDFS — Add the spark.kerberos.access.hadoopFileSystems parameter to your Spark configuration, setting its value to the fs.defaultFS of your HDFS cluster. For an HA EMR on ECS cluster, the value is typically hdfs://hdfs-cluster.