All Products
Search
Document Center

E-MapReduce:Connect to an external Hive Metastore service

Last Updated:Dec 04, 2025

E-MapReduce (EMR) Serverless Spark supports connecting to an external Hive Metastore service to access data stored in the Hive Metastore. This topic describes how to configure this connection in EMR Serverless Spark to efficiently manage and use data resources in your operating environment.

Prerequisites

A workspace and an SQL session have been created. For more information, see Create a workspace and Manage SQL sessions.

Limitations

  • You must restart existing sessions in the workspace to use Hive Metastore.

  • After you set Hive Metastore as the default data catalog, your flow tasks use it by default.

Procedure

Step 1: Prepare the Hive Metastore service

Note

This topic uses the Hive Metastore of EMR on ECS as an example of an external service. If a Hive Metastore service already exists in your virtual private cloud (VPC), you can skip this step.

  1. On the EMR on ECS page, you can create a DataLake cluster that includes the Hive service and has Metadata set to Built-in MySQL. For more information, see Create a cluster.

  2. Log on to the master node of the cluster using Secure Shell (SSH). For more information, see Log on to a cluster.

  3. Run the following command to enter the Hive command line.

    hive
  4. Run the following command to create a table named `dw_users` that points to Object Storage Service (OSS) and then write data to the table.

    CREATE TABLE `dw_users`(
      `name` string)
    LOCATION
      'oss://<yourBucket>/path/to/file';
    
    INSERT INTO dw_users select 'Bob';

Step 2: Add a network connection

  1. Go to the Network Connectivity page.

    1. Log on to the EMR console.

    2. In the navigation pane on the left, choose EMR Serverless > Spark.

    3. On the Spark page, click the target workspace.

    4. On the EMR Serverless Spark page, click Network Connectivity in the navigation pane on the left.

  2. On the Network Connectivity page, click Add Network Connection.

  3. In the Add Network Connection dialog box, configure the following parameters and click OK.

    Parameter

    Description

    Connection Name

    Enter a name for the new connection.

    VPC

    Select the same VPC as the EMR cluster.

    VSwitch

    Select the same vSwitch that is in the same VPC as the EMR cluster.

    The network connection is established when the Status is Successful.

    image

Step 3: Open the Hive Metastore service port

  1. Obtain the CIDR block of the vSwitch specified in the network connection.

    Log in to the VPC console or the and navigate to the VSwitches page to find the CIDR block of the vSwitch.

    image

  2. Add a security group rule.

    1. Log on to the EMR on ECS console.

    2. On the EMR on ECS page, click the target cluster ID.

    3. On the Basic Information page, click the link for Cluster Security Group.

    4. On the Security Groups page, click Add Rule. Set Source and Port, and then click OK.

      Parameter

      Description

      Port

      Enter port 9083.

      Source

      Enter the CIDR block of the vSwitch that you obtained in the previous step.

      Important

      To prevent security risks from external attacks, do not set Source to 0.0.0.0/0.

Step 4: Connect to Hive Metastore

  1. On the EMR Serverless Spark page, in the navigation pane on the left, click Data Catalog.

  2. On the Data Catalog page, click Add Data Catalog.

  3. In the dialog box, click External Hive Metastore. Configure the following parameters and click OK.image

    Parameter

    Description

    Network Connectivity

    Select the network connection that you added in Step 2.

    Metastore Endpoint

    The URI of the Hive Metastore. The format is thrift://<IP address of Hive metastore>:9083.

    <IP address of Hive metastore> is the internal IP address of the HMS service. In this example, it is the internal IP address of the master node of the EMR cluster. You can find it on the Node Management page of the EMR cluster.

    Note

    If high availability (HA) is enabled for HMS, you can enter the endpoints of multiple nodes. Separate the URIs with commas (,), for example: thrift://<IP address of Hive metastore 1>:9083,thrift://<IP address of Hive metastore 2>:9083.

    Kerberos Keytab File Path

    The path to the Kerberos keytab file.

    Kerberos Principal

    The name of the principal contained in the keytab file. It is used for identity verification with the Kerberos service. You can run the klist -kt <keytab file> command to view the principal name in the target keytab file.

Step 5: Use Hive Metastore

  1. On the Data Catalog page, find hive_metastore and click Set As Default in the Actions column to make it the default data catalog for the workspace.

    image

  2. Restart the SQL session.

    If an SQL session already exists in your workspace, you must stop and then restart the session for the hive_metastore settings to take effect.

  3. Query data from a table in the Hive Metastore using SQL Developer.

    1. Create a new SQL job. For more information, see Develop a SparkSQL job.

    2. Run the following command to query the `dw_users` table.

      SELECT * FROM dw_users;

      image

FAQ

How do I access HDFS data?

The configuration method depends on whether the Hadoop Distributed File System (HDFS) cluster is in high availability (HA) mode.

  • Accessing an HDFS path without HA

    If the table location points to an HDFS path in a cluster where HA is not enabled, you must ensure that the domain name in the location is accessible. By default, master-1-1.<cluster-id>.<region>.emr.aliyuncs.com is directly accessible. For other domain names, you must add mappings. For more information, see Manage domain names.

  • Accessing an HDFS path with HA

    If the table `location` points to an HDFS path in a cluster where HA is enabled, you must first configure the domain name mappings. Then, create a configuration file named hdfs-site.xml in Manage Custom Configuration Files and save it to the `/etc/spark/conf` path. This step ensures that the Java Runtime or Fusion Runtime can access the data. The following code provides a sample file. The complete content must be based on the hdfs-site.xml file in your EMR on ECS cluster.

    <?xml version="1.0"?>
    <configuration>
      <property>
        <name>dfs.nameservices</name>
        <value>hdfs-cluster</value>
      </property>
      <property>
        <name>dfs.ha.namenodes.hdfs-cluster</name>
        <value>nn1,nn2,nn3</value>
      </property>
      <property>
        <name>dfs.namenode.rpc-address.hdfs-cluster.nn1</name>
        <value>master-1-1.<cluster-id>.<region-id>.emr.aliyuncs.com:<port></value>
      </property>
      <property>
        <name>dfs.namenode.rpc-address.hdfs-cluster.nn2</name>
        <value>master-1-2.<cluster-id>.<region-id>.emr.aliyuncs.com:<port></value>
      </property>
      <property>
        <name>dfs.namenode.rpc-address.hdfs-cluster.nn3</name>
        <value>master-1-3.<cluster-id>.<region-id>.emr.aliyuncs.com:<port></value>
      </property>
      <property>
        <name>dfs.client.failover.proxy.provider.hdfs-cluster</name>
        <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
      </property>
    </configuration>
  • (Optional) Configure access to a Kerberos-enabled HDFS path

    If a job needs to access a Kerberos-enabled HDFS path, you must add the spark.kerberos.access.hadoopFileSystems parameter to the Spark configuration. The value of this parameter must be based on the `fs.defaultFS` configuration in the HDFS cluster. For example, the default value for an HA EMR on ECS cluster is hdfs://hdfs-cluster.