All Products
Search
Document Center

E-MapReduce:Connect to an external Hive Metastore

Last Updated:Apr 17, 2025

E-MapReduce (EMR) Serverless Spark allows you to connect to an external Hive Metastore. This way, you can access the data that is stored in the Hive Metastore with ease. This topic describes how to configure settings in EMR Serverless Spark to connect to an external Hive Metastore to implement efficient management and utilization of data resources in a work environment.

Prerequisites

A workspace and an SQL session are created. For more information, see Create a workspace and Manage SQL sessions.

Limits

  • To use the Hive Metastore service, you must restart an existing session in your workspace.

  • After you specify a Hive Metastore as a default catalog, your workflow tasks automatically depend on the Hive Metastore.

Procedure

Step 1: Prepare the Hive Metastore service

Note

In this example, the Hive Metastore that is deployed in EMR on ECS is used as an external service. If a Hive Metastore has been deployed in your virtual private cloud (VPC), skip this step.

  1. On the EMR on ECS page, create a DataLake cluster that contains the Hive service and for which the Metadata parameter is set to Built-in MySQL. For more information, see Create a cluster.

  2. Log on to the master node of the DataLake cluster in SSH mode. For more information, see Log on to a cluster.

  3. Run the following command to open the Hive CLI:

    hive
  4. Run the following commands to create a table named dw_users that points to Object Storage Service (OSS) and write data to the table:

    CREATE TABLE `dw_users`(
      `name` string)
    LOCATION
      'oss://<yourBucket>/path/to/file';
    
    INSERT INTO dw_users select 'Bob';

Step 2: Create a network connection

  1. Go to the Network Connections page.

    1. Log on to the EMR console.

    2. In the left-side navigation pane, choose EMR Serverless > Spark.

    3. On the Spark page, click the name of the desired workspace.

    4. In the left-side navigation pane of the EMR Serverless Spark page, click Network Connections.

  2. On the Network Connections page, click Create Network Connection.

  3. In the Create Network Connection dialog box, configure parameters and click OK. The following table describes the parameters.

    Parameter

    Description

    Name

    The name of the network connection.

    VPC

    The VPC in which your EMR cluster resides.

    vSwitch

    The vSwitch that is deployed in the VPC in which the EMR cluster is deployed.

    If Succeeded is displayed in the Status column of the connection, the network connection is created.

    image

Step 3: Enable the port of the Hive Metastore

  1. Obtain the CIDR block of the vSwitch that you specified when you created the network connection.

    You can log on to the VPC console and obtain the CIDR block of a desired vSwitch on the vSwitch page.

    image

  2. Configure security group rules.

    1. Log on to the EMR console.

    2. On the EMR on ECS page, find the desired cluster and click the name of the cluster in the Cluster ID/Name column.

    3. In the Security section of the Basic Information tab, click the link to the right of Cluster Security Group.

    4. On the Security Group Details tab, click Add Rule, configure the Port Range and Authorization Object parameters, and then click Save.

      Parameter

      Description

      Port Range

      The port number. Enter 9083.

      Authorization Object

      The CIDR block of the vSwitch obtained in the previous step.

      Important

      To prevent attacks from external users, we recommend that you do not set the Authorization Object parameter to 0.0.0.0/0.

Step 4: Connect to the Hive Metastore

  1. In the left-side navigation pane of the EMR Serverless Spark page, click Catalogs.

  2. On the Catalogs page, click Add Catalog.

  3. In the Add Catalog dialog box, click External Hive Metastore, configure the following parameters, and then click OK. The following table describes the parameters.

    image

    Parameter

    Description

    Network Connections

    The network connection that you created in Step 2.

    Metastore Service Address

    The Uniform Resource Identifier (URI) of the Hive metastore. Configure this parameter in the thrift://<IP address of a Hive Metastore>:9083 format.

    <IP address of a Hive Metastore> specifies the internal IP address of the Hive Metastore. In this example, the internal IP address of the master node of the EMR cluster is used. You can view the internal IP address on the Nodes tab of the EMR cluster.

Step 5: Use Hive Metastore

  1. On the Catalogs page, find the hive_metastore catalog and click Set as default in the Actions column to configure the hive_metastore catalog as the default data catalog in the current workspace.

    image

  2. Restart the SQL session.

    If an SQL session is running in your workspace, stop the SQL session, and then restart the SQL session to ensure that the setting of the hive_metastore catalog takes effect.

  3. Query data from tables in the hive_metastore catalog in an SQL job.

    1. Create an SQL job. For more information, see Develop an SQL.

    2. Run the command that is shown in the following figure to query data from the dw_users table in the hive_metastore catalog.

      SELECT * FROM dw_users;

      image

FAQ

How do I access HDFS data?

The method that you can use to access HDFS data varies based on whether your HDFS cluster is a high-availability (HA) cluster.

  • Non-HA HDFS cluster

    The location of a table points to an HDFS path of a non-HA HDFS cluster, and only one NameNode is used to manage the namespaces of HDFS file systems. In this case, you can directly access HDFS data.

  • HA HDFS cluster

    The location of a table points to an HDFS path of an HA HDFS cluster, and the HDFS cluster has multiple NameNodes. In this case, you must add the required configurations for the desired session in the Spark Configuration section. Then, you can use Java Runtime or the Fusion engine to access HDFS data as expected.

    Use Java Runtime to access HDFS data

    Find the desired cluster and click Services in the Actions column. On the Services tab, find HDFS and click Configure. On the Configure tab of the HDFS service page, configure the required configuration items on the hdfs-site.xml subtab based on the following sample code.

    Note
    • When you search for a configuration item by name, you must remove spark.hadoop. before the name of the configuration item.

    • The values of the configuration items in the following code are for reference only. You can view the values of the configurations items in the EMR console.

    spark.hadoop.dfs.namenode.rpc-address.hdfs-cluster.nn1  master-1-1.<cluster-id>.<region>.emr.aliyuncs.com:***
    spark.hadoop.dfs.nameservices                       hdfs-cluster
    spark.hadoop.dfs.ha.namenodes.hdfs-cluster          nn1,nn2,nn3
    spark.hadoop.dfs.client.failover.proxy.provider.hdfs-cluster  org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

    Example

    image

    Use the Fusion engine to access HDFS data

    1. Find the desired cluster and click Services in the Actions column. On the Services tab, find HDFS and click Configure. On the Configure tab of the HDFS service page, configure the required configuration items on the hdfs-site.xml subtab based on the following sample code.

      Note
      • When you search for a configuration item by name, you must remove spark.hadoop. before the name of the configuration item.

      • The values of the configuration items in the following code are for reference only. You can view the values of the configurations items in the EMR console.

      spark.hadoop.dfs.namenode.rpc-address.hdfs-cluster.nn1  master-1-1.<cluster-id>.<region>.emr.aliyuncs.com:****
      spark.hadoop.dfs.nameservices                       hdfs-cluster
      spark.hadoop.dfs.ha.namenodes.hdfs-cluster          nn1,nn2,nn3
      spark.hadoop.dfs.client.failover.proxy.provider.hdfs-cluster  org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

      Example

      image

    2. Copy the hdfs-site.xml file of the HDFS cluster to OSS and rename it as hdfs-client.xml. For example, the path where the renamed file is stored can be oss://<bucket>/tmp/hdfs-client.xml.

      For information about the paths where the configuration files are stored, see Paths of frequently used files.

    3. Add the following information in the Spark Configuration section of the desired session:

      spark.files                                         oss://<bucket>/tmp/hdfs-client.xml
      spark.executorEnv.LIBHDFS3_CONF                     /opt/spark/work-dir/hdfs-client.xml