All Products
Search
Document Center

E-MapReduce:Establish network connectivity between EMR Serverless Spark and other VPCs

Last Updated:Dec 04, 2025

The network connectivity feature lets you establish a connection between Serverless Spark and your virtual private cloud (VPC). This connection enables you to access data sources and servers or call other services within the VPC. This topic provides an example of how to connect Spark SQL and Application JAR jobs to a Hive Metastore (HMS) in your VPC by configuring network connectivity.

Prerequisites

A data source must be prepared. This topic uses a DataLake cluster as an example. The cluster must be created on the EMR on ECS page, include the Hive service, and use the Built-in MySQL database for Metadata. For more information, see Create a cluster.

Limitations

Currently, you can only use vSwitches in the following zones.

  • China regions

    Region Name

    Region ID

    Zone Name

    China (Hangzhou)

    cn-hangzhou

    • Zone H

    • Zone I

    • Zone J

    China (Shanghai)

    cn-shanghai

    • Zone B

    • Zone L

    • Zone F

    • Zone G

    China (Beijing)

    cn-beijing

    • Zone F

    • Zone G

    • Zone H

    • Zone K

    China (Shenzhen)

    cn-shenzhen

    • Zone E

    • Zone F

    China (Hong Kong)

    cn-hongkong

    • Zone B

    • Zone C

  • Other countries and regions

    Region Name

    Region ID

    Zone Name

    Germany (Frankfurt)

    eu-central-1

    • Zone A

    • Zone B

    Indonesia (Jakarta)

    ap-southeast-5

    • Zone A

    • Zone B

    Singapore

    ap-southeast-1

    • Zone B

    • Zone C

    US (Virginia)

    us-east-1

    • Zone A

    • Zone B

    US (Silicon Valley)

    us-west-1

    • Zone A

    • Zone B

    Japan (Tokyo)

    ap-northeast-1

    • Zone B

    • Zone C

Step 1: Add a network connection

  1. Go to the Network Connectivity page.

    1. Log on to the EMR console.

    2. In the navigation pane on the left, choose EMR Serverless > Spark.

    3. On the Spark page, click the name of the target workspace.

    4. On the EMR Serverless Spark page, click Network Connection in the navigation pane on the left.

  2. On the Network Connection page, click Create Network Connection.

  3. In the Create Network Connection dialog box, configure the parameters and click OK.

    Parameter

    Description

    Name

    Enter a name for the new connection.

    VPC

    Select the same VPC as your EMR cluster.

    If no VPC is available, click Create VPC to go to the VPC console and create one. For more information, see VPCs and vSwitches.

    Note

    If your Serverless Spark needs to access the Internet, make sure the network connection has public network access. For example, you can deploy an Internet NAT gateway in the VPC. This allows the Serverless Spark instance to access the Internet through the gateway. For more information, see Use the SNAT feature of an Internet NAT gateway to access the Internet.

    vSwitch

    Select a vSwitch in the same VPC as the EMR cluster.

    If no vSwitch is available in the current zone, go to the VPC console and create one. For more information, see Create and manage vSwitches.

    Important

    You can only select vSwitches in specific zones. For more information, see Limitations.

    The network connection is added when its Status changes to Succeeded.

    image

Step 2: Add a security group rule for the EMR cluster

  1. Obtain the CIDR block of the vSwitch specified in the network connection.

    You can log on to the VPC console and go to the vSwitches page to obtain the CIDR block of the vSwitch.

    image

  2. Add a security group rule.

    1. Log on to the EMR on ECS console.

    2. On the EMR on ECS page, click the ID of the target cluster.

    3. On the Basic Information tab, in the Security section, click the link next to Cluster Security Group.

    4. On the Security Group Details page, in the Rules section, click Add Rule. Configure the following parameters and click OK.

      Parameter

      Description

      Protocol

      Specify the allowed network communication protocol. The default is TCP.

      Note

      If your network connection is used for Kerberos authentication, select the UDP protocol and open port 88. For more information about Kerberos authentication, see Enable Kerberos authentication.

      Source

      Enter the CIDR block of the vSwitch that you obtained in the previous step.

      Important

      To prevent security risks from external attacks, do not set the Authorization Object to 0.0.0.0/0.

      Destination (Current Instance)

      Specify the destination port to allow access. For example, 9083.

(Optional) Step 3: Connect to the Hive service and query table data

You can skip this step if you have already created and configured a Hive table.

  1. Use Secure Shell (SSH) to log on to the master node of the cluster. For more information, see Log on to a cluster.

  2. Run the following command to enter the Hive command line:

    hive
  3. Run the following command to create a table:

    CREATE TABLE my_table (id INT,name STRING);
  4. Run the following commands to insert data into the table:

    INSERT INTO my_table VALUES (1, 'John'); 
    INSERT INTO my_table VALUES (2, 'Jane');
  5. Run the following command to query the data:

    SELECT * FROM my_table;

(Optional) Step 4: Prepare and upload the resource file

If you plan to use a JAR job, you must prepare a resource file. You can skip this step if you plan to use the SparkSQL job type.

  1. Create a new Maven project on your local machine.

    The project contains the following content:

    package com.example;
    
    import org.apache.spark.sql.Dataset;
    import org.apache.spark.sql.Row;
    import org.apache.spark.sql.SparkSession;
    
    public class DataFrameExample {
        public static void main(String[] args) {
            // Create a SparkSession.
            SparkSession spark = SparkSession.builder()
                    .appName("HMSQueryExample")
                    .enableHiveSupport()
                    .getOrCreate();
    
            // Execute the query.
            Dataset<Row> result = spark.sql("SELECT * FROM default.my_table");
    
            // Print the query results.
            result.show();
    
            // Close the SparkSession.
            spark.stop();
        }
    }

    The pom.xml file contains the following content:

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>org.example</groupId>
        <artifactId>sparkDataFrame</artifactId>
        <version>1.0-SNAPSHOT</version>
    
        <properties>
            <maven.compiler.source>8</maven.compiler.source>
            <maven.compiler.target>8</maven.compiler.target>
            <spark.version>3.3.1</spark.version>
            <scala.binary.version>2.12</scala.binary.version>
        </properties>
    
        <dependencies>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_${scala.binary.version}</artifactId>
                <version>${spark.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-sql_${scala.binary.version}</artifactId>
                <version>${spark.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-hive_${scala.binary.version}</artifactId>
                <version>${spark.version}</version>
            </dependency>
        </dependencies>
    </project>
  2. Run the mvn package command. After the project is compiled and packaged, the sparkDataFrame-1.0-SNAPSHOT.jar file is generated.

  3. On the EMR Serverless Spark page for the target workspace, click Artifacts in the navigation pane on the left.

  4. On the Artifacts page, click Upload File.

  5. Upload the sparkDataFrame-1.0-SNAPSHOT.jar file.

Step 5: Create and run a job

JAR job

  1. On the EMR Serverless Spark page, click Development in the navigation pane on the left.

  2. Click New.

  3. Enter a name, select Application(Batch) > JAR as the job type, and then click OK.

  4. In the new job development tab, configure the following parameters, leave the other parameters at their default settings, and then click Run.

    Parameter

    Description

    Main JAR Resource

    Select the resource file you uploaded in the previous step. For example, sparkDataFrame-1.0-SNAPSHOT.jar.

    Main Class

    The main class specified when you submit the Spark job. This example uses com.example.DataFrameExample.

    Network Connection

    Select the name of the network connection you added in Step 1.

    Spark Configuration

    Configure the following parameters.

    spark.hadoop.hive.metastore.uris thrift://*.*.*.*:9083
    spark.hadoop.hive.imetastoreclient.factory.class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClientFactory

    In this parameter, *.*.*.* is the private IP address of the HMS service. Replace it with the actual IP address. This example uses the private IP address of the master node of an EMR cluster. You can find this IP address on the Nodes page of the EMR cluster. Click the image icon next to the emr-master node group to view the IP address.

  5. After the job runs, go to the Execution Records section at the bottom of the page and click Logs in the Actions column.

  6. On the Log Exploration tab, you can view the log.

SparkSQL job

  1. Create and start an SQL session. For more information, see Manage SQL sessions.

    • Network Connection: Select the network connection that you added in Step 1.

    • Spark Configuration: Configure the following parameters.

      spark.hadoop.hive.metastore.uris thrift://*.*.*.*:9083
      spark.hadoop.hive.imetastoreclient.factory.class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClientFactory

      In this code, *.*.*.* represents the internal IP address of the HSM service. Replace it with the actual IP address. This example uses the internal IP address of the master node of the EMR cluster. You can obtain this IP address on the Nodes page of the EMR cluster by clicking the image icon next to the emr-master node group.

  2. On the EMR Serverless Spark page, click Development in the navigation pane on the left.

  3. On the Development tab, click the image icon.

  4. In the New dialog box, enter a name, such as users_task, leave the type as the default SparkSQL, and click OK.

  5. In the new job development tab, select the catalog, database, and the running SQL session instance. Then, enter the following command and click Run.

    SELECT * FROM default.my_table;
    Note

    When you deploy SQL code based on an external metastore to a workflow, ensure that your SQL statement specifies the table name in the db.table_name format. You must also select a default database from the Catalog option in the upper-right corner. The format must be catalog_id.default.

    The returned information is displayed in the Execution Results section at the bottom of the page.

    image