All Products
Search
Document Center

E-MapReduce:Manage Spark Thrift Server sessions

Last Updated:Mar 26, 2026

Spark Thrift Server is an Apache Spark service that supports SQL query execution over JDBC or ODBC connections, making it straightforward to integrate EMR Serverless Spark with Business Intelligence (BI) tools, data visualization tools, and custom applications. This topic describes how to create a Spark Thrift Server session, generate an access token, and connect from Python, Java, Beeline, Apache Superset, Hue, DataGrip, and Redash.

Prerequisites

Before you begin, ensure that you have:

Create a Spark Thrift Server session

After you create a Spark Thrift Server session, you can select it when creating a Spark SQL task.

  1. Log on to the EMR console.

  2. In the left-side navigation pane, choose EMR Serverless > Spark.

  3. On the Spark page, click the workspace name.

  4. In the left-side navigation pane of the EMR Serverless Spark page, choose Operation Center > Sessions.

  5. On the Session Manager page, click the Spark Thrift Server Sessions tab.

  6. Click Create Spark Thrift Server Session.

  7. Configure the parameters and click Create.

    ParameterDescription
    NameSession name. Must be 1–64 characters and can contain letters, digits, hyphens (-), underscores (_), and spaces.
    Deployment QueueThe queue used to deploy the session. Select a development queue or a queue shared by development and production environments. For more information, see Resource Queue Management.
    Engine versionThe Spark engine version for this session. For more information, see Engine version introduction.
    Use Fusion AccelerationEnables the Fusion engine to accelerate Spark workloads and reduce task costs. For billing details, see Billing and Fusion engine.
    Auto StopEnabled by default. The session stops automatically after 45 minutes of inactivity.
    Network ConnectivityThe network connection used to access data sources or external services in a virtual private cloud (VPC). For setup instructions, see Configure network connectivity between EMR Serverless Spark and a data source across VPCs.
    Spark Thrift Server PortPort 443 for public endpoint access; port 80 for internal same-region endpoint access.
    Access CredentialOnly the Token method is supported.
    spark.driver.coresNumber of CPU cores for the driver process. Default: 1.
    spark.driver.memoryMemory available to the driver process. Default: 3.5 GB.
    spark.executor.coresNumber of CPU cores per executor. Default: 1.
    spark.executor.memoryMemory available to each executor. Default: 3.5 GB.
    spark.executor.instancesNumber of executors. Default: 2.
    Dynamic Resource AllocationDisabled by default. When enabled, configure: Minimum Number of Executors (default: 2) and Maximum Number of Executors (default: 10 if spark.executor.instances is not set).
    More Memory ConfigurationsAdditional memory settings: spark.driver.memoryOverhead — non-heap memory per driver; if blank, Spark uses max(384 MB, 10% x spark.driver.memory). spark.executor.memoryOverhead — non-heap memory per executor; if blank, Spark uses max(384 MB, 10% x spark.executor.memory). spark.memory.offHeap.size — off-heap memory for the application (default: 1 GB); effective only when spark.memory.offHeap.enabled is true. When using the Fusion engine, both parameters default to enabled and 1 GB respectively.
    Spark ConfigurationAdditional Spark configuration key-value pairs, separated by spaces. For example: spark.sql.catalog.paimon.metastore dlf.
  8. On the Spark Thrift Server Sessions tab, click the session name, then open the Overview tab and copy the endpoint. Choose the endpoint that matches your network environment:

    • Public Endpoint — use when connecting from a local machine, external network, or cross-cloud environment. Internet traffic charges may apply. Apply appropriate security measures.

    • Internal Same-region Endpoint — use when connecting from an Alibaba Cloud ECS instance in the same region. Internal access is free and more secure, but is limited to the same-region Alibaba Cloud network.

Create a token

  1. On the Spark Thrift Server Sessions tab, click the session name.

  2. Click the Token Management tab.

  3. Click Create Token.

  4. Configure the parameters and click OK.

    ParameterDescription
    NameA name for the token.
    Expiration TimeNumber of days until the token expires. Must be 1 or more. Enabled by default; tokens expire after 365 days.
  5. Copy the token immediately after creation.

    Important

    Token information cannot be retrieved after you leave this page. If a token expires or is lost, create a new one or reset the existing token.

Connect to Spark Thrift Server

All connection methods use the same four values. Retrieve these from the console before connecting:

PlaceholderWhere to find it
<endpoint>Overview tab of the session — Endpoint(Public) or Endpoint(Internal)
<port>443 for public endpoint; 80 for internal same-region endpoint
<username>The token name from the Token Management tab
<token>The token value copied from the Token Management tab
Internal same-region endpoint access is limited to resources within the same VPC.

Connect using Python

  1. Install the required packages:

    pip install pyhive thrift
  2. Connect to Spark Thrift Server and run a query. All examples call hive.connect() with the endpoint, port, scheme, and token credentials. Use scheme='https' and port 443 for public access, or scheme='http' and port 80 for internal access.

    Connect by using a public endpoint

    from pyhive import hive
    
    if __name__ == '__main__':
        # Replace <endpoint>, <username>, and <token> with your actual values.
        cursor = hive.connect(
            '<endpoint>',
            port=443,
            scheme='https',
            username='<username>',
            password='<token>'
        ).cursor()
        cursor.execute('show databases')
        print(cursor.fetchall())
        cursor.close()

    Connect by using an internal same-region endpoint

    from pyhive import hive
    
    if __name__ == '__main__':
        # Replace <endpoint>, <username>, and <token> with your actual values.
        cursor = hive.connect(
            '<endpoint>',
            port=80,
            scheme='http',
            username='<username>',
            password='<token>'
        ).cursor()
        cursor.execute('show databases')
        print(cursor.fetchall())
        cursor.close()

Connect using Java

  1. Add the following dependencies to your pom.xml file:

    The built-in Hive version in EMR Serverless Spark is 2.x. Use hive-jdbc 2.x to ensure compatibility.
    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.0.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-jdbc</artifactId>
            <version>2.1.0</version>
        </dependency>
    </dependencies>
  2. Connect and run a query. Both examples use the Hive JDBC URL format with transportMode=http and the token embedded in the HTTP path. The only difference is the port and SSL setting.

    Connect by using a public endpoint

    import java.sql.Connection;
    import java.sql.DriverManager;
    import java.sql.ResultSet;
    import java.sql.ResultSetMetaData;
    import org.apache.hive.jdbc.HiveStatement;
    
    public class Main {
        public static void main(String[] args) throws Exception {
            String url = "jdbc:hive2://<endpoint>:443/;transportMode=http;httpPath=cliservice/token/<token>";
            Class.forName("org.apache.hive.jdbc.HiveDriver");
            Connection conn = DriverManager.getConnection(url);
            HiveStatement stmt = (HiveStatement) conn.createStatement();
    
            String sql = "show databases";
            System.out.println("Running " + sql);
            ResultSet res = stmt.executeQuery(sql);
    
            ResultSetMetaData md = res.getMetaData();
            String[] columns = new String[md.getColumnCount()];
            for (int i = 0; i < columns.length; i++) {
                columns[i] = md.getColumnName(i + 1);
            }
            while (res.next()) {
                System.out.print("Row " + res.getRow() + "=[");
                for (int i = 0; i < columns.length; i++) {
                    if (i != 0) {
                        System.out.print(", ");
                    }
                    System.out.print(columns[i] + "='" + res.getObject(i + 1) + "'");
                }
                System.out.println(")]");
            }
    
            conn.close();
        }
    }

    Connect by using an internal same-region endpoint

    import java.sql.Connection;
    import java.sql.DriverManager;
    import java.sql.ResultSet;
    import java.sql.ResultSetMetaData;
    import org.apache.hive.jdbc.HiveStatement;
    
    public class Main {
        public static void main(String[] args) throws Exception {
            String url = "jdbc:hive2://<endpoint>:80/;transportMode=http;httpPath=cliservice/token/<token>";
            Class.forName("org.apache.hive.jdbc.HiveDriver");
            Connection conn = DriverManager.getConnection(url);
            HiveStatement stmt = (HiveStatement) conn.createStatement();
    
            String sql = "show databases";
            System.out.println("Running " + sql);
            ResultSet res = stmt.executeQuery(sql);
    
            ResultSetMetaData md = res.getMetaData();
            String[] columns = new String[md.getColumnCount()];
            for (int i = 0; i < columns.length; i++) {
                columns[i] = md.getColumnName(i + 1);
            }
            while (res.next()) {
                System.out.print("Row " + res.getRow() + "=[");
                for (int i = 0; i < columns.length; i++) {
                    if (i != 0) {
                        System.out.print(", ");
                    }
                    System.out.print(columns[i] + "='" + res.getObject(i + 1) + "'");
                }
                System.out.println(")]");
            }
    
            conn.close();
        }
    }

Connect using Spark Beeline

  • Navigate to the Spark bin directory first, then connect. The path below is an example from an EMR on ECS cluster — adjust it to your actual Spark installation path. To find the path, run env | grep SPARK_HOME.

    Connect by using a public endpoint

    cd /opt/apps/SPARK3/spark-3.4.2-hadoop3.2-1.0.3/bin/
    
    ./beeline -u "jdbc:hive2://<endpoint>:443/;transportMode=http;httpPath=cliservice/token/<token>"

    Connect by using an internal same-region endpoint

    cd /opt/apps/SPARK3/spark-3.4.2-hadoop3.2-1.0.3/bin/
    
    ./beeline -u "jdbc:hive2://<endpoint>:80/;transportMode=http;httpPath=cliservice/token/<token>"
    Note

    The code uses /opt/apps/SPARK3/spark-3.4.2-hadoop3.2-1.0.3 as an example of the Spark installation path in an EMR on ECS cluster. You need to adjust the path based on the actual Spark installation path on your client. If you are not sure about the Spark installation path, you can use the env | grep SPARK_HOME command to find it.

  • The spark-beeline client is available directly on EMR on ECS clusters:

    Connect by using a public endpoint

    spark-beeline -u "jdbc:hive2://<endpoint>:443/;transportMode=http;httpPath=cliservice/token/<token>"

    Connect by using an internal same-region endpoint

    spark-beeline -u "jdbc:hive2://<endpoint>:80/;transportMode=http;httpPath=cliservice/token/<token>"

If Hive Beeline returns the following error, the Hive Beeline version is incompatible with Spark Thrift Server. Use Hive Beeline 2.x.

24/08/22 15:09:11 [main]: ERROR jdbc.HiveConnection: Error opening session
org.apache.thrift.transport.TTransportException: HTTP Response code: 404

Connect using Apache Superset

Apache Superset is a data exploration and visualization platform that supports a wide range of chart types. For setup instructions, see the Superset documentation.

  1. Install the required package:

    pip install thrift==0.20.0
  2. Start Superset and open the Superset interface.

  3. Click DATABASE in the upper-right corner to open the Connect a database page.

  4. Select Apache Spark SQL.

    image

  5. Enter the connection string for your network environment:

    Connect by using a public endpoint

    hive+https://<username>:<token>@<endpoint>:443/<db_name>

    Connect by using an internal same-region endpoint

    hive+http://<username>:<token>@<endpoint>:80/<db_name>
  6. Click FINISH to confirm the connection.

Connect using Hue

Hue is an open-source web interface for the Hadoop ecosystem.

  1. Install the required package:

    pip install thrift==0.20.0
  2. Add a Spark SQL connection to the Hue configuration file (typically /etc/hue/hue.conf):

    Connect by using a public endpoint

    [[[sparksql]]]
         name = Spark Sql
         interface=sqlalchemy
         options='{"url": "hive+https://<username>:<token>@<endpoint>:443/"}'

    Connect by using an internal same-region endpoint

    [[[sparksql]]]
         name = Spark Sql
         interface=sqlalchemy
         options='{"url": "hive+http://<username>:<token>@<endpoint>:80/"}'
  3. Restart Hue to apply the changes:

    sudo service hue restart
  4. Open the Hue interface and select the Spark SQL option. A successful configuration lets you connect to Spark Thrift Server and run SQL queries.

    image

Connect using DataGrip

DataGrip is a database IDE for querying, creating, and managing databases. The steps below use DataGrip 2025.1.2.

  1. Install DataGrip.

  2. Open DataGrip.

  3. Create a project.

    1. Click image and select File > New > Project. image

    2. Enter a project name (for example, Spark) and click OK.

  4. In the Database Explorer menu bar, click the icon 创建连接 and select Data Source > Other > Apache Spark.

    image

  5. In the Data Sources and Drivers dialog, configure the following:

    TabParameterDescription
    GeneralNameA custom connection name, for example, spark_thrift_server.
    AuthenticationAuthentication methodSelect No auth for development. In production, select User & Password to restrict SQL task submission to authorized users.
    DriverDriver versionClick Apache Spark, then click Go to Driver and confirm the driver version is ver. 1.2.2. This version is required for compatibility with the Spark 3.x engine.
    URLConnection URLEnter the JDBC URL for your network environment. Public endpoint: jdbc:hive2://<endpoint>:443/;transportMode=http;httpPath=cliservice/token/<token>. Internal same-region endpoint: jdbc:hive2://<endpoint>:80/;transportMode=http;httpPath=cliservice/token/<token>.
    OptionsRun keep-alive query(Optional) Prevents automatic disconnection due to inactivity.

    image

  6. Click Test Connection to confirm that the data source is configured successfully.

    image

  7. Click OK to complete the setup.

  8. After connecting, right-click a table in the Database Explorer, select New > Query Console, and write SQL in the editor to query table data. For more information, see DataGrip Help.

    image

Connect using Redash

Redash is an open-source BI tool with web-based query and data visualization capabilities. For installation instructions, see Setting up a Redash Instance.

  1. Install the required package:

    pip install thrift==0.20.0
  2. Log on to Redash.

  3. In the left navigation pane, click Settings. On the Data Sources tab, click +New Data Source.

  4. Configure the following parameters and click Create.

    ParameterDescription
    Type SelectionSelect Hive (HTTP).
    NameA custom name for the data source.
    HostThe Spark Thrift Server endpoint from the Overview tab (public or internal).
    Port443 for public endpoint access; 80 for internal same-region endpoint access.
    HTTP Path/cliservice
    UsernameAny username, for example, root.
    PasswordThe token value.
    HTTP Schemehttps for public endpoint access; http for internal same-region endpoint access.

    image

  5. Click Create > New Query to open the query editor and write SQL statements.

    image

View jobs by session

On the Sessions page, click the session name, then open the Execution Records tab.

The tab shows each job's run ID, start time, and a link to the Spark UI.

image