All Products
Search
Document Center

E-MapReduce:Manage Spark Thrift Server sessions

Last Updated:Aug 02, 2025

Spark Thrift Server is a service provided by Apache Spark that supports SQL query execution through JDBC or ODBC connections, making it convenient to integrate Spark environments with existing Business Intelligence (BI) tools, data visualization tools, and other data analysis tools. This topic describes how to create and connect to a Spark Thrift Server.

Prerequisites

A workspace is created. For more information, see Manage workspaces.

Create a Spark Thrift Server session

After you create a Spark Thrift Server, you can select this session when you create a Spark SQL task.

  1. Go to the Sessions page.

    1. Log on to the EMR console.

    2. In the left-side navigation pane, choose EMR Serverless > Spark.

    3. On the Spark page, click the name of the workspace that you want to manage.

    4. In the left-side navigation pane of the EMR Serverless Spark page, choose Operation Center > Sessions.

  2. On the Session Manager page, click the Spark Thrift Server Sessions tab.

  3. Click Create Spark Thrift Server Session.

  4. On the Create Spark Thrift Server Session page, configure the parameters and click Create.

    Parameter

    Description

    Name

    The name of the new Spark Thrift Server.

    The name must be 1 to 64 characters in length and can contain letters, digits, hyphens (-), underscores (_), and spaces.

    Deployment Queue

    Select an appropriate development queue to deploy the session. You can select only a development queue or a queue that is shared by development and production environments.

    For more information about queues, see Resource Queue Management.

    Engine Version

    The engine version used by the current session. For more information about engine versions, see Engine version introduction.

    Use Fusion Acceleration

    Fusion can accelerate the running of Spark workloads and reduce the total cost of tasks. For billing information, see Billing. For more information about the Fusion engine, see Fusion engine.

    Auto Stop

    This feature is enabled by default. The Spark Thrift Server session automatically stops after 45 minutes of inactivity.

    Network Connectivity

    The network connection that is used to access the data sources or external services in a virtual private cloud (VPC). For information about how to create a network connection, see Configure network connectivity between EMR Serverless Spark and a data source across VPCs.

    Spark Thrift Server Port

    The port number is 443 when you access the server by using a public endpoint, and 80 when you access the server by using an internal same-region endpoint.

    Access Credential

    Only the Token method is supported.

    spark.driver.cores

    The number of CPU cores used by the Driver process in a Spark application. The default value is 1 CPU.

    spark.driver.memory

    The amount of memory that the Driver process in a Spark application can use. The default value is 3.5 GB.

    spark.executor.cores

    The number of CPU cores that each Executor process can use. The default value is 1 CPU.

    spark.executor.memory

    The amount of memory that each Executor process can use. The default value is 3.5 GB.

    spark.executor.instances

    The number of executors allocated by Spark. The default value is 2.

    Dynamic Resource Allocation

    By default, this feature is disabled. After you enable this feature, you must configure the following parameters:

    • Minimum Number of Executors: Default value: 2.

    • Maximum Number of Executors: If you do not configure the spark.executor.instances parameter, the default value 10 is used.

    More Memory Configurations

    • spark.driver.memoryOverhead: the size of non-heap memory that is available to each driver. If you leave this parameter empty, Spark automatically assigns a value to this parameter based on the following formula: max(384 MB, 10% × spark.driver.memory).

    • spark.executor.memoryOverhead: the size of non-heap memory that is available to each executor. If you leave this parameter empty, Spark automatically assigns a value to this parameter based on the following formula: max(384 MB, 10% × spark.executor.memory).

    • spark.memory.offHeap.size: the size of off-heap memory that is available to the Spark application. Default value: 1 GB.

      This parameter is valid only if you set the spark.memory.offHeap.enabled parameter to true. By default, if you use the Fusion engine, the spark.memory.offHeap.enabled parameter is set to true and the spark.memory.offHeap.size parameter is set to 1 GB.

    Spark Configuration

    Enter Spark configuration information. By default, parameters are separated by spaces, for example, spark.sql.catalog.paimon.metastore dlf.

  5. Obtain the Endpoint information.

    1. On the Spark Thrift Server Sessions tab, click the name of the new Spark Thrift Server.

    2. On the Overview tab, copy the Endpoint information.

      Based on your network environment, you can choose one of the following Endpoints:

      • Public Endpoint: This is suitable for scenarios where you need to access EMR Serverless Spark over the Internet, such as from a local development machine, external network, or cross-cloud environment. This method may incur traffic fees. Make sure that you take necessary security measures.

      • Internal Same-region Endpoint: This is suitable for scenarios where an Alibaba Cloud ECS instance in the same region accesses EMR Serverless Spark over an internal network. Internal network access is free and more secure, but is limited to Alibaba Cloud internal network environments in the same region.

Create a Token

  1. On the Spark Thrift Server Sessions tab, click the name of the new Spark Thrift Server session.

  2. Click the Token Management tab.

  3. Click Create Token.

  4. In the Create Token dialog box, configure the parameters and click OK.

    Parameter

    Description

    Name

    The name of the new token.

    Expiration Time

    The expiration time of the token. The number of days must be greater than or equal to 1. This feature is enabled by default, and the token expires after 365 days.

  5. Copy the token information.

    Important

    After a token is created, you must immediately copy the information of the new token. You cannot view the token information later. If your token expires or is lost, you can create a new token or reset the token.

Connect to Spark Thrift Server

When you connect to Spark Thrift Server, replace the following information based on your actual situation:

  • <endpoint>: The Endpoint(Public) or Endpoint(Internal) information that you obtained on the Overview tab.

    If you use an internal same-region endpoint, access to Spark Thrift Server is limited to resources within the same VPC.

  • <port>: The port number. The port number is 443 when you access the server by using a public endpoint, and 80 when you access the server by using an internal same-region endpoint.

  • <username>: The name of the token that you created on the Token Management tab.

  • <token>: The token information that you copied on the Token Management tab.

Connect to Spark Thrift Server by using Python

  1. Run the following command to install the PyHive and Thrift packages:

    pip install pyhive thrift
  2. Write a Python script to connect to Spark Thrift Server.

    The following is an example Python script that shows how to connect to Hive and display a list of databases. Based on your network environment (public or internal), you can choose the appropriate connection method.

    Connect by using a public endpoint

    from pyhive import hive
    
    if __name__ == '__main__':
        # Replace <endpoint>, <username>, and <token> with your actual information.
        cursor = hive.connect('<endpoint>', port=443, scheme='https', username='<username>', password='<token>').cursor()
        cursor.execute('show databases')
        print(cursor.fetchall())
        cursor.close()

    Connect by using an internal same-region endpoint

    from pyhive import hive
    
    if __name__ == '__main__':
        # Replace <endpoint>, <username>, and <token> with your actual information.
        cursor = hive.connect('<endpoint>', port=80, scheme='http', username='<username>', password='<token>').cursor()
        cursor.execute('show databases')
        print(cursor.fetchall())
        cursor.close()

Connect to Spark Thrift Server by using Java

  1. Add the following Maven dependencies to your pom.xml file:

    <dependencies>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-common</artifactId>
                <version>3.0.0</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hive</groupId>
                <artifactId>hive-jdbc</artifactId>
                <version>2.1.0</version>
            </dependency>
        </dependencies>
    
    Note

    The built-in Hive version in Serverless Spark is 2.x. Therefore, only hive-jdbc 2.x versions are supported.

  2. Write Java code to connect to Spark Thrift Server.

    The following is a sample Java code that connects to Spark Thrift Server and queries the database list.

    Connect by using a public endpoint

    import java.sql.Connection;
    import java.sql.DriverManager;
    import java.sql.ResultSet;
    import java.sql.ResultSetMetaData;
    import org.apache.hive.jdbc.HiveStatement;
    
    public class Main {
        public static void main(String[] args) throws Exception {
            String url = "jdbc:hive2://<endpoint>:443/;transportMode=http;httpPath=cliservice/token/<token>";
            Class.forName("org.apache.hive.jdbc.HiveDriver");
            Connection conn = DriverManager.getConnection(url);
            HiveStatement stmt = (HiveStatement) conn.createStatement();
    
            String sql = "show databases";
            System.out.println("Running " + sql);
            ResultSet res = stmt.executeQuery(sql);
    
            ResultSetMetaData md = res.getMetaData();
            String[] columns = new String[md.getColumnCount()];
            for (int i = 0; i < columns.length; i++) {
                columns[i] = md.getColumnName(i + 1);
            }
            while (res.next()) {
                System.out.print("Row " + res.getRow() + "=[");
                for (int i = 0; i < columns.length; i++) {
                    if (i != 0) {
                        System.out.print(", ");
                    }
                    System.out.print(columns[i] + "='" + res.getObject(i + 1) + "'");
                }
                System.out.println(")]");
            }
            
            conn.close();
        }
    }

    Connect by using an internal same-region endpoint

    import java.sql.Connection;
    import java.sql.DriverManager;
    import java.sql.ResultSet;
    import java.sql.ResultSetMetaData;
    import org.apache.hive.jdbc.HiveStatement;
    
    public class Main {
        public static void main(String[] args) throws Exception {
            String url = "jdbc:hive2://<endpoint>:80/;transportMode=http;httpPath=cliservice/token/<token>";
            Class.forName("org.apache.hive.jdbc.HiveDriver");
            Connection conn = DriverManager.getConnection(url);
            HiveStatement stmt = (HiveStatement) conn.createStatement();
    
            String sql = "show databases";
            System.out.println("Running " + sql);
            ResultSet res = stmt.executeQuery(sql);
    
            ResultSetMetaData md = res.getMetaData();
            String[] columns = new String[md.getColumnCount()];
            for (int i = 0; i < columns.length; i++) {
                columns[i] = md.getColumnName(i + 1);
            }
            while (res.next()) {
                System.out.print("Row " + res.getRow() + "=[");
                for (int i = 0; i < columns.length; i++) {
                    if (i != 0) {
                        System.out.print(", ");
                    }
                    System.out.print(columns[i] + "='" + res.getObject(i + 1) + "'");
                }
                System.out.println(")]");
            }
            
            conn.close();
        }
    }

Connect to Spark Thrift Server by using Spark Beeline

  • If you are using a self-built cluster, you need to first go to the bin directory of Spark, and then use beeline to connect to Spark Thrift Server.

    Connect by using a public endpoint

    cd /opt/apps/SPARK3/spark-3.4.2-hadoop3.2-1.0.3/bin/
    
    ./beeline -u "jdbc:hive2://<endpoint>:443/;transportMode=http;httpPath=cliservice/token/<token>"

    Connect by using an internal same-region endpoint

    cd /opt/apps/SPARK3/spark-3.4.2-hadoop3.2-1.0.3/bin/
    
    ./beeline -u "jdbc:hive2://<endpoint>:80/;transportMode=http;httpPath=cliservice/token/<token>"
    Note

    The code uses /opt/apps/SPARK3/spark-3.4.2-hadoop3.2-1.0.3 as an example of the Spark installation path in an EMR on ECS cluster. You need to adjust the path based on the actual Spark installation path on your client. If you are not sure about the Spark installation path, you can use the env | grep SPARK_HOME command to find it.

  • If you are using an EMR on ECS cluster, you can directly use the Spark Beeline client to connect to Spark Thrift Server.

    Connect by using a public endpoint

    spark-beeline -u "jdbc:hive2://<endpoint>:443/;transportMode=http;httpPath=cliservice/token/<token>"

    Connect by using an internal same-region endpoint

    spark-beeline -u "jdbc:hive2://<endpoint>:80/;transportMode=http;httpPath=cliservice/token/<token>"

When you use Hive Beeline to connect to Serverless Spark Thrift Server, if the following error occurs, it is usually because the version of Hive Beeline is incompatible with Spark Thrift Server. Therefore, we recommend that you use Hive Beeline 2.x.

24/08/22 15:09:11 [main]: ERROR jdbc.HiveConnection: Error opening session
org.apache.thrift.transport.TTransportException: HTTP Response code: 404

Connect to Spark Thrift Server by configuring Apache Superset

Apache Superset is a modern data exploration and visualization platform with a rich set of chart forms, from simple line charts to highly detailed geospatial charts. For more information about Superset, see Superset.

  1. Install dependencies.

    Make sure that you have installed the thrift package version 0.20.0. If not, you can install it by using the following command:

    pip install thrift==0.20.0
  2. Start Superset and go to the Superset interface.

    For more information about how to start Superset, see Superset documentation.

  3. Click DATABASE in the upper-right corner of the page to go to the Connect a database page.

  4. On the Connect a database page, select Apache Spark SQL.

    image

  5. Enter the connection string and configure the related data source parameters.

    Connect by using a public endpoint

    hive+https://<username>:<token>@<endpoint>:443/<db_name>

    Connect by using an internal same-region endpoint

    hive+http://<username>:<token>@<endpoint>:80/<db_name>
  6. Click FINISH to confirm successful connection and verification.

Connect to Spark Thrift Server by configuring Hue

Hue is a popular open-source web interface for interacting with the Hadoop ecosystem. For more information about Hue, see Hue official documentation.

  1. Install dependencies.

    Make sure that you have installed the thrift package version 0.20.0. If not, you can install it by using the following command:

    pip install thrift==0.20.0
  2. Add a Spark SQL connection string to the Hue configuration file.

    Find the Hue configuration file (usually located at /etc/hue/hue.conf) and add the following content to the file:

    Connect by using a public endpoint

    [[[sparksql]]]
         name = Spark Sql
         interface=sqlalchemy
         options='{"url": "hive+https://<username>:<token>@<endpoint>:443/"}'

    Connect by using an internal same-region endpoint

    [[[sparksql]]]
         name = Spark Sql
         interface=sqlalchemy
         options='{"url": "hive+http://<username>:<token>@<endpoint>:80/"}'
  3. Restart Hue.

    After you modify the configuration, you need to run the following command to restart the Hue service for the changes to take effect:

    sudo service hue restart
  4. Verify the connection.

    After Hue is successfully restarted, access the Hue interface and find the Spark SQL option. If the configuration is correct, you should be able to successfully connect to Spark Thrift Server and execute SQL queries.

    image

Connect to Spark Thrift Server by using DataGrip

DataGrip is a database management environment for developers, designed to facilitate the querying, creation, and management of databases. Databases can run locally, on a server, or in the cloud. For more information about DataGrip, see DataGrip.

  1. Install DataGrip. For more information, see Install DataGrip.

    The DataGrip version used in this example is 2025.1.2.

  2. Open the DataGrip client and go to the DataGrip interface.

  3. Create a project.

    1. Click image and select Flie > New > > Project.

      image

    2. In the New Project dialog box, enter a project name, such as Spark, and click OK.

  4. Click the icon in the Database Explorer创建连接 menu bar. Select Data Source > Other > Apache Spark.

    image

  5. In the Data Sources and Drivers dialog box, configure the following parameters:

    image

    Tab

    Parameter

    Description

    General

    Name

    The custom connection name. For example, spark_thrift_server.

    Authentication

    The authentication method. In this example, No auth is selected.

    In a production environment, select User & Password to ensure that only authorized users can submit SQL tasks, which improves system security.

    Driver

    Click Apache Spark, and then click Go to Driver to confirm that the Driver version is ver. 1.2.2.

    Note

    Because the current Serverless Spark engine version is 3.x, the Driver version must be 1.2.2 to ensure system stability and feature compatibility.

    image

    URL

    The URL for connecting to Spark Thrift Server. Based on your network environment (public or internal), you can choose the appropriate connection method.

    • Connect by using a public endpoint

      jdbc:hive2://<endpoint>:443/;transportMode=http;httpPath=cliservice/token/<token>
    • Connect by using an internal same-region endpoint

      jdbc:hive2://<endpoint>:80/;transportMode=http;httpPath=cliservice/token/<token>

    Options

    Run keep-alive query

    This parameter is optional. Select this parameter to prevent automatic disconnection due to timeout.

  6. Click Test Connection to confirm that the data source is configured successfully.

    image

  7. Click OK to complete the configuration.

  8. Use DataGrip to manage Spark Thrift Server.

    After DataGrip successfully connects to Spark Thrift Server, you can perform data development. For more information, see DataGrip Help.

    For example, under the created connection, you can right-click a target table and select New > > Query Console. Then, you can write and run SQL scripts in the SQL editor on the right to view table data information.

    image

Connect Redash to Spark Thrift Server

Redash is an open-source BI tool that provides web-based database querying and data visualization capabilities.

  1. Install Redash. For installation instructions, see Setting up a Redash Instance.

  2. Install dependencies. Make sure the thrift package of version 0.20.0 is installed. If it is not installed, run the following command:

    pip install thrift==0.20.0
  3. Log on to Redash.

  4. In the left navigation pane, click Settings. On the Data Sources tab, click +New Data Source.

  5. In the dialog box, configure the following parameters, and then click Create.

    image

Parameter

Description

Type Selection

Data source type. Select Hive (HTTP).

Configuration

Name

Enter a custom name for the data source.

Host

Enter the endpoint of your Spark Thrift Server. You can find the public or internal endpoint on the Overview tab.

Port

Use 443 for access via a public domain name, or 80 for a internal domain name.

HTTP Path

Set to /cliservice.

Username

Enter any username such as root.

Password

Enter the Token value you created.

HTTP Scheme

Enter https if using a public domain name, or http if using a internal domain name.

  1. At the top of the page, choose Create > New Query You can write SQL statements in the editor.

    image

View the jobs run by using a specific session

You can view the jobs that are run by using a specific session on the Sessions page. Procedure:

  1. On the Sessions page, click the name of the desired session.

  2. On the page that appears, click the Execution Records tab.

    On the Execution Records tab, you can view the details of a job, such as the run ID and start time, and click the link in the Spark UI column to access the Spark UI.

image