Set Up Kyuubi Gateway for JDBC/ODBC Access on EMR Serverless Spark - E-MapReduce

Kyuubi Gateway provides Java Database Connectivity (JDBC) and ODBC interfaces to connect EMR Serverless Spark with SQL query and business intelligence (BI) tools such as Tableau and Power BI. It supports multi-tenancy and resource isolation for enterprise workloads.

Create a Kyuubi Gateway

Prerequisites

Before you begin, make sure you have:

An EMR Serverless Spark workspace
A queue configured in the workspace
(Optional) A virtual private cloud (VPC) network connection, if the gateway needs to reach data sources or external services in another VPC. See Network connectivity between EMR Serverless Spark and other VPCs.

Steps

Go to the Gateway page.
1. Log on to the EMR console.
2. In the left navigation pane, choose EMR Serverless > Spark.
3. Click the name of the workspace you want to manage.
4. In the left navigation pane of the workspace page, click O&M Center > Gateway.
On the Kyuubi Gateway page, click Create Kyuubi Gateway.

Configure the following parameters, then click Create

A large number of Spark configuration items reduces instantaneous concurrency for Spark job submission.

Parameter	Description
Name	The gateway name. Use only lowercase letters, digits, and hyphens (-). The name must start and end with a letter or digit.
Kyuubi Gateway Resource	The compute resource allocated to the gateway. Default: `2 CPU, 8 GB`. See the table below for specs and recommended maximum concurrent connections.
Kyuubi Version	The Kyuubi version for the gateway. If you use DLF (formerly DLF 2.5) as the Data Catalog, set this to 1.9.2-0.0.1 or later.
Engine Version	The Spark engine version. For details, see Engine versions.
Associated Queue	The queue where the gateway is deployed. Spark jobs submitted through the gateway run under the gateway creator's identity.
Authentication Method	Token-based authentication only. After creating the gateway, generate a token for it. See Manage tokens.
Service High Availability	When enabled, deploys three or more Kyuubi servers for high availability. Configure Number Of Kyuubi Servers and Zookeeper cluster address (comma-separated endpoints, for example, `zk1:2181,zk2:2181,zk3:2181`). Make sure the network is reachable.
Network Connection	The VPC network connection for accessing data sources or external services. See Network connectivity between EMR Serverless Spark and other VPCs.
Endpoint(Public)	Disabled by default. When enabled, the gateway is accessible via a public endpoint. Otherwise, access is through an internal endpoint.
Kyuubi Configuration	Kyuubi configuration key-value pairs, space-separated (for example, `kyuubi.engine.pool.size 1`). See the supported configurations below.
Spark Configuration	Spark configuration key-value pairs, space-separated (for example, `spark.sql.catalog.paimon.metastore dlf`). All parameters are supported except `spark.kubernetes.*`.

Resource specs and recommended concurrent connections

Spec	Recommended max concurrent connections
`1 CPU, 4 GB`	10
`2 CPU, 8 GB` (default)	20
`4 CPU, 16 GB`	30
`8 CPU, 32 GB`	45
`16 CPU, 64 GB`	85
`32 CPU, 128 GB`	135

Supported Kyuubi configurations

kyuubi.engine.pool.size
kyuubi.engine.pool.size.threshold
kyuubi.engine.share.level
kyuubi.engine.single.spark.session
kyuubi.session.engine.idle.timeout
kyuubi.session.engine.initialize.timeout
kyuubi.engine.security.token.max.lifetime
kyuubi.session.engine.check.interval
kyuubi.session.idle.timeout
kyuubi.session.engine.request.timeout
kyuubi.session.engine.login.timeout
kyuubi.backend.engine.exec.pool.shutdown.timeout
kyuubi.backend.server.exec.pool.shutdown.timeout
kyuubi.backend.server.exec.pool.keepalive.time
kyuubi.frontend.thrift.login.timeout
kyuubi.operation.status.polling.timeout
kyuubi.engine.pool.selectPolicy
kyuubi.authentication
kyuubi.kinit.principal
kyuubi.kinit.keytab
kyuubi.authentication.ldap.*
kyuubi.hadoop.proxyuser.hive.hosts
kyuubi.hadoop.proxyuser.hive.groups
kyuubi.hadoop.proxyuser.kyuubi.hosts
kyuubi.hadoop.proxyuser.kyuubi.groups
kyuubi.ha.*

On the Kyuubi Gateway page, find the gateway and click Start in the Actions column.

Manage tokens

Tokens are required for authentication when connecting to a Kyuubi Gateway.

On the Kyuubi Gateway page, find the target gateway and click Tokens in the Actions column.
Click Create Token.

Configure the following parameters, then click OK.

Parameter	Description
Name	The token name.
Expired At	The token expiration time in days (minimum: 1). Enabled by default; expires after 365 days.
Assigned To	The Resource Access Management (RAM) user or RAM role to which the token is assigned. Required when using DLF (formerly DLF 2.5) as the default on the Catalog tab. Select from the drop-down list of users and roles added in Access control. The assigned identity is used to access DLF when submitting Spark jobs through the gateway. Make sure the RAM user or RAM role has DLF access permissions. See Add an authorization.

Copy the token immediately after creation.
Important
The token value is only shown once and cannot be retrieved later. If the token expires or is lost, create a new token or reset the existing one.

Connect to a Kyuubi Gateway

Gather the following values before connecting. All connection methods use the same placeholders.

Placeholder	Where to find it
`<endpoint>`	Overview tab of the gateway
`<port>`	`443` for public endpoints; `80` for internal same-region endpoints
`<token>`	Token Management page — copy immediately after creation
`<tokenname>`	Token Management page
`<UserName/RoleName>`	Access control — the RAM user or RAM role added to the gateway

If your gateway uses DLF (formerly DLF 2.5) as the default Data Catalog, you must include the user=<UserName/RoleName> parameter in connection strings. Each method below notes where to add it.

Connect using Beeline

Before connecting, make sure your Beeline version is compatible with the Kyuubi server version. If Beeline is not installed, see Getting Started - Apache Kyuubi.

The JDBC URL format is:

jdbc:hive2://<endpoint>:<port>/;transportMode=http;httpPath=cliservice/token/<token>

If you use DLF, add user=<UserName/RoleName> after transportMode=http;:

beeline -u "jdbc:hive2://<endpoint>:<port>/;transportMode=http;user=<UserName/RoleName>;httpPath=cliservice/token/<token>"

For other catalogs:

beeline -u "jdbc:hive2://<endpoint>:<port>/;transportMode=http;httpPath=cliservice/token/<token>"

To set session parameters, append them after #:

beeline -u "jdbc:hive2://<endpoint>:<port>/;transportMode=http;httpPath=cliservice/token/<token>;#spark.sql.shuffle.partitions=100;spark.executor.instances=2;"

Connect using Java

Add the following dependencies to your pom.xml file. Replace version numbers as needed.

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.0.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-jdbc</artifactId>
        <version>2.3.9</version>
    </dependency>
</dependencies>

Connect to the gateway with the following code. The JDBC URL format is: jdbc:hive2://<endpoint>:<port>/;transportMode=http;httpPath=cliservice/token/<token> If you use DLF, append ;user=<UserName/RoleName> to the URL.

import org.apache.hive.jdbc.HiveStatement;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;

public class Main {
    public static void main(String[] args) throws Exception {
        // For DLF: append ;user=<UserName/RoleName> to the URL
        String url = "jdbc:hive2://<endpoint>:<port>/;transportMode=http;httpPath=cliservice/token/<token>";
        Class.forName("org.apache.hive.jdbc.HiveDriver");
        Connection conn = DriverManager.getConnection(url);
        HiveStatement stmt = (HiveStatement) conn.createStatement();

        String sql = "select * from students;";
        System.out.println("Running " + sql);
        ResultSet res = stmt.executeQuery(sql);

        ResultSetMetaData md = res.getMetaData();
        String[] columns = new String[md.getColumnCount()];
        for (int i = 0; i < columns.length; i++) {
            columns[i] = md.getColumnName(i + 1);
        }
        while (res.next()) {
            System.out.print("Row " + res.getRow() + "=[");
            for (int i = 0; i < columns.length; i++) {
                if (i != 0) {
                    System.out.print(", ");
                }
                System.out.print(columns[i] + "='" + res.getObject(i + 1) + "'");
            }
            System.out.println(")]");
        }

        conn.close();
    }
}

Connect using Python

Install the required packages.
```
pip3 install pyhive thrift
```
Connect to the gateway. Store sensitive values in environment variables rather than hardcoding them in your script. Set the following environment variables before running:
- KYUUBI_ENDPOINT — the gateway endpoint
- KYUUBI_PORT — the port number
- KYUUBI_TOKEN — the token value
- KYUUBI_USERNAME — the RAM user or role name (required when using DLF; for other catalogs, set this to the token name)
```
import os
from pyhive import hive

if __name__ == '__main__':
    cursor = hive.connect(
        os.environ['KYUUBI_ENDPOINT'],
        port=os.environ['KYUUBI_PORT'],
        scheme='http',
        username=os.environ['KYUUBI_USERNAME'],
        password=os.environ['KYUUBI_TOKEN']
    ).cursor()
    cursor.execute('show databases')
    print(cursor.fetchall())
    cursor.close()
```
When using DLF, set KYUUBI_USERNAME to the RAM user or role name (<UserName/RoleName>). For other catalogs, set KYUUBI_USERNAME to the token name (<tokenname>).

Connect using the REST API

Kyuubi Gateway supports open-source-compatible Representational State Transfer (REST) APIs over HTTP. The following API paths are supported:

/api/v1/sessions/*
/api/v1/operations/*
/api/v1/batches/*

Example 1: Run an SQL query

All four steps use token-based authentication in the URL path (/token/<token>). If you use DLF, add -u '<UserName/RoleName>:<password>' to the curl command in each step, where <password> is a placeholder and can be any value.

Create a session and specify Spark configurations. Replace <dev_queue> with the actual queue name. If you use DLF:

curl -X 'POST' \
  'http://<endpoint>:<port>/api/v1/sessions/token/<token>' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -u '<UserName/RoleName>:<password>' \
  -d '{
  "configs": {
    "set:hivevar:spark.emr.serverless.kyuubi.engine.queue": "<dev_queue>"
  }
}'

If you use another catalog:

curl -X 'POST' \
  'http://<endpoint>:<port>/api/v1/sessions/token/<token>' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "configs": {
    "set:hivevar:spark.emr.serverless.kyuubi.engine.queue": "<dev_queue>"
  }
}'

The response returns a session handle (identifier) that identifies this session:

{"identifier":"619e6ded-xxxx-xxxx-xxxx-c2a43f6fac46","kyuubiInstance":"0.0.0.0:10099"}

Use the identifier value as <sessionHandle> in subsequent requests.

Submit a SQL statement. If you use DLF:

curl -X 'POST' \
  'http://<endpoint>:<port>/api/v1/sessions/<sessionHandle>/operations/statement/token/<token>' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -u '<UserName/RoleName>:<password>' \
  -d '{
  "statement": "select * from test;",
  "runAsync": true,
  "queryTimeout": 0,
  "confOverlay": {
    "additionalProp1": "string",
    "additionalProp2": "string"
  }
}'

If you use another catalog:

curl -X 'POST' \
  'http://<endpoint>:<port>/api/v1/sessions/<sessionHandle>/operations/statement/token/<token>' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "statement": "select * from test;",
  "runAsync": true,
  "queryTimeout": 0,
  "confOverlay": {
    "additionalProp1": "string",
    "additionalProp2": "string"
  }
}'

The response returns an operation handle:

{"identifier":"a743e8ff-xxxx-xxxx-xxxx-a66fec66cfa4"}

Use the identifier value as <operationHandle> in subsequent requests.

Check the statement status. If you use DLF:

curl --location -X 'GET' \
  'http://<endpoint>:<port>/api/v1/operations/<operationHandle>/event/token/<token>' \
  -H 'accept: application/json' \
  -u '<UserName/RoleName>:<password>'

If you use another catalog:

curl --location -X 'GET' \
  'http://<endpoint>:<port>/api/v1/operations/<operationHandle>/event/token/<token>' \
  -H 'accept: application/json'

Retrieve the results. If you use DLF:

curl --location -X 'GET' \
  'http://<endpoint>:<port>/api/v1/operations/<operationHandle>/rowset/token/<token>/?maxrows=100&fetchorientation=FETCH_NEXT' \
  -H 'accept: application/json' \
  -u '<UserName/RoleName>:<password>'

If you use another catalog:

curl --location -X 'GET' \
  'http://<endpoint>:<port>/api/v1/operations/<operationHandle>/rowset/token/<token>/?maxrows=100&fetchorientation=FETCH_NEXT' \
  -H 'accept: application/json'

Example 2: Submit a batch job

Use the batches API to submit a Spark batch job. The gateway starts a Spark application and runs the specified task.

Download the test JAR package: spark-examples_2.12-3.3.1.jar. This is the SparkPi example that comes with Spark and calculates the value of Pi (π).

If you use DLF:

curl --location \
  --request POST 'http://<endpoint>:<port>/api/v1/batches/token/<token>' \
  --user '<UserName/RoleName>:<password>' \
  --form 'batchRequest="{
    \"batchType\": \"SPARK\",
    \"className\": \"org.apache.spark.examples.SparkPi\",
    \"name\": \"kyuubi-spark-pi\",
    \"resource\": \"oss://bucket/path/to/spark-examples_2.12-3.3.1.jar\"
  }";type=application/json'

If you use another catalog:

curl --location \
  --request POST 'http://<endpoint>:<port>/api/v1/batches/token/<token>' \
  --form 'batchRequest="{
    \"batchType\": \"SPARK\",
    \"className\": \"org.apache.spark.examples.SparkPi\",
    \"name\": \"kyuubi-spark-pi\",
    \"resource\": \"oss://bucket/path/to/spark-examples_2.12-3.3.1.jar\"
  }";type=application/json'

Configure a high-availability Kyuubi Gateway

A high-availability (HA) Kyuubi Gateway uses ZooKeeper for coordination and requires a custom JDBC driver for HA-aware connections.

Prerequisites

A ZooKeeper cluster accessible from your client (for example, Alibaba Cloud MSE or the ZooKeeper component on EMR on ECS)
Network connectivity between your client and the ZooKeeper cluster's VPC. See Network connectivity between EMR Serverless Spark and other VPCs.

Enable high availability

When creating or editing a Kyuubi Gateway, enable Service High Availability, set the number of Kyuubi servers and the ZooKeeper cluster address, and select a network connection for Network Connection.

Connect to the HA gateway

After enabling HA, the gateway coordinates through ZooKeeper. Use the same placeholders described in Connect to a Kyuubi Gateway.

Connect using Beeline

The HA Beeline connection requires a custom JDBC Driver JAR.

Download kyuubi-hive-jdbc-1.9.2.jar.
Replace the default JDBC Driver JAR file.
1. Back up the existing driver:
```
mv /your_path/apache-kyuubi-1.9.2-bin/beeline-jars /bak_path
```
  Note
  On EMR on ECS, the default Kyuubi path is /opt/apps/KYUUBI/kyuubi-1.9.2-1.0.0/beeline-jars. To find the installation path, run env | grep KYUUBI_HOME.
2. Copy the downloaded driver to the Beeline directory:
```
cp /download/serverless-spark-kyuubi-hive-jdbc-1.9.2.jar /your_path/apache-kyuubi-1.9.2-bin/beeline-jars
```

Connect using Beeline:

/your_path/apache-kyuubi-1.9.2-bin/bin/beeline -u 'jdbc:hive2://<endpoint>:<port>/;transportMode=http;httpPath=cliservice/token/<token>'

Connect using Java

The HA Java connection uses a shaded JDBC driver from Serverless Spark instead of the standard Hive JDBC driver.

Download serverless-spark-kyuubi-hive-jdbc-shaded-1.9.2.jar.

Install the driver to your local Maven repository:

mvn install:install-file \
  -Dfile=/download/serverless-spark-kyuubi-hive-jdbc-shaded-1.9.2.jar \
  -DgroupId=org.apache.kyuubi \
  -DartifactId=kyuubi-hive-jdbc-shaded \
  -Dversion=1.9.2-ss \
  -Dpackaging=jar

Add the following dependencies to your pom.xml file:

<dependencies>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>3.0.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.kyuubi</groupId>
      <artifactId>kyuubi-hive-jdbc-shaded</artifactId>
      <version>1.9.2-ss</version>
    </dependency>
</dependencies>

Connect with the following code. Note that the driver class and import differ from the standard (non-HA) Java example.

import org.apache.kyuubi.jdbc.hive.KyuubiStatement;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;

public class Main {
    public static void main(String[] args) throws Exception {
        String url = "jdbc:hive2://<endpoint>:<port>/;transportMode=http;httpPath=cliservice/token/<token>";
        Class.forName("org.apache.kyuubi.jdbc.KyuubiHiveDriver");
        Connection conn = DriverManager.getConnection(url);
        KyuubiStatement stmt = (KyuubiStatement) conn.createStatement();

        String sql = "select * from test;";
        ResultSet res = stmt.executeQuery(sql);

        ResultSetMetaData md = res.getMetaData();
        String[] columns = new String[md.getColumnCount()];
        for (int i = 0; i < columns.length; i++) {
            columns[i] = md.getColumnName(i + 1);
        }
        while (res.next()) {
            System.out.print("Row " + res.getRow() + "=[");
            for (int i = 0; i < columns.length; i++) {
                if (i != 0) {
                    System.out.print(", ");
                }
                System.out.print(columns[i] + "='" + res.getObject(i + 1) + "'");
            }
            System.out.println(")]");
        }

        conn.close();
    }
}

View Spark jobs submitted through Kyuubi

Monitor Spark jobs submitted through a Kyuubi Gateway on the Kyuubi Application tab of the Job History page. The tab shows Application ID, Application Name, Application Status, and Start At for each job.

The Application ID (in spark-xxxx format) is generated by the Spark engine and matches the ID returned by the Kyuubi client, uniquely identifying each task instance.

To view jobs for a specific gateway:

On the Kyuubi Gateway page, click the target gateway name.
In the upper-right corner, click Applications. The page lists all Spark jobs submitted through this gateway.