Run Spark SQL on LDPS via JDBC for Real-Time Analytics - Lindorm

Use Java Database Connectivity (JDBC) to connect your application to Lindorm Distributed Processing System (LDPS) and run Spark SQL queries, analytics, and data generation workloads.

Prerequisites

Before you begin, make sure you have:

A Lindorm instance with LindormTable activated. See Create an instance
LDPS activated for the instance. See Activate LDPS and modify the configurations
JDK 1.8 or later installed in a Java IDE

Get the JDBC endpoint

The JDBC endpoint follows this format:

jdbc:hive2://<host>:10009/;?token=<your-token>

To look up the endpoint for your instance, see View endpoints.

Connect with Beeline

Use Beeline, the interactive CLI client bundled in the Spark release package, to run SQL statements directly against LDPS without writing any application code.

Download the Spark release package and decompress it.
Set the SPARK_HOME environment variable to the decompressed directory:
```
export SPARK_HOME=/path/to/spark/
```

Configure $SPARK_HOME/conf/beeline.conf with the following parameters:

Parameter	Description
`endpoint`	JDBC endpoint of LDPS
`user`	Username for Lindorm wide tables
`password`	Password for Lindorm wide tables
`shareResource`	Whether multiple interactive sessions share Spark resources. Default: `true`

Start Beeline:
```
/bin/beeline
```
In the interactive session, execute SQL statements against your LDPS data sources.

LDPS supports multiple data source types. For details, see Precautions.

Example: Create and query a table with Hive Metastore

After you activate Hive Metastore, run the following statements to create a table, insert data, and query it. For setup instructions, see Use Hive Metastore to manage metadata in Lindorm.

CREATE TABLE test (id INT, name STRING);
INSERT INTO test VALUES (0, 'Jay'), (1, 'Edison');
SELECT id, name FROM test;

Connect with Java

All Java examples use the org.apache.hive.jdbc.HiveDriver driver and the DriverManager.getConnection() API.

Add the JDBC driver dependency to your project. Maven (pom.xml):

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-jdbc</artifactId>
    <version>2.3.8</version>
</dependency>

Connect to LDPS and run a query:

import java.sql.*;

public class App {
    public static void main(String[] args) throws Exception {
        // Register the Hive JDBC driver
        Class.forName("org.apache.hive.jdbc.HiveDriver");

        // Replace with your LDPS JDBC endpoint
        String endpoint = "jdbc:hive2://123.234.XX.XX:10009/;?token=bisdfjis-f7dc-fdsa-9qwe-dasdfhhv8****";
        String user = "";
        String password = "";

        Connection con = DriverManager.getConnection(endpoint, user, password);
        Statement stmt = con.createStatement();

        // Execute a query and print results
        ResultSet res = stmt.executeQuery("SELECT * FROM test");
        while (res.next()) {
            System.out.println(res.getString(1));
        }
    }
}

(Optional) Pass Spark job parameters by appending them to the endpoint URL with semicolons:

String endpoint = "jdbc:hive2://123.234.XX.XX:10009/;?token=bisdfjis-f7dc-fdsa-9qwe-dasdfhhv8****"
    + ";spark.dynamicAllocation.minExecutors=3"
    + ";spark.sql.adaptive.enabled=false";

Connect with Python

All Python examples use the JayDeBeApi library, which bridges the Python DB-API 2.0 interface to the Hive JDBC driver.

Download the Spark release package and decompress it.

Set the environment variables:

export SPARK_HOME=/path/to/dir/
export CLASSPATH=$CLASSPATH:$SPARK_HOME/jars/*

Install JayDeBeApi:
```
pip install JayDeBeApi
```

Connect to LDPS and run a query:

import jaydebeapi

driver   = 'org.apache.hive.jdbc.HiveDriver'
endpoint = 'jdbc:hive2://123.234.XX.XX:10009/;?token=bisdfjis-f7dc-fdsa-9qwe-dasdfhhv8****'
jar_path = '/path/to/sparkhome/jars/hive-jdbc-****.jar'
user     = '****'
password = '****'

conn = jaydebeapi.connect(driver, endpoint, [user, password], [jar_path])
cursor = conn.cursor()

cursor.execute("select 1")
results = cursor.fetchall()

cursor.close()
conn.close()

(Optional) Pass Spark job parameters by appending them to the endpoint string:

endpoint = (
    "jdbc:hive2://123.234.XX.XX:10009/;?token=bisdfjis-f7dc-fdsa-9qwe-dasdfhhv8****"
    ";spark.dynamicAllocation.minExecutors=3"
    ";spark.sql.adaptive.enabled=false"
)

What's next

Precautions — supported data sources and known limitations for LDPS
Use Hive Metastore to manage metadata in Lindorm — manage table metadata for Beeline and JDBC queries
Activate LDPS and modify the configurations — tune LDPS cluster settings