This topic describes how to use Ganos Spark to manage and analyze large-scale geographic data based on the Apache Spark distributed system. Ganos Spark provides a variety of Spark-based API operations for data loading, analytics and storage. Ganos Spark provides different levels of data analytics models. The most basic model is GeometryRDD, which is used to convert between SimpleFeatures in Ganos data and resilient distributed dataset (RDD) models in Spark. Based on Spark SQL, Ganos Spark also provides a set of user-defined types (UDTs), user-defined functions (UDFs), and user-defined aggregation functions (UDAFs) to manage spatial data. These functions allow you to use a structured query language that is similar to SQL for data query and analytics. The following figure shows the architecture of Ganos Spark:

1. Download the Ganos Spark toolkit

Click Ganos Spark driver to download the Ganos Spark toolkit.

Add dependencies to the pom file in the project directory:
<! -- Ganos Spark -->
<dependency>
    <groupId>com.aliyun.ganos</groupId>
    <artifactId>ganos-spark-runtime</artifactId>
    <version>1.0-SNAPSHOT</version>
    <scope>system</scope>
    <systemPath>../ganos-spark-runtime-1.0-SNAPSHOT.jar</systemPath>
</dependency>

<! -- Spark -->
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-catalyst_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
    <exclusions>
        <exclusion>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-reflect</artifactId>
        </exclusion>
    </exclusions>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-yarn_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
</dependency>
     <dependency>
    <groupId>io.netty</groupId>
    <artifactId>netty-all</artifactId>
    <version>4.1.18.Final</version>
</dependency>

2. Query data in HBase Ganos by using Ganos Spark

After you configure the runtime environment, you can connect to HBase Ganos to query data by using the Ganos Spark service, as shown in the following example:

package com.aliyun.ganos

import com.aliyun.ganos.spark.GanosSparkKryoRegistrator
import org.apache.log4j.{ Level, Logger}
import org.apache.spark.sql.SparkSession

object GanosSparkDemo {

  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.ERROR)
    Logger.getLogger("com").setLevel(Level.ERROR)

    // Specify connection parameters of ApsaraDB for HBase. The catalog name is set to POINT.
    val params = Map(
      "hbase.catalog" -> "POINT",
      "hbase.zookeepers" -> "The ZooKeeper address used to connect to ApsaraDB for HBase",
      "geotools" -> "true")

    // Initialize a SparkSession object.
    val sparkSession = SparkSession.builder
      .appName("Simple Application")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .config("spark.sql.crossJoin.enabled", "true")
      .config("spark.kryo.registrator", classOf[GanosSparkKryoRegistrator].getName)
      .master("local[*]")
      .getOrCreate()

    // Load the data from automatic identification system (AIS).
    val dataFrame = sparkSession.read
      .format("ganos")
      .options(params)
      .option("ganos.feature", "AIS")
      .load()

    // Query all data.
    dataFrame.createOrReplaceTempView("ais")
    val r=sparkSession.sql("SELECT * FROM ais")
    r.show()

    // Query spatio-temporal data.
    val r1=sparkSession.sql("SELECT * FROM ais WHERE st_contains(st_makeBBOX(70.00000,11.00000,75.00000,14.00000), geom)")
    r1.show()

    // Write query results to HBase Ganos.
    r1.write.format("ganos").options(params).option("ganos.feature", "result").save()
  }
}
			

The following figure shows the output:

For more information about spatial functions supported by Ganos Spark, see Ganos Spark functions.

3. Use Ganos Spark in Jupyter

Ganos Spark provides the toolkit that allows you to query data and display the result in Jupyter.

Click Leaflet tool to download the Ganos Spark Leaflet toolkit.

Log on to the console and perform the following tasks:

3.1. Install Jupyter.
$ pip install --upgrade jupyter
or
$ pip3 install --upgrade jupyter
3.2. Configure the SPARK_HOME environment variable, add a kernel named Ganos Spark Test by using toree, and then launch Jupyter:
$ jars="ganos-spark-runtime-1.0-SNAPSHOT.jar,ganos-spark-jupyter-leaflet-1.0-SNAPSHOT.jar"
$ jupyter toree install --replace --user --kernel_name "Ganos Spark Test" --spark_home=${SPARK_HOME} --spark_opts="--master localhost[*] --jars $jars"
$ jupyter notebook

After the server is started, you can visit http://localhost:8888 and create a Ganos Spark Test session in the Jupyter console.

3.3. Load HBase Ganos data.

3.3.1. Create a Spark session.

3.3.2. Query data in HBase Ganos by using Spark SQL.

3.3.3. Display data in the Leaflet:

You can click here to download the complete notebook document for test.