This topic describes how to load Hive data to DLA Ganos for analysis.

Hive is a widely used data warehousing tool in the Hadoop ecosystem. It is mainly used for extract, transform, load (ETL) operations, data queries, and data analysis on large amounts of data in Hadoop. Hive maps structured data files that are stored in Hadoop Distributed File System (HDFS) to a database table. This tool supports SQL queries and converts SQL statements into Map or Reduce tasks.

Procedure

  1. Initialize a Spark session.
     val spark: SparkSession = SparkSession.builder()
          .config("hive.metastore.uris", hiveMetastoreUris)
          .config("hive.sql.warehouse.dir", hiveWarehouseDir)
          .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
          .config("spark.sql.crossJoin.enabled", "true")
          .config("spark.kryo.registrator", classOf[GanosSparkKryoRegistrator].getName)
          .enableHiveSupport()
          .getOrCreate()
    
    // Load the JTS package for the Spark session to process spatio-temporal data.
    spark.withJTS
    import spark.implicits._
    val sc = spark.sparkContext
  2. Load data from Hive.
    val tableName = args(0)
    // Read tableName from Hive.
    val dfFromHive = sparkSession.sql(
    s"""
        |select * from $tableName
        |""".stripMargin)
  3. Create a spatio-temporal geometry object.
    val ganosDF=dfFromHive.withColumn("geom",st_makePoint(col("x"),col("y")))
    ganosDF.show

    In the preceding statement, x represents the horizontal coordinate value of the spatial object in each record, and y represents the vertical coordinate value of the spatial object in each record.

  4. Query spatio-temporal data.
    ganosDF.createOrReplaceTempView("testpoints")
    // Create an SQL query job.
    val points = spark.sql("select * from testpoints where st_contains(st_makeBox2d(st_point(38,48), st_point(52,62)),geom)")