This topic introduces Spark SQL, Datasets, and DataFrames, and describes how to use Spark SQL to perform basic operations.

Introduction

Spark SQL is a Spark module for structured data processing. Compared with the basic API of Spark Resilient Distributed Dataset (RDD), the interfaces of Spark SQL provide more structured information about data and computing. Spark SQL can be used to perform SQL queries and read data from Hive tables.

A Dataset is a distributed collection of data. As an interface that is newly developed in Spark 1.6, a Dataset integrates the advantages of both RDD and Spark SQL. A Dataset can be constructed from Java virtual machine (JVM) objects and managed by using functional transformations such as map, filterMap, and filter. The Dataset API is available in Scala and Java, but unavailable in Python or R. However, because of the dynamic nature of Python and R, many advantages of the Dataset API are available in Python and R.

A DataFrame is a Dataset that is organized into named columns. It is conceptually equivalent to a table in a relational database. A DataFrame can be constructed from many sources, such as a structured data file, a Hive table, an external database, or an existing RDD. The DataFrame API is available in Scala, Java, Python, and R. A DataFrame in Scala or Java is represented by a Dataset of rows. In the Scala API, DataFrame is a type alias of Dataset[Row]. In the Java API, a DataFrame is represented by Dataset<Row>.

Use Spark SQL to perform basic operations

Spark SQL allows you to manage data by using SQL statements. Spark parses, optimizes, and executes the SQL statements that you use.

The following examples describe how to use Spark SQL to read files:
  • Example 1: Spark supports data in various formats. In this example, Spark reads data in the JSON format and generates data in the Parquet format. Sample code:
    val peopleDF = spark.read.json("examples/src/main/resources/people.json")
    peopleDF.write.parquet("people.parquet")
  • Example 2: Spark reads the names of people whose ages range from 13 to 19 from a parquetFile table, transforms the name data into a DataFrame, and then generates the DataFrame in a readable format by using the map transformation. Sample code:
    val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19")
    namesDF.map(attributes => "Name: " + attributes(0)).show()