This topic describes the CREATE SCAN statement of Spark SQL. This statement is supported in EMR V3.23.0 and later.

Background information

The CREATE SCAN statement supports the data model of Spark Data Source API V2. You can use the CREATE SCAN statement to implement batch and streaming queries in Spark SQL in a unified manner. For example, you can define a Kafka data source table and define a CREATE SCAN statement for batch read and another for streaming read.

When you define a table, specify only the basic information of the data source. You do not need to specify the method to read the parameters of the table. Take note of the following limits on the CREATE SCAN statement:
  • You can use a view defined by the CREATE SCAN statement only as a data source table, not a data output table.
  • You can use the CREATE SCAN statement to process raw tables, but raw tables support only batch read.

Syntax

CREATE SCAN tbName_alias
ON tbName
USING queryType
OPTIONS (propertyName=propertyValue[,propertyName=propertyValue]*)
You can set queryType to one of the following values:
  • BATCH: implements batch read on source table tbName and defines temporary view tbName_alias.
  • STREAM: implements streaming read on source table tbName and defines temporary view tbName_alias.

You can define the runtime parameters for reading the data source in the OPTIONS clause. The parameters vary depending on data sources. Most of the parameters are used to define streaming read.

Example

  1. Create a Log Service data source table.
    spark-sql> CREATE TABLE loghub_table_intput_test(content string)
             > USING loghub
             > OPTIONS
             > (...)
  2. Process the Log Service table data offline by using the BATCH method, and count the number of data records generated till the current time.
    spark-sql> CREATE SCAN loghub_table_intput_test_batch
             > ON loghub_table_intput_test
             > USING BATCH;
    spark-sql> SELECT COUNT(*) FROM loghub_table_intput_test_batch;
  3. Process the Log Service table data by using the STREAM method.
    spark-sql> CREATE TABLE loghub_table_output_test(content string)
             > USING loghub
             > OPTIONS
             > (...);
    
    spark-sql> CREATE SCAN loghub_table_intput_test_stream
             > ON loghub_table_intput_test
             > USING STREAM;
    
    -- The following code shows an invalid SELECT operation on a streaming table. An error is reported.
    spark-sql> SELECT COUNT(*) FROM loghub_table_test_stream;
    Error in query: Queries with streaming sources must be executed with writeStream.start();; 
    spark-sql> INSERT INTO loghub_table_output_test SELECT content FROM loghub_table_intput_test_stream;