This topic introduces the configuration concepts of streaming queries and describes the configuration parameters.

Configuration concepts

Note We recommend that you do not use streaming query configurations in EMR V3.23.0 or later. For information about the latest query configurations, see CREATE SCAN and STREAM statement.
Before you use Spark SQL to implement a streaming query, get familiar with the following concepts:
  • Data source configuration: the definition of a table.

    When you define a table, configure only a data source. For example, you can specify the endpoint of the Kafka data source and the topic name. We recommend that you do not include the configurations of specific query instances in the definition of a table. Otherwise, the table does not support concurrent queries that are not related to each other.

  • Query instance configuration: the parameter settings for running a streaming query.

    Each query instance must be configured separately. You can set queryName to reduce unnecessary modifications to SQL statements. Use the SET statement to configure the parameters. For more information about the parameters, see Configuration parameters.

The statements for query instances include:
  • INSERT INTO ...
  • CREATE TABLE ... AS SELECT ...
For each query instance, queryName specified in the nearest SET statement applies. Examples:
  • Example 1:
    SET streaming.query.name=one_test_job
    
    -- query 1
    INSERT INTO tb_test_1 SELECT ...
    
    -- query 2
    INSERT INTO tb_test_2 SELECT ...
    
    -- The names of queries 1 and 2 are both one_test_job. This example is invalid because the name of each query instance must be unique.
  • Example 2:
    SET streaming.query.name=one_test_job_1
    SET streaming.query.name=one_test_job_2
    
    -- query 1
    CREATE TABLE tb_test_1 AS SELECT ...
    
    -- The name of query 1 is one_test_job_2.

Configuration parameters

Parameter DataFrame API SET statement format Description Required
queryName writeStream.queryName(...) SET streaming.query.name=$queryName The name of a streaming query. The name is used to distinguish the parameters of a query from those of other queries. Yes
option writeStream.option(...) SET spark.sql.streaming.query.options.$queryName.$optionName=$optionValue checkpointLocation: the directory of the checkpoint file. Yes
A custom directory. No
outputMode writeStream.outputMode(...) SET spark.sql.streaming.query.outputMode.$queryName=$outputMode The output mode of the query result. Default value: append. No
trigger writeStream.trigger(...) SET spark.sql.streaming.query.trigger.$queryName=$triggerType The execution mode of the query. Default value: ProcessingTime. No
SET spark.sql.streaming.query.trigger.intervalMs.$queryName=$intervalMs The interval between queries. Unit: milliseconds. Default value: 0. No