This topic describes how to submit a job and view information about the job.

Prerequisites

You have logged on to an E-MapReduce (EMR) cluster. For more information, see Log on to a cluster.

Submit a job

Method 1: Submit a job by directly executing statements

  1. After you log on to a cluster, run the following command to start Streaming SQL:
    streaming-sql
  2. Enter the DDL or DML statements supported by Streaming SQL.

Method 2: Submit a job by using a file

Write the DDL or DML statements that you want to execute in a file. In this example, a file named test.sql is used. Then, use one of the following methods to start a streaming job:

  • Submit a job in yarn-client mode (default)
    streaming-sql -f test.sql
  • Submit a job in yarn-cluster mode
    streaming-sql --master yarn --deploy-mode cluster -f test.sql

View information about a job

Spark Structured Streaming does not allow you to view the information about a structured streaming job. EMR V3.21.0 and later versions provide a preview release of the Spark Streaming SQL feature. This feature allows you to view the statistics about a structured streaming query on the extended Spark web UI. You can also view the information about a structured streaming job on the Spark web UI.

Access the Spark web UI

  1. Go to the Public Connect Strings page for the cluster in the EMR console and click the link for YARN UI.
    To access the YARN web UI by using your Knox account, you must obtain the username and password of the Knox account. For more information, see Manage user accounts.
  2. In the Hadoop console, click ApplicationMaster in the Tracking UI column of the desired job.
  3. On the page that appears, click the Structured Streaming tab.
    On the Streaming Query page, you can view the list of queries.
    You can click the ID of a query in the Run ID column to view the information about the query. Streaming Query
The preceding example shows how to access the Spark web UI by using Knox. If Knox is not enabled for your cluster, you can enter http://${baseUrl}/streamingsql in the address bar of a browser to access the Spark web UI.
Note ${baseUrl} indicates the URL that you want to use.

View the list of queries

The Streaming Query page provides the Active Streaming Queries and Completed Streaming Queries sections.
  • Active Streaming Queries: displays the streaming queries that are running.
  • Completed Streaming Queries: displays the completed queries, including the finished and failed queries.
Note The parameters that are displayed in the Active Streaming Queries and Completed Streaming Queries sections vary based on the version of Spark. The following table describes the parameters.
Parameter Description
Query Name The name of the query. The name is specified by running the SET streaming.query.name=${QUERY_NAME} command.
Status The status of the query. Valid values: RUNNING, FAILED, and FINISHED.
Id The ID of the query. The ID is saved to a checkpoint file and remains unchanged even if the query is run multiple times.
Run ID The run ID of the query. A run ID is generated each time the query is run.
Submit Time The time when the query is submitted for running.
Duration The duration for which the query runs.
Avg Input PerSec The average data input rate of a specific number of batches. The number of batches is specified by the spark.sql.streaming.numRecentProgressUpdates parameter. By default, the value of the spark.sql.streaming.numRecentProgressUpdates parameter is 100.
Avg Process PerSec The average data processing rate of a specific number of batches. The number of batches is specified by the spark.sql.streaming.numRecentProgressUpdates parameter. By default, the value of the spark.sql.streaming.numRecentProgressUpdates parameter is 100.
Total Input Rows The total number of entries of a specific number of batches, which is different from the total number of entries during the running of a query. The number of batches is specified by the spark.sql.streaming.numRecentProgressUpdates parameter. By default, the value of the spark.sql.streaming.numRecentProgressUpdates parameter is 100.
Last Batch ID The ID of the most recent batch that finished running.
Last Progress The information about the most recent batch that is run.
ERROR The error message that is returned when a query fails.

On the Streaming Query page, you can terminate a query. After the query is terminated, the status of the query changes to FINISHED.

View the query statistics

On the Streaming Query page, click the ID of a query in the Run ID column. On the Streaming Query Statistics page, you can view the details of the query, including the line charts and histograms of the Input Rate, Process Rate, and Input Rows parameters. You can also view the stacked column chart of the Duration parameter, which is presented from the following dimensions: WalCommit, QueryPlanning, GetOffset, GetBatch, and AddBatch. Streaming Query Statistics
You can view the time that is consumed by a batch in each phase at different points in time. The Streaming Query Statistics page displays statistics about a specific number of batches. The number of batches is specified by the spark.sql.streaming.numRecentProgressUpdates parameter. If you want to view the statistics about more batches, you can set spark.sql.streaming.numRecentProgressUpdates parameter to a larger value. However, more memory space is occupied when you set the parameter to a larger value. Duration