This topic describes how to submit a job and view information about the job.
Prerequisites
You have logged on to an E-MapReduce (EMR) cluster. For more information, see Log on to a cluster.
Submit a job
Method 1: Submit a job by directly executing statements
Method 2: Submit a job by using a file
Write the DDL or DML statements that you want to execute in a file. In this example, a file named test.sql is used. Then, use one of the following methods to start a streaming job:
- Submit a job in yarn-client mode (default)
streaming-sql -f test.sql
- Submit a job in yarn-cluster mode
streaming-sql --master yarn --deploy-mode cluster -f test.sql
View information about a job
Spark Structured Streaming does not allow you to view the information about a structured streaming job. EMR V3.21.0 and later versions provide a preview release of the Spark Streaming SQL feature. This feature allows you to view the statistics about a structured streaming query on the extended Spark web UI. You can also view the information about a structured streaming job on the Spark web UI.
Access the Spark web UI
${baseUrl}
indicates the URL that you want to use.
View the list of queries
- Active Streaming Queries: displays the streaming queries that are running.
- Completed Streaming Queries: displays the completed queries, including the finished and failed queries.
Parameter | Description |
---|---|
Query Name | The name of the query. The name is specified by running the SET streaming.query.name=${QUERY_NAME} command.
|
Status | The status of the query. Valid values: RUNNING, FAILED, and FINISHED. |
Id | The ID of the query. The ID is saved to a checkpoint file and remains unchanged even if the query is run multiple times. |
Run ID | The run ID of the query. A run ID is generated each time the query is run. |
Submit Time | The time when the query is submitted for running. |
Duration | The duration for which the query runs. |
Avg Input PerSec | The average data input rate of a specific number of batches. The number of batches is specified by the spark.sql.streaming.numRecentProgressUpdates parameter. By default, the value of the spark.sql.streaming.numRecentProgressUpdates parameter is 100. |
Avg Process PerSec | The average data processing rate of a specific number of batches. The number of batches is specified by the spark.sql.streaming.numRecentProgressUpdates parameter. By default, the value of the spark.sql.streaming.numRecentProgressUpdates parameter is 100. |
Total Input Rows | The total number of entries of a specific number of batches, which is different from the total number of entries during the running of a query. The number of batches is specified by the spark.sql.streaming.numRecentProgressUpdates parameter. By default, the value of the spark.sql.streaming.numRecentProgressUpdates parameter is 100. |
Last Batch ID | The ID of the most recent batch that finished running. |
Last Progress | The information about the most recent batch that is run. |
ERROR | The error message that is returned when a query fails. |
On the Streaming Query page, you can terminate a query. After the query is terminated, the status of the query changes to FINISHED.