You can use Spark SQL to develop streaming analytics jobs in EMR V3.21.0 and later. This topic describes the data sources supported by Spark SQL and methods to process data in the data sources.

Data sources

Data source Batch read Batch write Streaming read Streaming write
Kafka Supported Not supported Supported Supported
Loghub Supported Supported Supported Supported
Tablestore Supported Supported Supported Supported
DataHub Not supported Not supported Supported Supported
HBase Supported Supported Not supported Supported
JDBC Supported Supported Not supported Supported
Druid Not supported Not supported Not supported Supported
Redis Not supported Not supported Not supported Supported
Kudu Supported Supported Not supported Supported
DTS Supported Not supported Supported Not supported

Methods to process data in the data sources

You can use one of the following methods to process data in the data sources:

  • Command line

    1. Download the data source JAR package that has been precompiled.

      The JAR package contains the implementation packages and related dependency packages of the Loghub, Tablestore, HBase, JDBC, and Redis data sources. The packages for the Kafka and Druid data sources are not contained in this JAR package and will be supplemented later. For more information, see Release notes.

    2. Use the streaming-sql command line for interactive development.
      [hadoop@emr-header-1 ~]# streaming-sql --master yarn-client --jars emr-datasources_shaded_2.11-${version}.jar --driver-class-path emr-datasources_shaded_2.11-${version}.jar
  • Workflow

    For more information, see Configure a Streaming SQL job.