You can use Spark SQL to develop streaming analytics jobs in EMR V3.21.0 and later. This topic describes the data sources supported by Spark SQL and methods to process data in the data sources.
Data sources
Data source | Batch read | Batch write | Streaming read | Streaming write |
---|---|---|---|---|
Kafka | Supported | Not supported | Supported | Supported |
Loghub | Supported | Supported | Supported | Supported |
Tablestore | Supported | Supported | Supported | Supported |
DataHub | Not supported | Not supported | Supported | Supported |
HBase | Supported | Supported | Not supported | Supported |
JDBC | Supported | Supported | Not supported | Supported |
Druid | Not supported | Not supported | Not supported | Supported |
Redis | Not supported | Not supported | Not supported | Supported |
Kudu | Supported | Supported | Not supported | Supported |
DTS | Supported | Not supported | Supported | Not supported |
Methods to process data in the data sources
You can use one of the following methods to process data in the data sources:
-
Command line
- Download the data source JAR package that has been precompiled.
The JAR package contains the implementation packages and related dependency packages of the Loghub, Tablestore, HBase, JDBC, and Redis data sources. The packages for the Kafka and Druid data sources are not contained in this JAR package and will be supplemented later. For more information, see Release notes.
- Use the streaming-sql command line for interactive development.
[hadoop@emr-header-1 ~]# streaming-sql --master yarn-client --jars emr-datasources_shaded_2.11-${version}.jar --driver-class-path emr-datasources_shaded_2.11-${version}.jar
- Download the data source JAR package that has been precompiled.
-
Workflow
For more information, see Configure a Streaming SQL job.