This topic describes how to run Spark Streaming jobs to process Kafka data in the
E-MapReduce (EMR) console.
Prerequisites
- EMR is activated.
- The Alibaba Cloud account is authorized. For more information, see Authorize roles.
- PuTTY and SSH Secure File Transfer Client are installed on your on-premises machine.
Step 1: Create a Hadoop cluster and a Kafka cluster
Create a Hadoop cluster and a Kafka cluster that belong to the same security group.
For more information, see Create a cluster.
- Log on to the Alibaba Cloud EMR console.
- Create a Hadoop cluster.
- Create a Kafka cluster.
Step 2: Obtain the required JAR package and upload it to the Hadoop cluster
- Obtain the JAR package examples-1.2.0-shaded-2.jar.zip.
- Use SSH Secure File Transfer Client to upload the JAR package to the /home/hadoop path of the master node in the Hadoop cluster.
Step 3: Create a topic on the Kafka cluster
In this example, a topic named test is created. The topic has 10 partitions and 2
replicas.
- Log on to the master node of the Kafka cluster. For more information, see Connect to the master node of an EMR cluster in SSH mode.
- Run the following command to create a topic:
/usr/lib/kafka-current/bin/kafka-topics.sh --partitions 10 --replication-factor 2 --zookeeper emr-header-1:2181 /kafka-1.0.0 --topic test --create
Note After you create the topic, keep the logon window open for later use.
Step 4: Run a Spark Streaming job
In this example, a WordCount job is run for streaming data.
- Log on to the master node of the Hadoop cluster. For more information, see Connect to the master node of an EMR cluster in SSH mode.
- Run the following command to submit a WordCount job for streaming data:
spark-submit --class com.aliyun.emr.example.spark.streaming.KafkaSample /home/hadoop/examples-1.2.0-shaded-2.jar 192.168.xxx.xxx:9092 test 5
The following table describes the parameters.
Parameter |
Description |
192.168.xxx.xxx |
The internal IP address of a Kafka broker component in the Kafka cluster. For more information, see Figure 1.
|
test |
The name of the topic. |
5 |
The time interval. |
Figure 1. List of components in the Kafka cluster
Step 5: Use Kafka to publish messages
- In the command-line interface (CLI) of the Kafka cluster, run the following command
to start the Kafka producer:
/usr/lib/kafka-current/bin/kafka-console-producer.sh --topic test --broker-list emr-worker-1:9092
- Enter text information in the logon window of the Kafka cluster. Text statistics are
displayed in the logon window of the Hadoop cluster in real time.
For example, enter the information shown in the following figure to the logon window
of the Kafka cluster.

The information shown in the following figure is displayed in the logon window of
the Hadoop cluster.

Step 6: View the status of the Spark Streaming job
- Click the Cluster Management tab in the EMR console.
- On the Cluster Management page, find the Hadoop cluster you created and click Details in the Actions column.
- In the left-side navigation pane of the Cluster Overview page, click Connect Strings.
- Click the link of Spark History Server UI.
- On the History Server page, click the App ID of the Spark Streaming job that you want to view.
You can view the status of the Spark Streaming job.
