Use Cases for EMR Serverless Spark | Use EMR Serverless Spark to Submit a PySpark Streaming Job

With the rapid development of big data, the stream processing technology has become essential for real-time data analysis. E-MapReduce (EMR) Serverless Spark provides a powerful and scalable platform that simplifies real-time data processing and saves the trouble of server management to improve efficiency. This article describes how to use EMR Serverless Spark to submit a PySpark streaming job and demonstrates the usability and maintainability of EMR Serverless Spark in stream processing.

Prerequisites

A workspace is created. For more information, see Create a workspace.

Procedure

Step 1: Create a Dataflow cluster and generate messages

1. On the EMR on ECS page, create a Dataflow cluster that contains the Kafka service. For more information, see Create a cluster.

2. Log on to the master node of the Dataflow cluster. For more information, see Log on to a cluster.

3. Run the following command to switch the directory:

cd /var/log/emr/taihao_exporter

4. Run the following command to create a topic:

# Create a topic named taihaometrics, with 10 partitions and a replica factor of 2.
kafka-topics.sh --partitions 10 --replication-factor 2 --bootstrap-server core-1-1:9092 --topic taihaometrics --create

5. Run the following command to send messages:

# Use the kafka-console-producer CLI to send messages to the taihaometrics topic.
tail -f metrics.log | kafka-console-producer.sh --broker-list core-1-1:9092 --topic taihaometrics

Step 2: Create a network connection

1. Go to the Connections page.

In the left-side navigation pane of the EMR console, click EMR Serverless > Spark.
On the Spark page, click the name of the workspace that you want to manage.
In the left-side navigation pane of the EMR Serverless Spark page, click Connections.

2. On the Connections page, click Create Connection.

3. In the Create Connection dialog box, configure the parameters described in the following table and click OK.

Parameter	Description
Name	Enter the connection name. Example: connection_to_emr_kafka.
VPC	Select the same virtual private cloud (VPC) as the Dataflow cluster. If no VPC is available, click Create a VPC to create one in the VPC console. For more information, see Create and manage a VPC.
vSwitches	Select a vSwitch that is deployed in the same VPC as the Dataflow cluster. If no vSwitch is available in the current zone, click vSwitch to create one in the VPC console. For more information, see Create and manage a vSwitch.

If Running is displayed in the Status column of the connection, the connection is created.

Step 3: Configure security group rules for the Dataflow cluster

1. Obtain the CIDR block of the vSwitch to which a cluster node is connected. On the Nodes tab, click the name of a node group to view the associated vSwitch. Then, log on to the VPC console and obtain the CIDR block of the vSwitch on the VSwitch page.

2. Configure security group rules.

On the EMR on ECS page, click the name of the Dataflow cluster that you created.
In the Security section of the Basic Information tab, click the link to the right of Cluster Security Group.
On the Security Group Details page, click Add Rule, configure Port Range and Authorization Object, and then click Save.

Parameter	Description
Port Range	Enter the port number 9092.
Authorization Object	Enter the vSwitch CIDR block obtained in the previous step. Note: To prevent attacks from external users, do not set Authorization Object to 0.0.0.0/0.

Step 4: Upload JAR packages to OSS

Upload all JAR packages in kafka.zip to Object Storage Service (OSS). For more information, see Simple upload.

Step 5: Upload the resource file

1. In the left-side navigation pane of the EMR Serverless Spark page, click Artifacts.

2. On the Artifacts page, click Upload.

3. In the Upload Artifact dialog box, click the area in the dotted line rectangle to select the pyspark_ss_demo.py file, or drag the file to the area.

Step 6: Create and start a streaming job

1. In the left-side navigation pane of the EMR Serverless Spark page, click Drafts.

2. On the Drafts tab, click Create.

3. Specify a name in the Name field, set Type to Application(Batch) > PySpark, and then click OK.

4. Configure the parameters described in the following table for the job and click Save. You do not need to configure other parameters.

Parameter	Description
Main Python Resources	Select the path of the pyspark_ss_demo.py file that you uploaded on the Artifacts page in the previous step.
Engine Version	Select the Spark version. For more information, see Engine versions.
Execution Parameters	Enter the private IP address of the core-1-1 node in the Dataflow cluster. You can go to the Nodes tab of the Dataflow cluster and click the + icon to the left of the core node group to view the private IP address of the core-1-1 node.
Spark Configuration	Enter the Spark configurations. The following code provides an example of Spark configurations: `spark.jars oss:///kafka-lib/commons-pool2-2.11.1.jar,oss:///kafka-lib/kafka-clients-2.8.1.jar,oss:///kafka-lib/spark-sql-kafka-0-10_2.12-3.3.1.jar,oss:///kafka-lib/spark-token-provider-kafka-0-10_2.12-3.3.1.jar spark.emr.serverless.network.service.name connection_to_emr_kafka` Note: `spark.jars` is used to specify the path of the external JAR packages to load when the Spark job is running. In this example, the OSS path of the packages you uploaded in Step 4 is used. Replace the value of spark.jars with the actual path.

5. Click Publish.

6. In the Publish dialog box, click OK.

7. Start the streaming job.

Click Go to O&M.
Click Start.

Step 7: View logs

Click the Logs tab.
In the Driver Logs list, click stdOut.log.

In the log file, you can view information about the running of the application and the returned results.

References

E-MapReduce Official Website: https://www.alibabacloud.com/en/product/emapreduce
Service Console: https://emr-next.console.aliyun.com/
Product Documentation: https://www.alibabacloud.com/help/en/emr/emr-serverless-spark/
For information about how to develop a PySpark batch job, see Get started with PySpark jobs.

Community

Use Cases for EMR Serverless Spark | Use EMR Serverless Spark to Submit a PySpark Streaming Job

Prerequisites

Procedure

Step 1: Create a Dataflow cluster and generate messages

Step 2: Create a network connection

Step 3: Configure security group rules for the Dataflow cluster

Step 4: Upload JAR packages to OSS

Step 5: Upload the resource file

Step 6: Create and start a streaming job

Step 7: View logs

References

Read previous post:

Read next post:

Alibaba EMR

You may also like

Comments

Alibaba EMR

Related Products

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Realtime Compute for Apache Flink

MaxCompute