submit a PySpark streaming job - E-MapReduce - Alibaba Cloud Documentation Center

In the age of big data, stream processing is vital for real-time data analytics. EMR Serverless Spark is a powerful and scalable platform that simplifies real-time data processing and improves efficiency by removing the need to manage servers. This topic describes how to submit a PySpark streaming job using EMR Serverless Spark and demonstrates the platform's ease of use and maintenance for stream processing.

Prerequisites

A workspace has been created. For more information, see Create a workspace.

Procedure

Step 1: Create a real-time Dataflow cluster and generate messages

On the EMR on ECS page, create a Dataflow cluster that includes the Kafka service. For more information, see Create a cluster.
Log on to the master node of the EMR cluster. For more information, see Log on to a cluster.
Run the following command to switch the directory.
```
cd /var/log/emr/taihao_exporter
```

Run the following command to create a topic.

# Create a topic named taihaometrics with 10 partitions and a replication factor of 2.
kafka-topics.sh --partitions 10 --replication-factor 2 --bootstrap-server core-1-1:9092 --topic taihaometrics --create

Run the following command to send messages.

# Use kafka-console-producer to send messages to the taihaometrics topic.
tail -f metrics.log | kafka-console-producer.sh --broker-list core-1-1:9092 --topic taihaometrics

Step 2: Create a network connection

Go to the Network Connections page.
1. In the navigation pane on the left of the EMR console, choose EMR Serverless > Spark.
2. On the Spark page, click the name of the target workspace.
3. On the EMR Serverless Spark page, click Network Connection in the navigation pane on the left.
On the Network Connection page, click Create Network Connection.

In the Create Network Connection dialog box, configure the following parameters and click OK.

Parameter	Description
Name	Enter a name for the new connection. For example, connection_to_emr_kafka.
VPC	Select the same VPC as the EMR cluster. If no VPC is available, click Create VPC to go to the VPC console and create one. For more information, see Create and manage a VPC.
vSwitch	Select the same vSwitch that is in the same VPC as the EMR cluster. If no vSwitch is available in the current zone, go to the VPC console and create one. For more information, see Create and manage a vSwitch.

When the Status changes to Succeeded, the network connection is created.

Step 3: Add a security group rule for the EMR cluster

Obtain the CIDR block of the vSwitch for the cluster node.
On the Nodes page, click a node group name to view the associated vSwitch information. Then, log on to the VPC console and obtain the CIDR block of the vSwitch on the vSwitch page.

Add a security group rule.

On the EMR on ECS page, click the ID of the target cluster.
On the Basic Information page, click the link next to Cluster Security Group.

On the Security Group Details page, in the Rules section, click Add Rule. Configure the following parameters and click OK.

Parameter	Description
Source	Enter the CIDR block of the vSwitch that you obtained in the previous step. Important To prevent security risks from external attacks, do not set the authorization object to 0.0.0.0/0.
Destination (Current Instance)	Enter port 9092.

Step 4: Upload JAR packages to OSS

Decompress the kafka.zip file and upload all the JAR packages to Object Storage Service (OSS). For more information, see Simple upload.

Step 5: Upload the resource file

On the EMR Serverless Spark page, click Artifacts in the navigation pane on the left.
On the Artifacts page, click Upload File.
In the Upload File dialog box, click the upload area and select the pyspark_ss_demo.py file.

Step 6: Create and start a streaming job

On the EMR Serverless Spark page, click Development in the navigation pane on the left.
On the Development tab, click the icon.
Enter a name, select Application(Streaming) > PySpark for the job type, and then click OK.

In the new development tab, configure the following parameters, leave the other parameters at their default values, and then click Save.

Parameter	Description
Main Python Resource	Select the pyspark_ss_demo.py file that you uploaded on the Resource Upload page in the previous step.
Engine Version	The Spark version. For more information, see Engine versions.
Execution Parameters	The internal IP address of the core-1-1 node of the EMR cluster. You can view the address under the Core node group on the Nodes page of the EMR cluster.
Spark Configurations	The Spark configuration information. The following code provides an example. `spark.jars oss://path/to/commons-pool2-2.11.1.jar,oss://path/to/kafka-clients-2.8.1.jar,oss://path/to/spark-sql-kafka-0-10_2.12-3.3.1.jar,oss://path/to/spark-token-provider-kafka-0-10_2.12-3.3.1.jar spark.emr.serverless.network.service.name connection_to_emr_kafka` Note `spark.jars`: Specifies the paths of external JAR packages to load at runtime. Replace the paths with the actual paths of all JAR packages that you uploaded in Step 4. `spark.emr.serverless.network.service.name`: Specifies the name of the network connection. Replace the name with the name of the network connection that you created in Step 2.

Click Publish.
In the Publish dialog box, click OK.
Start the streaming job.
1. Click Go To O&M.
2. Click Start.

Step 7: View logs

Click the Log Exploration tab.
On the Log Exploration tab, you can view information about the application's execution and the returned results.

References

For more information about the PySpark development process, see PySpark development quick start.