With the rapid development of big data, the stream processing technology has become essential for real-time data analysis. E-MapReduce (EMR) Serverless Spark provides a powerful and scalable platform that simplifies real-time data processing and saves the trouble of server management to improve efficiency. This article describes how to use EMR Serverless Spark to submit a PySpark streaming job and demonstrates the usability and maintainability of EMR Serverless Spark in stream processing.
A workspace is created. For more information, see Create a workspace.
1. On the EMR on ECS page, create a Dataflow cluster that contains the Kafka service. For more information, see Create a cluster.
2. Log on to the master node of the Dataflow cluster. For more information, see Log on to a cluster.
3. Run the following command to switch the directory:
cd /var/log/emr/taihao_exporter
4. Run the following command to create a topic:
# Create a topic named taihaometrics, with 10 partitions and a replica factor of 2.
kafka-topics.sh --partitions 10 --replication-factor 2 --bootstrap-server core-1-1:9092 --topic taihaometrics --create
5. Run the following command to send messages:
# Use the kafka-console-producer CLI to send messages to the taihaometrics topic.
tail -f metrics.log | kafka-console-producer.sh --broker-list core-1-1:9092 --topic taihaometrics
1. Go to the Connections page.
2. On the Connections page, click Create Connection.
3. In the Create Connection dialog box, configure the parameters described in the following table and click OK.
Parameter | Description |
Name | Enter the connection name. Example: connection_to_emr_kafka. |
VPC | Select the same virtual private cloud (VPC) as the Dataflow cluster. If no VPC is available, click Create a VPC to create one in the VPC console. For more information, see Create and manage a VPC. |
vSwitches | Select a vSwitch that is deployed in the same VPC as the Dataflow cluster. If no vSwitch is available in the current zone, click vSwitch to create one in the VPC console. For more information, see Create and manage a vSwitch. |
If Running is displayed in the Status column of the connection, the connection is created.
1. Obtain the CIDR block of the vSwitch to which a cluster node is connected. On the Nodes tab, click the name of a node group to view the associated vSwitch. Then, log on to the VPC console and obtain the CIDR block of the vSwitch on the VSwitch page.
2. Configure security group rules.
Parameter | Description |
Port Range | Enter the port number 9092. |
Authorization Object | Enter the vSwitch CIDR block obtained in the previous step. Note: To prevent attacks from external users, do not set Authorization Object to 0.0.0.0/0. |
Upload all JAR packages in kafka.zip to Object Storage Service (OSS). For more information, see Simple upload.
1. In the left-side navigation pane of the EMR Serverless Spark page, click Artifacts.
2. On the Artifacts page, click Upload.
3. In the Upload Artifact dialog box, click the area in the dotted line rectangle to select the pyspark_ss_demo.py file, or drag the file to the area.
1. In the left-side navigation pane of the EMR Serverless Spark page, click Drafts.
2. On the Drafts tab, click Create.
3. Specify a name in the Name field, set Type to Application(Batch) > PySpark, and then click OK.
4. Configure the parameters described in the following table for the job and click Save. You do not need to configure other parameters.
Parameter | Description |
Main Python Resources | Select the path of the pyspark_ss_demo.py file that you uploaded on the Artifacts page in the previous step. |
Engine Version | Select the Spark version. For more information, see Engine versions. |
Execution Parameters | Enter the private IP address of the core-1-1 node in the Dataflow cluster. You can go to the Nodes tab of the Dataflow cluster and click the + icon to the left of the core node group to view the private IP address of the core-1-1 node. |
Spark Configuration | Enter the Spark configurations. The following code provides an example of Spark configurations:
Note: spark.jars is used to specify the path of the external JAR packages to load when the Spark job is running. In this example, the OSS path of the packages you uploaded in Step 4 is used. Replace the value of spark.jars with the actual path. |
5. Click Publish.
6. In the Publish dialog box, click OK.
7. Start the streaming job.
In the log file, you can view information about the running of the application and the returned results.
62 posts | 6 followers
FollowAlibaba EMR - November 22, 2024
Alibaba EMR - November 14, 2024
Alibaba Clouder - April 9, 2019
Alibaba Cloud Native - January 25, 2024
Alibaba EMR - May 11, 2021
Alibaba EMR - May 8, 2021
62 posts | 6 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreMore Posts by Alibaba EMR