MongoDB provides Spark connectors to allow you to connect E-MapReduce (EMR) Serverless Spark to MongoDB. To connect EMR Serverless Spark to MongoDB, you need to only add specific configurations when you develop a job. This topic describes how to read data from and write data to MongoDB in EMR Serverless Spark.
Prerequisites
An EMR Serverless Spark workspace is created. For more information, see Create a workspace.
A MongoDB database is created.
For more information, see Getting Started with MongoDB.
Limits
The engine version of Serverless Spark must meet the following requirements:
esr-4.x: esr-4.1.0 or later
esr-3.x: esr-3.1.0 or later
esr-2.x: esr-2.5.0 or later
Procedure
Step 1: Obtain the JAR packages of MongoDB and the Spark connector and upload the packages to OSS
Download the required dependencies from the Maven project based on the versions of Spark and MongoDB. For more information, see Getting Started with the Spark Connector. In this topic, MongoDB 5.0.1 is used. The following JAR packages are downloaded:
mongo-spark-connector_2.12-10.4.1.jarmongodb-driver-core-5.0.1.jarmongodb-driver-sync-5.0.1.jarbson-5.0.1.jar
Upload the downloaded JAR packages of MongoDB and the Spark connector to Alibaba Cloud Object Storage Service (OSS). For more information, see Simple upload.
Step 2: Create a network connection
EMR Serverless Spark can access MongoDB only if the connection between EMR Serverless Spark and MongoDB is established. For more information, see Network connectivity between EMR Serverless Spark and other VPCs.
When you configure a security group rule, you must configure the Port Range parameter based on your business requirements. The parameter value ranges from 1 to 65535.
Step 3: Read data from MongoDB in EMR Serverless Spark
Create a notebook session. For more information, see Manage notebook sessions.
On the Create Notebook Session page, set the Engine Version parameter to the version that meets the requirements in the Limits section, select the created network connection from the Network Connection drop-down list, and then add the following code to the Spark Configuration section to load the Spark connector:
spark.mongodb.write.connection.uri mongodb://<IP address of MongoDB>:27017 spark.mongodb.read.connection.uri mongodb://<IP address of MongoDB>:27017 spark.emr.serverless.user.defined.jars oss://<bucketname>/path/to/mongo-spark-connector_2.12-10.4.1.jar,oss://<bucketname>/path/to/mongodb-driver-core-5.0.1.jar,oss://<bucketname>/path/to/mongodb-driver-sync-5.0.1.jar,oss://<bucketname>/path/to/bson-5.0.1.jarThe following table describes the parameters in the preceding code. You can configure the parameters based on your business requirements.
Parameter
Description
Example
spark.mongodb.write.connection.uriThe Uniform Resource Identifier (URI) used by Spark to read data from and write data to MongoDB.
<IP address of MongoDB>: the IP address of MongoDB.27017: the default port of MongoDB.
mongodb://192.168.x.x:27017
spark.mongodb.read.connection.urispark.emr.serverless.user.defined.jarsThe external dependencies required by Spark.
oss://<yourBucketname>/spark/mongodb/mongo-spark-connector_2.12-10.4.1.jarOn the Data Development page, create a notebook job. In the upper-right corner of the configuration tab of the job, select the created notebook session.
For more information, see Manage notebook sessions.
Copy the following code to the Python cell of the created notebook, modify the parameters based on your business requirements, and then click Run.
df = spark.read \ .format("mongodb") \ .option("database", "<yourDatabase>") \ .option("collection", "<yourCollection>") \ .load() df.printSchema() df.show()The following table describes the parameters in the preceding code. You can configure the parameters based on your business requirements.
Parameter
Description
<yourDatabase>The name of the MongoDB database. Example: mongo_table.
<yourCollection>The name of the MongoDB collection. Example: MongoCollection.
If existing data is returned as expected, the configurations are correct.

Step 4: Write data to MongoDB in EMR Serverless Spark
Copy the following code to the Python cell of the created notebook, modify the parameters based on your business requirements, and then click Run.
from pyspark.sql import Row
data = [
Row(name="Sam", age=25, city="New York"),
Row(name="Charlie", age=35, city="Chicago")
]
df = spark.createDataFrame(data)
df.show()
df.write \
.format("mongodb") \
.option("database", "<yourDatabase>") \
.option("collection", "<yourCollection>") \
.mode("append") \
.save()
If the written data is returned as expected, the configurations are correct.
