All Products
Search
Document Center

E-MapReduce:Read data from and write data to MongoDB

Last Updated:Oct 29, 2025

MongoDB provides Spark connectors to allow you to connect E-MapReduce (EMR) Serverless Spark to MongoDB. To connect EMR Serverless Spark to MongoDB, you need to only add specific configurations when you develop a job. This topic describes how to read data from and write data to MongoDB in EMR Serverless Spark.

Prerequisites

Limits

The engine version of Serverless Spark must meet the following requirements:

  • esr-4.x: esr-4.1.0 or later

  • esr-3.x: esr-3.1.0 or later

  • esr-2.x: esr-2.5.0 or later

Procedure

Step 1: Obtain the JAR packages of MongoDB and the Spark connector and upload the packages to OSS

  1. Download the required dependencies from the Maven project based on the versions of Spark and MongoDB. For more information, see Getting Started with the Spark Connector. In this topic, MongoDB 5.0.1 is used. The following JAR packages are downloaded:

    • mongo-spark-connector_2.12-10.4.1.jar

    • mongodb-driver-core-5.0.1.jar

    • mongodb-driver-sync-5.0.1.jar

    • bson-5.0.1.jar

  2. Upload the downloaded JAR packages of MongoDB and the Spark connector to Alibaba Cloud Object Storage Service (OSS). For more information, see Simple upload.

Step 2: Create a network connection

EMR Serverless Spark can access MongoDB only if the connection between EMR Serverless Spark and MongoDB is established. For more information, see Network connectivity between EMR Serverless Spark and other VPCs.

Important

When you configure a security group rule, you must configure the Port Range parameter based on your business requirements. The parameter value ranges from 1 to 65535.

Step 3: Read data from MongoDB in EMR Serverless Spark

  1. Create a notebook session. For more information, see Manage notebook sessions.

    On the Create Notebook Session page, set the Engine Version parameter to the version that meets the requirements in the Limits section, select the created network connection from the Network Connection drop-down list, and then add the following code to the Spark Configuration section to load the Spark connector:

    spark.mongodb.write.connection.uri                mongodb://<IP address of MongoDB>:27017
    spark.mongodb.read.connection.uri                 mongodb://<IP address of MongoDB>:27017
    spark.emr.serverless.user.defined.jars            oss://<bucketname>/path/to/mongo-spark-connector_2.12-10.4.1.jar,oss://<bucketname>/path/to/mongodb-driver-core-5.0.1.jar,oss://<bucketname>/path/to/mongodb-driver-sync-5.0.1.jar,oss://<bucketname>/path/to/bson-5.0.1.jar

    The following table describes the parameters in the preceding code. You can configure the parameters based on your business requirements.

    Parameter

    Description

    Example

    spark.mongodb.write.connection.uri

    The Uniform Resource Identifier (URI) used by Spark to read data from and write data to MongoDB.

    • <IP address of MongoDB>: the IP address of MongoDB.

    • 27017: the default port of MongoDB.

    mongodb://192.168.x.x:27017

    spark.mongodb.read.connection.uri

    spark.emr.serverless.user.defined.jars

    The external dependencies required by Spark.

    oss://<yourBucketname>/spark/mongodb/mongo-spark-connector_2.12-10.4.1.jar

  2. On the Data Development page, create a notebook job. In the upper-right corner of the configuration tab of the job, select the created notebook session.

    For more information, see Manage notebook sessions.

  3. Copy the following code to the Python cell of the created notebook, modify the parameters based on your business requirements, and then click Run.

    df = spark.read \
        .format("mongodb") \
        .option("database", "<yourDatabase>") \
        .option("collection", "<yourCollection>") \
        .load()
    
    df.printSchema()
    df.show()
    

    The following table describes the parameters in the preceding code. You can configure the parameters based on your business requirements.

    Parameter

    Description

    <yourDatabase>

    The name of the MongoDB database. Example: mongo_table.

    <yourCollection>

    The name of the MongoDB collection. Example: MongoCollection.

    If existing data is returned as expected, the configurations are correct.

    image

Step 4: Write data to MongoDB in EMR Serverless Spark

Copy the following code to the Python cell of the created notebook, modify the parameters based on your business requirements, and then click Run.

from pyspark.sql import Row

data = [
    Row(name="Sam", age=25, city="New York"),
    Row(name="Charlie", age=35, city="Chicago")
]

df = spark.createDataFrame(data)
df.show()

df.write \
    .format("mongodb") \
    .option("database", "<yourDatabase>") \
    .option("collection", "<yourCollection>") \
    .mode("append") \
    .save()
    

If the written data is returned as expected, the configurations are correct.

image