All Products
Search
Document Center

AnalyticDB for MySQL:Access ApsaraDB for MongoDB

Last Updated:Jun 20, 2023

This topic describes how to use AnalyticDB for MySQL Data Lakehouse Edition (V3.0) Spark to access ApsaraDB for MongoDB.

Prerequisites

  • An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster is created. For more information, see Create a cluster.

  • A database account is created.

  • A job resource group is created. For more information, see Create a resource group.

  • An ApsaraDB for MongoDB instance is created in the same region as the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster. A database and a collection are created in the ApsaraDB for MongoDB instance, and data is written to the database. For more information, see Quick start.

  • The vSwitch IP address of the ApsaraDB for MongoDB instance is added to a whitelist of the instance. For more information, see Configure a whitelist for an ApsaraDB for MongoDB instance.

    Note

    To view the vSwitch ID, log on to the ApsaraDB for MongoDB console and go to the Basic Information page. To view the vSwitch IP address, log on to the Virtual Private Cloud (VPC) console.

  • An Elastic Compute Service (ECS) security group is added to the ApsaraDB for MongoDB instance as a whitelist. Rules of the security group allow inbound and outbound traffic on ports of the ApsaraDB for MongoDB instance. For more information, see Configure an ECS security group and Add a security group rule.

Procedure

  1. Download the JAR packages that are required for AnalyticDB for MySQL Spark to access ApsaraDB for MongoDB by using the following links: mongo-spark-connector_2.12-10.1.1.jar, mongodb-driver-sync-4.8.2.jar, bson-4.8.2.jar, bson-record-codec-4.8.2.jar, and mongodb-driver-core-4.8.2.jar.

  2. Add the following dependencies to the pom.xml file:

      <dependency>
        <groupId>org.mongodb.spark</groupId>
        <artifactId>mongo-spark-connector_2.12</artifactId>
        <version>10.1.1</version>
      </dependency>
  3. Write and package a program to access ApsaraDB for MongoDB. In this example, the generated package is named spark-mongodb.jar. Sample code:

    package com.aliyun.spark
    
    import org.apache.spark.sql.SparkSession
    
    object SparkOnMongoDB {
      def main(args: Array[String]): Unit = {
        // Specify the VPC endpoint of the ApsaraDB for MongoDB instance. You can view the VPC endpoint on the Database Connections page of the ApsaraDB for MongoDB console. 
        val connectionUri = args(0)
        // Specify the name of the database in the ApsaraDB for MongoDB instance. 
        val database = args(1)
        // Specify the name of the collection in the ApsaraDB for MongoDB instance. 
        val collection = args(2)
        
        val spark = SparkSession.builder()
          .appName("MongoSparkConnectorIntro")
          .config("spark.mongodb.read.connection.uri", connectionUri)
          .config("spark.mongodb.write.connection.uri", connectionUri)
          .getOrCreate()
    
        val df = spark.read.format("mongodb").option("database", database).option("collection", collection).load()
        
        df.show()
        
        spark.stop()
      }
    }
    Note

    For more information about the settings, see Configuration Options, Write to MongoDB, and Read from MongoDB.

  4. Upload the JAR packages that are obtained from Steps 1 and 3 to Object Storage Service (OSS). For more information, see Simple upload.

  5. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select the region where the cluster resides. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster and click the cluster ID. In the left-side navigation pane, choose Job Development > Spark JAR Development.

  6. Select the job resource group and a job type for the Spark job. In this example, the batch type is selected.

  7. Enter the following code in the Spark editor.

    Important
    • You can access ApsaraDB for MongoDB from AnalyticDB for MySQL over a VPC or the Internet.

    • We recommend that you access ApsaraDB for MongoDB from AnalyticDB for MySQL over a VPC.

    {
      "args": [
        -- Specify the VPC endpoint of the ApsaraDB for MongoDB instance. You can view the VPC endpoint on the Database Connections page of the ApsaraDB for MongoDB console. 
    	  "mongodb://<username>:<password>@<host1>:<port1>,<host2>:<port2>,...,<hostN>:<portN>/<database_name>",
        -- Specify the name of the database in the ApsaraDB for MongoDB instance. 
        "<database_name>",
        -- Specify the name of the collection in the ApsaraDB for MongoDB instance. 
        "<collection_name>"
    	],
      "file": "oss://<bucket_name>/spark-mongodb.jar",
    	"jars": [
    		"oss://<bucket_name>/mongo-spark-connector_2.12-10.1.1.jar",
    	  "oss://<bucket_name>/mongodb-driver-sync-4.8.2.jar",
    	  "oss://<bucket_name>/bson-4.8.2.jar",
    	  "oss://<bucket_name>/bson-record-codec-4.8.2.jar",
    	  "oss://<bucket_name>/mongodb-driver-core-4.8.2.jar"
    	],
      "name": "MongoSparkConnectorIntro",
    	"className": "com.aliyun.spark.SparkOnMongoDB",
      "conf": {
        "spark.driver.resourceSpec": "medium",
        "spark.executor.instances": 2,
        "spark.executor.resourceSpec": "medium",
        "spark.adb.eni.enabled": "true",
    	  "spark.adb.eni.vswitchId": "vsw-bp14pj8h0****",
    	  "spark.adb.eni.securityGroupId": "sg-bp11m93k021tp****"
      }
    }

    The following table describes the parameters.

    Parameter

    Description

    args

    The arguments that are required for the use of the JAR packages. Specify the arguments based on your business requirements. Separate multiple arguments with commas (,).

    file

    The OSS path of spark-mongodb.jar.

    jars

    The OSS path of the JAR package that is required for the Spark job.

    name

    The name of the Spark job.

    className

    The entry class of the Java or Scala program. The entry class is not required for a Python program.

    spark.adb.eni.enabled

    Specifies whether to enable elastic network interface (ENI).

    When you use Data Lakehouse Edition (V3.0) Spark to access ApsaraDB for MongoDB, you must enable ENI.

    spark.adb.eni.vswitchId

    The vSwitch ID of the ApsaraDB for MongoDB instance. To view the vSwitch ID, log on to the ApsaraDB for MongoDB console and go to the Basic Information page.

    spark.adb.eni.securityGroupId

    The ID of the ECS security group that is added to the ApsaraDB for MongoDB instance as a whitelist. For more information, see Configure an ECS security group.

    conf

    The configuration parameters that are required for the Spark job, which are similar to those of Apache Spark. The parameters must be in the key:value format. Separate multiple parameters with commas (,). For more information, see Conf configuration parameters.

  8. Click Run Now.

  9. After the state of the involved Spark application changes to Completed, find the Spark application and click Log in the Actions column on the Applications tab to view the data of the ApsaraDB for MongoDB table.