how to use AnalyticDB for MySQL Data Lakehouse Edition (V3.0) Spark to access ApsaraDB for MongoDB - AnalyticDB for MySQL

This topic describes how to use AnalyticDB for MySQL Data Lakehouse Edition (V3.0) Spark to access ApsaraDB for MongoDB.

Prerequisites

An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster is created. For more information, see Create a cluster.
A database account is created.
- If you use an Alibaba Cloud account, you need to create only a privileged database account. For more information, see Create a database account.
- If you use a Resource Access Management (RAM) user, you must create both a privileged database account and a standard database account and associate the standard account with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
A job resource group is created. For more information, see Create a resource group.
An ApsaraDB for MongoDB instance is created in the same region as the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster. A database and a collection are created in the ApsaraDB for MongoDB instance, and data is written to the database. For more information, see Quick start.
The vSwitch IP address of the ApsaraDB for MongoDB instance is added to a whitelist of the instance. For more information, see Configure a whitelist for an ApsaraDB for MongoDB instance.
Note
To view the vSwitch ID, log on to the ApsaraDB for MongoDB console and go to the Basic Information page. To view the vSwitch IP address, log on to the Virtual Private Cloud (VPC) console.
An Elastic Compute Service (ECS) security group is added to the ApsaraDB for MongoDB instance as a whitelist. Rules of the security group allow inbound and outbound traffic on ports of the ApsaraDB for MongoDB instance. For more information, see Configure an ECS security group and Add a security group rule.

Procedure

Download the JAR packages that are required for AnalyticDB for MySQL Spark to access ApsaraDB for MongoDB by using the following links: mongo-spark-connector_2.12-10.1.1.jar, mongodb-driver-sync-4.8.2.jar, bson-4.8.2.jar, bson-record-codec-4.8.2.jar, and mongodb-driver-core-4.8.2.jar.

Add the following dependencies to the pom.xml file:

  <dependency>
    <groupId>org.mongodb.spark</groupId>
    <artifactId>mongo-spark-connector_2.12</artifactId>
    <version>10.1.1</version>
  </dependency>

Write and package a program to access ApsaraDB for MongoDB. In this example, the generated package is named spark-mongodb.jar. Sample code:

package com.aliyun.spark

import org.apache.spark.sql.SparkSession

object SparkOnMongoDB {
  def main(args: Array[String]): Unit = {
    // Specify the VPC endpoint of the ApsaraDB for MongoDB instance. You can view the VPC endpoint on the Database Connections page of the ApsaraDB for MongoDB console. 
    val connectionUri = args(0)
    // Specify the name of the database in the ApsaraDB for MongoDB instance. 
    val database = args(1)
    // Specify the name of the collection in the ApsaraDB for MongoDB instance. 
    val collection = args(2)
    
    val spark = SparkSession.builder()
      .appName("MongoSparkConnectorIntro")
      .config("spark.mongodb.read.connection.uri", connectionUri)
      .config("spark.mongodb.write.connection.uri", connectionUri)
      .getOrCreate()

    val df = spark.read.format("mongodb").option("database", database).option("collection", collection).load()
    
    df.show()
    
    spark.stop()
  }
}

Note

For more information about the settings, see Configuration Options, Write to MongoDB, and Read from MongoDB.

Upload the JAR packages that are obtained from Steps 1 and 3 to Object Storage Service (OSS). For more information, see Simple upload.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select the region where the cluster resides. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster and click the cluster ID. In the left-side navigation pane, choose Job Development > Spark JAR Development.
Select the job resource group and a job type for the Spark job. In this example, the batch type is selected.

Enter the following code in the Spark editor.

Important

You can access ApsaraDB for MongoDB from AnalyticDB for MySQL over a VPC or the Internet.
We recommend that you access ApsaraDB for MongoDB from AnalyticDB for MySQL over a VPC.

{
  "args": [
    -- Specify the VPC endpoint of the ApsaraDB for MongoDB instance. You can view the VPC endpoint on the Database Connections page of the ApsaraDB for MongoDB console. 
	  "mongodb://<username>:<password>@<host1>:<port1>,<host2>:<port2>,...,<hostN>:<portN>/<database_name>",
    -- Specify the name of the database in the ApsaraDB for MongoDB instance. 
    "<database_name>",
    -- Specify the name of the collection in the ApsaraDB for MongoDB instance. 
    "<collection_name>"
	],
  "file": "oss://<bucket_name>/spark-mongodb.jar",
	"jars": [
		"oss://<bucket_name>/mongo-spark-connector_2.12-10.1.1.jar",
	  "oss://<bucket_name>/mongodb-driver-sync-4.8.2.jar",
	  "oss://<bucket_name>/bson-4.8.2.jar",
	  "oss://<bucket_name>/bson-record-codec-4.8.2.jar",
	  "oss://<bucket_name>/mongodb-driver-core-4.8.2.jar"
	],
  "name": "MongoSparkConnectorIntro",
	"className": "com.aliyun.spark.SparkOnMongoDB",
  "conf": {
    "spark.driver.resourceSpec": "medium",
    "spark.executor.instances": 2,
    "spark.executor.resourceSpec": "medium",
    "spark.adb.eni.enabled": "true",
	  "spark.adb.eni.vswitchId": "vsw-bp14pj8h0****",
	  "spark.adb.eni.securityGroupId": "sg-bp11m93k021tp****"
  }
}

The following table describes the parameters.

Parameter	Description
`args`	The arguments that are required for the use of the JAR packages. Specify the arguments based on your business requirements. Separate multiple arguments with commas (,).
`file`	The OSS path of `spark-mongodb.jar`.
`jars`	The OSS path of the JAR package that is required for the Spark job.
`name`	The name of the Spark job.
`className`	The entry class of the Java or Scala program. The entry class is not required for a Python program.
`spark.adb.eni.enabled`	Specifies whether to enable elastic network interface (ENI). When you use Data Lakehouse Edition (V3.0) Spark to access ApsaraDB for MongoDB, you must enable ENI.
`spark.adb.eni.vswitchId`	The vSwitch ID of the ApsaraDB for MongoDB instance. To view the vSwitch ID, log on to the ApsaraDB for MongoDB console and go to the Basic Information page.
`spark.adb.eni.securityGroupId`	The ID of the ECS security group that is added to the ApsaraDB for MongoDB instance as a whitelist. For more information, see Configure an ECS security group.
conf	The configuration parameters that are required for the Spark job, which are similar to those of Apache Spark. The parameters must be in the `key:value` format. Separate multiple parameters with commas (,). For more information, see Conf configuration parameters.

Click Run Now.
After the state of the involved Spark application changes to Completed, find the Spark application and click Log in the Actions column on the Applications tab to view the data of the ApsaraDB for MongoDB table.