Use Spark to read Elasticsearch data - AnalyticDB - Alibaba Cloud Documentation Center

This topic shows you how to use the Spark engine in AnalyticDB for MySQL to read data from an Alibaba Cloud Elasticsearch data source over an elastic network interface (ENI).

Prerequisites

An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
A job resource group is created for the AnalyticDB for MySQL cluster.
A database account is created for the AnalyticDB for MySQL cluster.
- If you use an Alibaba Cloud account, you need to only create a privileged account.
- If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user.
An Object Storage Service (OSS) bucket is created in the same region as the AnalyticDB for MySQL cluster.
The AnalyticDB for MySQL cluster and the Alibaba Cloud Elasticsearch cluster must be in the same region. For more information, see Create an Alibaba Cloud Elasticsearch cluster.
The IP address of the AnalyticDB for MySQL cluster has been added to the whitelist of the Alibaba Cloud Elasticsearch cluster. For more information, see Configure a public or private IP address whitelist for an Elasticsearch cluster.

Preparations

In the Elasticsearch console, go to the Basic Information page and obtain the vSwitch ID.
In the Elastic Compute Service (ECS) console, go to the Security Group page and obtain the security group ID of the Alibaba Cloud Elasticsearch cluster. If no security group is added, see Create a security group.

Connect to Alibaba Cloud Elasticsearch with Scala

Download the JAR package that matches the version of your Alibaba Cloud Elasticsearch cluster. For the download link, see Elasticsearch Spark. This example uses Elasticsearch-spark-30_2.12-7.17.9.jar.

Add the required dependencies to the dependencies section of your pom.xml file.

<!-- https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-spark-30 -->
<dependency>
    <groupId>org.elasticsearch</groupId>
    <artifactId>elasticsearch-spark-30_2.12</artifactId>
    <version>7.17.9</version>
    <scope>provided</scope>
</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.12</artifactId>
    <version>3.2.0</version>
    <scope>provided</scope>
</dependency>

Important

Ensure the Elasticsearch-spark-30_2.12 version in the pom.xml file matches your Alibaba Cloud Elasticsearch cluster, and the Spark-core_2.12 version matches the Spark version of AnalyticDB for MySQL.

Write, compile, and package the following sample code. For this example, name the output JAR file spark-example.jar.

package org.example

import org.apache.spark.sql.{SaveMode, SparkSession}

object SparkEs {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().getOrCreate();

    // Create a DataFrame.
    val columns = Seq("language","users_count")
    val data = Seq(("Java", "20000"), ("Python", "100000"), ("Scala", "3000"))
    val writeDF = spark.createDataFrame(data).toDF(columns:_*)

    // Write data.
    writeDF.write.format("es").mode(SaveMode.Overwrite)
    // The private endpoint of the Alibaba Cloud Elasticsearch cluster.
    .option("es.nodes", "es-cn-nwy34drji0003****.elasticsearch.aliyuncs.com")
    // The port number of the private endpoint.
    .option("es.port", "9200")
    // The username for the Alibaba Cloud Elasticsearch cluster. The username must be elastic.
    .option("es.net.http.auth.user", "elastic")
    // The password for the Alibaba Cloud Elasticsearch cluster.
    .option("es.net.http.auth.pass", "password")
    // This must be set to true when connecting to an Alibaba Cloud Elasticsearch cluster. 
    .option("es.nodes.wan.only", "true")
    // This must be set to false when connecting to an Alibaba Cloud Elasticsearch cluster.
    .option("es.nodes.discovery", "false")
    // The resource to write to, in the format /.
    .save("spark/_doc")

    // Read data.
    spark.read.format("es")
    // The private endpoint of the Alibaba Cloud Elasticsearch cluster.
    .option("es.nodes", "es-cn-nwy34drji0003****.elasticsearch.aliyuncs.com")
    // The port number of the private endpoint.
    .option("es.port", "9200")
    // The username for the Alibaba Cloud Elasticsearch cluster. The username must be elastic.
    .option("es.net.http.auth.user", "elastic")
    // The password for the Alibaba Cloud Elasticsearch cluster.
    .option("es.net.http.auth.pass", "password")
    // This must be set to true when connecting to an Alibaba Cloud Elasticsearch cluster. 
    .option("es.nodes.wan.only", "true")
    // This must be set to false when connecting to an Alibaba Cloud Elasticsearch cluster.
    .option("es.nodes.discovery", "false")
    // The data source to read from, in the format of <index>/<type>.
    .load("spark/_doc").show
  }
}

Upload the JAR file downloaded in Step 1 and the sample program spark-example.jar to an OSS bucket. For more information, see Upload objects.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, click Job Development > Spark JAR Development.
Above the editor, select a job resource group and a spark application type. This topic uses the Batch type as an example.

In the editor, enter the following job configuration.

{

    "name": "ES-SPARK-EXAMPLE",
    "className": "org.example.SparkEs",
    "conf": {
        "spark.driver.resourceSpec": "small",
        "spark.executor.instances": 1,
        "spark.executor.resourceSpec": "small",
        "spark.adb.eni.enabled": "true",
        "spark.adb.eni.vswitchId": "vsw-bp17jqw3lrrobn6y4****",
        "spark.adb.eni.securityGroupId": "sg-bp163uxgt4zandx1****"
    },
    "file": "oss://testBucketName/spark-example.jar",
    "jars": "oss://testBucketName/Elasticsearch-spark-30_2.12-7.17.9.jar"
}

The following table describes the parameters.

Parameter	Description
name	The name of the Spark job.
className	The entry class for the Java or Scala program. This parameter is not required for Python jobs.
conf	The configuration settings for the Spark application, similar to open source Spark. These settings are provided as `key:value` pairs in the conf object. For information about configuration parameters that differ from open source Spark or are specific to AnalyticDB for MySQL, see Spark application configuration parameters.
spark.adb.eni.enabled	Specifies whether to enable ENI access. This setting is required to access Elasticsearch data sources from Spark in Enterprise, Basic, or Data Lakehouse Edition.
spark.adb.eni.vswitchId	The vSwitch ID of the Alibaba Cloud Elasticsearch cluster. For more information, see Preparations.
spark.adb.eni.securityGroupId	The security group ID of the Alibaba Cloud Elasticsearch cluster. For more information, see Preparations.
file	The OSS path of the sample program `spark-example.jar`.
jars	The OSS path to the JAR packages required by the Spark job.

Click Run Now.

Connect to Alibaba Cloud Elasticsearch with PySpark

Download the JAR package that matches the version of your Alibaba Cloud Elasticsearch cluster. For the download link, see Elasticsearch Spark. This example uses Elasticsearch-spark-30_2.12-7.17.9.jar.

Add the required dependencies to the dependencies section of your pom.xml file.

<!-- https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-spark-30 -->
<dependency>
    <groupId>org.elasticsearch</groupId>
    <artifactId>elasticsearch-spark-30_2.12</artifactId>
    <version>7.17.9</version>
    <scope>provided</scope>
</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.12</artifactId>
    <version>3.2.0</version>
    <scope>provided</scope>
</dependency>

Important

Write the following sample code and save it as es-spark-example.py.

from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession \
        .builder \
        .getOrCreate()

    # Create a DataFrame.
    dept = [("Finance", 10),
            ("Marketing", 20),
            ("Sales", 30),
            ("IT", 40)
            ]
    deptColumns = ["dept_name", "dept_id"]
    deptDF = spark.createDataFrame(data=dept, schema=deptColumns)
    deptDF.printSchema()
    deptDF.show(truncate=False)

    # Write data.
    deptDF.write.format('es').mode("overwrite") \
        # The private endpoint of the Alibaba Cloud Elasticsearch cluster.
        .option('es.nodes', 'es-cn-nwy34drji0003****.elasticsearch.aliyuncs.com') \
        # The port number of the private endpoint.
        .option('es.port', '9200') \
        # The username for the Alibaba Cloud Elasticsearch cluster. The username must be elastic.
        .option('es.net.http.auth.user', 'elastic') \
        # The password for the Alibaba Cloud Elasticsearch cluster.
        .option('es.net.http.auth.pass', 'password') \
        # This must be set to true when connecting to an Alibaba Cloud Elasticsearch cluster.
        .option("es.nodes.wan.only", "true") \
        # This must be set to false when connecting to an Alibaba Cloud Elasticsearch cluster.
        .option("es.nodes.discovery", "false") \
        # The resource to write to, in the format /.
        .save("spark/_doc")

    # Read data.
    df = spark.read.format("es") \
        # The private endpoint of the Alibaba Cloud Elasticsearch cluster.
        .option('es.nodes', 'es-cn-nwy34drji0003****.elasticsearch.aliyuncs.com') \
        # The port number of the private endpoint.
        .option('es.port', '9200') \
        # The username for the Alibaba Cloud Elasticsearch cluster. The username must be elastic.
        .option('es.net.http.auth.user', 'elastic') \
        # The password for the Alibaba Cloud Elasticsearch cluster.
        .option('es.net.http.auth.pass', 'password') \
        # This must be set to true when connecting to an Alibaba Cloud Elasticsearch cluster. 
        .option("es.nodes.wan.only", "true") \
        # This must be set to false when connecting to an Alibaba Cloud Elasticsearch cluster.
        .option("es.nodes.discovery", "false") \
        # The data source to read from, in the format of <index>/<type>.
        .load("spark/_doc").show

Upload the JAR file downloaded in Step 1 and the es-spark-example.py program to an OSS bucket. For more information, see Upload objects.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, click Job Development > Spark JAR Development.
Above the editor, select a job resource group and a spark application type. This topic uses the Batch type as an example.

In the editor, enter the following job configuration.

{
    "name": "ES-SPARK-EXAMPLE",
    "conf": {
        "spark.driver.resourceSpec": "small",
        "spark.executor.instances": 1,
        "spark.executor.resourceSpec": "small",
        "spark.adb.eni.enabled": "true",
        "spark.adb.eni.vswitchId": "vsw-bp17jqw3lrrobn6y4****",
        "spark.adb.eni.securityGroupId": "sg-bp163uxgt4zandx1****"
    },
    "file": "oss://testBucketName/es-spark-example.py",
    "jars": "oss://testBucketName/Elasticsearch-spark-30_2.12-7.17.9.jar"
}

The following table describes the parameters.

Parameter	Description
name	The name of the Spark job.
conf	The configuration settings for the Spark application, similar to open source Spark. These settings are provided as `key:value` pairs in the conf object. For information about configuration parameters that differ from open source Spark or are specific to AnalyticDB for MySQL, see Spark application configuration parameters.
spark.adb.eni.enabled	Specifies whether to enable ENI access. This setting is required to access Elasticsearch data sources from Spark in Enterprise, Basic, or Data Lakehouse Edition.
spark.adb.eni.vswitchId	The vSwitch ID of the Alibaba Cloud Elasticsearch cluster. For more information, see Preparations.
spark.adb.eni.securityGroupId	The security group ID of the Alibaba Cloud Elasticsearch cluster. For more information, see Preparations.
file	The OSS path of the `es-spark-example.py` program.
jars	The OSS path to the JAR packages required by the Spark job.

Click Run Now.