An elastic network interface (ENI) is a virtual network interface controller (NIC) that can be bound to a VPC-type ECS instance. You can use ENIs to deploy high availability clusters and perform low-cost failovers and fine-grained network management. This topic describes how to access Hadoop clusters in your VPC by using Serverless Spark. Hadoop clusters on which Kerberos authentication is enabled cannot be accessed by using Serverless Spark.

Prerequisites

Serverless Spark can access your VPC. For more information, see Access your VPC.
Notice The VSwitch and security group of the existing ENI in your cluster can be used.

Procedure

  1. Add the CIDR block of the VSwitch where the ENI resides to a whitelist or security group of Hadoop Distributed File System (HDFS).
    • If you need to access HDFS in an EMR cluster, you must add the CIDR block of the VSwitch where the ENI resides to the inbound rules of the security group of the EMR cluster and allow access to the required HDFS ports, including ports 8020, 50010, and 50020. The ports may vary if you change the default ones. The following figure shows the security group of the EMR cluster. For more information, see 该topic中文需要重写,后续重新提翻.
    • If you need to access HDFS in Xpack-Spark, you can add the CIDR block of the VSwitch where the ENI resides to a whitelist of the Xpack-Spark cluster. For more information, see Configure a whitelist or a security group.
      Notice If you are a Xpack-Spark user, contact Xpack-Spark technical support over DingTalk (ID: dgw-jk1ia6xzp) to activate HDFS first. The Xpack-Spark cluster may be unstable or even attacked after HDFS is activated. Xpack-Spark users are not allowed to activate HDFS by themselves.
    • If you need to access a self-managed Hadoop cluster, you can add the CIDR block of the VSwitch where the ENI resides to the security group to which the cluster belongs, and allow access to the required HDFS ports, including ports 8020, 50010, and 50020. The ports may vary if you change the default ones.
  2. Obtain the parameters that need to be configured in Serverless Spark.
    Note If you cannot run a Spark job, skip this step.
    You must obtain the parameter configurations of a Hadoop cluster because Serverless Spark cannot automatically obtain the parameter configurations from HDOOP_CONF_DIR of your Hadoop cluster. You can use the tools provided by Alibaba Cloud to read the configurations of your Hadoop cluster. You can also download the spark-examples-0.0.1-SNAPSHOT-shaded.jar package and upload it to OSS. Then, submit the Spark job to your cluster to obtain the required Hadoop cluster configurations in the output of your job.
    wget https://dla003.oss-cn-hangzhou.aliyuncs.com/GetSparkConf/spark-examples-0.0.1-SNAPSHOT-shaded.jar
    • Access HDFS in an EMR cluster: After you upload the JAR package to OSS, run the following commands to submit the job for obtaining the configurations:
      --class com.aliyun.spark.util.GetConfForServerlessSpark
      --deploy-mode client
      ossref://{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar
      get hadoop

      After the job is executed, you can view the configurations in the stdout output of the driver on the SparkUI or the logs of job details.

    • Access HDFS in an Xpack-Hdoop cluster: After you upload the JAR package to the resource management directory, run the following commands to submit the job for obtaining the configurations:
      --class com.aliyun.spark.util.GetConfForServerlessSpark
      /{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar
      get hadoop

      After the job is executed, you can view the stdout output of the driver on the SparkUI.

    • Access HDFS in other types of Hadoop clusters: If you have not specified the HADOOP_CONF_DIR environment variable on the cluster, you must manually specify this variable.
      --class com.aliyun.spark.util.GetConfForServerlessSpark
      /{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar
      get --hadoop-conf-dir </path/to/your/hadoop/conf/dir> hadoop
  3. Write a Spark application to access HDFS.
    The following sample code reads and writes HDFS directories based on the imported HDFS directory information and then displays the information:
    package com.aliyun.spark
    
    import org.apache.spark.sql.SparkSession
    
    object SparkHDFS {
      def main(args: Array[String]): Unit = {
        val sparkSession = SparkSession
          .builder()
          .appName("Spark HDFS TEST")
          .getOrCreate()
    
        val welcome = "hello, dla-spark"
    
        //Specifies the HDFS directory to store required data.
        val hdfsPath = args(0)
        //Stores the welcome string to the specified HDFS directory.
        sparkSession.sparkContext.parallelize(Seq(welcome)).saveAsTextFile(hdfsPath)
        //Reads data from the specified HDFS directory and prints the data.
        sparkSession.sparkContext.textFile(hdfsPath).collect.foreach(println)
      }
    }
  4. Upload the JAR package of the Spark application and dependencies to OSS.
    For more information, see Upload objects.
    Note The region where OSS resides must be the same as the region where Serverless Spark resides.
  5. Submit a job in Serverless Spark and perform data computations.
    • The following code shows how to access an HDFS cluster that is deployed in non-HA mode. For more information, see Create and run Spark jobs.
      {
          "args": [
              "${fs.defaultFS}/tmp/dla_spark_test"
          ],
          "name": "spark-on-hdfs",
          "className": "com.aliyun.spark.SparkHDFS",
          "conf": {
          "spark.dla.eni.enable": "true",
          "spark.dla.eni.vswitch.id": "{ID of the VSwitch you selected}",
          "spark.dla.eni.security.group.id": "{ID of the security group you selected}",
          "spark.driver.resourceSpec": "medium",
          "spark.executor.instances": 1,
          "spark.executor.resourceSpec": "medium"
          },
          "file": "oss://{OSS directory in which your JAR package is stored}"
      }
      The following table describes the parameters.
      Parameter Description Remarks
      fs.defaultFS The configurations in the configuration file core-site.xml of your HDFS cluster. If the value of this parameter is the domain name of a machine, convert it to the IP address that corresponds to the domain name. Typical format: hdfs://${IP address that corresponds to the domain name}:9000/path/to/dir You can log on to the primary node of the cluster and view the mapping between domain names and IP addresses in the hosts file under /etc. Alternatively, you can ping the domain name or perform Step 2 to obtain the configuration of this parameter.
      spark.dla.eni.vswitch.id The ID of the VSwitch you selected.
      spark.dla.eni.security.group.id The ID of the security group you selected.
      spark.dla.eni.enable Specifies whether to enable ENI.
      After the job succeeds, click Log in the Operation column to view job logs.21
    • The following code shows how to access an HDFS cluster that is deployed in HA mode:
      {
          "args": [
              "${fs.defaultFS}/tmp/test"
          ],
          "name": "spark-on-hdfs",
          "className": "com.aliyun.spark.SparkHDFS",
          "conf": {
              "spark.dla.eni.enable": "true",
              "spark.dla.eni.vswitch.id": "{ID of the VSwitch you selected}",
              "spark.dla.eni.security.group.id": "{ID of the security group you selected}",
              "spark.driver.resourceSpec": "medium",  
              "spark.executor.instances": 1,
              "spark.executor.resourceSpec": "medium",
              "spark.hadoop.dfs.nameservices":"{Name of your nameservices}",
                      "spark.hadoop.dfs.client.failover.proxy.provider.${nameservices}":"{Full path name of the implementation class of the failover proxy provider}",
                      "spark.hadoop.dfs.ha.namenodes.${nameservices}":"{List of name nodes to which your nameservices belongs}",
                      "spark.hadoop.dfs.namenode.rpc-address.${nameservices}.${nn1}":"IP address of namenode0:Port of namenode0",
                      "spark.hadoop.dfs.namenode.rpc-address.${nameservices}.${nn2}":"IP address of namenode1:Port of namenode1",
          },
          "file": "oss://{{OSS directory in which your JAR package is stored}"
      }
      Note Serverless Spark cannot read the HADOOP_CONF_DIR directory on a cluster. If a cluster is deployed in HA mode, the five parameters provided in the following table are required. You can obtain the configurations of these parameters in hdfs-site.xml.
      Parameter Description Remarks
      spark.hadoop.dfs.nameservices Corresponds to dfs.nameservices in hdfs-site.xml.
      spark.hadoop.dfs.client.failover.proxy.provider.${nameservices} Corresponds to dfs.client.failover.proxy.provider.${nameservices} in hdfs-site.xml.
      spark.hadoop.dfs.ha.namenodes.${nameservices} Corresponds to dfs.ha.namenodes.${nameservices} in hdfs-site.xml.
      spark.hadoop.dfs.namenode.rpc-address.${nameservices}.${nn1/nn2} Corresponds to dfs.namenode.rpc-address.${nameservices}.${nn1/nn2} in hdfs-site.xml. Enter the IP address that corresponds to the domain name of namenode:Port. You can view the mapping between domain names and IP addresses in the hosts file under /etc on the primary node of your cluster. You can also perform Step 2 to obtain the configuration.