This topic describes how to use the serverless Spark engine of Data Lake Analytics (DLA) to access Hadoop clusters on which Kerberos authentication is not enabled.

Prerequisites

  • DLA is activated and a Spark virtual cluster (VC) is created in the DLA console. For more information about how to activate DLA, see Activate Data Lake Analytics.
  • Object Storage Service (OSS) is activated. For more information, see Sign up for OSS.
  • The vSwitch ID and security group ID that are required for creating a Spark compute node are obtained. You can select the IDs of an existing vSwitch and an existing security group. You can also create a vSwitch and a security group and use their IDs. The vSwitch and security group that you selected must meet the following conditions:
    • The vSwitch must be in the same virtual private cloud (VPC) as the Hadoop cluster. You can use the vSwitch ID that is configured in the console for managing the Hadoop cluster. The following figures show the vSwitch IDs of the Hadoop cluster in E-MapReduce (EMR) and the Hadoop cluster in X-Pack Spark of ApsaraDB for HBase.
    • The security group that you selected must be in the same VPC as the Hadoop cluster. You can log on to the Elastic Compute Service (ECS) console. In the left-side navigation pane, choose Network & Security > Security Groups. On the Security Groups page, enter the VPC ID in the search box to search for the security groups that are associated with the VPC and select the ID of a security group.
    • If the Hadoop cluster is configured with a whitelist for access control, you must add the CIDR block of the vSwitch to the whitelist.
    Notice If you want to access a Hadoop cluster in X-Pack Spark of ApsaraDB for HBase, join the DingTalk group (ID: dgw-jk1ia6xzp) to activate Hadoop distributed file system (HDFS) first. By default, HDFS is not activated. This is because the Hadoop cluster in X-Pack Spark may be unstable or even attacked after HDFS is activated.

Procedure

  1. Obtain the parameter settings of the Hadoop cluster that you must configure on the serverless Spark engine of DLA.
    Note If Spark jobs cannot be executed in the Hadoop cluster, skip this step.
    To obtain the parameter settings of the Hadoop cluster that you want to access, you can run the following wget command to download the spark-examples-0.0.1-SNAPSHOT-shaded.jar file and upload the file to OSS. Then, submit the Spark job to the Hadoop cluster and obtain the parameter settings of the Hadoop cluster.
    wget https://dla003.oss-cn-hangzhou.aliyuncs.com/GetSparkConf/spark-examples-0.0.1-SNAPSHOT-shaded.jar
    • If you want to access a Hadoop cluster in EMR, you can upload the spark-examples-0.0.1-SNAPSHOT-shaded.jar file to OSS, and run the following commands to submit a job to the Hadoop cluster and obtain job configurations.
      --class com.aliyun.spark.util.GetConfForServerlessSpark
      --deploy-mode client
      ossref://{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar
      get hadoop

      After the job succeeds, you can view the configurations from the stdout output of the driver on the SparkUI or from the logs on the job details page.

    • If you want to access a Hadoop cluster in X-Pack Spark of ApsaraDB for HBase, you can upload the spark-examples-0.0.1-SNAPSHOT-shaded.jar file to the resource management directory, and run the following commands to submit a job to the Hadoop cluster and obtain job configurations.
      --class com.aliyun.spark.util.GetConfForServerlessSpark
      /{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar
      get hadoop

      After the job succeeds, you can view the configurations from the stdout output of the driver on the SparkUI.

    • If you want to access a Hadoop cluster of another type and you have not specified the HADOOP_CONF_DIR environment variable on the cluster, you must manually specify this variable.
      --class com.aliyun.spark.util.GetConfForServerlessSpark
      /{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar
      get --hadoop-conf-dir </path/to/your/hadoop/conf/dir> hadoop
  2. Edit the code in the Spark application file to access the Hadoop cluster.
    The following sample code reads data from and writes data to HDFS directories based on the imported HDFS directory information and then displays the data:
    package com.aliyun.spark
    
    import org.apache.spark.sql.SparkSession
    
    object SparkHDFS {
      def main(args: Array[String]): Unit = {
        val sparkSession = SparkSession
          .builder()
          .appName("Spark HDFS TEST")
          .getOrCreate()
    
        val welcome = "hello, dla-spark"
    
        // Specifies the HDFS directory to store required data.
        val hdfsPath = args(0)
        // Stores the welcome string to the specified HDFS directory.
        sparkSession.sparkContext.parallelize(Seq(welcome)).saveAsTextFile(hdfsPath)
        // Reads data from the specified HDFS directory and displays the data.
        sparkSession.sparkContext.textFile(hdfsPath).collect.foreach(println)
      }
    }
  3. Upload the JAR file of the Spark application and dependencies to OSS.
    For more information, see Upload objects.
    Note OSS and the serverless Spark engine of DLA must be deployed in the same region.
  4. Submit a job in the serverless Spark engine of DLA and perform data computations.
    • The following code snippet provides an example if you want to access a non-high-availability Hadoop cluster that has only one primary node or NameNode. For more information, see Create and run Spark jobs and Configure a Serverless Spark job.
      {
          "args": [
              "${fs.defaultFS}/tmp/dla_spark_test"
          ],
          "name": "spark-on-hdfs",
          "className": "com.aliyun.spark.SparkHDFS",
          "conf": {
          "spark.dla.eni.enable": "true",
          "spark.dla.eni.vswitch.id": "{vSwitch ID}",
          "spark.dla.eni.security.group.id": "{Security group ID}",    
          "spark.dla.job.log.oss.uri": "oss://<OSS URI that specifies where SparkUI logs are saved/>",
          "spark.driver.resourceSpec": "medium",
          "spark.executor.instances": 1,
          "spark.executor.resourceSpec": "medium"
          },
          "file": "oss://{OSS directory where your JAR file is stored}"
      }
      The following table describes the parameters that are used in the preceding code.
      Parameter Description Remarks
      fs.defaultFS The configurations in the configuration file core-site.xml of the Hadoop cluster. If the value of this parameter is the domain name of a machine, convert it to the IP address that corresponds to the domain name. Typical format: hdfs://${IP address that corresponds to the domain name}:9000/path/to/dir You can log on to the primary node of the cluster and view the mapping between domain names and IP addresses from the hosts file in the etc folder. Alternatively, you can ping the domain name or perform Step 1 to obtain the configuration of this parameter.
      spark.dla.eni.vswitch.id The ID of the vSwitch that you selected. N/A.
      spark.dla.eni.security.group.id The ID of the security group that you selected. N/A.
      spark.dla.eni.enable Specifies whether to enable an elastic network interface (ENI). N/A.

      After the job succeeds, find the job and click Log in the Operation column to view the logs of the job.

    • The following code snippet provides an example if you want to access a high-availability Hadoop cluster that has more than one primary node or NameNode.
      {
          "args": [
              "${fs.defaultFS}/tmp/test"
          ],
          "name": "spark-on-hdfs",
          "className": "com.aliyun.spark.SparkHDFS",
          "conf": {
              "spark.dla.eni.enable": "true",
              "spark.dla.eni.vswitch.id": "{vSwitch ID}",
              "spark.dla.eni.security.group.id": "{Security group ID}",
              "spark.driver.resourceSpec": "medium",
              "spark.dla.job.log.oss.uri": "oss://<OSS URI that specifies where SparkUI logs are saved/>",  
              "spark.executor.instances": 1,
              "spark.executor.resourceSpec": "medium",
              "spark.hadoop.dfs.nameservices":"{Names of your nameservices}",
              "spark.hadoop.dfs.client.failover.proxy.provider.${nameservices}":"{Full path name of the implementation class of the failover proxy provider}",
              "spark.hadoop.dfs.ha.namenodes.${nameservices}":"{List of namenodes to which your nameservices belong}",
              "spark.hadoop.dfs.namenode.rpc-address.${nameservices}.${nn1}":"IP address of namenode0:Port number of namenode0",
              "spark.hadoop.dfs.namenode.rpc-address.${nameservices}.${nn2}":"IP address of namenode1:Port number of namenode1"
          },
          "file": "oss://{{OSS directory where your JAR file is stored}"
      }
    Parameter Description Remarks
    spark.hadoop.dfs.nameservices The parameter that corresponds to dfs.nameservices in the hdfs-site.xml file. N/A.
    spark.hadoop.dfs.client.failover.proxy.provider.${nameservices} The parameter that corresponds to dfs.client.failover.proxy.provider.${nameservices} in the hdfs-site.xml file. N/A.
    spark.hadoop.dfs.ha.namenodes.${nameservices} The parameter that corresponds to dfs.ha.namenodes.${nameservices} in the hdfs-site.xml file. N/A.
    spark.hadoop.dfs.namenode.rpc-address.${nameservices}.${nn1/nn2} The parameter that corresponds to dfs.namenode.rpc-address.${nameservices}.${nn1/nn2} in the hdfs-site.xml file. Enter the IP address that corresponds to the domain name of namenode:Port number. You can view the mapping between domain names and IP addresses from the hosts file in the etc folder on the primary node of the Hadoop cluster. You can also perform Step 1 to obtain the configuration.