An elastic network interface (ENI) is a virtual network interface controller (NIC) that can be bound with a VPC-type ECS instance. You can use ENIs to deploy high-availability clusters and perform low-cost failovers and fine-grained network management. This topic describes how to use the serverless Spark engine to access Hive clusters in your VPC through ENIs.

Prerequisites

The serverless Spark engine can access your VPC. For more information.
Notice The VSwitch and security group of the existing ENI in your cluster can be used.

Procedure

  1. Add the CIDR block of the ENI VSwitch to the whitelist or security group of the Hive cluster.
    • If you want to access a Hive cluster in E-MapReduce (EMR), you must add the CIDR block of the VSwitch to the inbound rule of the security group in the Hive cluster. Then, expose this CIDR block to Hive port 9083 and HDFS ports 8020, 50010, and 50020. The HDFS ports may vary if you change the default ones.
    • If you want to access a Hive cluster in X-Pack Spark, you can add the CIDR block of the ENI VSwitch to the whitelist for access control of the Hive cluster. For more information, see Configure a whitelist or a security group.
      Notice In this case, you must contact the technical support personnel (DingTalk ID: dgw-jk1ia6xzp) to enable HDFS. HDFS is not enabled by default. This is because HDFS is prone to malicious attacks, which may interrupt cluster services or even cause damages.
    • If you want to access a self-managed Hive cluster, you can also add the CIDR block of the ENI VSwitch to the security group of the Hive cluster, and expose the CIDR block to Hive port 9083 and HDFS ports 8020, 50010, and 50020. The HDFS ports may vary if you change the default ones.
  2. Obtain the parameters that must be configured in the serverless Spark engine.
    Note Skip this step if you cannot run the Spark job.
    The serverless Spark engine cannot automatically obtain configurations from the directory specified by HIVE_CONF_DIR in your Hive cluster. To obtain the configurations, you can manually configure the required parameters or use the tool provided by Alibaba Cloud. You can also use the following wget command to download the spark-examples-0.0.1-SNAPSHOT-shaded.jar package and upload it to OSS. Then, submit a Spark job to your cluster and obtain the required configurations from the output of the job.
    wget https://dla003.oss-cn-hangzhou.aliyuncs.com/GetSparkConf/spark-examples-0.0.1-SNAPSHOT-shaded.jar
    • Hive clusters in EMR: You can run the following command to submit the job after the JAR package is uploaded to OSS:
      --class com.aliyun.spark.util.GetConfForServerlessSpark
      --deploy-mode client
      ossref://{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar
      get hive Hive

      After the job succeeds, you can use SparkUI to view the stdout output of the driver. You can also view the output from the job logs on the job details page.

    • Hive clusters in X-Pack Spark: You can upload the JAR package to the resource management directory and then run the following command to submit the job and obtain configurations.
      --class com.aliyun.spark.util.GetConfForServerlessSpark
      /{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar
      get hive Hive

      After the job succeeds, you can view the configurations from the stdout output of the driver on SparkUI.

    • Self-managed Hive clusters: If you have not specified the HIVE_CONF_DIR environment variable for this cluster, manually specify this variable.
      --class com.aliyun.spark.util.GetConfForServerlessSpark
      --deploy-mode client
      /{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar
      get --Hive-conf-dir </path/to/your/hive/conf/dir> hive Hive
  3. Edit the code in the SparkApplication file to access the Hive cluster.
    The following sample code can be used to create a table in default namespace based on the table name you entered. The table contains only one column, whose value is hello, dla-spark. Then, the system reads the column from the table and writes output to stdout:
    package com.aliyun.spark
    
    import org.apache.spark.sql.SparkSession
    
    object SparkHive {
      def main(args: Array[String]): Unit = {
        val sparkSession = SparkSession
          .builder()
          .appName("Spark HIVE TEST")
          .enableHiveSupport()
          .getOrCreate()
    
        val welcome = "hello, dla-spark"
    
        // The name of the table in your Hive cluster.
        val tableName = args(0)
    
        import sparkSession.implicits. _
        // Save DataFrame df that has only one row and one column of data to your Hive cluster. The table name is the one you entered, and the column name is welcome_col.
        val df = Seq(welcome).toDF("welcome_col")
        df.write.format("hive").mode("overwrite").saveAsTable(tableName)
    
        // Read the table you specified from your Hive cluster.
        val dfFromHive = sparkSession.sql(
          s"""
            |select * from $tableName
            |""".stripMargin)
        dfFromHive.show(10)
      }
    }
  4. Upload the SparkApplication JAR package and dependencies to OSS.
    For more information, see Upload objects.
    Note The region where OSS is deployed must be the same as the region where the serverless Spark engine is deployed.
  5. Submit a job in the serverless Spark engine and perform computation.
    • The following code snippet provides an example if you want to access high-availability Hive clusters. For more information, see Create and run Spark jobs.
      {
          "args": [
              "hello_dla"
          ],
          "name": "spark-on-hive",
          "className": "com.aliyun.spark.SparkHive",
          "conf": {
          "spark.dla.eni.vswitch.id": "{The ID of your ENI VSwitch}",
          "spark.dla.eni.security.group.id": "{The ID of your security group}",
          "spark.dla.eni.enable": "true",
          "spark.driver.resourceSpec": "medium",
          "spark.executor.instances": 1,
          "spark.executor.resourceSpec": "medium",
          "spark.Hive.hive.metastore.uris":"thrift://${ip}:${port},thrift://${ip}:${port}",
          "spark.Hive.dfs.nameservices":"{The name of your nameservices}",
          "spark.Hive.dfs.client.failover.proxy.provider.${nameservices}":"{The full path name of the implementation class of your failover proxy provider}",
          "spark.Hive.dfs.ha.namenodes.${nameservices}":"{The namenode list to which your nameservices belongs}",
          "spark.Hive.dfs.namenode.rpc-address.${nameservices}.${nn1}":"The IP address and port number of namenode0",
          "spark.Hive.dfs.namenode.rpc-address.${nameservices}.${nn2}":"The IP address and port number of namenode1
      "
          },
          "file": "oss://{The OSS directory where your JAR package is saved}"
      }
      The following table describes the parameters.
      Parameter Description Remarks
      spark.Hive.hive.metastore.uris The uniform resource identifiers (URIs) of the Hive metastore service you want to access. These URIs correspond to those configured for hive.metastore.uris in the ${HIVE_CONF_DIR}/hive-site.xml file. The value of the hive.metastore.uris configuration item is in the format of Domain name:Port. To obtain the value of spark.Hive.hive.metastore.uris, you must convert the value format of the hive.metastore.uris configuration item to IP address:Port. To obtain the mappings between domain names and IP addresses, you can log on to the master node of your cluster and view the hosts file in the etc folder. Alternatively, You can ping domain names on the master node to obtain the mappings. You can also obtain the mappings from Step 2.
      spark.dla.eni.vswitch.id The ID of your ENI VSwitch.
      spark.dla.eni.security.group.id The ID of your security group.
      spark.dla.eni.enable Specifies whether to enable or disable your ENI VSwitch.
      spark.Hive.dfs.nameservices The parameter that corresponds to dfs.nameservices in the hdfs-site.xml file.
      spark.Hive.dfs.client.failover.proxy.provider.${nameservices} The parameter that corresponds to dfs.client.failover.proxy.provider.${nameservices} in the hdfs-site.xml file.
      spark.Hive.dfs.ha.namenodes.${nameservices} The parameter that corresponds to dfs.ha.namenodes.${nameservices} in the hdfs-site.xml file.
      spark.Hive.dfs.namenode.rpc-address.${nameservices}.${nn1/nn2} The parameter that corresponds to dfs.namenode.rpc-address.${nameservices}.${nn1/nn2} in the hdfs-site.xml file. The value of the spark.Hive.dfs.namenode.rpc-address.${nameservices}.${nn1/nn2} parameter must be in the format of IP address: Port. However, the value of dfs.namenode.rpc-address.${nameservices}.${nn1/nn2} is in the format of Domain name:Port. To obtain the mappings between domain names and IP addresses, you can log on to the master node of your cluster and view the hosts file in the etc folder. Alternatively, You can ping domain names on the master node to obtain the mappings. You can also obtain the mappings from Step 2.
      After the job succeeds, find the job in the task list and click Log in the Operation column to view job logs.12
    • The following code snippet provides an example if you want to access non-high-availability Hive clusters.
      {
          "args": [
              "hello_dla"
          ],
          "name": "spark-on-hive",
          "className": "com.aliyun.spark.SparkHive",
          "conf": {
              "spark.dla.eni.vswitch.id": "{The ID of your ENI VSwitch}",
              "spark.dla.eni.security.group.id": "{The ID of your security group}",
                "spark.dla.eni.enable": "true",
              "spark.driver.resourceSpec": "medium",
              "spark.executor.instances": 1,
              "spark.executor.resourceSpec": "medium",
              "spark.Hive.hive.metastore.uris":"thrift://${ip}:${port},thrift://${ip}:${port}",
              "spark.dla.eni.extra.hosts":"${ip0} ${hostname_0} ${hostname_1} ${hostname_n}"
          },
          "file": "oss://{The OSS directory where your JAR package is saved}"
      }
      Parameter Description Remarks
      spark.Hive.hive.metastore.uris The URIs of the Hive metastore service you want to access. These URIs correspond to those configured for hive.metastore.uris in the ${HIVE_CONF_DIR}/hive-site.xml file. The value of the hive.metastore.uris configuration item is in the format of Domain name:Port. To obtain the value of spark.Hive.hive.metastore.uris, you must convert the value format of the hive.metastore.uris configuration item to IP address:Port. To obtain the mappings between domain names and IP addresses, you can log on to the master node of your cluster and view the hosts file in the etc folder. Alternatively, You can ping domain names on the master node to obtain the mappings. You can also obtain the mappings from the logs queried in Step 2.
      spark.dla.eni.vswitch.id The ID of your ENI VSwitch.
      spark.dla.eni.security.group.id The ID of your security group.
      spark.dla.eni.enable Specifies whether to enable or disable your ENI VSwitch.
      spark.dla.eni.extra.hosts When the serverless Spark engine parses the domain name that indicates the location of a Hive table, the mappings between IP addresses and hosts are required. Note that IP addresses and domain names in the mapping information must be separated by spaces. Multiple pairs of IP address and domain name are separated by commas (,), such as ip0 master0, ip1 master1. You can obtain the parameter value from fs.defaultFS in the ${Hive_CONF_DIR}/core-site.xml file. For example, if the value of fs.defaultFs is hdfs://master-1:9000, the value of spark.dla.eni.extra.hosts must be ${IP address of master-1} master-1. To obtain the mappings between domain names and IP addresses, you can log on to the master node of your cluster and view the hosts file in the etc folder. You can also obtain the mappings from the logs queried in Step 2.