This topic describes how to use the serverless Spark engine of Data Lake Analytics (DLA) to access Hive clusters in your virtual private cloud (VPC).

Prerequisites

  • DLA is activated and a Spark virtual cluster (VC) is created in the DLA console. For more information about how to activate DLA, see Activate DLA.
  • Object Storage Service (OSS) is activated. For more information, see Sign up for OSS.
  • The vSwitch ID and security group ID that are required for creating a Spark compute node are obtained. You can select the IDs of an existing vSwitch and an existing security group. You can also create a vSwitch and a security group and use their IDs. The vSwitch and security group that you selected must meet the following conditions:
    • The vSwitch must be in the same VPC as the Hive cluster.
    • The security group that you selected must be in the same VPC as the Hive cluster. You can log on to the Elastic Compute Service (ECS) console. In the left-side navigation pane, choose Network & Security > Security Groups. On the Security Groups page, enter the VPC ID in the search box to search for the security groups associated with the VPC and select the ID of a security group.
    • If the Hive cluster is configured with a whitelist for access control, you must add the CIDR block of the vSwitch to the whitelist.

Procedure

  1. If Hive metadata is stored in an independent ApsaraDB RDS database and table data is stored in OSS, you can directly use the following configurations and skip subsequent steps. Otherwise, proceed with the following steps.
    {
        
        "name": "spark-on-hive",
        "className": "com.aliyun.spark.SparkHive", # The name of the class. You can change the name based on your needs.
        "jars": [
            "oss://path/to/mysql-connector-java-5.1.47.jar"
        ],
        "conf": {
            "spark.dla.eni.vswitch.id": "{vSwitch ID}",    
            "spark.dla.eni.security.group.id": "{Security group ID}",
            "spark.dla.eni.enable": "true",
            "spark.driver.resourceSpec": "medium",
            "spark.dla.connectors": "oss",
            "spark.executor.instances": 1,
            "spark.sql.catalogImplementation": "hive",
            "spark.executor.resourceSpec": "medium",
            "spark.hadoop.javax.jdo.option.ConnectionDriverName": "com.mysql.jdbc.Driver",
            "spark.hadoop.javax.jdo.option.ConnectionUserName": "<hive_user_name>", # The username that is used to log on to the ApsaraDB RDS database in the Hive cluster. 
            "spark.hadoop.javax.jdo.option.ConnectionPassword": "<your_pass_word>", # The password that is used to log on to the ApsaraDB RDS database in the Hive cluster.
            "spark.hadoop.javax.jdo.option.ConnectionURL": "<Java Database Connectivity (JDBC) uniform resource locator (URL)>", # The JDBC URL of the ApsaraDB RDS database in the Hive cluster.
            "spark.dla.job.log.oss.uri": "<OSS uniform resource identifier (URI) that specifies where logs are saved>"
        },
        "file": "<oss://Directory where the JAR file of the main resource is saved>"
    }
    Note The JAR file specified in jars is the file of the JDBC driver of ApsaraDB for HBase. You can obtain the JAR file from the official Maven Repository website and upload the file to OSS.
  2. Obtain the parameter settings of the Hive cluster that you must configure on the serverless Spark engine of DLA.
    Note If you cannot execute Spark jobs in the Hive cluster, you can skip this step.
    To obtain the parameter settings of the Hive cluster that you want to access, run the following wget command to download the spark-examples-0.0.1-SNAPSHOT-shaded.jar file and upload the file to OSS. Then, submit the Spark job to the Hive cluster and obtain the parameter settings of the Hive cluster.
    wget https://dla003.oss-cn-hangzhou.aliyuncs.com/GetSparkConf/spark-examples-0.0.1-SNAPSHOT-shaded.jar
    • If you want to access a Hive cluster in EMR, you can upload the spark-examples-0.0.1-SNAPSHOT-shaded.jar file to OSS, and run the following command to submit a job to the Hive cluster and obtain job configurations.
      --class com.aliyun.spark.util.GetConfForServerlessSpark
      --deploy-mode client
      ossref://{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar
      get hive hadoop

      After the job succeeds, you can view the configurations from the stdout output of the driver on the SparkUI or from the logs on the job details page.

    • If you want to access a Hive cluster in X-Pack Spark of ApsaraDB for HBase, you can upload the spark-examples-0.0.1-SNAPSHOT-shaded.jar file to the resource management directory, and run the following command to submit a job to the Hive cluster and obtain job configurations.
      --class com.aliyun.spark.util.GetConfForServerlessSpark
      /{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar
      get hive hadoop

      After the job succeeds, you can view the configurations from the stdout output of the driver on the SparkUI.

    • If you want to access a Hive cluster of another type and you have not specified the HIVE_CONF_DIR environment variable in the cluster, you must manually specify this variable.
      --class com.aliyun.spark.util.GetConfForServerlessSpark
      --deploy-mode client
      /{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar
      get --hive-conf-dir </path/to/your/hive/conf/dir> hive hadoop
  3. Edit the code in the Spark application file to access the Hive cluster.
    In the following sample code, a table is created in the default namespace based on the imported table name. The table contains only one column of data of the STRING type, which is hello, dla-spark. The data in this column is read and then displayed in the stdout output.
    package com.aliyun.spark
    
    import org.apache.spark.sql.SparkSession
    
    object SparkHive {
      def main(args: Array[String]): Unit = {
        val sparkSession = SparkSession
          .builder()
          .appName("Spark HIVE TEST")
          .enableHiveSupport()
          .getOrCreate()
    
        val welcome = "hello, dla-spark"
    
        // The name of the Hive table.
        val tableName = args(0)
    
        import sparkSession.implicits. _
        // Save the DataFrame: df dataset that contains only one row and one column of data to the Hive table. The table name is the same as that of the obtained tableName, and the column name is welcome_col.
        val df = Seq(welcome).toDF("welcome_col")
        df.write.format("hive").mode("overwrite").saveAsTable(tableName)
    
        // Read the value of tableName from the Hive table.
        val dfFromHive = sparkSession.sql(
          s"""
            |select * from $tableName
            |""".stripMargin)
        dfFromHive.show(10)
      }
    }
  4. Upload the JAR file of the Spark application and dependency to OSS.
    For more information, see Upload objects.
    Note OSS and the serverless Spark engine of DLA must be deployed in the same region.
  5. Submit a job in the serverless Spark engine of DLA and perform data computations.
    • The following code snippet provides an example if you want to access a high-availability Hive cluster that has more than one primary node or NameNode. For more information, see Create and run Spark jobs and Configure a Spark job.
      {
          "args": [
              "hello_dla"
          ],
          "name": "spark-on-hive",
          "className": "com.aliyun.spark.SparkHive",
          "conf": {
          "spark.sql.catalogImplementation":"hive",
          "spark.dla.eni.vswitch.id": "{vSwitch ID}",
          "spark.dla.eni.security.group.id": "{Security group ID}",
          "spark.dla.eni.enable": "true",
          "spark.driver.resourceSpec": "medium",
          "spark.executor.instances": 1,
          "spark.executor.resourceSpec": "medium",
          "spark.dla.job.log.oss.uri": "oss://<OSS URI that specifies where SparkUI logs are saved/>",
          "spark.hadoop.hive.metastore.uris":"thrift://${ip}:${port},thrift://${ip}:${port}",
          "spark.hadoop.dfs.nameservices":"{Names of your nameservices}",
          "spark.hadoop.dfs.client.failover.proxy.provider.${nameservices}":"{Full path name of the implementation class of the failover proxy provider}",
          "spark.hadoop.dfs.ha.namenodes.${nameservices}":"{List of NameNodes to which your nameservices belong}",
          "spark.hadoop.dfs.namenode.rpc-address.${nameservices}.${nn1}":"IP address of namenode0:Port number of namenode0",
          "spark.hadoop.dfs.namenode.rpc-address.${nameservices}.${nn2}":"IP address of namenode1:Port number of namenode1"
          },
          "file": "oss://{OSS directory where the JAR file is stored}"
      }
      The following table describes the parameters that are used in the preceding code.
      Parameter Description Remarks
      spark.hadoop.hive.metastore.uris The URIs of the Hive metastore service that you want to access. These URIs correspond to those configured for hive.metastore.uris in the ${HIVE_CONF_DIR}/hive-site.xml file. The value of the hive.metastore.uris configuration item is in the format of Domain name:Port number. To obtain the value of spark.hadoop.hive.metastore.uris, you must convert the value format of the hive.metastore.uris configuration item to IP address:Port number. To view the mapping between domain names and IP addresses, you can log on to the primary node of the cluster and view the mapping from the hosts file in the etc folder or ping a domain name on the primary node. You can also obtain the mapping from Step 2.
      spark.dla.eni.vswitch.id The ID of the vSwitch that you selected. N/A
      spark.dla.eni.security.group.id The ID of the security group that you selected. N/A
      spark.dla.eni.enable Specifies whether to enable an elastic network interface (ENI). N/A
      spark.hadoop.dfs.nameservices The parameter that corresponds to dfs.nameservices in the hdfs-site.xml file. N/A
      spark.dla.job.log.oss.uri The OSS URI that specifies where SparkUI logs are saved. N/A
      spark.hadoop.dfs.client.failover.proxy.provider.${nameservices} The parameter that corresponds to dfs.client.failover.proxy.provider.${nameservices} in the hdfs-site.xml file. N/A
      spark.hadoop.dfs.ha.namenodes.${nameservices} The parameter that corresponds to dfs.ha.namenodes.${nameservices} in the hdfs-site.xml file. N/A
      spark.hadoop.dfs.namenode.rpc-address.${nameservices}.${nn1/nn2} The parameter that corresponds to dfs.namenode.rpc-address.${nameservices}.${nn1/nn2} in the hdfs-site.xml file. The value of the spark.hadoop.dfs.namenode.rpc-address.${nameservices}.${nn1/nn2} parameter must be in the format of IP address:Port number. However, the value of dfs.namenode.rpc-address.${nameservices}.${nn1/nn2} is in the format of Domain name:Port number. To view the mapping between domain names and IP addresses, you can log on to the primary node of the Hive cluster and view the mapping from the hosts file in the etc folder. Alternatively, You can ping a domain name on the primary node to obtain the mapping. You can also obtain the mapping from Step 2.

      After the job succeeds, find the job and click Log in the Operation column to view the logs of the job.

    • The following code snippet provides an example if you want to access a non-high-availability Hive cluster that has only one primary node or NameNode.
      {
          "args": [
              "hello_dla"
          ],
          "name": "spark-on-hive",
          "className": "com.aliyun.spark.SparkHive",
          "conf": {
              "spark.sql.catalogImplementation":"hive",
              "spark.dla.eni.vswitch.id": "{vSwitch ID}",
              "spark.dla.eni.security.group.id": "{Security group ID}",
              "spark.dla.eni.enable": "true",
              "spark.driver.resourceSpec": "medium",
              "spark.executor.instances": 1,
              "spark.executor.resourceSpec": "medium",
              "spark.dla.job.log.oss.uri": "oss://<OSS URI that specifies where SparkUI logs are saved/>","
              "spark.hadoop.hive.metastore.uris":"thrift://${ip}:${port},thrift://${ip}:${port}",
              "spark.dla.eni.extra.hosts":"${ip0} ${hostname_0} ${hostname_1} ${hostname_n}"
          },
          "file": "oss://{OSS directory where the JAR file is stored}"
      }
      Parameter Description Remarks
      spark.hadoop.hive.metastore.uris The URIs of the Hive metastore service that you want to access. These URIs correspond to those configured for hive.metastore.uris in the ${HIVE_CONF_DIR}/hive-site.xml file. The value of the hive.metastore.uris configuration item is in the format of Domain name:Port number. To obtain the value of spark.hadoop.hive.metastore.uris, you must convert the value format of the hive.metastore.uris configuration item to IP address:Port number. To view the mapping between domain names and IP addresses, you can log on to the primary node of the cluster and view the mapping from the hosts file in the etc folder or ping a domain name on the primary node. You can also obtain the mapping from Step 1.
      spark.dla.job.log.oss.uri The OSS URI that specifies the directory where SparkUI logs are saved. N/A
      spark.dla.eni.vswitch.id The ID of the vSwitch that you selected. N/A
      spark.dla.eni.security.group.id The ID of the security group that you selected. N/A
      spark.dla.eni.enable Specifies whether to enable an ENI. N/A
      spark.dla.eni.extra.hosts The mapping between IP addresses and hostnames. The serverless Spark engine of DLA can correctly describe the domain name information based on the mapping between IP addresses and hostnames when the serverless Spark engine parses the location of the Hive table.
      Notice A space is required between an IP address and a domain name. Multiple groups of IP addresses and domain names are separated by commas (,), for example, ip0 master0, ip1 master1.
      You can obtain the value of this parameter from fs.defaultFS in the core-site.xml file under ${Hive_CONF_DIR} of the Hive cluster. For example, if the value of fs.defaultFs is hdfs://master-1:9000, you must set spark.dla.eni.extra.hosts to ${IP address of master-1} master-1. To view the mapping between domain names and IP addresses, you can log on to the primary node of your self-managed Hive cluster and view the mapping from the hosts file in the etc folder. You can also obtain the mapping from Step 2.