All Products
Search
Document Center

Data Lake Analytics - Deprecated:Access Hive clusters

Last Updated:Feb 19, 2024

This topic describes how to use the serverless Spark engine of Data Lake Analytics (DLA) to access Hive clusters in your virtual private cloud (VPC).

Important

DLA is discontinued. AnalyticDB for MySQL Data Lakehouse Edition supports the features of DLA and provides more features and better performance. For more information about how to access Hive clusters from AnalyticDB for MySQL Spark, see Access a Hive data source.

Prerequisites

  • DLA is activated. For more information, see Activate DLA.

  • You have logged on to the DLA console and a Spark virtual cluster is created in the console.创建虚拟集群

  • Object Storage Service (OSS) is activated. For more information, see Activate OSS.

  • The vSwitch ID and security group ID that are required for creating a Spark compute node are obtained. You can select an existing vSwitch and an existing security group. You can also create a vSwitch and a security group. The vSwitch and security group that you select must meet the following conditions:

    • The vSwitch must be in the same VPC as the Hive cluster that you want to access. You can use the ID of the vSwitch in which the cluster is deployed. To obtain the vSwitch ID, go to the details page of the cluster in the Alibaba Cloud Management Console.

    • The security group must be in the same VPC as the Hive cluster that you want to access. To obtain the ID of a security group, go to the Elastic Compute Service (ECS) console. In the left-side navigation pane, choose Network & Security > Security Groups. On the Security Group page, enter the VPC ID in the search box to search for the security groups associated with the VPC and select the ID of a security group.

    • If a whitelist is configured for the Hive cluster for access control, you must add the CIDR block of the vSwitch to the whitelist.

Procedure

  1. If your Hive metadata is stored in an independent ApsaraDB RDS database and table data is stored in OSS, you can use the configurations in the following code and skip subsequent steps. Otherwise, proceed with the following steps.

    {
        
        "name": "spark-on-hive",
        "className": "com.aliyun.spark.SparkHive", # The entry class of the test code. You can modify this parameter based on your business requirements.
        "jars": [
            "oss://path/to/mysql-connector-java-5.1.47.jar"
        ],
        "conf": {
            "spark.dla.eni.vswitch.id": "{vSwitch ID}",    
            "spark.dla.eni.security.group.id": "{Security group ID}",
            "spark.dla.eni.enable": "true",
            "spark.driver.resourceSpec": "medium",
            "spark.dla.connectors": "oss",
            "spark.executor.instances": 1,
            "spark.sql.catalogImplementation": "hive",
            "spark.executor.resourceSpec": "medium",
            "spark.hadoop.javax.jdo.option.ConnectionDriverName": "com.mysql.jdbc.Driver",
            "spark.hadoop.javax.jdo.option.ConnectionUserName": "<hive_user_name>", # The username that is used to log on to the ApsaraDB RDS database in which Hive metadata is stored. 
            "spark.hadoop.javax.jdo.option.ConnectionPassword": "<your_pass_word>", # The password that is used to log on to the ApsaraDB RDS database in which Hive metadata is stored.
            "spark.hadoop.javax.jdo.option.ConnectionURL": "<Java Database Connectivity (JDBC) URL>", # The JDBC URL of the ApsaraDB RDS database in which Hive metadata is stored.
            "spark.dla.job.log.oss.uri": "<OSS path in which logs are stored>"
        },
        "file": "<oss://Directory in which the JAR file of the main resource is stored>"
    }
    Note

    The JAR file specified in the jars parameter is the file of the JDBC driver of MySQL databases. You can obtain the JAR file from the official Maven Repository website and upload the file to OSS.

  2. Obtain the required parameter configurations of the Hive cluster to configure the serverless Spark engine of DLA.

    Note

    If you cannot run Spark jobs in the Hive cluster, skip this step.

    To obtain the parameter configurations of the Hive cluster that you want to access, run the following wget command to download the spark-examples-0.0.1-SNAPSHOT-shaded.jar file and upload the file to OSS. Submit a Spark job to the Hive cluster and then obtain the parameter configurations of the Hive cluster.

    wget https://dla003.oss-cn-hangzhou.aliyuncs.com/GetSparkConf/spark-examples-0.0.1-SNAPSHOT-shaded.jar
    • If you want to access an E-MapReduce (EMR) cluster, you can upload the JAR file to OSS and run the following command to submit a job to the EMR cluster:

      --class com.aliyun.spark.util.GetConfForServerlessSpark
      --deploy-mode client
      ossref://{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar
      get hive hadoop

      After the job is complete, you can view the configurations from the stdout output of the driver on the SparkUI or from the logs on the job details page.

    • If you want to access an ApsaraDB for HBase cluster of X-Pack Spark, you can upload the JAR file to the resource management directory and run the following command to submit a job to the cluster:

      --class com.aliyun.spark.util.GetConfForServerlessSpark
      /{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar
      get hive hadoop

      After the job is complete, you can view the configurations from the stdout output of the driver on the SparkUI.

    • If you want to access a Hive cluster of another type and you have not specified the HIVE_CONF_DIR environment variable in the cluster, you must manually specify this variable.

      --class com.aliyun.spark.util.GetConfForServerlessSpark
      --deploy-mode client
      /{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar
      get --hive-conf-dir </path/to/your/hive/conf/dir> hive hadoop
  3. Write the code in the file of a Spark application to access the Hive cluster.

    In the following sample code, a table is created in the default namespace based on the table name. The table contains only one column of data of the STRING type, which is hello, dla-spark. The data in this column is read and then displayed in the stdout output.

    package com.aliyun.spark
    
    import org.apache.spark.sql.SparkSession
    
    object SparkHive {
      def main(args: Array[String]): Unit = {
        val sparkSession = SparkSession
          .builder()
          .appName("Spark HIVE TEST")
          .enableHiveSupport()
          .getOrCreate()
    
        val welcome = "hello, dla-spark"
    
        // The name of the Hive table.
        val tableName = args(0)
    
        import sparkSession.implicits._
        // Save the DataFrame: df dataset that contains only one row and one column of data to the Hive table. The table name is the same as the value of the tableName variable, and the column name is welcome_col.
        val df = Seq(welcome).toDF("welcome_col")
        df.write.format("hive").mode("overwrite").saveAsTable(tableName)
    
        // Read the table name from Hive.
        val dfFromHive = sparkSession.sql(
          s"""
            |select * from $tableName
            |""".stripMargin)
        dfFromHive.show(10)
      }
    }
  4. Upload the JAR file and dependencies of the Spark application to OSS.

    For more information, see Upload objects.

    Note

    OSS and the serverless Spark engine of DLA must be deployed in the same region.

  5. Submit a job in the serverless Spark engine of DLA and perform data computing.

    • The following sample code provides an example if you want to access a high-availability Hive cluster that has more than one master node or NameNode. For more information, see Create and run Spark jobs and Configure a Spark job.

      {
          "args": [
              "hello_dla"
          ],
          "name": "spark-on-hive",
          "className": "com.aliyun.spark.SparkHive",
          "conf": {
          "spark.sql.catalogImplementation":"hive",
          "spark.dla.eni.vswitch.id": "{vSwitch ID}",
          "spark.dla.eni.security.group.id": "{Security group ID}",
          "spark.dla.eni.enable": "true",
          "spark.driver.resourceSpec": "medium",
          "spark.executor.instances": 1,
          "spark.executor.resourceSpec": "medium",
          "spark.dla.job.log.oss.uri": "oss://<Directory in which your SparkUI logs are stored/>",
          "spark.hadoop.hive.metastore.uris":"thrift://${ip}:${port},thrift://${ip}:${port}",
          "spark.hadoop.dfs.nameservices":"{Names of your nameservices}",
          "spark.hadoop.dfs.client.failover.proxy.provider.${nameservices}":"{Full path name of the implementation class of the failover proxy provider}",
          "spark.hadoop.dfs.ha.namenodes.${nameservices}":"{NameNodes to which your nameservices belong}",
          "spark.hadoop.dfs.namenode.rpc-address.${nameservices}.${nn1}":"IP address of NameNode 0:Port number of NameNode 0",
          "spark.hadoop.dfs.namenode.rpc-address.${nameservices}.${nn2}":"IP address of NameNode 1:Port number of NameNode 1"
          },
          "file": "oss://{Directory in which your JAR file is stored}"
      }

      The following table describes the parameters.

      Parameter

      Description

      Note

      spark.hadoop.hive.metastore.uris

      The URIs of the Hive metastore service that you want to access. These URIs correspond to those configured for hive.metastore.uris in the ${HIVE_CONF_DIR}/hive-site.xml file. The value of the hive.metastore.uris configuration item is in the format of Domain name:Port number. To obtain the value of this parameter, you must convert the value format of the hive.metastore.uris configuration item to IP address:Port number.

      To view the mapping between domain names and IP addresses, log on to the master node of the cluster and view the mapping from the hosts file in the etc folder or send PING messages to a domain name on the master node. You can also obtain the mapping by performing the operations in Step 2.

      spark.dla.eni.vswitch.id

      The ID of the vSwitch that you select.

      None

      spark.dla.eni.security.group.id

      The ID of the security group that you select.

      None

      spark.dla.eni.enable

      Specifies whether to enable an elastic network interface (ENI).

      None

      spark.hadoop.dfs.nameservices

      This parameter corresponds to the dfs.nameservices parameter in the hdfs-site.xml file.

      None

      spark.dla.job.log.oss.uri

      The OSS path in which your SparkUI logs are stored.

      None

      spark.hadoop.dfs.client.failover.proxy.provider.${nameservices}

      This parameter corresponds to the dfs.client.failover.proxy.provider.${nameservices} parameter in the hdfs-site.xml file.

      None

      spark.hadoop.dfs.ha.namenodes.${nameservices}

      This parameter corresponds to the dfs.ha.namenodes.${nameservices} parameter in the hdfs-site.xml file.

      None

      spark.hadoop.dfs.namenode.rpc-address.${nameservices}.${nn1/nn2}

      This parameter corresponds to the dfs.namenode.rpc-address.${nameservices}.${nn1/nn2} parameter in the hdfs-site.xml file.

      The value of this parameter must be in the format of IP address:Port number. To view the mapping between domain names and IP addresses, log on to the master node of the Hive cluster and view the mapping from the hosts file in the etc folder. Alternatively, send PING messages to a domain name on the master node to obtain the mapping. You can also obtain the mapping by performing the operations in Step 2.

      After the job is complete, find the job and click Log in the Operation column to view the logs of the job.

    • The following sample code provides an example if you want to access a non-high-availability Hive cluster that has only one master node or NameNode:

      {
          "args": [
              "hello_dla"
          ],
          "name": "spark-on-hive",
          "className": "com.aliyun.spark.SparkHive",
          "conf": {
              "spark.sql.catalogImplementation":"hive",
              "spark.dla.eni.vswitch.id": "{vSwitch ID}",
              "spark.dla.eni.security.group.id": "{Security group ID}",
              "spark.dla.eni.enable": "true",
              "spark.driver.resourceSpec": "medium",
              "spark.executor.instances": 1,
              "spark.executor.resourceSpec": "medium",
              "spark.dla.job.log.oss.uri": "oss://<Directory in which your SparkUI logs are stored/>","
              "spark.hadoop.hive.metastore.uris":"thrift://${ip}:${port},thrift://${ip}:${port}",
              "spark.dla.eni.extra.hosts":"${ip0} ${hostname_0} ${hostname_1} ${hostname_n}"
          },
          "file": "oss://{Directory in which your JAR file is stored}"
      }

      Parameter

      Description

      Note

      spark.hadoop.hive.metastore.uris

      The URIs of the Hive metastore service that you want to access. These URIs correspond to those configured for hive.metastore.uris in the ${HIVE_CONF_DIR}/hive-site.xml file. The value of the hive.metastore.uris configuration item is in the format of Domain name:Port number. To obtain the value of this parameter, you must convert the value format of the hive.metastore.uris configuration item to IP address:Port number.

      To view the mapping between domain names and IP addresses, log on to the master node of the cluster and view the mapping from the hosts file in the etc folder or send PING messages to a domain name on the master node. You can also obtain the mapping by performing the operations in Step 2.

      spark.dla.job.log.oss.uri

      The OSS path in which your SparkUI logs are stored.

      None

      spark.dla.eni.vswitch.id

      The ID of the vSwitch that you select.

      None

      spark.dla.eni.security.group.id

      The ID of the security group that you select.

      None

      spark.dla.eni.enable

      Specifies whether to enable an ENI.

      None

      spark.dla.eni.extra.hosts

      The mapping between IP addresses and hostnames. The serverless Spark engine of DLA can correctly describe the domain name information based on the mapping between IP addresses and hostnames when the serverless Spark engine parses the location of the Hive table.

      Important

      Separate IP addresses and hostnames with spaces. Separate groups of IP addresses and hostnames with commas (,). Example: ip0 master0, ip1 master1.

      You can obtain the value of this parameter from the fs.defaultFS property in the core-site.xml file under the ${Hive_CONF_DIR} directory of the Hive cluster. For example, if the value of the fs.defaultFs property is hdfs://master-1:9000, you must set this parameter to ${IP address of master-1} master-1. To view the mapping between domain names and IP addresses, log on to the master node of your self-managed Hive cluster and view the mapping from the hosts file in the etc folder. You can also obtain the mapping by performing the operations in Step 2.