All Products
Search
Document Center

AnalyticDB:Access a Hive data source

Last Updated:Sep 10, 2024

AnalyticDB for MySQL Spark allows you to access Hive data sources over the Thrift or Java Database Connectivity (JDBC) protocol. For security purposes, you can enable the Kerberos authentication feature for Hive data sources. This ensures that only Kerberos-authenticated clients can access AnalyticDB for MySQL clusters and submit jobs. This topic uses the Hive service in E-MapReduce (EMR) to describe how to access a Hive data source.

Prerequisites

  • An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.

  • A database account is created for the AnalyticDB for MySQL cluster.

  • The AnalyticDB for MySQL cluster resides in the same region as an Object Storage Service (OSS) bucket.

  • An EMR cluster is created in the same region as the AnalyticDB for MySQL cluster. For more information, see Create a cluster.

  • The EMR cluster meets the following requirements:

    • The resource form of the EMR cluster is EMR on ECS.

    • The business scenario of the EMR cluster is set to Data Lake. The Hadoop-Common, Hadoop Distributed File System (HDFS), YARN, and Hive services are selected for the cluster. The metadata of the EMR cluster is stored in a self-managed ApsaraDB RDS database or a built-in MySQL database.

    Important

    To access a Kerberos-encrypted Hive data source, make sure that the Kerberos authentication feature is enabled for the EMR cluster.

Preparations

  1. Download the JAR package that is used to connect to Hive data sources from AnalyticDB for MySQL Spark. For more information, see MySQL Connector Java.

  2. (Conditionally required) If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, log on to the master node of the EMR cluster, go to the /opt/apps/HIVE/hive-current/lib directory, and then download all JAR packages. For information about how to log on to the master node of an EMR cluster, see Log on to a cluster.

  3. (Conditionally required) If the Kerberos authentication feature is enabled for the EMR cluster, download the configuration files required for Kerberos authentication.

    1. Log on to the EMR cluster. For more information, see the "Log on to the master node of the cluster" section of the Log on to a cluster topic.

    2. Download the krb5.conf file. For more information, see the "Configuration files" section of the Basic operations on Kerberos topic.

    3. Download the hadoop.keytab, core-site.xml, and hdfs-site.xml files. Perform the following steps:

      1. Run the following command to find the environment variables and configuration information related to Hadoop:

        env | grep hadoop

        Sample result:

        HADOOP_HOME=/opt/apps/HADOOP-COMMON/hadoop-common-current/
        HADOOP_CONF_DIR=/etc/taihao-apps/hadoop-conf
        PATH=/opt/apps/JINDOSDK/jindosdk-current/bin:/opt/apps/HADOOP-COMMON/hadoop-common-current/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/apps/HIVE/hive-current/bin:/opt/apps/SPARK-EXTENSION/spark-extension-current/bin:/opt/apps/SPARK3/spark-current/bin:/root/bin
      2. Go to the directory indicated by the HADOOP_CONF_DIR parameter and download the hadoop.keytab, core-site.xml, and hdfs-site.xml files.

    4. Run the following command to query the principal:

      listprincs

      In the returned information, the string whose prefix is hadoop/master is the required principal.

  4. Upload the JAR packages and configuration files that you obtained in Steps 1 to 3 to OSS. For more information, see Simple upload.

Use Spark JAR to access a Hive data source

  1. Write a program that is used to access the Hive data source and compile the program into a JAR package. In this example, the JAR package is named hive_test.jar. Upload the JAR package to OSS. For more information, see Simple upload. Sample code:

    package com.aliyun.spark
    
    import org.apache.spark.sql.SparkSession
    
    object SparkHive {
     def main(args: Array[String]): Unit = {
     val sparkSession = SparkSession
     .builder()
     .appName("Spark HIVE TEST")
     .enableHiveSupport()
     .getOrCreate()
    
     val welcome = "hello, adb-spark"
    
     // The name of the Hive table. 
     val tableName = args(0)
    
     import sparkSession.implicits._
     // Save a DataFrame named df that contains only one row and one column of data to the Hive table. The table name is the value of the tableName parameter, and the column name is welcome_col. 
     val df = Seq(welcome).toDF("welcome_col")
     df.write.format("hive").mode("overwrite").saveAsTable(tableName)
    
     // Read data from the Hive table whose name is the value of the tableName parameter. 
     val dfFromHive = sparkSession.sql(
     s"""
     |select * from $tableName
     |""".stripMargin)
     dfFromHive.show(10)
     }
    }
  2. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.

  3. In the left-side navigation pane, choose Job Development > Spark JAR Development.

  4. Select a job resource group and the Batch type for the Spark job.

    Access a Hive data source

    You can connect to a metadatabase over different protocols.

    Connect to a metadatabase over Thrift

    {
      "args": [
        "hello_adb"
      ],
      "jars": [
        # The OSS path of the JAR package that is used to connect to Hive data sources from AnalyticDB for MySQL Spark. 
        "oss://<bucket_name>/mysql-connector-java.jar",
        # The OSS path of the JAR packages that are required by the Hive service of the EMR cluster. 
        "oss://<bucket_name>/hive_lib/*" 
      ],
      "file": "oss://<bucket_name>/hive_test.jar" ,
      "name": "spark-on-hive",
      "className": "com.aliyun.spark.SparkHive",
      "conf": {
        "spark.adb.eni.enabled": "true",
        "spark.adb.eni.adbHostAlias.enabled": "true",
        "spark.adb.eni.vswitchId": "vsw-bp1mbnyrjtf3ih1****",    
        "spark.adb.eni.securityGroupId": "sg-bp180fryne3qle****",
        "spark.driver.resourceSpec": "medium",
        "spark.executor.instances": 1,
        "spark.executor.resourceSpec": "medium",
        "spark.hadoop.hive.metastore.uris":"thrift://master-1-1.c-9c9b32****.cn-hangzhou.emr.aliyuncs.com:9083",
        "spark.adb.eni.extraHosts":"172.24.xx.xx master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com", 
        # Configure the following parameters when the EMR cluster is a high-availability cluster: 
        "spark.hadoop.dfs.nameservices":"<Name of the HDFS service>",
        "spark.hadoop.dfs.client.failover.proxy.provider.<Name of the HDFS service>":"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
        "spark.hadoop.dfs.ha.namenodes.<Name of the HDFS service>":"<Name of the NameNode>",
        "spark.hadoop.dfs.namenode.rpc-address.<Name of the HDFS service>.<Name of the NameNode>":"master-1-1.c-9c9b322****.cn-hangzhou.emr.aliyuncs.com:9000",
        # If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, configure parameters for the Hive service. 
        "spark.sql.hive.metastore.jars": "path",
        "spark.sql.hive.metastore.version": "<Actual Hive version>",
        "spark.sql.hive.metastore.jars.path": "/tmp/*/*.jar"
      }
    }

    Parameters:

    Note

    The Required column in the following table applies only to the use case in this topic.

    Parameter

    Required

    Description

    args

    Yes

    The arguments that are required for the use of the JAR packages. Specify the arguments based on your business requirements. Separate multiple arguments with commas (,).

    name

    Yes

    The name of the Spark job.

    className

    Yes

    The entry class of the Java or Scala application. The entry class is not required for a Python application.

    jars

    Yes

    The OSS path of the JAR packages that are used to run the Spark job.

    file

    Yes

    The OSS path of the hive_test.jar package.

    spark.adb.eni.enabled

    Yes

    Specifies whether to enable ENI. A value of true specifies that ENI is enabled.

    spark.adb.eni.adbHostAlias.enabled

    Yes

    Specifies whether to resolve the domain name of the Hive table. Set this parameter to true.

    spark.adb.eni.vswitchId

    Yes

    The vSwitch ID of the EMR cluster.

    To view the vSwitch ID, you can log on to the VPC console and go to the Resource Management page of the VPC in which the EMR cluster resides.

    spark.adb.eni.securityGroupId

    Yes

    The ID of the security group to which the EMR cluster belongs. To view the security group ID, you can go to the Basic Information page of the EMR cluster.

    spark.adb.eni.extraHosts

    No

    The mapping between the IP address and the domain name of the Hive data source. This is required for Spark to correctly resolve the domain name of the Hive table.

    • To obtain the domain name, view the value of the fs.defaultFS parameter in the <Hive_CONF_DIR>/core-site.xml file of the self-managed Hive cluster. For example, if the value of the fs.defaultFs parameter is hdfs://master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com:9000, the domain name is master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com.

    • To obtain the IP address, connect to the master node of the self-managed Hive cluster and view the IP address in the /etc/hosts file.

    Note
    • This parameter is required if the data source is a self-managed Hive cluster and its domain name cannot be resolved. This parameter is optional if the data source is an EMR cluster that uses DNS to resolve the domain name.

    • Separate IP addresses and domain names with spaces. Separate multiple groups of IP addresses and domain names with commas (,). Example: "ip0 master0, ip1 master1".

    spark.hadoop.hive.metastore.uris

    Yes

    The URI of the Hive metastore.

    To view the URI, you can log on to the EMR console and click EMR on ECS in the left-side navigation pane. On the Services tab, click Configure to the right of HIVE and view the value of the hive.metastore.uris parameter.

    spark.hadoop.dfs.nameservices

    No

    The name of the HDFS service.

    To obtain the name, you can log on to the EMR console and click EMR on ECS in the left-side navigation pane. On the Services tab, click Configure to the right of HDFS and view the value of the dfs.nameservices parameter.

    Note

    If the EMR cluster is a high-availability cluster, you must specify this parameter.

    spark.hadoop.dfs.client.failover.proxy.provider.<Name of the HDFS service>

    No

    The provider that the client uses to connect to the NameNode.

    To obtain the provider name, you can log on to the EMR console and click EMR on ECS in the left-side navigation pane. On the Services tab, click Configure to the right of HDFS and view the value of the dfs.client.failover.proxy.provider.<Name of the HDFS service> in the HDFS service section. Default value: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.

    Note

    If the EMR cluster is a high-availability cluster, you must specify this parameter.

    spark.hadoop.dfs.ha.namenodes.<Name of the HDFS service>

    No

    The name of the NameNode.

    To obtain the name, you can log on to the EMR console and click EMR on ECS in the left-side navigation pane. On the Services tab, click Configure to the right of HDFS and view the value of the dfs.ha.namenodes.<Name of the HDFS service> parameter.

    Note

    If the EMR cluster is a high-availability cluster, you must specify this parameter.

    spark.hadoop.dfs.namenode.rpc-address.<Name of the HDFS service>.<Name of the NameNode>

    No

    The address that is used by the NameNode for remote procedure calls (RPCs).

    To obtain the address, you can log on to the EMR console and click EMR on ECS in the left-side navigation pane. On the Services tab, click Configure to the right of HDFS and view the value of the dfs.namenode.rpc-address.<Name of the HDFS service>.<Name of the NameNode> parameter.

    Note

    If the EMR cluster is a high-availability cluster, you must specify this parameter.

    spark.sql.hive.metastore.jars

    No

    Set the value to path.

    Note

    If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, you must specify this parameter.

    spark.sql.hive.metastore.version

    No

    The version number of the Hive service of the EMR cluster.

    Note

    If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, you must specify this parameter.

    spark.sql.hive.metastore.jars.path

    No

    Set the value to /tmp/*/*.jar.

    Note

    If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, you must specify this parameter.

    conf

    Yes

    The configuration parameters that are required for the Spark job, which are similar to those of Apache Spark. The parameters must be in the key:value format. Separate multiple parameters with commas (,). For more information, see Spark application configuration parameters.

    Connect to a metadatabase over JDBC

    {
      "args": [
        "hello_adb"
      ],
      "name": "spark-on-hive",
      "className": "com.aliyun.spark.SparkHive", 
      "jars": [
        # The OSS path of the JAR package that is used to connect to Hive data sources from AnalyticDB for MySQL Spark. 
        "oss://<bucket_name>/mysql-connector-java.jar",
        # The OSS path of the JAR packages that are required by the Hive service of the EMR cluster. 
        "oss://<bucket_name>/hive_lib/*" 
      ],
      "file": "oss://<bucket_name>/hive_test.jar" ,
      "conf": {
        "spark.adb.eni.enabled": "true",
        "spark.adb.eni.vswitchId": "vsw-bp1mbnyrjtf3ih1****",    
        "spark.adb.eni.securityGroupId": "sg-bp180fryne3qle****",
        "spark.adb.eni.adbHostAlias.enabled": "true",
        "spark.adb.eni.extraHosts":"172.24.xx.xx master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com", 
        "spark.driver.resourceSpec": "medium",
        "spark.executor.instances": 1,
        "spark.executor.resourceSpec": "medium",
        "spark.sql.catalogImplementation": "hive",
        "spark.hadoop.javax.jdo.option.ConnectionDriverName": "com.mysql.jdbc.Driver",
        "spark.hadoop.javax.jdo.option.ConnectionUserName": "<hive_username>",
        "spark.hadoop.javax.jdo.option.ConnectionPassword": "<hive_password>",
        "spark.hadoop.javax.jdo.option.ConnectionURL": "jdbc:mysql://rm-bp1h5d11r8qtm****.mysql.rds.aliyuncs.com/<Database name>", 
        # If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, configure parameters for the Hive service. 
        "spark.sql.hive.metastore.jars": "path",
        "spark.sql.hive.metastore.version": "<Actual Hive version>",
        "spark.sql.hive.metastore.jars.path": "/tmp/*/*.jar"
      }
    }

    Parameters:

    Note

    The Required column in the following table applies only to the use case in this topic.

    Parameter

    Required

    Description

    args

    Yes

    The arguments that are required for the use of the JAR packages. Specify the arguments based on your business requirements. Separate multiple arguments with commas (,).

    name

    Yes

    The name of the Spark job.

    className

    Yes

    The entry class of the Java or Scala application. The entry class is not required for a Python application.

    jars

    Yes

    The OSS path of the JAR packages that are used to run the Spark job.

    file

    Yes

    The OSS path of the hive_test.jar package.

    spark.adb.eni.enabled

    Yes

    Specifies whether to enable ENI. A value of true specifies that ENI is enabled.

    spark.adb.eni.vswitchId

    Yes

    The vSwitch ID of the EMR cluster.

    To view the vSwitch ID, you can log on to the VPC console and go to the Resource Management page of the VPC in which the EMR cluster resides.

    spark.adb.eni.securityGroupId

    Yes

    The ID of the security group to which the EMR cluster belongs.

    To view the security group ID, you can go to the Basic Information page of the EMR cluster.

    spark.adb.eni.adbHostAlias.enabled

    Yes

    Specifies whether to resolve the domain name of the Hive table. Set this parameter to true.

    spark.adb.eni.extraHosts

    No

    The mapping between the IP address and the domain name of the Hive data source. This is required for Spark to correctly resolve the domain name of the Hive table.

    • To obtain the domain name, view the value of the fs.defaultFS parameter in the <Hive_CONF_DIR>/core-site.xml file of the self-managed Hive cluster. For example, if the value of the fs.defaultFs parameter is hdfs://master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com:9000, the domain name is master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com.

    • To obtain the IP address, connect to the master node of the self-managed Hive cluster and view the IP address in the /etc/hosts file.

    Note
    • This parameter is required if the data source is a self-managed Hive cluster and its domain name cannot be resolved. This parameter is optional if the data source is an EMR cluster that uses DNS to resolve the domain name.

    • Separate IP addresses and domain names with spaces. Separate multiple groups of IP addresses and domain names with commas (,). Example: "ip0 master0, ip1 master1".

    spark.hadoop.javax.jdo.option.ConnectionDriverName

    Yes

    The name of the connection driver.

    To view the name, you can log on to the EMR console and click EMR on ECS in the left-side navigation pane. On the Services tab, click Configure to the right of HIVE and view the value of the javax.jdo.option.ConnectionDriverName parameter on the hivemetastore-site tab.

    spark.hadoop.javax.jdo.option.ConnectionUserName

    Yes

    The name of the database account of the self-managed ApsaraDB RDS database or built-in MySQL database.

    To view the database account, you can log on to the EMR console and click EMR on ECS in the left-side navigation pane. On the Services tab, click Configure to the right of HIVE and view the value of the javax.jdo.option.ConnectionUserName parameter on the hivemetastore-site tab.

    spark.hadoop.javax.jdo.option.ConnectionPassword

    Yes

    The password of the database account.

    To view the password, you can log on to the EMR console and click EMR on ECS in the left-side navigation pane. On the Services tab, click Configure to the right of HIVE and view the value of the javax.jdo.option.ConnectionPassword parameter on the hivemetastore-site tab.

    spark.hadoop.javax.jdo.option.ConnectionURL

    Yes

    The connection URL of the database that contains the database name.

    Specify the connection URL in the jdbc:mysql://rm-xxxxxx.mysql.rds.aliyuncs.com/<Database name> format.

    To view the connection URL, you can log on to the EMR console and click EMR on ECS in the left-side navigation pane. On the Services tab, click Configure to the right of HIVE and view the value of the javax.jdo.option.ConnectionURL parameter on the hivemetastore-site tab.

    spark.sql.hive.metastore.jars

    No

    Set the value to path.

    Note

    If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, you must specify this parameter.

    spark.sql.hive.metastore.version

    No

    The version number of the Hive service of the EMR cluster.

    Note

    If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, you must specify this parameter.

    spark.sql.hive.metastore.jars.path

    No

    Set the value to /tmp/*/*.jar.

    Note

    If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, you must specify this parameter.

    conf

    Yes

    The configuration parameters that are required for the Spark job, which are similar to those of Apache Spark. The parameters must be in the key:value format. Separate multiple parameters with commas (,). For more information, see Spark application configuration parameters.

    Access a Kerberos-encrypted Hive data source

    {
      "args": ["hello_adb"],
      "name": "spark-on-hive",
      "className": "com.aliyun.spark.SparkHive", 
      "jars": [
        # The OSS path of the JAR package that is used to connect to Hive data sources from AnalyticDB for MySQL Spark. 
        "oss://testBucketname/mysql-connector-java.jar",
        # The OSS path of the JAR packages that are required by the Hive service of the EMR cluster. Specify this parameter if the Hive version of the EMR cluster is incompatible with the Hive version of AnalyticDB for MySQL. 
        "oss://testBucketname/hive_lib/*" ],
      "file": "oss://testBucketname/hive_test.jar" ,
      "conf": {
        "spark.adb.eni.enabled": "true",
        "spark.adb.eni.vswitchId": "vsw-bp1mbnyrjtf3ih1****",    
        "spark.adb.eni.securityGroupId": "sg-bp180fryne3qle****",
        "spark.adb.eni.adbHostAlias.enabled": "true",
        "spark.adb.eni.extraHosts":"172.24.xx.xx master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com", 
        "spark.driver.resourceSpec": "medium",
        "spark.executor.instances": 1,
        "spark.executor.resourceSpec": "medium",
        "spark.sql.catalogImplementation": "hive",
        "spark.hadoop.javax.jdo.option.ConnectionDriverName": "com.mysql.jdbc.Driver",
        "spark.hadoop.javax.jdo.option.ConnectionUserName": "hive_username",
        "spark.hadoop.javax.jdo.option.ConnectionPassword": "hive_password",
        "spark.hadoop.javax.jdo.option.ConnectionURL": "jdbc:mysql://master-1-1.c-49f95900****.cn-beijing.emr.aliyuncs.com/hivemeta?createDatabaseIfNotExist=true&characterEncoding=UTF-8", 
        # If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, configure parameters for the Hive service. 
        "spark.sql.hive.metastore.jars": "path",
        "spark.sql.hive.metastore.version": "<Actual Hive version>",
        "spark.sql.hive.metastore.jars.path": "/tmp/*/*.jar"
        # Kerberos-related parameters. 
        "spark.kubernetes.driverEnv.ADB_SPARK_DOWNLOAD_FILES": "oss://testBucketname/hadoop/hadoop.keytab, oss://testBucketname/hadoop/core-site.xml, oss://testBucketname/hadoop/hdfs-site.xml",
        "spark.executorEnv.ADB_SPARK_DOWNLOAD_FILES": "oss://testBucketname/hadoop/krb5.conf, oss://testBucketname/hadoop/hadoop.keytab, oss://testBucketname/hadoop/core-site.xml, oss://testBucketname/hadoop/hdfs-site.xml",
        "spark.kubernetes.driverEnv.HADOOP_CONF_DIR": "/tmp/testBucketname/hadoop",
        "spark.executorEnv.HADOOP_CONF_DIR": "/tmp/testBucketname/hadoop",
        "spark.kerberos.keytab": "local:///tmp/testBucketname/hadoop/hadoop.keytab",
        "spark.executor.extraJavaOptions":"-Djava.security.krb5.conf=/tmp/testBucketname/hadoop/krb5.conf",
        "spark.kubernetes.kerberos.krb5.path": "oss://testBucketname/hadoop/krb5.conf",
        "spark.kerberos.principal":"hadoop/master-1-1.c-49f95900****.cn-beijing.emr.aliyuncs.com@EMR.C-49F95900****.COM"
      }
    }

    Parameters:

    Parameter

    Required

    Description

    args

    Yes

    The parameters that are required for JAR packages. Separate multiple parameters with commas (,).

    className

    Yes

    The entry class of the Java application.

    jars

    Yes

    The OSS path of the JAR packages that are used to run the Spark job.

    file

    Yes

    The OSS path of the hive_test.jar package.

    spark.adb.eni.enabled

    Yes

    Specifies whether to enable Elastic Network Interface (ENI).

    If you want to access a Kerberos-encrypted Hive data source, set this parameter to true.

    spark.adb.eni.vswitchId

    Yes

    The vSwitch ID of the EMR cluster.

    To view the vSwitch ID, you can log on to the VPC console and go to the Resource Management page of the VPC in which the EMR cluster resides.

    spark.adb.eni.securityGroupId

    Yes

    The ID of the security group to which the EMR cluster belongs. To view the security group ID, you can go to the Basic Information page of the EMR cluster.

    spark.adb.eni.adbHostAlias.enabled

    Yes

    Specifies whether to resolve the domain name of the Hive table. Set this parameter to true.

    spark.sql.catalogImplementation

    Yes

    If you want to access a Hive data source, set this parameter to hive.

    spark.hadoop.javax.jdo.option.ConnectionDriverName

    Yes

    The name of the connection driver.

    To view the name, you can log on to the EMR console and click EMR on ECS in the left-side navigation pane. On the Services tab, click Configure to the right of HIVE and view the value of the javax.jdo.option.ConnectionDriverName parameter on the hivemetastore-site tab.

    spark.hadoop.javax.jdo.option.ConnectionUserName

    Yes

    The name of the database account of the self-managed ApsaraDB RDS database or built-in MySQL database.

    To view the database account, you can log on to the EMR console and click EMR on ECS in the left-side navigation pane. On the Services tab, click Configure to the right of HIVE and view the value of the javax.jdo.option.ConnectionUserName parameter on the hivemetastore-site tab.

    spark.hadoop.javax.jdo.option.ConnectionPassword

    Yes

    The password of the database account.

    To view the password, you can log on to the EMR console and click EMR on ECS in the left-side navigation pane. On the Services tab, click Configure to the right of HIVE and view the value of the javax.jdo.option.ConnectionPassword parameter on the hivemetastore-site tab.

    spark.hadoop.javax.jdo.option.ConnectionURL

    Yes

    The connection URL of the database that contains the database name.

    To view the connection URL, you can log on to the EMR console and click EMR on ECS in the left-side navigation pane. On the Services tab, click Configure to the right of HIVE and view the value of the javax.jdo.option.ConnectionURL parameter on the hivemetastore-site tab.

    spark.kubernetes.driverEnv.ADB_SPARK_DOWNLOAD_FILES

    Yes

    The OSS path of the hadoop.keytab, core-site.xml, and hdfs-site.xml files, which is a Spark driver parameter.

    spark.executorEnv.ADB_SPARK_DOWNLOAD_FILES

    Yes

    The OSS path of the krb5.conf, hadoop.keytab, core-site.xml, and hdfs-site.xml files, which is a Spark executor parameter.

    spark.kubernetes.driverEnv.HADOOP_CONF_DIR

    Yes

    The on-premises directory in which the hadoop.keytab, core-site.xml, and hdfs-site.xml files are stored, which is a Spark driver parameter. Format: /tmp/<OSS directory in which files are stored>.

    For example, if the hadoop.keytab, core-site.xml, and hdfs-site.xml files are stored in the oss://testBucketname/hadoop directory, the on-premises directory of the files is /tmp/testBucketname/hadoop.

    spark.executorEnv.HADOOP_CONF_DIR

    Yes

    The on-premises directory in which the hadoop.keytab, core-site.xml, and hdfs-site.xml files are stored, which is a Spark executor parameter. Format: /tmp/<OSS directory in which files are stored>.

    For example, if the hadoop.keytab, core-site.xml, and hdfs-site.xml files are stored in the oss://testBucketname/hadoop directory, the on-premises directory of the files is /tmp/testBucketname/hadoop.

    spark.kerberos.keytab

    Yes

    The on-premises path of the hadoop.keytab file. Format: local:///tmp/<OSS path of the hadoop.keytab file>.

    For example, if the OSS path of the hadoop.keytab file is oss://testBucketname/hadoop/hadoop.keytab, the on-premises path of the file is local:///tmp/testBucketname/hadoop/hadoop.keytab.

    spark.executor.extraJavaOptions

    Yes

    The on-premises path of the krb5.conf file, which is a Spark executor parameter. Format: /tmp/<OSS path of the krb5.conf file>.

    For example, if the OSS path of the krb5.conf file is oss://testBucketname/hadoop/krb5.conf, the on-premises path of the file is /tmp/testBucketname/hadoop/krb5.conf.

    spark.kubernetes.kerberos.krb5.path

    Yes

    The OSS path of the krb5.conf file.

    spark.kerberos.principal

    Yes

    The principal that you obtained during preparations.

    spark.adb.eni.extraHosts

    No

    The mapping between the IP address and the domain name of the Hive data source. This is required for Spark to correctly resolve the domain name of the Hive table.

    • To obtain the domain name, view the value of the fs.defaultFS parameter in the <Hive_CONF_DIR>/core-site.xml file of the self-managed Hive cluster. For example, if the value of the fs.defaultFs parameter is hdfs://master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com:9000, the domain name is master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com.

    • To obtain the IP address, connect to the master node of the self-managed Hive cluster and view the IP address in the /etc/hosts file.

    Note
    • This parameter is required if the data source is a self-managed Hive cluster and its domain name cannot be resolved. This parameter is optional if the data source is an EMR cluster that uses DNS to resolve the domain name.

    • Separate IP addresses and domain names with spaces. Separate multiple groups of IP addresses and domain names with commas (,). Example: "ip0 master0, ip1 master1".

    spark.sql.hive.metastore.jars

    No

    Set the value to path.

    Note

    If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, you must specify this parameter.

    spark.sql.hive.metastore.version

    No

    The version number of the Hive service of the EMR cluster.

    Note

    If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, you must specify this parameter.

    spark.sql.hive.metastore.jars.path

    No

    Set the value to /tmp/*/*.jar.

    Note

    If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, you must specify this parameter.

    conf

    No

    The configuration parameters that are required for the Spark job, which are similar to those of Apache Spark. The parameters must be in the key:value format. Separate multiple parameters with commas (,). For more information, see Spark application configuration parameters.

Use Spark SQL to access a Hive data source

  1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.

  2. In the left-side navigation pane, choose Job Development > SQL Development.

  3. Select the Spark engine and a Job Resource Group. Write an SQL job and click Run Now.

    Access a Hive data source

    You can connect to a metadatabase over different protocols.

    Connect to a metadatabase over Thrift

    SET spark.adb.eni.enabled=true;
    SET spark.adb.eni.vswitchId=vsw-bp1mbnyrjtf3ih1****; 
    SET spark.adb.eni.securityGroupId=sg-bp180fryne3qle****;
    SET spark.adb.eni.adbHostAlias.enabled=true;
    SET spark.driver.resourceSpec=medium;
    SET spark.executor.instances=1;
    SET spark.executor.resourceSpe=medium;
    SET spark.sql.catalogImplementation=hive;
    SET spark.hadoop.hive.metastore.uris=thrift://master-1-1.c-9c9b32****.cn-hangzhou.emr.aliyuncs.com:9083;
    SET spark.sql.hive.metastore.version=2.3.9.adb;
    -- Configure the following parameters when the EMR cluster is a high-availability cluster: 
    SET spark.hadoop.dfs.nameservices=<Name of the HDFS service>;
    SET spark.hadoop.dfs.client.failover.proxy.provider.<Name of the HDFS service>=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider;
    SET spark.hadoop.dfs.ha.namenodes.<Name of the HDFS service>=<Name of the NameNode>;
    SET spark.hadoop.dfs.namenode.rpc-address.<Name of the HDFS service>.<Name of the NameNode>=master-1-1.c-9c9b322****.cn-hangzhou.emr.aliyuncs.com:9000;
    -- If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, configure parameters for the Hive service. 
    SET spark.sql.hive.metastore.jars=path;
    SET spark.sql.hive.metastore.version=<Actual Hive version>;
    SET spark.sql.hive.metastore.jars.path=/tmp/*/*.jar;
    -- The SQL statement. 
    SHOW databases;

    For more information, see the "Parameters" section of this topic.

    Connect to a metadatabase over JDBC

    SET spark.adb.eni.enabled=true;
    SET spark.adb.eni.vswitchId=vsw-bp1mbnyrjtf3ih1****; 
    SET spark.adb.eni.securityGroupId=sg-bp180fryne3qle****;
    SET spark.adb.eni.adbHostAlias.enabled=true;
    SET spark.driver.resourceSpec=medium;
    SET spark.executor.instances=1;
    SET spark.executor.resourceSpe=medium;
    SET spark.sql.catalogImplementation=hive;
    SET spark.hadoop.javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver;
    SET spark.hadoop.javax.jdo.option.ConnectionUserName=hive_username;
    SET spark.hadoop.javax.jdo.option.ConnectionPassword=hive_password;
    SET spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:mysql://rm-bp1h5d11r8qtm****.mysql.rds.aliyuncs.com/dbname;
    SET spark.sql.hive.metastore.version=2.3.9.adb;
    -- If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, configure parameters for the Hive service. 
    SET spark.sql.hive.metastore.jars=path;
    SET spark.sql.hive.metastore.version=<Actual Hive version>;
    SET spark.sql.hive.metastore.jars.path=/tmp/*/*.jar;
    -- The SQL statement. 
    SHOW databases;

    For more information, see the "Parameters" section of this topic.

    Access a Kerberos-encrypted Hive data source

    SET spark.adb.eni.enabled=true;
    SET spark.adb.eni.vswitchId=vsw-bp1mbnyrjtf3ih1****; 
    SET spark.adb.eni.securityGroupId=sg-bp180fryne3qle****;
    SET spark.adb.eni.adbHostAlias.enabled=true;
    SET spark.driver.resourceSpec=medium;
    SET spark.executor.instances=1;
    SET spark.executor.resourceSpe=medium;
    SET spark.sql.catalogImplementation=hive;
    SET spark.hadoop.javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver;
    SET spark.hadoop.javax.jdo.option.ConnectionUserName=hive_username;
    SET spark.hadoop.javax.jdo.option.ConnectionPassword=hive_password;
    SET spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:mysql://rm-bp1h5d11r8qtm****.mysql.rds.aliyuncs.com/dbname;
    SET spark.kubernetes.driverEnv.ADB_SPARK_DOWNLOAD_FILES=oss://testBucketname/hadoop/hadoop.keytab, oss://testBucketname/hadoop/core-site.xml, oss://testBucketname/hadoop/hdfs-site.xml;
    SET spark.executorEnv.ADB_SPARK_DOWNLOAD_FILES=oss://testBucketname/hadoop/krb5.conf, oss://testBucketname/hadoop/hadoop.keytab, oss://testBucketname/hadoop/core-site.xml, oss://testBucketname/hadoop/hdfs-site.xml;
    SET spark.kubernetes.driverEnv.HADOOP_CONF_DIR=/tmp/testBucketname/hadoop;
    SET spark.executorEnv.HADOOP_CONF_DIR=/tmp/testBucketname/hadoop;
    SET spark.kerberos.keytab=local:///tmp/testBucketname/hadoop/hadoop.keytab;
    SET spark.kubernetes.kerberos.krb5.path=oss://testBucketname/hadoop/krb5.conf;
    SET spark.executor.extraJavaOptions=-Djava.security.krb5.conf=/tmp/testBucketname/hadoop/krb5.conf;
    SET spark.kerberos.principal=hadoop/master-1-1.c-49f95900****.cn-beijing.emr.aliyuncs.com@EMR.C-49F95900****.COM;
    SET spark.sql.hive.metastore.version=2.3.9.adb;
    -- If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, configure parameters for the Hive service. 
    SET spark.sql.hive.metastore.jars=path;
    SET spark.sql.hive.metastore.version=<Actual Hive version>;
    SET spark.sql.hive.metastore.jars.path=/tmp/*/*.jar;
    -- The SQL statement. 
    SHOW databases;

    For more information, see the "Parameters" section of this topic.