All Products
Search
Document Center

AnalyticDB for MySQL:Access a Hive data source

Last Updated:Sep 26, 2023

This topic describes how to use AnalyticDB for MySQL Data Lakehouse Edition (V3.0) to access a Hive data source. In this topic, the Hive service in E-MapReduce (EMR) is used.

Prerequisites

  • An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster is created. For more information, see Create a cluster.

  • A database account is created.

  • A job resource group is created. For more information, see Create a resource group.

  • An EMR cluster is created in the same region as the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster. The business scenario of the EMR cluster is set to New Data Lake. The Hadoop-Common, HDFS, YARN, and Hive services are selected for the cluster. The metadata of the cluster is stored in a self-managed ApsaraDB RDS database or built-in MySQL database. For more information, see Create a cluster. If you use a self-managed Hive data source, a self-managed Hive cluster is created in the same region as the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster.

    Important

    A Hive data source in an EMR cluster that has Metadata set to DLF Unified Metadata cannot be accessed from AnalyticDB for MySQL Data Lakehouse Edition (V3.0) Spark over an elastic network interface (ENI).

  • Object Storage Service (OSS) is activated and a bucket is created. For more information, see Activate OSS and Create buckets.

Procedure

  1. Download the JAR package that is required to access a Hive data source from AnalyticDB for MySQL Data Lakehouse Edition (V3.0) Spark. For more information, see MySQL Connector Java.

  2. If the Hive server runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, log on to the master node of the EMR cluster, go to the /opt/apps/HIVE/hive-current/lib directory, and then download all JAR packages. For more information about how to log on to the master node of an EMR cluster, see Log on to a cluster.

  3. Write and package a program to access the Hive data source. The program serves as the JAR package that is required for Spark jobs. In this example, the generated package is named hive_test.jar. Sample code:

    package com.aliyun.spark
    
    import org.apache.spark.sql.SparkSession
    
    object SparkHive {
     def main(args: Array[String]): Unit = {
     val sparkSession = SparkSession
     .builder()
     .appName("Spark HIVE TEST")
     .enableHiveSupport()
     .getOrCreate()
    
     val welcome = "hello, adb-spark"
    
     // The name of the Hive table. 
     val tableName = args(0)
    
     import sparkSession.implicits._
     // Save the DataFrame: df dataset that contains only one row and one column of data to the Hive table. The table name is the same as that of the obtained tableName, and the column name is welcome_col. 
     val df = Seq(welcome).toDF("welcome_col")
     df.write.format("hive").mode("overwrite").saveAsTable(tableName)
    
     // Read tableName from the Hive table. 
     val dfFromHive = sparkSession.sql(
     s"""
     |select * from $tableName
     |""".stripMargin)
     dfFromHive.show(10)
     }
    }
  4. Upload the JAR packages that are obtained from Steps 1 to 3 to OSS. For more information, see Simple upload.

  5. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select the region where the cluster resides. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster and click the cluster ID. In the left-side navigation pane, choose Job Development > Spark JAR Development.

  6. Select the job resource group and the batch type for the Spark job.

  7. Enter and run the following code in the Spark editor.

    You can connect to a Hive metastore or a self-managed ApsaraDB RDS database over different protocols.

    Connect to a Hive metastore over the Thrift protocol

    {
      "args": [
        "hello_adb"
      ],
      "jars": [
        # The JAR package that is required to access the Hive data source from AnalyticDB for MySQL Data Lakehouse Edition (V3.0) Spark. 
       "oss://<bucket_name>/mysql-connector-java.jar",
        # The JAR package that is required for the Hive server. 
       "oss://<bucket_name>/hive_lib/*" 
      ],
      "file": "oss://<bucket_name>/hive_test.jar" ,
      "name": "spark-on-hive",
      "className": "com.aliyun.spark.SparkHive",
      "conf": {
      	"spark.adb.eni.enabled": "true",
        "spark.adb.eni.adbHostAlias.enabled": "true",
        "spark.adb.eni.vswitchId": "vsw-bp1mbnyrjtf3ih1****",    
        "spark.adb.eni.securityGroupId": "sg-bp180fryne3qle****",
        "spark.driver.resourceSpec": "medium",
        "spark.executor.instances": 1,
        "spark.executor.resourceSpec": "medium",
        "spark.hadoop.hive.metastore.uris":"thrift://master-1-1.c-9c9b32****.cn-hangzhou.emr.aliyuncs.com:9083",
        "spark.adb.eni.extraHosts":"172.24.xx.xx master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com", 
         # You must configure the following parameters only when the EMR cluster is a high-availability cluster: 
        "spark.hadoop.dfs.nameservices":"<Name of dfs.nameservices>",
        "spark.hadoop.dfs.client.failover.proxy.provider.<Name of dfs.nameservices>":"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
        "spark.hadoop.dfs.ha.namenodes.<Name of dfs.nameservices>":"<Name of namenode>",
        "spark.hadoop.dfs.namenode.rpc-address.<Name of dfs.nameservices>.<Name of NameNode>":"master-1-1.c-9c9b322****.cn-hangzhou.emr.aliyuncs.com:9000",
         # If the Hive server runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, you must specify custom parameters for the version. 
        "spark.sql.hive.metastore.jars": "path",
        "spark.sql.hive.metastore.version": "<Actual version of Hive>",
        "spark.sql.hive.metastore.jars.path": "/tmp/*/*.jar"
      }
    }
    Note

    The Required column in the following table applies only to the use case in this topic.

    Parameter

    Required

    Description

    args

    Yes

    The arguments that are required for the use of the JAR packages. Specify the arguments based on your business requirements. Separate multiple arguments with commas (,).

    name

    Yes

    The name of the Spark job.

    className

    Yes

    The entry class of the Java or Scala program. The entry class is not required for a Python program.

    jars

    Yes

    The OSS path of the JAR package that is required for the Spark job.

    file

    Yes

    The OSS path of hive_test.jar.

    spark.adb.eni.enabled

    Yes

    Specifies whether to enable ENI. The value true specifies that ENI is enabled.

    spark.adb.eni.adbHostAlias.enabled

    Yes

    Specifies whether to parse the location of the Hive table for the domain name of the table. Set this parameter to true.

    spark.adb.eni.vswitchId

    Yes

    The vSwitch ID of the EMR cluster.

    To view the ID, you can log on to the Virtual Private Cloud (VPC) console and go to the Resources page of the VPC in which the EMR cluster resides.

    spark.adb.eni.securityGroupId

    Yes

    The ID of the security group to which the EMR cluster belongs. To view the ID, you can go to the Basic Information page of the EMR cluster.

    spark.adb.eni.extraHosts

    No

    The mapping between IP addresses and hostnames. This is required for Spark to correctly parse the location of the Hive table.

    • Hostname: In the <Hive_CONF_DIR>/core-site.xml file of the self-managed Hive cluster, you can obtain the hostname from the value of fs.defaultFS. For example, if the value of fs.defaultFs is hdfs://master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com:9000, the hostname is master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com.

    • IP address: You can log on to the master node of the self-managed Hive cluster. Then, you can view the IP address in /etc/hosts.

    Note
    • This parameter is required if the data source is a self-managed Hive cluster and its location cannot be parsed. This parameter is optional if the data source is an EMR cluster that uses DNS to parse the location.

    • Separate IP addresses and hostnames with spaces. Separate groups of IP addresses and hostnames with commas (,). Example: "ip0 master0, ip1 master1".

    spark.hadoop.hive.metastore.uris

    Yes

    The URI of the Hive metastore.

    To view the URI, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of hive.metastore.uris in the Hive service section.

    spark.hadoop.dfs.nameservices

    No

    The name of the HDFS service.

    To obtain the name, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of dfs.nameservices in the HDFS service section.

    Note

    If the EMR cluster is a high-availability cluster, you must specify this parameter.

    spark.hadoop.dfs.client.failover.proxy.provider.<Name of dfs.nameservices>

    No

    The provider that the client uses to connect to the NameNode.

    To obtain the provider name, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of dfs.client.failover.proxy.provider.<Name of nameservices> in the HDFS service section. Default value: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.

    Note

    If the EMR cluster is a high-availability cluster, you must specify this parameter.

    spark.hadoop.dfs.ha.namenodes.<Name of dfs.nameservices>

    No

    The name of the NameNode.

    To obtain the name, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of dfs.ha.namenodes.<Name of dfs.nameservices> in hdfs-site.xml in the HDFS service section.

    Note

    If the EMR cluster is a high-availability cluster, you must specify this parameter.

    spark.hadoop.dfs.namenode.rpc-address.<Name of dfs.nameservices>.<Name of namenode>

    No

    The address that is used by the NameNode for remote procedure calls (RPCs).

    To obtain the address, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of dfs.namenode.rpc-address.<Name of dfs.nameservices>.<Name of namenode> in the HDFS service section.

    Note

    If the EMR cluster is a high-availability cluster, you must specify this parameter.

    spark.sql.hive.metastore.jars

    No

    Set the value to path.

    Note

    If the data source runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, you must specify the parameter.

    spark.sql.hive.metastore.version

    No

    The version number of the Hive server.

    Note

    If the data source runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, you must specify the parameter.

    spark.sql.hive.metastore.jars.path

    No

    Set the value to /tmp/*/*.jar.

    Note

    If the data source runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, you must specify the parameter.

    conf

    Yes

    The configuration parameters that are required for the Spark job, which are similar to those of Apache Spark. The parameters must be in the key:value format. Separate multiple parameters with commas (,). For more information, see Conf configuration parameters.

    Connect to a self-managed ApsaraDB RDS database over the JDBC protocol

    {
      "args": [
        "hello_adb"
      ],
      "name": "spark-on-hive",
      "className": "com.aliyun.spark.SparkHive", 
      "jars": [
        # The JAR package that is required to access the Hive data source from AnalyticDB for MySQL Data Lakehouse Edition (V3.0) Spark. 
        "oss://<bucket_name>/mysql-connector-java.jar",
        # The JAR package that is required for the Hive server. 
        "oss://<bucket_name>/hive_lib/*" 
      ],
       "file": "oss://<bucket_name>/hive_test.jar" ,
      "conf": {
      	"spark.adb.eni.enabled": "true",
        "spark.adb.eni.vswitchId": "vsw-bp1mbnyrjtf3ih1****",    
        "spark.adb.eni.securityGroupId": "sg-bp180fryne3qle****",
        "spark.adb.eni.adbHostAlias.enabled": "true",
        "spark.adb.eni.extraHosts":"172.24.xx.xx master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com", 
        "spark.driver.resourceSpec": "medium",
        "spark.executor.instances": 1,
        "spark.executor.resourceSpec": "medium",
        "spark.sql.catalogImplementation": "hive",
        "spark.hadoop.javax.jdo.option.ConnectionDriverName": "com.mysql.jdbc.Driver",
        "spark.hadoop.javax.jdo.option.ConnectionUserName": "<hive_username>",
        "spark.hadoop.javax.jdo.option.ConnectionPassword": "<hive_password>",
        "spark.hadoop.javax.jdo.option.ConnectionURL": "jdbc:mysql://rm-bp1h5d11r8qtm****.mysql.rds.aliyuncs.com/<Database name>", 
        # If the Hive server runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, you must specify custom parameters for the version. 
        "spark.sql.hive.metastore.jars": "path",
        "spark.sql.hive.metastore.version": "<Actual version of Hive>",
        "spark.sql.hive.metastore.jars.path": "/tmp/*/*.jar"
      }
    }
    Note

    The Required column in the following table applies only to the use case in this topic.

    Parameter

    Required

    Description

    args

    Yes

    The arguments that are required for the use of the JAR packages. Specify the arguments based on your business requirements. Separate multiple arguments with commas (,).

    name

    Yes

    The name of the Spark job.

    className

    Yes

    The entry class of the Java or Scala program. The entry class is not required for a Python program.

    jars

    Yes

    The OSS path of the JAR package that is required for the Spark job.

    file

    Yes

    The OSS path of hive_test.jar.

    spark.adb.eni.enabled

    Yes

    Specifies whether to enable ENI. The value true specifies that ENI is enabled.

    spark.adb.eni.vswitchId

    Yes

    The vSwitch ID of the EMR cluster.

    To view the ID, you can log on to the VPC console and go to the Resources page of the VPC in which the EMR cluster resides.

    spark.adb.eni.securityGroupId

    Yes

    The ID of the security group to which the EMR cluster belongs.

    To view the ID, you can go to the Basic Information page of the Hive cluster.

    spark.adb.eni.adbHostAlias.enabled

    Yes

    Specifies whether to parse the location of the Hive table for the domain name of the table. Set this parameter to true.

    spark.adb.eni.extraHosts

    No

    The mapping between IP addresses and hostnames. This is required for Spark to correctly parse the location of the Hive table.

    • Hostname: In the <Hive_CONF_DIR>/core-site.xml file of the self-managed Hive cluster, you can obtain the hostname from the value of fs.defaultFS. For example, if the value of fs.defaultFs is hdfs://master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com:9000, the hostname is master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com.

    • IP address: You can log on to the master node of the self-managed Hive cluster. Then, you can view the IP address in /etc/hosts.

    Note
    • This parameter is required if the data source is a self-managed Hive cluster and its location cannot be parsed. This parameter is optional if the data source is an EMR cluster that uses DNS to parse the location.

    • Separate IP addresses and hostnames with spaces. Separate groups of IP addresses and hostnames with commas (,). Example: "ip0 master0, ip1 master1".

    spark.hadoop.javax.jdo.option.ConnectionDriverName

    Yes

    The name of the connection driver.

    To view the name, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of javax.jdo.option.ConnectionDriverName in hivemetastore-site.xml in the Hive service section.

    spark.hadoop.javax.jdo.option.ConnectionUserName

    Yes

    The database account of the self-managed ApsaraDB RDS database or built-in MySQL database.

    To view the database account, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of javax.jdo.option.ConnectionUserName in hivemetastore-site.xml in the Hive service section.

    spark.hadoop.javax.jdo.option.ConnectionPassword

    Yes

    The password of the database account.

    To view the password, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of javax.jdo.option.ConnectionPassword in hivemetastore-site.xml in the Hive service section.

    spark.hadoop.javax.jdo.option.ConnectionURL

    Yes

    The database address that contains the database name.

    Specify the address in the jdbc:mysql://rm-xxxxxx.mysql.rds.aliyuncs.com/<Database name> format.

    To view the address, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of javax.jdo.option.ConnectionURL in hivemetastore-site.xml in the Hive service section.

    spark.sql.hive.metastore.jars

    No

    Set the value to path.

    Note

    If the data source runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, you must specify the parameter.

    spark.sql.hive.metastore.version

    No

    The version number of the Hive server.

    Note

    If the data source runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, you must specify the parameter.

    spark.sql.hive.metastore.jars.path

    No

    Set the value to /tmp/*/*.jar.

    Note

    If the data source runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, you must specify the parameter.

    conf

    Yes

    The configuration parameters that are required for the Spark job, which are similar to those of Apache Spark. The parameters must be in the key:value format. Separate multiple parameters with commas (,). For more information, see Conf configuration parameters.

  8. Click Run Now.