Use AnalyticDB for MySQL Spark to access a Hive data source - AnalyticDB for MySQL

This topic describes how to use AnalyticDB for MySQL Data Lakehouse Edition (V3.0) to access a Hive data source. In this topic, the Hive service in E-MapReduce (EMR) is used.

Prerequisites

An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster is created. For more information, see Create a cluster.
A database account is created.
- If you use an Alibaba Cloud account, you need to create only a privileged database account. For more information, see Create a database account.
- If you use a Resource Access Management (RAM) user, you must create both a privileged database account and a standard database account and associate the standard account with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
A job resource group is created. For more information, see Create a resource group.
An EMR cluster is created in the same region as the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster. The business scenario of the EMR cluster is set to New Data Lake. The Hadoop-Common, HDFS, YARN, and Hive services are selected for the cluster. The metadata of the cluster is stored in a self-managed ApsaraDB RDS database or built-in MySQL database. For more information, see Create a cluster. If you use a self-managed Hive data source, a self-managed Hive cluster is created in the same region as the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster.
Important
A Hive data source in an EMR cluster that has Metadata set to DLF Unified Metadata cannot be accessed from AnalyticDB for MySQL Data Lakehouse Edition (V3.0) Spark over an elastic network interface (ENI).
Object Storage Service (OSS) is activated and a bucket is created. For more information, see Activate OSS and Create buckets.

Procedure

Download the JAR package that is required to access a Hive data source from AnalyticDB for MySQL Data Lakehouse Edition (V3.0) Spark. For more information, see MySQL Connector Java.
If the Hive server runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, log on to the master node of the EMR cluster, go to the /opt/apps/HIVE/hive-current/lib directory, and then download all JAR packages. For more information about how to log on to the master node of an EMR cluster, see Log on to a cluster.

Write and package a program to access the Hive data source. The program serves as the JAR package that is required for Spark jobs. In this example, the generated package is named hive_test.jar. Sample code:

package com.aliyun.spark

import org.apache.spark.sql.SparkSession

object SparkHive {
 def main(args: Array[String]): Unit = {
 val sparkSession = SparkSession
 .builder()
 .appName("Spark HIVE TEST")
 .enableHiveSupport()
 .getOrCreate()

 val welcome = "hello, adb-spark"

 // The name of the Hive table. 
 val tableName = args(0)

 import sparkSession.implicits._
 // Save the DataFrame: df dataset that contains only one row and one column of data to the Hive table. The table name is the same as that of the obtained tableName, and the column name is welcome_col. 
 val df = Seq(welcome).toDF("welcome_col")
 df.write.format("hive").mode("overwrite").saveAsTable(tableName)

 // Read tableName from the Hive table. 
 val dfFromHive = sparkSession.sql(
 s"""
 |select * from $tableName
 |""".stripMargin)
 dfFromHive.show(10)
 }
}

Upload the JAR packages that are obtained from Steps 1 to 3 to OSS. For more information, see Simple upload.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select the region where the cluster resides. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster and click the cluster ID. In the left-side navigation pane, choose Job Development > Spark JAR Development.
Select the job resource group and the batch type for the Spark job.

Enter and run the following code in the Spark editor.

You can connect to a Hive metastore or a self-managed ApsaraDB RDS database over different protocols.

Connect to a Hive metastore over the Thrift protocol

{
  "args": [
    "hello_adb"
  ],
  "jars": [
    # The JAR package that is required to access the Hive data source from AnalyticDB for MySQL Data Lakehouse Edition (V3.0) Spark. 
   "oss://<bucket_name>/mysql-connector-java.jar",
    # The JAR package that is required for the Hive server. 
   "oss://<bucket_name>/hive_lib/*" 
  ],
  "file": "oss://<bucket_name>/hive_test.jar" ,
  "name": "spark-on-hive",
  "className": "com.aliyun.spark.SparkHive",
  "conf": {
  	"spark.adb.eni.enabled": "true",
    "spark.adb.eni.adbHostAlias.enabled": "true",
    "spark.adb.eni.vswitchId": "vsw-bp1mbnyrjtf3ih1****",    
    "spark.adb.eni.securityGroupId": "sg-bp180fryne3qle****",
    "spark.driver.resourceSpec": "medium",
    "spark.executor.instances": 1,
    "spark.executor.resourceSpec": "medium",
    "spark.hadoop.hive.metastore.uris":"thrift://master-1-1.c-9c9b32****.cn-hangzhou.emr.aliyuncs.com:9083",
    "spark.adb.eni.extraHosts":"172.24.xx.xx master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com", 
     # You must configure the following parameters only when the EMR cluster is a high-availability cluster: 
    "spark.hadoop.dfs.nameservices":"<Name of dfs.nameservices>",
    "spark.hadoop.dfs.client.failover.proxy.provider.<Name of dfs.nameservices>":"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
    "spark.hadoop.dfs.ha.namenodes.<Name of dfs.nameservices>":"<Name of namenode>",
    "spark.hadoop.dfs.namenode.rpc-address.<Name of dfs.nameservices>.<Name of NameNode>":"master-1-1.c-9c9b322****.cn-hangzhou.emr.aliyuncs.com:9000",
     # If the Hive server runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, you must specify custom parameters for the version. 
    "spark.sql.hive.metastore.jars": "path",
    "spark.sql.hive.metastore.version": "<Actual version of Hive>",
    "spark.sql.hive.metastore.jars.path": "/tmp/*/*.jar"
  }
}

Note

The Required column in the following table applies only to the use case in this topic.

Parameter	Required	Description
`args`	Yes	The arguments that are required for the use of the JAR packages. Specify the arguments based on your business requirements. Separate multiple arguments with commas (,).
`name`	Yes	The name of the Spark job.
`className`	Yes	The entry class of the Java or Scala program. The entry class is not required for a Python program.
`jars`	Yes	The OSS path of the JAR package that is required for the Spark job.
`file`	Yes	The OSS path of `hive_test.jar`.
`spark.adb.eni.enabled`	Yes	Specifies whether to enable ENI. The value true specifies that ENI is enabled.
`spark.adb.eni.adbHostAlias.enabled`	Yes	Specifies whether to parse the location of the Hive table for the domain name of the table. Set this parameter to true.
`spark.adb.eni.vswitchId`	Yes	The vSwitch ID of the EMR cluster. To view the ID, you can log on to the Virtual Private Cloud (VPC) console and go to the Resources page of the VPC in which the EMR cluster resides.
`spark.adb.eni.securityGroupId`	Yes	The ID of the security group to which the EMR cluster belongs. To view the ID, you can go to the Basic Information page of the EMR cluster.
`spark.adb.eni.extraHosts`	No	The mapping between IP addresses and hostnames. This is required for Spark to correctly parse the location of the Hive table. Hostname: In the <Hive_CONF_DIR>/core-site.xml file of the self-managed Hive cluster, you can obtain the hostname from the value of fs.defaultFS. For example, if the value of fs.defaultFs is hdfs://master-1.c-9c9b322d**.cn-hangzhou.emr.aliyuncs.com:9000, the hostname is master-1.c-9c9b322d.cn-hangzhou.emr.aliyuncs.com. IP address: You can log on to the master node of the self-managed Hive cluster. Then, you can view the IP address in /etc/hosts. Note** This parameter is required if the data source is a self-managed Hive cluster and its location cannot be parsed. This parameter is optional if the data source is an EMR cluster that uses DNS to parse the location. Separate IP addresses and hostnames with spaces. Separate groups of IP addresses and hostnames with commas (,). Example: "ip0 master0, ip1 master1".
`spark.hadoop.hive.metastore.uris`	Yes	The URI of the Hive metastore. To view the URI, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of hive.metastore.uris in the Hive service section.
`spark.hadoop.dfs.nameservices`	No	The name of the HDFS service. To obtain the name, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of dfs.nameservices in the HDFS service section. Note If the EMR cluster is a high-availability cluster, you must specify this parameter.
`spark.hadoop.dfs.client.failover.proxy.provider.<Name of dfs.nameservices>`	No	The provider that the client uses to connect to the NameNode. To obtain the provider name, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of dfs.client.failover.proxy.provider.<Name of nameservices> in the HDFS service section. Default value: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider. Note If the EMR cluster is a high-availability cluster, you must specify this parameter.
`spark.hadoop.dfs.ha.namenodes.<Name of dfs.nameservices>`	No	The name of the NameNode. To obtain the name, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of dfs.ha.namenodes.<Name of dfs.nameservices> in hdfs-site.xml in the HDFS service section. Note If the EMR cluster is a high-availability cluster, you must specify this parameter.
`spark.hadoop.dfs.namenode.rpc-address.<Name of dfs.nameservices>.<Name of namenode>`	No	The address that is used by the NameNode for remote procedure calls (RPCs). To obtain the address, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of dfs.namenode.rpc-address.<Name of dfs.nameservices>.<Name of namenode> in the HDFS service section. Note If the EMR cluster is a high-availability cluster, you must specify this parameter.
`spark.sql.hive.metastore.jars`	No	Set the value to path. Note If the data source runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, you must specify the parameter.
`spark.sql.hive.metastore.version`	No	The version number of the Hive server. Note If the data source runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, you must specify the parameter.
`spark.sql.hive.metastore.jars.path`	No	Set the value to */tmp//.jar. Note* If the data source runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, you must specify the parameter.
`conf`	Yes	The configuration parameters that are required for the Spark job, which are similar to those of Apache Spark. The parameters must be in the `key:value` format. Separate multiple parameters with commas (,). For more information, see Conf configuration parameters.

Connect to a self-managed ApsaraDB RDS database over the JDBC protocol

{
  "args": [
    "hello_adb"
  ],
  "name": "spark-on-hive",
  "className": "com.aliyun.spark.SparkHive", 
  "jars": [
    # The JAR package that is required to access the Hive data source from AnalyticDB for MySQL Data Lakehouse Edition (V3.0) Spark. 
    "oss://<bucket_name>/mysql-connector-java.jar",
    # The JAR package that is required for the Hive server. 
    "oss://<bucket_name>/hive_lib/*" 
  ],
   "file": "oss://<bucket_name>/hive_test.jar" ,
  "conf": {
  	"spark.adb.eni.enabled": "true",
    "spark.adb.eni.vswitchId": "vsw-bp1mbnyrjtf3ih1****",    
    "spark.adb.eni.securityGroupId": "sg-bp180fryne3qle****",
    "spark.adb.eni.adbHostAlias.enabled": "true",
    "spark.adb.eni.extraHosts":"172.24.xx.xx master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com", 
    "spark.driver.resourceSpec": "medium",
    "spark.executor.instances": 1,
    "spark.executor.resourceSpec": "medium",
    "spark.sql.catalogImplementation": "hive",
    "spark.hadoop.javax.jdo.option.ConnectionDriverName": "com.mysql.jdbc.Driver",
    "spark.hadoop.javax.jdo.option.ConnectionUserName": "<hive_username>",
    "spark.hadoop.javax.jdo.option.ConnectionPassword": "<hive_password>",
    "spark.hadoop.javax.jdo.option.ConnectionURL": "jdbc:mysql://rm-bp1h5d11r8qtm****.mysql.rds.aliyuncs.com/<Database name>", 
    # If the Hive server runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, you must specify custom parameters for the version. 
    "spark.sql.hive.metastore.jars": "path",
    "spark.sql.hive.metastore.version": "<Actual version of Hive>",
    "spark.sql.hive.metastore.jars.path": "/tmp/*/*.jar"
  }
}

Note

The Required column in the following table applies only to the use case in this topic.

Parameter	Required	Description
`args`	Yes	The arguments that are required for the use of the JAR packages. Specify the arguments based on your business requirements. Separate multiple arguments with commas (,).
`name`	Yes	The name of the Spark job.
`className`	Yes	The entry class of the Java or Scala program. The entry class is not required for a Python program.
`jars`	Yes	The OSS path of the JAR package that is required for the Spark job.
`file`	Yes	The OSS path of `hive_test.jar`.
`spark.adb.eni.enabled`	Yes	Specifies whether to enable ENI. The value true specifies that ENI is enabled.
`spark.adb.eni.vswitchId`	Yes	The vSwitch ID of the EMR cluster. To view the ID, you can log on to the VPC console and go to the Resources page of the VPC in which the EMR cluster resides.
`spark.adb.eni.securityGroupId`	Yes	The ID of the security group to which the EMR cluster belongs. To view the ID, you can go to the Basic Information page of the Hive cluster.
`spark.adb.eni.adbHostAlias.enabled`	Yes	Specifies whether to parse the location of the Hive table for the domain name of the table. Set this parameter to true.
`spark.adb.eni.extraHosts`	No	The mapping between IP addresses and hostnames. This is required for Spark to correctly parse the location of the Hive table. Hostname: In the <Hive_CONF_DIR>/core-site.xml file of the self-managed Hive cluster, you can obtain the hostname from the value of fs.defaultFS. For example, if the value of fs.defaultFs is hdfs://master-1.c-9c9b322d**.cn-hangzhou.emr.aliyuncs.com:9000, the hostname is master-1.c-9c9b322d.cn-hangzhou.emr.aliyuncs.com. IP address: You can log on to the master node of the self-managed Hive cluster. Then, you can view the IP address in /etc/hosts. Note** This parameter is required if the data source is a self-managed Hive cluster and its location cannot be parsed. This parameter is optional if the data source is an EMR cluster that uses DNS to parse the location. Separate IP addresses and hostnames with spaces. Separate groups of IP addresses and hostnames with commas (,). Example: "ip0 master0, ip1 master1".
`spark.hadoop.javax.jdo.option.ConnectionDriverName`	Yes	The name of the connection driver. To view the name, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of javax.jdo.option.ConnectionDriverName in hivemetastore-site.xml in the Hive service section.
`spark.hadoop.javax.jdo.option.ConnectionUserName`	Yes	The database account of the self-managed ApsaraDB RDS database or built-in MySQL database. To view the database account, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of javax.jdo.option.ConnectionUserName in hivemetastore-site.xml in the Hive service section.
`spark.hadoop.javax.jdo.option.ConnectionPassword`	Yes	The password of the database account. To view the password, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of javax.jdo.option.ConnectionPassword in hivemetastore-site.xml in the Hive service section.
`spark.hadoop.javax.jdo.option.ConnectionURL`	Yes	The database address that contains the database name. Specify the address in the `jdbc:mysql://rm-xxxxxx.mysql.rds.aliyuncs.com/<Database name>` format. To view the address, you can log on to the EMR console and click EMR on ECS. On the Services page, you can view the value of javax.jdo.option.ConnectionURL in hivemetastore-site.xml in the Hive service section.
`spark.sql.hive.metastore.jars`	No	Set the value to path. Note If the data source runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, you must specify the parameter.
`spark.sql.hive.metastore.version`	No	The version number of the Hive server. Note If the data source runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, you must specify the parameter.
`spark.sql.hive.metastore.jars.path`	No	Set the value to */tmp//.jar. Note* If the data source runs a version that is earlier than Hive 2.3, and the version is incompatible with the JAR package, you must specify the parameter.
`conf`	Yes	The configuration parameters that are required for the Spark job, which are similar to those of Apache Spark. The parameters must be in the `key:value` format. Separate multiple parameters with commas (,). For more information, see Conf configuration parameters.

Click Run Now.