AnalyticDB for MySQL Spark allows you to access Hive data sources over the Thrift or Java Database Connectivity (JDBC) protocol. For security purposes, you can enable the Kerberos authentication feature for Hive data sources. This ensures that only Kerberos-authenticated clients can access AnalyticDB for MySQL clusters and submit jobs. This topic uses the Hive service in E-MapReduce (EMR) to describe how to access a Hive data source.
Prerequisites
An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.
A database account is created for the AnalyticDB for MySQL cluster.
If you use an Alibaba Cloud account, you must create a privileged account. For more information, see the "Create a privileged account" section of the Create a database account topic.
If you use a Resource Access Management (RAM) user, you must create both a privileged account and a standard account and associate the standard account with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
The AnalyticDB for MySQL cluster resides in the same region as an Object Storage Service (OSS) bucket.
An EMR cluster is created in the same region as the AnalyticDB for MySQL cluster. For more information, see Create a cluster.
The EMR cluster meets the following requirements:
The resource form of the EMR cluster is EMR on ECS.
The business scenario of the EMR cluster is set to Data Lake. The Hadoop-Common, Hadoop Distributed File System (HDFS), YARN, and Hive services are selected for the cluster. The metadata of the EMR cluster is stored in a self-managed ApsaraDB RDS database or a built-in MySQL database.
ImportantTo access a Kerberos-encrypted Hive data source, make sure that the Kerberos authentication feature is enabled for the EMR cluster.
Preparations
Download the JAR package that is used to connect to Hive data sources from AnalyticDB for MySQL Spark. For more information, see MySQL Connector Java.
(Conditionally required) If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, log on to the master node of the EMR cluster, go to the
/opt/apps/HIVE/hive-current/libdirectory, and then download all JAR packages. For information about how to log on to the master node of an EMR cluster, see Log on to a cluster.(Conditionally required) If the Kerberos authentication feature is enabled for the EMR cluster, download the configuration files required for Kerberos authentication.
Log on to the EMR cluster. For more information, see the "Log on to the master node of the cluster" section of the Log on to a cluster topic.
Download the krb5.conf file. For more information, see the "Configuration files" section of the Basic operations on Kerberos topic.
Download the hadoop.keytab, core-site.xml, and hdfs-site.xml files. Perform the following steps:
Run the following command to find the environment variables and configuration information related to Hadoop:
env | grep hadoopSample result:
HADOOP_HOME=/opt/apps/HADOOP-COMMON/hadoop-common-current/ HADOOP_CONF_DIR=/etc/taihao-apps/hadoop-conf PATH=/opt/apps/JINDOSDK/jindosdk-current/bin:/opt/apps/HADOOP-COMMON/hadoop-common-current/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/apps/HIVE/hive-current/bin:/opt/apps/SPARK-EXTENSION/spark-extension-current/bin:/opt/apps/SPARK3/spark-current/bin:/root/binGo to the directory indicated by the HADOOP_CONF_DIR parameter and download the hadoop.keytab, core-site.xml, and hdfs-site.xml files.
Run the following command to query the principal:
listprincsIn the returned information, the string whose prefix is
hadoop/masteris the required principal.
Upload the JAR packages and configuration files that you obtained in Steps 1 to 3 to OSS. For more information, see Simple upload.
Use Spark JAR to access a Hive data source
Write a program that is used to access the Hive data source and compile the program into a JAR package. In this example, the JAR package is named
hive_test.jar. Upload the JAR package to OSS. For more information, see Simple upload. Sample code:package com.aliyun.spark import org.apache.spark.sql.SparkSession object SparkHive { def main(args: Array[String]): Unit = { val sparkSession = SparkSession .builder() .appName("Spark HIVE TEST") .enableHiveSupport() .getOrCreate() val welcome = "hello, adb-spark" // The name of the Hive table. val tableName = args(0) import sparkSession.implicits._ // Save a DataFrame named df that contains only one row and one column of data to the Hive table. The table name is the value of the tableName parameter, and the column name is welcome_col. val df = Seq(welcome).toDF("welcome_col") df.write.format("hive").mode("overwrite").saveAsTable(tableName) // Read data from the Hive table whose name is the value of the tableName parameter. val dfFromHive = sparkSession.sql( s""" |select * from $tableName |""".stripMargin) dfFromHive.show(10) } }Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose .
Select a job resource group and the Batch type for the Spark job.
Use Spark SQL to access a Hive data source
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose .
Select the Spark engine and a Job Resource Group. Write an SQL job and click Run Now.