AnalyticDB for MySQL Spark can read from and write to Hive data sources over the Thrift or Java Database Connectivity (JDBC) protocol. For security-sensitive environments, Kerberos authentication ensures that only authenticated clients can access the cluster and submit jobs. This topic uses the Hive service in E-MapReduce (EMR) as an example.
How it works
Spark connects to the Hive metastore in one of two modes:
Remote mode (Thrift): Spark connects to the Hive metastore service over the Thrift protocol. Use this when the metastore runs as a standalone service (the default for EMR clusters).
Local mode (JDBC): Spark connects directly to the underlying metastore database (ApsaraDB RDS or built-in MySQL) over JDBC. Use this when you need direct database-level access or when the Thrift service is unavailable.
Both modes support optional Kerberos authentication for encrypted clusters.
Prerequisites
Before you begin, ensure that you have:
An AnalyticDB for MySQL Data Lakehouse Edition clusterData Lakehouse Edition
A database account for the cluster:
Alibaba Cloud account: a privileged account. For more information, see the "Create a privileged account" section of Create a database account.
Resource Access Management (RAM) user: a privileged account and a standard account, with the standard account associated with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
The AnalyticDB for MySQL cluster and an Object Storage Service (OSS) bucket in the same region
An EMR cluster in the same region as the AnalyticDB for MySQL cluster. For more information, see Create a cluster.
The EMR cluster configured as follows:
Resource form: EMR on ECS
Business scenario: Data Lake
Services: Hadoop-Common, Hadoop Distributed File System (HDFS), YARN, and Hive
Metadata stored in a self-managed ApsaraDB RDS database or a built-in MySQL database
To access a Kerberos-encrypted Hive data source, make sure that the Kerberos authentication feature is enabled for the EMR cluster.
Prepare files
Before running any Spark job, download the required JAR files and configuration files, then upload them to OSS.
Step 1: Download the MySQL connector JAR file
Download the JAR file for connecting to Hive data sources from MySQL Connector Java.
Step 2: Download Hive JAR files (conditional)
If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, download the Hive JAR files from the EMR master node:
Log on to the master node of the EMR cluster. For more information, see Log on to a cluster.
Go to the
/opt/apps/HIVE/hive-current/libdirectory and download all JAR files.
Step 3: Download Kerberos configuration files (conditional)
If Kerberos authentication is enabled for the EMR cluster, download the following configuration files:
Log on to the master node. For more information, see the "Log on to the master node of the cluster" section of Log on to a cluster.
Download
krb5.conf. For more information, see the "Configuration files" section of Basic operations on Kerberos.Download
hadoop.keytab,core-site.xml, andhdfs-site.xml:Go to the directory shown in
HADOOP_CONF_DIRand downloadhadoop.keytab,core-site.xml, andhdfs-site.xml.
env | grep hadoopSample output:
HADOOP_HOME=/opt/apps/HADOOP-COMMON/hadoop-common-current/ HADOOP_CONF_DIR=/etc/taihao-apps/hadoop-confRun the following command to find the Kerberos principal:
listprincsIn the output, the string with the prefix
hadoop/masteris the required principal.
Step 4: Upload files to OSS
Upload all downloaded JAR files and configuration files to OSS. For more information, see Simple upload.
Troubleshooting
JDBC connection fails with JDOFatalDataStoreException
Symptom: The job fails with an error similar to:
Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database.
JDBC url = jdbc:mysql://...Cause: The JDBC connection parameters are misconfigured.
Fix: Verify the following parameters in the EMR console under Services > HIVE > Configure > hivemetastore-site:
spark.hadoop.javax.jdo.option.ConnectionURL— check the hostname, port, and database namespark.hadoop.javax.jdo.option.ConnectionUserName— check the database account namespark.hadoop.javax.jdo.option.ConnectionPassword— check the passwordspark.hadoop.javax.jdo.option.ConnectionDriverName— must becom.mysql.jdbc.Driver
Kerberos authentication fails
Symptom: The job fails with an error containing GSS initiate failed or KrbException.
Fix:
Confirm that the
krb5.conf,hadoop.keytab,core-site.xml, andhdfs-site.xmlfiles are uploaded to the OSS paths specified inspark.executorEnv.ADB_SPARK_DOWNLOAD_FILES.Verify that
spark.kerberos.principalmatches the output oflistprincson the EMR master node (the string prefixed withhadoop/master).Check that the format of
spark.kerberos.keytabislocal:///tmp/<OSS path>, not the OSS path itself.
Domain name resolution fails
Symptom: The job fails with a UnknownHostException for the Hive metastore hostname.
Fix: Set spark.adb.eni.extraHosts to map the metastore's IP address to its hostname:
"spark.adb.eni.extraHosts": "<ip_address> <hostname>"To find the hostname, check fs.defaultFS in core-site.xml on the EMR cluster. To find the IP address, check /etc/hosts on the master node.