All Products
Search
Document Center

AnalyticDB:Access a Hive data source

Last Updated:Mar 28, 2026

AnalyticDB for MySQL Spark can read from and write to Hive data sources over the Thrift or Java Database Connectivity (JDBC) protocol. For security-sensitive environments, Kerberos authentication ensures that only authenticated clients can access the cluster and submit jobs. This topic uses the Hive service in E-MapReduce (EMR) as an example.

How it works

Spark connects to the Hive metastore in one of two modes:

  • Remote mode (Thrift): Spark connects to the Hive metastore service over the Thrift protocol. Use this when the metastore runs as a standalone service (the default for EMR clusters).

  • Local mode (JDBC): Spark connects directly to the underlying metastore database (ApsaraDB RDS or built-in MySQL) over JDBC. Use this when you need direct database-level access or when the Thrift service is unavailable.

Both modes support optional Kerberos authentication for encrypted clusters.

Prerequisites

Before you begin, ensure that you have:

  • An AnalyticDB for MySQL Data Lakehouse Edition clusterData Lakehouse Edition

  • A database account for the cluster:

  • The AnalyticDB for MySQL cluster and an Object Storage Service (OSS) bucket in the same region

  • An EMR cluster in the same region as the AnalyticDB for MySQL cluster. For more information, see Create a cluster.

  • The EMR cluster configured as follows:

    • Resource form: EMR on ECS

    • Business scenario: Data Lake

    • Services: Hadoop-Common, Hadoop Distributed File System (HDFS), YARN, and Hive

    • Metadata stored in a self-managed ApsaraDB RDS database or a built-in MySQL database

Important

To access a Kerberos-encrypted Hive data source, make sure that the Kerberos authentication feature is enabled for the EMR cluster.

Prepare files

Before running any Spark job, download the required JAR files and configuration files, then upload them to OSS.

Step 1: Download the MySQL connector JAR file

Download the JAR file for connecting to Hive data sources from MySQL Connector Java.

Step 2: Download Hive JAR files (conditional)

If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, download the Hive JAR files from the EMR master node:

  1. Log on to the master node of the EMR cluster. For more information, see Log on to a cluster.

  2. Go to the /opt/apps/HIVE/hive-current/lib directory and download all JAR files.

Step 3: Download Kerberos configuration files (conditional)

If Kerberos authentication is enabled for the EMR cluster, download the following configuration files:

  1. Log on to the master node. For more information, see the "Log on to the master node of the cluster" section of Log on to a cluster.

  2. Download krb5.conf. For more information, see the "Configuration files" section of Basic operations on Kerberos.

  3. Download hadoop.keytab, core-site.xml, and hdfs-site.xml:

    1. Go to the directory shown in HADOOP_CONF_DIR and download hadoop.keytab, core-site.xml, and hdfs-site.xml.

    env | grep hadoop

    Sample output:

    HADOOP_HOME=/opt/apps/HADOOP-COMMON/hadoop-common-current/
    HADOOP_CONF_DIR=/etc/taihao-apps/hadoop-conf
  4. Run the following command to find the Kerberos principal:

    listprincs

    In the output, the string with the prefix hadoop/master is the required principal.

Step 4: Upload files to OSS

Upload all downloaded JAR files and configuration files to OSS. For more information, see Simple upload.

Use Spark JAR to access a Hive data source

Step 1: Write and compile the Spark application

Write a Spark application that accesses the Hive data source and compile it into a JAR file. The following example writes one row to a Hive table and then reads it back. In this example, the JAR file is named hive_test.jar.

package com.aliyun.spark

import org.apache.spark.sql.SparkSession

object SparkHive {
  def main(args: Array[String]): Unit = {
    val sparkSession = SparkSession
      .builder()
      .appName("Spark HIVE TEST")
      .enableHiveSupport()
      .getOrCreate()

    val welcome = "hello, adb-spark"

    // The name of the Hive table.
    val tableName = args(0)

    import sparkSession.implicits._
    // Write one row to the Hive table.
    val df = Seq(welcome).toDF("welcome_col")
    df.write.format("hive").mode("overwrite").saveAsTable(tableName)

    // Read back all rows from the Hive table.
    val dfFromHive = sparkSession.sql(
      s"""
         |select * from $tableName
         |""".stripMargin)
    dfFromHive.show(10)
  }
}

Upload hive_test.jar to OSS. For more information, see Simple upload.

Step 2: Open Spark JAR development in the console

  1. Log on to the AnalyticDB for MySQL console. In the upper-left corner, select a region. In the left-side navigation pane, click ClustersData Lakehouse Edition. On the Data Lakehouse Edition tab, find the target cluster and click its ID.

  2. In the left-side navigation pane, choose Job Development > Spark JAR Development.

  3. Select a job resource group and set the job type to Batch.

Step 3: Configure and submit the job

Use one of the following configurations depending on your connection protocol and security requirements.

Connect over Thrift

Thrift (remote mode) connects to the Hive metastore service. All examples use a JSON job configuration submitted to the Spark JAR Development page.

{
  "args": ["hello_adb"],
  "name": "spark-on-hive",
  "className": "com.aliyun.spark.SparkHive",
  "jars": [
    "oss://<bucket_name>/mysql-connector-java.jar",
    "oss://<bucket_name>/hive_lib/*"
  ],
  "file": "oss://<bucket_name>/hive_test.jar",
  "conf": {
    "spark.adb.eni.enabled": "true",
    "spark.adb.eni.adbHostAlias.enabled": "true",
    "spark.adb.eni.vswitchId": "vsw-bp1mbnyrjtf3ih1****",
    "spark.adb.eni.securityGroupId": "sg-bp180fryne3qle****",
    "spark.adb.eni.extraHosts": "172.24.xx.xx master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com",
    "spark.driver.resourceSpec": "medium",
    "spark.executor.instances": 1,
    "spark.executor.resourceSpec": "medium",
    "spark.hadoop.hive.metastore.uris": "thrift://master-1-1.c-9c9b32****.cn-hangzhou.emr.aliyuncs.com:9083",

    "spark.hadoop.dfs.nameservices": "<HDFS service name>",
    "spark.hadoop.dfs.client.failover.proxy.provider.<HDFS service name>": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
    "spark.hadoop.dfs.ha.namenodes.<HDFS service name>": "<NameNode name>",
    "spark.hadoop.dfs.namenode.rpc-address.<HDFS service name>.<NameNode name>": "master-1-1.c-9c9b322****.cn-hangzhou.emr.aliyuncs.com:9000",

    "spark.sql.hive.metastore.jars": "path",
    "spark.sql.hive.metastore.version": "<actual Hive version>",
    "spark.sql.hive.metastore.jars.path": "/tmp/*/*.jar"
  }
}

Replace the placeholders:

PlaceholderDescriptionExample
<bucket_name>Your OSS bucket namemy-bucket
<HDFS service name>The HDFS service name from EMR. In the EMR console, go to Services > HDFS > Configure and find the value of dfs.nameservices. Required only for high-availability clusters.emr-cluster
<NameNode name>The NameNode name from EMR. In the EMR console, go to Services > HDFS > Configure and find the value of dfs.ha.namenodes.<HDFS service name>. Required only for high-availability clusters.nn1,nn2
<actual Hive version>The Hive version of your EMR cluster. Required only when the Hive version is earlier than 2.3.2.1.0

For all parameters, see the Parameters section below.

Connect over JDBC

JDBC (local mode) connects directly to the underlying metastore database. Replace <hive_username>, <hive_password>, and the connection URL with values from the hivemetastore-site tab in the EMR console (Services > HIVE > Configure).

{
  "args": ["hello_adb"],
  "name": "spark-on-hive",
  "className": "com.aliyun.spark.SparkHive",
  "jars": [
    "oss://<bucket_name>/mysql-connector-java.jar",
    "oss://<bucket_name>/hive_lib/*"
  ],
  "file": "oss://<bucket_name>/hive_test.jar",
  "conf": {
    "spark.adb.eni.enabled": "true",
    "spark.adb.eni.adbHostAlias.enabled": "true",
    "spark.adb.eni.vswitchId": "vsw-bp1mbnyrjtf3ih1****",
    "spark.adb.eni.securityGroupId": "sg-bp180fryne3qle****",
    "spark.adb.eni.extraHosts": "172.24.xx.xx master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com",
    "spark.driver.resourceSpec": "medium",
    "spark.executor.instances": 1,
    "spark.executor.resourceSpec": "medium",
    "spark.sql.catalogImplementation": "hive",
    "spark.hadoop.javax.jdo.option.ConnectionDriverName": "com.mysql.jdbc.Driver",
    "spark.hadoop.javax.jdo.option.ConnectionUserName": "<hive_username>",
    "spark.hadoop.javax.jdo.option.ConnectionPassword": "<hive_password>",
    "spark.hadoop.javax.jdo.option.ConnectionURL": "jdbc:mysql://rm-bp1h5d11r8qtm****.mysql.rds.aliyuncs.com/<database_name>",

    "spark.sql.hive.metastore.jars": "path",
    "spark.sql.hive.metastore.version": "<actual Hive version>",
    "spark.sql.hive.metastore.jars.path": "/tmp/*/*.jar"
  }
}

For all parameters, see the Parameters section below.

Connect to a Kerberos-encrypted Hive data source

Add the following Kerberos-related parameters to either the Thrift or JDBC configuration above. The example below extends the JDBC configuration.

{
  "args": ["hello_adb"],
  "name": "spark-on-hive",
  "className": "com.aliyun.spark.SparkHive",
  "jars": [
    "oss://testBucketname/mysql-connector-java.jar",
    "oss://testBucketname/hive_lib/*"
  ],
  "file": "oss://testBucketname/hive_test.jar",
  "conf": {
    "spark.adb.eni.enabled": "true",
    "spark.adb.eni.adbHostAlias.enabled": "true",
    "spark.adb.eni.vswitchId": "vsw-bp1mbnyrjtf3ih1****",
    "spark.adb.eni.securityGroupId": "sg-bp180fryne3qle****",
    "spark.adb.eni.extraHosts": "172.24.xx.xx master-1.c-9c9b322d****.cn-hangzhou.emr.aliyuncs.com",
    "spark.driver.resourceSpec": "medium",
    "spark.executor.instances": 1,
    "spark.executor.resourceSpec": "medium",
    "spark.sql.catalogImplementation": "hive",
    "spark.hadoop.javax.jdo.option.ConnectionDriverName": "com.mysql.jdbc.Driver",
    "spark.hadoop.javax.jdo.option.ConnectionUserName": "hive_username",
    "spark.hadoop.javax.jdo.option.ConnectionPassword": "hive_password",
    "spark.hadoop.javax.jdo.option.ConnectionURL": "jdbc:mysql://master-1-1.c-49f95900****.cn-beijing.emr.aliyuncs.com/hivemeta?createDatabaseIfNotExist=true&characterEncoding=UTF-8",

    "spark.sql.hive.metastore.jars": "path",
    "spark.sql.hive.metastore.version": "<actual Hive version>",
    "spark.sql.hive.metastore.jars.path": "/tmp/*/*.jar",

    "spark.kubernetes.driverEnv.ADB_SPARK_DOWNLOAD_FILES": "oss://testBucketname/hadoop/hadoop.keytab, oss://testBucketname/hadoop/core-site.xml, oss://testBucketname/hadoop/hdfs-site.xml",
    "spark.executorEnv.ADB_SPARK_DOWNLOAD_FILES": "oss://testBucketname/hadoop/krb5.conf, oss://testBucketname/hadoop/hadoop.keytab, oss://testBucketname/hadoop/core-site.xml, oss://testBucketname/hadoop/hdfs-site.xml",
    "spark.kubernetes.driverEnv.HADOOP_CONF_DIR": "/tmp/testBucketname/hadoop",
    "spark.executorEnv.HADOOP_CONF_DIR": "/tmp/testBucketname/hadoop",
    "spark.kerberos.keytab": "local:///tmp/testBucketname/hadoop/hadoop.keytab",
    "spark.executor.extraJavaOptions": "-Djava.security.krb5.conf=/tmp/testBucketname/hadoop/krb5.conf",
    "spark.kubernetes.kerberos.krb5.path": "oss://testBucketname/hadoop/krb5.conf",
    "spark.kerberos.principal": "hadoop/master-1-1.c-49f95900****.cn-beijing.emr.aliyuncs.com@EMR.C-49F95900****.COM"
  }
}

Parameters

The Required column reflects the requirements for the use cases in this topic.

Common parameters (all configurations)

ParameterRequiredDescription
argsYesArguments passed to the JAR file. Separate multiple arguments with commas (,).
nameYesThe name of the Spark job.
classNameYesThe main class of the Java or Scala application. Not required for Python applications.
jarsYesOSS paths of the JAR files required to run the Spark job.
fileYesOSS path of hive_test.jar.
confYesConfiguration parameters for the Spark job, in key:value format. Separate multiple entries with commas (,). For more information, see Spark application configuration parameters.
spark.adb.eni.enabledYesSet to true to enable Elastic Network Interface (ENI).
spark.adb.eni.adbHostAlias.enabledYesSet to true to enable domain name resolution for the Hive data source.
spark.adb.eni.vswitchIdYesThe vSwitch ID of the EMR cluster. To find it, go to the VPC console and open the Resource Management page of the VPC containing the EMR cluster.
spark.adb.eni.securityGroupIdYesThe security group ID of the EMR cluster. To find it, go to the Basic Information page of the EMR cluster.
spark.adb.eni.extraHostsNoIP-to-hostname mappings for the Hive data source, required when the domain name cannot be resolved via DNS. Format: "<ip0> <hostname0>, <ip1> <hostname1>". To get the hostname, check the fs.defaultFS value in <Hive_CONF_DIR>/core-site.xml. To get the IP address, check /etc/hosts on the master node. Required for self-managed Hive clusters; optional for EMR clusters that use DNS.

Thrift-specific parameters

ParameterRequiredDescription
spark.hadoop.hive.metastore.urisYesThe Thrift URI of the Hive metastore. In the EMR console, go to Services > HIVE > Configure and find the value of hive.metastore.uris.
spark.hadoop.dfs.nameservicesNoThe HDFS service name. Required for high-availability clusters. In the EMR console, go to Services > HDFS > Configure and find the value of dfs.nameservices.
spark.hadoop.dfs.client.failover.proxy.provider.<HDFS service name>NoThe failover proxy provider class. Required for high-availability clusters. Default: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.
spark.hadoop.dfs.ha.namenodes.<HDFS service name>NoThe NameNode names. Required for high-availability clusters. In the EMR console, go to Services > HDFS > Configure and find the value of dfs.ha.namenodes.<HDFS service name>.
spark.hadoop.dfs.namenode.rpc-address.<HDFS service name>.<NameNode name>NoThe RPC address of the NameNode. Required for high-availability clusters. In the EMR console, go to Services > HDFS > Configure and find the value of dfs.namenode.rpc-address.<HDFS service name>.<NameNode name>.

JDBC-specific parameters

ParameterRequiredDescription
spark.sql.catalogImplementationYesSet to hive to access a Hive data source.
spark.hadoop.javax.jdo.option.ConnectionDriverNameYesThe JDBC driver class name. In the EMR console, go to Services > HIVE > Configure > hivemetastore-site and find the value of javax.jdo.option.ConnectionDriverName.
spark.hadoop.javax.jdo.option.ConnectionUserNameYesThe database account name for the ApsaraDB RDS or built-in MySQL database. In the EMR console, go to Services > HIVE > Configure > hivemetastore-site and find the value of javax.jdo.option.ConnectionUserName.
spark.hadoop.javax.jdo.option.ConnectionPasswordYesThe database account password. In the EMR console, go to Services > HIVE > Configure > hivemetastore-site and find the value of javax.jdo.option.ConnectionPassword.
spark.hadoop.javax.jdo.option.ConnectionURLYesThe JDBC connection URL, including the database name. Format: jdbc:mysql://rm-xxxxxx.mysql.rds.aliyuncs.com/<database_name>. In the EMR console, go to Services > HIVE > Configure > hivemetastore-site and find the value of javax.jdo.option.ConnectionURL.

Hive version compatibility parameters

If the Hive version of the EMR cluster is earlier than 2.3 and incompatible with the Hive version of AnalyticDB for MySQL, add the following three parameters:

ParameterValueDescription
spark.sql.hive.metastore.jarspathTells Spark to load Hive metastore JARs from a local path.
spark.sql.hive.metastore.versionThe actual Hive version of your EMR cluster (for example, 2.1.0)The Hive metastore version to use.
spark.sql.hive.metastore.jars.path/tmp/*/*.jarThe local path pattern for the Hive JAR files you downloaded.

Kerberos parameters

ParameterRequiredDescription
spark.kubernetes.driverEnv.ADB_SPARK_DOWNLOAD_FILESYesOSS paths of hadoop.keytab, core-site.xml, and hdfs-site.xml for the Spark driver.
spark.executorEnv.ADB_SPARK_DOWNLOAD_FILESYesOSS paths of krb5.conf, hadoop.keytab, core-site.xml, and hdfs-site.xml for the Spark executor.
spark.kubernetes.driverEnv.HADOOP_CONF_DIRYesLocal directory for the driver's Hadoop configuration files. Format: /tmp/<OSS directory>. For example, if files are in oss://testBucketname/hadoop, use /tmp/testBucketname/hadoop.
spark.executorEnv.HADOOP_CONF_DIRYesLocal directory for the executor's Hadoop configuration files. Same format as spark.kubernetes.driverEnv.HADOOP_CONF_DIR.
spark.kerberos.keytabYesLocal path of hadoop.keytab. Format: local:///tmp/<OSS path of hadoop.keytab>. For example, local:///tmp/testBucketname/hadoop/hadoop.keytab.
spark.executor.extraJavaOptionsYesJVM option pointing to the local krb5.conf path. Format: -Djava.security.krb5.conf=/tmp/<OSS path of krb5.conf>.
spark.kubernetes.kerberos.krb5.pathYesOSS path of krb5.conf.
spark.kerberos.principalYesThe Kerberos principal obtained during preparation (the string prefixed with hadoop/master).

Use Spark SQL to access a Hive data source

Step 1: Open SQL development in the console

  1. Log on to the AnalyticDB for MySQL console. In the upper-left corner, select a region. In the left-side navigation pane, click ClustersData Lakehouse Edition. On the Data Lakehouse Edition tab, find the target cluster and click its ID.

  2. In the left-side navigation pane, choose Job Development > SQL Development.

  3. Select the Spark engine and a Job Resource Group. Write your SQL and click Run Now.

Step 2: Configure and run the SQL job

Use SET statements to configure Spark parameters before your SQL statement.

Connect over Thrift

SET spark.adb.eni.enabled=true;
SET spark.adb.eni.vswitchId=vsw-bp1mbnyrjtf3ih1****;
SET spark.adb.eni.securityGroupId=sg-bp180fryne3qle****;
SET spark.adb.eni.adbHostAlias.enabled=true;
SET spark.driver.resourceSpec=medium;
SET spark.executor.instances=1;
SET spark.executor.resourceSpec=medium;
SET spark.hadoop.hive.metastore.uris=thrift://master-1-1.c-9c9b32****.cn-hangzhou.emr.aliyuncs.com:9083;
SET spark.sql.hive.metastore.version=2.3.9.adb;

-- High-availability clusters only:
SET spark.hadoop.dfs.nameservices=<HDFS service name>;
SET spark.hadoop.dfs.client.failover.proxy.provider.<HDFS service name>=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider;
SET spark.hadoop.dfs.ha.namenodes.<HDFS service name>=<NameNode name>;
SET spark.hadoop.dfs.namenode.rpc-address.<HDFS service name>.<NameNode name>=master-1-1.c-9c9b322****.cn-hangzhou.emr.aliyuncs.com:9000;

-- Hive version earlier than 2.3 only:
SET spark.sql.hive.metastore.jars=path;
SET spark.sql.hive.metastore.version=<actual Hive version>;
SET spark.sql.hive.metastore.jars.path=/tmp/*/*.jar;

SHOW databases;

Connect over JDBC

SET spark.adb.eni.enabled=true;
SET spark.adb.eni.vswitchId=vsw-bp1mbnyrjtf3ih1****;
SET spark.adb.eni.securityGroupId=sg-bp180fryne3qle****;
SET spark.adb.eni.adbHostAlias.enabled=true;
SET spark.driver.resourceSpec=medium;
SET spark.executor.instances=1;
SET spark.executor.resourceSpec=medium;
SET spark.sql.catalogImplementation=hive;
SET spark.hadoop.javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver;
SET spark.hadoop.javax.jdo.option.ConnectionUserName=hive_username;
SET spark.hadoop.javax.jdo.option.ConnectionPassword=hive_password;
SET spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:mysql://rm-bp1h5d11r8qtm****.mysql.rds.aliyuncs.com/dbname;
SET spark.sql.hive.metastore.version=2.3.9.adb;

-- Hive version earlier than 2.3 only:
SET spark.sql.hive.metastore.jars=path;
SET spark.sql.hive.metastore.version=<actual Hive version>;
SET spark.sql.hive.metastore.jars.path=/tmp/*/*.jar;

SHOW databases;

Connect to a Kerberos-encrypted Hive data source

SET spark.adb.eni.enabled=true;
SET spark.adb.eni.vswitchId=vsw-bp1mbnyrjtf3ih1****;
SET spark.adb.eni.securityGroupId=sg-bp180fryne3qle****;
SET spark.adb.eni.adbHostAlias.enabled=true;
SET spark.driver.resourceSpec=medium;
SET spark.executor.instances=1;
SET spark.executor.resourceSpec=medium;
SET spark.sql.catalogImplementation=hive;
SET spark.hadoop.javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver;
SET spark.hadoop.javax.jdo.option.ConnectionUserName=hive_username;
SET spark.hadoop.javax.jdo.option.ConnectionPassword=hive_password;
SET spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:mysql://rm-bp1h5d11r8qtm****.mysql.rds.aliyuncs.com/dbname;
SET spark.kubernetes.driverEnv.ADB_SPARK_DOWNLOAD_FILES=oss://testBucketname/hadoop/hadoop.keytab, oss://testBucketname/hadoop/core-site.xml, oss://testBucketname/hadoop/hdfs-site.xml;
SET spark.executorEnv.ADB_SPARK_DOWNLOAD_FILES=oss://testBucketname/hadoop/krb5.conf, oss://testBucketname/hadoop/hadoop.keytab, oss://testBucketname/hadoop/core-site.xml, oss://testBucketname/hadoop/hdfs-site.xml;
SET spark.kubernetes.driverEnv.HADOOP_CONF_DIR=/tmp/testBucketname/hadoop;
SET spark.executorEnv.HADOOP_CONF_DIR=/tmp/testBucketname/hadoop;
SET spark.kerberos.keytab=local:///tmp/testBucketname/hadoop/hadoop.keytab;
SET spark.kubernetes.kerberos.krb5.path=oss://testBucketname/hadoop/krb5.conf;
SET spark.executor.extraJavaOptions=-Djava.security.krb5.conf=/tmp/testBucketname/hadoop/krb5.conf;
SET spark.kerberos.principal=hadoop/master-1-1.c-49f95900****.cn-beijing.emr.aliyuncs.com@EMR.C-49F95900****.COM;
SET spark.sql.hive.metastore.version=2.3.9.adb;

-- Hive version earlier than 2.3 only:
SET spark.sql.hive.metastore.jars=path;
SET spark.sql.hive.metastore.version=<actual Hive version>;
SET spark.sql.hive.metastore.jars.path=/tmp/*/*.jar;

SHOW databases;

For parameter descriptions, see the Parameters section.

Troubleshooting

JDBC connection fails with JDOFatalDataStoreException

Symptom: The job fails with an error similar to:

Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database.
JDBC url = jdbc:mysql://...

Cause: The JDBC connection parameters are misconfigured.

Fix: Verify the following parameters in the EMR console under Services > HIVE > Configure > hivemetastore-site:

  • spark.hadoop.javax.jdo.option.ConnectionURL — check the hostname, port, and database name

  • spark.hadoop.javax.jdo.option.ConnectionUserName — check the database account name

  • spark.hadoop.javax.jdo.option.ConnectionPassword — check the password

  • spark.hadoop.javax.jdo.option.ConnectionDriverName — must be com.mysql.jdbc.Driver

Kerberos authentication fails

Symptom: The job fails with an error containing GSS initiate failed or KrbException.

Fix:

  1. Confirm that the krb5.conf, hadoop.keytab, core-site.xml, and hdfs-site.xml files are uploaded to the OSS paths specified in spark.executorEnv.ADB_SPARK_DOWNLOAD_FILES.

  2. Verify that spark.kerberos.principal matches the output of listprincs on the EMR master node (the string prefixed with hadoop/master).

  3. Check that the format of spark.kerberos.keytab is local:///tmp/<OSS path>, not the OSS path itself.

Domain name resolution fails

Symptom: The job fails with a UnknownHostException for the Hive metastore hostname.

Fix: Set spark.adb.eni.extraHosts to map the metastore's IP address to its hostname:

"spark.adb.eni.extraHosts": "<ip_address> <hostname>"

To find the hostname, check fs.defaultFS in core-site.xml on the EMR cluster. To find the IP address, check /etc/hosts on the master node.

What's next