This topic describes how to use the serverless Spark engine of Data Lake Analytics (DLA) to access Hive clusters in your virtual private cloud (VPC).
DLA is discontinued. AnalyticDB for MySQL Data Lakehouse Edition supports the features of DLA and provides more features and better performance. For more information about how to access Hive clusters from AnalyticDB for MySQL Spark, see Access a Hive data source.
Prerequisites
DLA is activated. For more information, see Activate DLA.
You have logged on to the DLA console and a Spark virtual cluster is created in the console.
Object Storage Service (OSS) is activated. For more information, see Activate OSS.
The vSwitch ID and security group ID that are required for creating a Spark compute node are obtained. You can select an existing vSwitch and an existing security group. You can also create a vSwitch and a security group. The vSwitch and security group that you select must meet the following conditions:
The vSwitch must be in the same VPC as the Hive cluster that you want to access. You can use the ID of the vSwitch in which the cluster is deployed. To obtain the vSwitch ID, go to the details page of the cluster in the Alibaba Cloud Management Console.
The security group must be in the same VPC as the Hive cluster that you want to access. To obtain the ID of a security group, go to the Elastic Compute Service (ECS) console. In the left-side navigation pane, choose Network & Security > Security Groups. On the Security Group page, enter the VPC ID in the search box to search for the security groups associated with the VPC and select the ID of a security group.
If a whitelist is configured for the Hive cluster for access control, you must add the CIDR block of the vSwitch to the whitelist.
Procedure
If your Hive metadata is stored in an independent ApsaraDB RDS database and table data is stored in OSS, you can use the configurations in the following code and skip subsequent steps. Otherwise, proceed with the following steps.
{ "name": "spark-on-hive", "className": "com.aliyun.spark.SparkHive", # The entry class of the test code. You can modify this parameter based on your business requirements. "jars": [ "oss://path/to/mysql-connector-java-5.1.47.jar" ], "conf": { "spark.dla.eni.vswitch.id": "{vSwitch ID}", "spark.dla.eni.security.group.id": "{Security group ID}", "spark.dla.eni.enable": "true", "spark.driver.resourceSpec": "medium", "spark.dla.connectors": "oss", "spark.executor.instances": 1, "spark.sql.catalogImplementation": "hive", "spark.executor.resourceSpec": "medium", "spark.hadoop.javax.jdo.option.ConnectionDriverName": "com.mysql.jdbc.Driver", "spark.hadoop.javax.jdo.option.ConnectionUserName": "<hive_user_name>", # The username that is used to log on to the ApsaraDB RDS database in which Hive metadata is stored. "spark.hadoop.javax.jdo.option.ConnectionPassword": "<your_pass_word>", # The password that is used to log on to the ApsaraDB RDS database in which Hive metadata is stored. "spark.hadoop.javax.jdo.option.ConnectionURL": "<Java Database Connectivity (JDBC) URL>", # The JDBC URL of the ApsaraDB RDS database in which Hive metadata is stored. "spark.dla.job.log.oss.uri": "<OSS path in which logs are stored>" }, "file": "<oss://Directory in which the JAR file of the main resource is stored>" }
NoteThe JAR file specified in the jars parameter is the file of the JDBC driver of MySQL databases. You can obtain the JAR file from the official Maven Repository website and upload the file to OSS.
Obtain the required parameter configurations of the Hive cluster to configure the serverless Spark engine of DLA.
NoteIf you cannot run Spark jobs in the Hive cluster, skip this step.
To obtain the parameter configurations of the Hive cluster that you want to access, run the following wget command to download the
spark-examples-0.0.1-SNAPSHOT-shaded.jar
file and upload the file to OSS. Submit a Spark job to theHive
cluster and then obtain the parameter configurations of the Hive cluster.wget https://dla003.oss-cn-hangzhou.aliyuncs.com/GetSparkConf/spark-examples-0.0.1-SNAPSHOT-shaded.jar
If you want to access an E-MapReduce (EMR) cluster, you can upload the JAR file to OSS and run the following command to submit a job to the EMR cluster:
--class com.aliyun.spark.util.GetConfForServerlessSpark --deploy-mode client ossref://{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar get hive hadoop
After the job is complete, you can view the configurations from the stdout output of the driver on the SparkUI or from the logs on the job details page.
If you want to access an ApsaraDB for HBase cluster of X-Pack Spark, you can upload the JAR file to the resource management directory and run the following command to submit a job to the cluster:
--class com.aliyun.spark.util.GetConfForServerlessSpark /{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar get hive hadoop
After the job is complete, you can view the configurations from the stdout output of the driver on the SparkUI.
If you want to access a Hive cluster of another type and you have not specified the
HIVE_CONF_DIR
environment variable in the cluster, you must manually specify this variable.--class com.aliyun.spark.util.GetConfForServerlessSpark --deploy-mode client /{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar get --hive-conf-dir </path/to/your/hive/conf/dir> hive hadoop
Write the code in the file of a Spark application to access the Hive cluster.
In the following sample code, a table is created in the
default namespace
based on the table name. The table contains only one column of data of the STRING type, which ishello, dla-spark
. The data in this column is read and then displayed in the stdout output.package com.aliyun.spark import org.apache.spark.sql.SparkSession object SparkHive { def main(args: Array[String]): Unit = { val sparkSession = SparkSession .builder() .appName("Spark HIVE TEST") .enableHiveSupport() .getOrCreate() val welcome = "hello, dla-spark" // The name of the Hive table. val tableName = args(0) import sparkSession.implicits._ // Save the DataFrame: df dataset that contains only one row and one column of data to the Hive table. The table name is the same as the value of the tableName variable, and the column name is welcome_col. val df = Seq(welcome).toDF("welcome_col") df.write.format("hive").mode("overwrite").saveAsTable(tableName) // Read the table name from Hive. val dfFromHive = sparkSession.sql( s""" |select * from $tableName |""".stripMargin) dfFromHive.show(10) } }
Upload the JAR file and dependencies of the Spark application to OSS.
For more information, see Upload objects.
NoteOSS and the serverless Spark engine of DLA must be deployed in the same region.
Submit a job in the serverless Spark engine of DLA and perform data computing.
The following sample code provides an example if you want to access a high-availability Hive cluster that has more than one master node or NameNode. For more information, see Create and run Spark jobs and Configure a Spark job.
{ "args": [ "hello_dla" ], "name": "spark-on-hive", "className": "com.aliyun.spark.SparkHive", "conf": { "spark.sql.catalogImplementation":"hive", "spark.dla.eni.vswitch.id": "{vSwitch ID}", "spark.dla.eni.security.group.id": "{Security group ID}", "spark.dla.eni.enable": "true", "spark.driver.resourceSpec": "medium", "spark.executor.instances": 1, "spark.executor.resourceSpec": "medium", "spark.dla.job.log.oss.uri": "oss://<Directory in which your SparkUI logs are stored/>", "spark.hadoop.hive.metastore.uris":"thrift://${ip}:${port},thrift://${ip}:${port}", "spark.hadoop.dfs.nameservices":"{Names of your nameservices}", "spark.hadoop.dfs.client.failover.proxy.provider.${nameservices}":"{Full path name of the implementation class of the failover proxy provider}", "spark.hadoop.dfs.ha.namenodes.${nameservices}":"{NameNodes to which your nameservices belong}", "spark.hadoop.dfs.namenode.rpc-address.${nameservices}.${nn1}":"IP address of NameNode 0:Port number of NameNode 0", "spark.hadoop.dfs.namenode.rpc-address.${nameservices}.${nn2}":"IP address of NameNode 1:Port number of NameNode 1" }, "file": "oss://{Directory in which your JAR file is stored}" }
The following table describes the parameters.
Parameter
Description
Note
spark.hadoop.hive.metastore.uris
The URIs of the Hive metastore service that you want to access. These URIs correspond to those configured for hive.metastore.uris in the ${HIVE_CONF_DIR}/hive-site.xml file. The value of the hive.metastore.uris configuration item is in the format of Domain name:Port number. To obtain the value of this parameter, you must convert the value format of the hive.metastore.uris configuration item to IP address:Port number.
To view the mapping between domain names and IP addresses, log on to the master node of the cluster and view the mapping from the hosts file in the etc folder or send PING messages to a domain name on the master node. You can also obtain the mapping by performing the operations in Step 2.
spark.dla.eni.vswitch.id
The ID of the vSwitch that you select.
None
spark.dla.eni.security.group.id
The ID of the security group that you select.
None
spark.dla.eni.enable
Specifies whether to enable an elastic network interface (ENI).
None
spark.hadoop.dfs.nameservices
This parameter corresponds to the dfs.nameservices parameter in the hdfs-site.xml file.
None
spark.dla.job.log.oss.uri
The OSS path in which your SparkUI logs are stored.
None
spark.hadoop.dfs.client.failover.proxy.provider.${nameservices}
This parameter corresponds to the dfs.client.failover.proxy.provider.${nameservices} parameter in the hdfs-site.xml file.
None
spark.hadoop.dfs.ha.namenodes.${nameservices}
This parameter corresponds to the dfs.ha.namenodes.${nameservices} parameter in the hdfs-site.xml file.
None
spark.hadoop.dfs.namenode.rpc-address.${nameservices}.${nn1/nn2}
This parameter corresponds to the dfs.namenode.rpc-address.${nameservices}.${nn1/nn2} parameter in the hdfs-site.xml file.
The value of this parameter must be in the format of IP address:Port number. To view the mapping between domain names and IP addresses, log on to the master node of the Hive cluster and view the mapping from the hosts file in the etc folder. Alternatively, send PING messages to a domain name on the master node to obtain the mapping. You can also obtain the mapping by performing the operations in Step 2.
After the job is complete, find the job and click Log in the
column to view the logs of the job.The following sample code provides an example if you want to access a non-high-availability Hive cluster that has only one master node or NameNode:
{ "args": [ "hello_dla" ], "name": "spark-on-hive", "className": "com.aliyun.spark.SparkHive", "conf": { "spark.sql.catalogImplementation":"hive", "spark.dla.eni.vswitch.id": "{vSwitch ID}", "spark.dla.eni.security.group.id": "{Security group ID}", "spark.dla.eni.enable": "true", "spark.driver.resourceSpec": "medium", "spark.executor.instances": 1, "spark.executor.resourceSpec": "medium", "spark.dla.job.log.oss.uri": "oss://<Directory in which your SparkUI logs are stored/>"," "spark.hadoop.hive.metastore.uris":"thrift://${ip}:${port},thrift://${ip}:${port}", "spark.dla.eni.extra.hosts":"${ip0} ${hostname_0} ${hostname_1} ${hostname_n}" }, "file": "oss://{Directory in which your JAR file is stored}" }
Parameter
Description
Note
spark.hadoop.hive.metastore.uris
The URIs of the Hive metastore service that you want to access. These URIs correspond to those configured for hive.metastore.uris in the ${HIVE_CONF_DIR}/hive-site.xml file. The value of the hive.metastore.uris configuration item is in the format of Domain name:Port number. To obtain the value of this parameter, you must convert the value format of the hive.metastore.uris configuration item to IP address:Port number.
To view the mapping between domain names and IP addresses, log on to the master node of the cluster and view the mapping from the hosts file in the etc folder or send PING messages to a domain name on the master node. You can also obtain the mapping by performing the operations in Step 2.
spark.dla.job.log.oss.uri
The OSS path in which your SparkUI logs are stored.
None
spark.dla.eni.vswitch.id
The ID of the vSwitch that you select.
None
spark.dla.eni.security.group.id
The ID of the security group that you select.
None
spark.dla.eni.enable
Specifies whether to enable an ENI.
None
spark.dla.eni.extra.hosts
The mapping between IP addresses and hostnames. The serverless Spark engine of DLA can correctly describe the domain name information based on the mapping between IP addresses and hostnames when the serverless Spark engine parses the location of the Hive table.
ImportantSeparate IP addresses and hostnames with spaces. Separate groups of IP addresses and hostnames with commas (,). Example: ip0 master0, ip1 master1.
You can obtain the value of this parameter from the fs.defaultFS property in the core-site.xml file under the ${Hive_CONF_DIR} directory of the Hive cluster. For example, if the value of the fs.defaultFs property is hdfs://master-1:9000, you must set this parameter to ${IP address of master-1} master-1. To view the mapping between domain names and IP addresses, log on to the master node of your self-managed Hive cluster and view the mapping from the hosts file in the etc folder. You can also obtain the mapping by performing the operations in Step 2.