Configure Spark Metadata via DLF or Hive Metastore - E-MapReduce

E-MapReduce (EMR) on ACK supports two metadata management options for Spark clusters: Data Lake Formation (DLF), a managed service, or a self-managed Hive metastore. This topic describes how to configure each option.

Choose a metadata management method

	DLF (recommended)	Self-managed Hive metastore
Management	Fully managed by Alibaba Cloud	Managed by you
Best for	Production environments where you do not need to maintain independent metadatabases; environments with multiple big data compute engines (MaxCompute, Hologres, Machine Learning Platform for AI); or environments with multiple EMR clusters	Existing Hive metastore deployments you want to reuse
Setup effort	Enable with one click in the console	Configure a Thrift URI and deploy client configuration

Prerequisites

Before you begin, ensure that you have:

A Spark cluster created on the EMR on ACK page of the E-MapReduce console. For more information, see Step 2: Create a cluster
(If using DLF) Data Lake Formation (DLF) activated. For more information, see Quick start
(If using a self-managed Hive metastore) A self-managed Hive metastore is created and accessible from the Container Service for Kubernetes (ACK) clusters you created

Method 1 (recommended): Manage metadata by using DLF

Log on to the EMR on ACK console. On the EMR on ACK page, find your Spark cluster and click its name.
On the Cluster Details tab, click Enable next to Data Lake Formation (DLF).
In the Enable DLF dialog, click OK.

Job data submitted to the Spark cluster is automatically imported to DLF.

Method 2: Manage metadata by using a self-managed Hive metastore

Log on to the EMR on ACK console. On the EMR on ACK page, find your Spark cluster and click Configure in the Actions column.
On the Configure tab, click the spark-defaults.conf tab.

Click Add Configuration Item and set the following parameters:

Parameter	Value
Key	`spark.hadoop.hive.metastore.uris`
Value	`thrift://<IP address of the self-managed Hive metastore>:9083`

Replace <IP address of the self-managed Hive metastore> with the IP address of your Hive metastore. The value uses the Thrift protocol on port 9083.

Click OK. In the dialog that appears, enter a reason in the Execution Reason field and click Save.
At the bottom of the Configure tab, click Deploy Client Configuration. In the dialog that appears, enter a reason in the Execution Reason field, click OK, and then click OK in the Confirm dialog.

Job data submitted to the Spark cluster is automatically imported to the self-managed Hive metastore.