E-MapReduce (EMR) on ACK supports two metadata management options for Spark clusters: Data Lake Formation (DLF), a managed service, or a self-managed Hive metastore. This topic describes how to configure each option.
Choose a metadata management method
| DLF (recommended) | Self-managed Hive metastore | |
|---|---|---|
| Management | Fully managed by Alibaba Cloud | Managed by you |
| Best for | Production environments where you do not need to maintain independent metadatabases; environments with multiple big data compute engines (MaxCompute, Hologres, Machine Learning Platform for AI); or environments with multiple EMR clusters | Existing Hive metastore deployments you want to reuse |
| Setup effort | Enable with one click in the console | Configure a Thrift URI and deploy client configuration |
Prerequisites
Before you begin, ensure that you have:
-
A Spark cluster created on the EMR on ACK page of the E-MapReduce console. For more information, see Step 2: Create a cluster
-
(If using DLF) Data Lake Formation (DLF) activated. For more information, see Quick start
-
(If using a self-managed Hive metastore) A self-managed Hive metastore is created and accessible from the Container Service for Kubernetes (ACK) clusters you created
Method 1 (recommended): Manage metadata by using DLF
-
Log on to the EMR on ACK console. On the EMR on ACK page, find your Spark cluster and click its name.
-
On the Cluster Details tab, click Enable next to Data Lake Formation (DLF).
-
In the Enable DLF dialog, click OK.
Job data submitted to the Spark cluster is automatically imported to DLF.
Method 2: Manage metadata by using a self-managed Hive metastore
-
Log on to the EMR on ACK console. On the EMR on ACK page, find your Spark cluster and click Configure in the Actions column.
-
On the Configure tab, click the spark-defaults.conf tab.
-
Click Add Configuration Item and set the following parameters:
Parameter Value Key spark.hadoop.hive.metastore.urisValue thrift://<IP address of the self-managed Hive metastore>:9083Replace
<IP address of the self-managed Hive metastore>with the IP address of your Hive metastore. The value uses the Thrift protocol on port 9083. -
Click OK. In the dialog that appears, enter a reason in the Execution Reason field and click Save.
-
At the bottom of the Configure tab, click Deploy Client Configuration. In the dialog that appears, enter a reason in the Execution Reason field, click OK, and then click OK in the Confirm dialog.
Job data submitted to the Spark cluster is automatically imported to the self-managed Hive metastore.