This topic describes the types of metadatabases that are supported by E-MapReduce (EMR) and compares their advantages.

Metadatabase types

EMR Hive metadata can be stored in the following types of metadatabases: Data Lake Formation (DLF), self-managed ApsaraDB RDS, and built-in MySQL.

DLF

Metadata is stored in Data Lake Formation (DLF). DLF provides an O&M-free, highly available, and high-performance unified metadata service. The metadata service is compatible with multiple versions of a Hive metastore, can be seamlessly integrated with open source compute engines in EMR, and supports version management of Hive metastores and data profiling. In addition, DLF also provides features such as data exploration, data lake management, and data permission management, and can be seamlessly integrated with other Alibaba Cloud computing services, such as MaxCompute, Databricks DataInsight (DDI), and Hologres. This way, you can use DLF in a wide range of computing scenarios. For more information, see Overview.

If you use DLF to store Hive metadata, you do not need to deploy a Hive metastore in an EMR cluster. This is the major advantage of DLF over self-managed ApsaraDB RDS and built-in MySQL. The services that are used to query metadata and store metadata are hosted in DLF. This helps eliminate O&M costs. In addition, DLF supports various engines, such as MaxCompute, Realtime Compute for Apache Flink, DDI, and Hologres, and allows these engines to share metadata in lakehouse solutions or among multiple clusters. The DLF client SDK provides APIs that are compatible with Hive metastores. This way, the engines can directly use the DLF client SDK to access metadata in DLF. Users can also use the DLF client to access metadata in DLF.

Figure 1. Deployment architecture of DLF in a single cluster
Deployment architecture of DLF in a single cluster
Figure 2. Deployment architecture of DLF in multiple clusters
Deployment architecture of DLF in multiple clusters

Self-managed ApsaraDB RDS

Metadata is stored in ApsaraDB RDS. The deployment architecture of self-managed ApsaraDB RDS in EMR clusters is similar to that of built-in MySQL. The difference is that built-in MySQL uses an on-premises MySQL database to store metadata whereas self-managed ApsaraDB RDS uses an ApsaraDB RDS for MySQL database to store metadata. Self-managed ApsaraDB RDS supports metadata sharing among multiple clusters, as shown in Deployment architecture of self-managed ApsaraDB RDS in multiple clusters. Metadata can be accessed by Hive metastores in these clusters.

Figure 3. Deployment architecture of self-managed ApsaraDB RDS in a single cluster
Deployment architecture of self-managed ApsaraDB RDS in a single cluster
Figure 4. Deployment architecture of self-managed ApsaraDB RDS in multiple clusters
Deployment architecture of self-managed ApsaraDB RDS in multiple clusters

Built-in MySQL

Metadata is stored in MySQL. MySQL Server instances are deployed in EMR clusters, usually on the master nodes of the clusters. If engines such as Hive, Spark, or Presto want to access metadata, these engines access a Hive metastore to access the metadata in MySQL. You can use a Hive metastore to access metadata. An engine uses the Thrift protocol to access the Hive metastore, and the Hive metastore uses the JDBC protocol to access MySQL.

You can also manually connect to the MySQL Server instances by using the MySQL client to view metadata. Metadata cannot be shared among multiple clusters because each cluster has one MySQL database, as shown in Deployment architecture of built-in MySQL in multiple clusters.

Figure 5. Deployment architecture of built-in MySQL in a single cluster
Cluster
Figure 6. Deployment architecture of built-in MySQL in multiple clusters
1

Advantages of different types of metadatabases

Differences between built-in MySQL and self-managed ApsaraDB RDS

If you select self-managed ApsaraDB RDS as the metadatabase, metadata can be shared among clusters.

In terms of availability, reliability, and performance, self-managed RDS outperforms built-in MySQL. For more information, see Competitive advantages of ApsaraDB RDS instances over self-managed databases.

Differences between DLF and self-managed ApsaraDB RDS

ItemDLFSelf-managed ApsaraDB RDS
UsabilityIf DLF is activated, a DLF metadatabase can be directly used in an EMR cluster. If an ApsaraDB RDS instance is purchased, a self-managed ApsaraDB RDS metadatabase can be directly used in an EMR cluster.
Metadata managementDLF provides various capabilities such as visualized metadata retrieval, metadata management, multi-version management, data statistics, and lifecycle management. None.
Support for various engines
  • Hive, Spark, and Presto are supported.
  • MaxCompute and Hologres are supported.
  • Hive, Spark, and Presto are supported.
  • MaxCompute and Hologres are not supported.
BillingDLF is free of charge. For more information, see Billing. The subscription and pay-as-you-go billing methods are supported.
O&M costsAuto scaling is supported. You do not need to perform O&M operations. You must perform O&M operations, such as upgrade and scale-out.
High availabilityActive/standby disaster recovery is supported. Active/standby disaster recovery is supported.
PerformanceThe performance is high, and Hive metastores in EMR clusters are optimized. The performance is high, but Hive metastores in EMR clusters are not optimized.
OthersFine-grained data permission management, analysis of data stored in data lakes, and data lake format management are supported. None.

FAQ: How do non-EMR clusters access metadata in DLF?

If the local test environment or other cloud services want to access metadata in DLF, the DLF client SDK must be integrated. For more information, see DLF.

Note Before you access DLF, you must obtain the endpoint, username, and password that are required to access DLF.
  • The endpoint varies based on the region.
  • The username and password are the AccessKey ID and AccessKey secret of the Alibaba Cloud Account that you use to activate DLF.
  • You must perform the following operation to change the storage type of the metadata:

    On the Hive service page, set the value of the hive.imetastoreclient.factory.class parameter to com.aliyun.datalake.metastore.hive2.DlfMetaStoreClientFactory. For more information, see Modify parameters.