E-MapReduce (EMR) metadata includes core information about data storage, data structure, and access permissions within EMR clusters. EMR supports three metadata services: Data Lake Formation (DLF), self-managed ApsaraDB RDS, and built-in MySQL. This topic compares the three services to help you choose the right one for your workload.
Which service should I use?
| Situation | Recommended service |
|---|---|
| New cluster for test or production use | DLF |
| Existing clusters that need cross-cluster metadata sharing | DLF |
| Fine-grained control over the metadata store with your own RDS instance | Self-managed ApsaraDB RDS |
| Short-term proof-of-concept (POC) testing on a single cluster | Built-in MySQL |
For most workloads, use DLF. It requires no operations and maintenance (O&M) effort, supports all major engines (including MaxCompute and Hologres), and provides built-in high availability (HA).
Do not use built-in MySQL in test or production environments. The MySQL database runs on a single master node with no HA or cross-cluster sharing support, which can cause service instability.
Comparison of metadata services
| Item | DLF | Self-managed ApsaraDB RDS | Built-in MySQL |
|---|---|---|---|
| Backend storage | Data is stored in DLF. | Data is stored in ApsaraDB RDS for MySQL instances. Purchase and configure an instance before cluster creation. | Data is stored in the MySQL instance of an EMR cluster. |
| Applicable environment | Test and production | Test and production | POC testing of a single cluster only |
| Cross-cluster metadata sharing | Supported | Supported | Not supported |
| Engine compatibility | Hive, Spark, Presto, MaxCompute, and Hologres | Hive, Spark, and Presto | Hive, Spark, and Presto |
| Metadata management | Visualized metadata retrieval, metadata management, multi-version management, data statistics, and lifecycle management | None | None |
| High availability | Primary/secondary disaster recovery | Primary/secondary disaster recovery | Not supported |
| O&M cost | No O&M required. Auto scaling is supported. | Manual O&M required (upgrades and scale-out). Suitable for fine-grained cluster management. | Increases upgrade costs. MySQL runs on a single cluster node. |
| Billing | Currently free. See Billing for details. | Charged by computing and storage resources. See Billable items for details. | Free |
Check supported regions before selecting a metadata service. For DLF supported regions, see Supported regions and endpoints.
Deployment architecture
DLF
Metadata is stored in DLF and shared across multiple clusters. The DLF client SDK exposes APIs compatible with Hive Metastore, so engines can access metadata directly through the SDK. For more information, see the Product introduction documentation.
| Deployment architecture of DLF in a single cluster | Deployment architecture of DLF in multiple clusters |
|---|---|
|
|
|
Self-managed ApsaraDB RDS
Metadata is stored in ApsaraDB RDS for MySQL instances and shared across multiple clusters via Hive Metastore.
| Deployment architecture of self-managed ApsaraDB RDS in a single cluster | Deployment architecture of self-managed ApsaraDB RDS in multiple clusters |
|---|---|
|
|
|
Built-in MySQL
Metadata is stored in MySQL instances deployed in EMR clusters, usually on the master nodes of the clusters. Because each cluster has its own MySQL database, metadata cannot be shared across clusters.
The default credentials for built-in MySQL are usernamerootand passwordEMRroot1234.
| Deployment architecture of built-in MySQL in a single cluster | Deployment architecture of built-in MySQL in multiple clusters |
|---|---|
|
|
|
What's next
-
To switch to DLF for unified metadata storage, see Use DLF for unified metadata storage.
-
To configure a self-managed ApsaraDB RDS for MySQL database as the metadata store, see Configure self-managed RDS.