A data catalog is the top-level metadata entity in Data Lake Formation (DLF). Each catalog can contain multiple databases and acts as an isolation boundary — binding different E-MapReduce (EMR) clusters to separate catalogs keeps their metadata invisible to each other.
Manage catalogs
Create a catalog
The Location field only accepts Object Storage Service (OSS) paths. If your default storage path is not on OSS, leave this field blank.
-
Log on to the Data Lake Formation console.
-
In the left-side navigation pane, choose Metadata > Metadata.
-
Click the Catalog List tab, and click New Catalog.
-
Configure the following fields, and click OK.
Field Required Description Catalog ID Yes A unique identifier for this catalog. Cannot be duplicated. Description No A description of the catalog. Location No The default storage path for this catalog. Only OSS paths are supported. Leave blank if your default storage is not OSS.
View catalogs
-
In the left-side navigation pane, choose Metadata > Metadata.
-
Click the Catalog List tab to see all catalogs.
Edit a catalog
Only Description and Location are editable.
-
In the left-side navigation pane, choose Metadata > Metadata.
-
Click the Catalog List tab.
-
In the Actions column, click Edit.
-
Update Description or Location, and click OK.
Delete a catalog
Deleting a catalog is irreversible. The data cannot be recovered.
-
In the left-side navigation pane, choose Metadata > Metadata.
-
Click the Catalog List tab.
-
In the Actions column, click Delete.
-
In the confirmation dialog box, click Delete.
Bind an EMR cluster to a catalog
Each EMR cluster reads metadata from the catalog specified in its compute engine configuration.
Switching to a different catalog causes all existing database and table references in the cluster to become invalid. Any running jobs that depend on those references will fail. Please fully consider the impact before switching.
The following table shows which engines require separate configuration and which inherit Hive settings automatically.
| Engine | Config file | Needs separate config | Version notes |
|---|---|---|---|
| Hive | core-site.xml |
Yes | — |
| Spark | hive-site.xml |
Yes | EMR 5.6.0, 3.40.0, and earlier use Hive config |
| Presto | hive.properties |
Yes | Supported in EMR 5.8.0, 3.42.0, and later only |
| Impala | — | No | Uses Hive config automatically |
Hive engine
-
In the
core-site.xmlfile of the Hive service, add the following configuration item. For more information, see Manage configuration items.Key Value dlf.catalog.idThe Catalog ID of the DLF catalog -
Save and deploy the configuration.
-
Click Save, and then click Deploy Client Configuration.
-
In the dialog box, enter an Execution Reason and click OK.
-
-
Restart the Hive service.
-
On the Hive service configuration page, click More > Restart.
-
In the dialog box, enter an Execution Reason and click OK.
After a successful restart, the Hive service status changes to Healthy.
-
Spark engine
Modify the hive-site.xml file of the Spark service using the same steps as Hive engine.
For EMR 5.6.0, 3.40.0, and earlier versions, Spark uses the Hive configuration automatically. No separate Spark configuration is needed.
Presto engine
Modify the hive.properties file of the Presto service using the same steps as Hive engine.
Presto catalog binding is supported only in EMR 5.8.0, 3.42.0, and later versions.
Impala engine
No configuration changes are needed for Impala. It uses the Hive configuration automatically.