What is a data catalog? - Data Lake Formation - Alibaba Cloud ドキュメントセンター

A data catalog is the top-level metadata entity in Data Lake Formation (DLF) and can contain multiple databases. You can create, view, edit, and delete data catalogs, and bind them to compute engines for metadata isolation.

Use cases

Data catalogs are primarily used for metadata isolation. For example, multiple E-MapReduce (EMR) clusters can each be bound to a different data catalog, making metadata invisible across clusters.

Basic operations

Create a data catalog

Log on to the Data Lake Formation console.
In the left-side navigation pane, choose Metadata > Metadata.
Click the [Catalogs] tab, and then click [New Catalog].
Configure the following parameters and click [OK].
- [Catalog ID]: Required. A unique identifier of the data catalog.
- [Description]: Optional. A description of the data catalog.
- [Location]: Optional. The default storage path. Only OSS paths are supported.

View data catalogs

In the left-side navigation pane, choose Metadata > Metadata.
Click the [Catalogs] tab to view the list of data catalogs.

Edit a data catalog

In the left-side navigation pane, choose Metadata > Metadata.
Click the [Catalogs] tab.
In the data catalog list, find the catalog to edit and click [Modify] in the Actions column.
Modify the parameters as needed and click [OK].
- [Description]: Optional. A description of the data catalog.
- [Location]: Optional. The default storage path. Only OSS paths are supported.

Delete a data catalog

警告

This action is irreversible. A deleted data catalog and its data cannot be recovered. Proceed with caution.

In the left-side navigation pane, choose Metadata > Metadata.
Click the [Catalogs] tab.
In the data catalog list, find the catalog to delete and click [Delete] in the Actions column.
In the dialog box, click [Delete].

Compute engine integration

Change an E-MapReduce cluster's data catalog

重要

After you change the Data Lake Formation (DLF) Catalog ID that is bound to an E-MapReduce (EMR) cluster, the cluster points to the new Catalog ID. This change invalidates operations on databases and tables in the original data catalog and causes running jobs to fail. Ensure that you fully understand the impact before you proceed.

Hive engine integration

In the Hive service's [core-site.xml] file, add the following configuration item. For more information, see Add configuration items.

Parameter	Value
dlf.catalog.id	The ID of the DLF data catalog.

Apply the configuration.
1. Click [Save]. After you save the configuration, click [Deploy Client Configuration].
2. In the dialog box, enter an [Execution Reason] and click [OK].
Restart the Hive service.
1. On the configuration page of the Hive service, choose More > 再起動.
2. In the dialog box, enter an [Execution Reason] and click [OK].
  
  After the service restarts, the Hive service status changes to Healthy, which confirms that the Catalog ID was successfully changed.

Spark engine integration

Modify the [Spark] service's [hive-site.xml] file. For detailed steps, see the Hive engine integration section.

説明

For EMR versions 5.6.0, 3.40.0, and earlier, you only need to modify the Hive configuration because Spark automatically uses it.

Presto engine integration

Modify the Presto service's hive.properties file. For detailed steps, see the Hive engine integration section.

説明

This feature is supported only in EMR versions 5.8.0, 3.42.0, and later.

Impala engine integration

説明

You only need to modify the Hive configuration because Impala automatically uses it.