All Products
Search
Document Center

Data Lake Formation:Migrate EMR metadata to DLF

Last Updated:Feb 01, 2024

The EMR+DLF data lake solution provides unified metadata management and permission management of data lakes for enterprises, and supports multi-source data ingestion and end-to-end data exploration. This solution supports the migration of existing E-MapReduce (EMR) cluster metadata in self-managed ApsaraDB RDS or built-in MySQL databases to DLF.

This topic describes how to migrate metadata stored by the Hive metastore service in MySQL databases or ApsaraDB RDS databases to DLF. This topic also describes how to configure and use DLF for unified storage of metadata of an EMR cluster.

Scenarios

  1. You want to migrate metadata from third-party big data clusters to Alibaba Cloud EMR.

  2. You want to migrate the data and metadata from an EMR cluster that uses a MySQL database for metadata storage to an EMR cluster that uses DLF for metadata storage.

  3. You want to change the metadata storage for an EMR cluster from a MySQL database to DLF.

    1. Note: The EMR major version must be EMR V3.33 or an EMR V3.X version later than EMR V3.33, EMR V4.6 or an EMR V4.X version later than EMR V4.6, or EMR V5.1 or an EMR V5.X version later than EMR V5.1. If you want to migrate metadata from EMR clusters of earlier versions to DLF, join the DingTalk group 33719678.

Migrate metadata

Preparations

Before you migrate metadata, you must check the remote access permissions on the metadatabase.

  • Log on to your ApsaraDB RDS metadatabase or MySQL metadatabase and execute the following statement to grant remote access permissions (the root account and a database named hivemeta are used in this example):

GRANT ALL PRIVILEGES ON hivemeta.* TO 'root'@'%' IDENTIFIED BY 'xxxx' WITH GRANT OPTION;
FLUSH PRIVILEGES;
  • For an ApsaraDB RDS metadatabase, you can also view and modify the access permissions in the ApsaraDB RDS console.

image

Start migration

DLF provides a visualized metadata migration feature to quickly migrate the metadata in a Hive metastore to DLF.

Create a migration task

Log on to the DLF console, switch to the region of the EMR cluster, choose Metadata > Migrate Metadata in the left-side navigation tree, and click Create Migration Task. See the following figure.

image.png

Configure the source database

image

  1. Database Type: Select MySQL.

  2. MySQL Type: Select an option based on the Hive metadata type.

    1. For a built-in MySQL database of the cluster, select Other MySQL Databases. In this case, JDBC URL, Username, and Password are required. We recommend that you enter an internal IP address for JDBC URL and select Alibaba Cloud VPC for Network Type. If you want to select Internet for Network Type, enter a public IP address for JDBC URL.

    2. For an independent ApsaraDB RDS metadatabase, select Alibaba Cloud RDS. In this case, RDS Instance, Database Name, Username, and Password are required. ApsaraDB RDS metadatabases can be accessed only by using Alibaba Cloud VPCs.

  3. Network Type: You can select Alibaba Cloud VPC or Internet. Select an option based on the setting of MySQL Type.

    1. Alibaba Cloud VPC: specifies the VPC of the EMR cluster or ApsaraDB RDS instance.

    2. Internet: If you select Internet, you must add a rule in the EMR console to enable the default port 3306 of the EMR cluster for the elastic IP addresses (EIPs) of DLF. The EIP 121.41.166.235 in the China (Hangzhou) region is used in the following example.

image

The following table lists the DLF EIPs in different regions.

Region

EIP

China (Hangzhou)

121.41.166.235

China (Shanghai)

47.103.63.0

China (Beijing)

47.94.234.203

China (Shenzhen)

39.108.114.206

Singapore

161.117.233.48

Germany (Frankfurt)

8.211.38.47

China (Zhangjiakou)

8.142.121.7

China (Hong Kong)

8.218.148.213

Configure the migration task

image

  1. Task Name: Enter a name for the metadata migration task.

  2. Task Description: This parameter is optional. You can enter a description for the task.

  3. Conflict Resolution Strategy:

    1. Update Original Metadata: updates the original metadata in DLF. This option is recommended.

    2. Delete Original Metadata and Create Metadata: deletes the original metadata from DLF and then creates new metadata.

  4. Log Storage Path: The system records each metadata object, migration status, and error logs (if any) in the specified OSS path.

  5. Object to Synchronize: The objects to be synchronized include databases, functions, tables, and partitions. In most cases, select Select All.

  6. Location Replacement: You must set this parameter if you want to replace the location of a table or database during migration.

Run the migration task

The migration task that you created is displayed on the Migration Task tab. Click Run in the Actions column to run the task. See the following figure. image

image

View running records and logs

Click the Execution History tab to view task running details.

image

Click View Logs in the Actions column of the task to view running logs. See the following figure.

Verify the metadata migration

In the left-side navigation tree of the DLF console, choose Metadata > Metadata. The metadatabase that you migrated is displayed on the Metadata page. See the following figure.

image

Use DLF for unified metadata storage for an EMR cluster

This section describes how to configure and use DLF for unified metadata storage for an EMR cluster.

Change metadata storage for compute engines

Hive

In the EMR console, add the following configurations to the hive-site.xml file, enable Save and Deliver Configuration, click Save, and then restart the Hive service.

<!-- Configure the URL of the DLF metadata service. Replace regionId with the region ID of the desired cluster, such as cn-hangzhou. -->
dlf.catalog.endpoint=dlf-vpc.{regionId}.aliyuncs.com
<!-- Note: Check the configuration after you perform copy and paste operations. No spaces are allowed for the configuration.    -->
hive.imetastoreclient.factory.class=com.aliyun.datalake.metastore.hive2.DlfMetaStoreClientFactory
dlf.catalog.akMode=EMR_AUTO
dlf.catalog.proxyMode=DLF_ONLY

<!-- Configuration for Hive 3 -->
hive.notification.event.poll.interval=0s
<!-- Configuration for versions earlier than EMR V3.33 and versions earlier than EMR V4.6.0 -->
dlf.catalog.sts.isNewMode=false

image

Presto

In the EMR console, add the following configurations to the hive.properties file, enable Save and Deliver Configuration, click Save, and then restart the Presto service.

hive.metastore=dlf
<!-- Configure the URL of the DLF metadata service. Replace regionId with the region ID of the desired cluster, such as cn-hangzhou. -->
dlf.catalog.endpoint=dlf-vpc.{regionId}.aliyuncs.com
dlf.catalog.akMode=EMR_AUTO
dlf.catalog.proxyMode=DLF_ONLY
 
<!-- See the value of hive.metastore.warehouse.dir configured in the hive-site.xml file of Hive. -->
dlf.catalog.default-warehouse-dir= <!-- Set it to the same value as hive.metastore.warehouse.dir. -->

<!-- Configuration for versions earlier than EMR V3.33 and versions earlier than EMR V4.6.0 -->
dlf.catalog.sts.isNewMode=false

Spark

Click Deploy Client Configuration and restart Spark.

image

Impala

In the EMR console, click Deploy Client Configuration and restart Impala.

Verify the change of metadata storage for the compute engines

Hive is used in the following example. You can use the similar method for other engines.

1. Log on to the cluster and run the hive command.

2. Create a database by executing the following statement: create database dlf_test_db;

3. Log on to the DLF console and check whether the database exists.

4. Delete the database by executing the following statement: drop database dlf_test_db;

image

FAQ

  1. Why should I run a metadata migration task for multiple times?

Metadata migration tasks are executed based on metadata in ApsaraDB RDS or MySQL databases. You can run a metadata migration task for multiple times to ensure the eventual consistency between the metadata in the source database and the metadata in DLF.

References

For more details about the best practices, see Best practices for migrating EMR metadata to DLF.