Export DLF Metadata to Hive Metastore via Spark for Seamless Migration - Data Lake Formation

Data Lake Formation (DLF) provides a metadata export tool that copies DLF metadata to a Hive metastore. The tool runs as a Spark job on an E-MapReduce (EMR) cluster, uses DLF as the baseline, and syncs metadata to the target Hive metastore.

How it works: The tool compares DLF metadata against existing Hive metastore metadata, then writes DLF records to the Hive metastore (ApsaraDB RDS for MySQL). It does not delete DLF-side data. After export, both DLF and the Hive metastore contain the same metadata.

Limitations:

Cross-cluster metadata export is not supported when metadata locations are in Apsara File Storage for HDFS (HDFS). If metadata is stored in HDFS, the HDFS namespace for all database and table locations must match the namespace of the EMR cluster you are exporting to.
Metadata stored in Object Storage Service (OSS) does not have this restriction.

Prerequisites

Before you begin, make sure you have:

An EMR cluster with the Metastore service configured. The ApsaraDB RDS database used by the export tool must be the metadatabase configured for that Metastore service.
An ApsaraDB RDS database with metadata tables initialized. For initialization steps, see Configure a self-managed ApsaraDB RDS for MySQL database.
Read and write permissions on all DLF metadatabases and tables, including ListDatabase and ListTables. Configure these permissions on the Data Permissions page in the DLF console.

Step 1: Prepare a configuration file

Create a YAML configuration file and upload it to an OSS path.

The following is a complete example:

!!com.aliyun.dlf.migrator.app.config.MigratorConfig
clientInfo:
  accessKeyId: <Your AccessKey ID.>
  accessKeySecret: <Your AccessKey secret.>
  endPoint: dlf-vpc.cn-hangzhou.aliyuncs.com
  regionId: cn-hangzhou
  catalogId: <Your catalog ID. The default value is the user ID of your Alibaba Cloud account.>
mysql:
  connectionUri: jdbc:mysql://emr-header-1:3306/hivemeta
  driver: com.mysql.cj.jdbc.Driver
  userName: root
  password: xxxx
runOptions:
  batchSize: 100
  lowerCaseTableNames: false
  schema: hivemeta
  records: oss://xxxx/migrator/validate/log/
  objectTypes:
    - database
    - table
    - partition
    - function
  operations:
    - validate
  fixIfInConsistence: true
  fixMode: to_hive
  validateDatabases: [db1,db2]
  excludeTables: [aa,bb]
  excludeTablePrefixes: [xx,yy]
  ignoreValidateCreateTime: true
  skipFixTime: 1
  ignoreDropOperation: false

Configuration reference

clientInfo

Parameter	Required	Description
`accessKeyId`	Yes	The AccessKey ID of your Alibaba Cloud account.
`accessKeySecret`	Yes	The AccessKey secret of your Alibaba Cloud account.
`endPoint`	Yes	The DLF endpoint. Format: `dlf-vpc.<region-id>.aliyuncs.com`.
`regionId`	Yes	The region ID, for example, `cn-hangzhou`.
`catalogId`	Yes	The DLF data catalog ID. Defaults to the user ID of your Alibaba Cloud account.

mysql

Parameter	Required	Description
`connectionUri`	Yes	The JDBC URL for connecting to the MySQL metadatabase, for example, `jdbc:mysql://emr-header-1:3306/hivemeta`.
`driver`	Yes	The MySQL driver class. The default value `com.mysql.cj.jdbc.Driver` works in most cases.
`userName`	Yes	The username for accessing the metadatabase.
`password`	Yes	The password for accessing the metadatabase.

runOptions

Parameter	Required	Default	Description
`schema`	Yes	—	The name of the Hive metadatabase. Must match the database name in `connectionUri`. If you change one, update the other.
`batchSize`	Yes	—	The number of records per DLF SDK call. Cannot exceed 500. Use 100 for most cases — larger values may cause timeouts, and smaller values reduce throughput.
`lowerCaseTableNames`	Yes	—	Whether table names in the ApsaraDB RDS metadatabase are lowercase. Set to `false` if table names are uppercase.
`records`	Yes	—	The OSS path where the tool stores run logs, including processing records and errors.
`objectTypes`	Yes	—	The types of objects to export. Valid values: `database`, `table`, `partition`, `function`.
`operations`	Yes	—	Fixed value: `validate`.
`fixIfInConsistence`	Yes	—	Fixed value: `true`.
`fixMode`	Yes	—	Fixed value: `to_hive`. The tool compares DLF metadata against Hive metadata using DLF as the baseline, then syncs DLF metadata to the Hive metastore.
`validateDatabases`	No	—	The names of databases to export. Only the listed databases are processed. Omit to export all databases.
`excludeTables`	No	—	The names of tables to exclude from the export.
`excludeTablePrefixes`	No	—	Table name prefixes to exclude. Tables whose names start with any listed prefix are skipped.
`compareTotalNumber`	No	`false`	Whether to return a summary comparing the total count of DLF metadata and ApsaraDB RDS metadata.
`ignoreValidateCreateTime`	No	—	Whether to ignore table creation time during comparison. Set to `true` if table creation timestamps differ between DLF and Hive but the tables are otherwise identical.
`skipFixTime`	No	`240`	Only metadata created before this many minutes ago is compared. Metadata created within this window is skipped. Reduce this value (for example, to `1`) if you want to include recently created metadata. Unit: minutes.
`ignoreDropOperation`	No	—	Set to `true` to retain metadata in the Hive metastore after export instead of deleting it. Use this when you want to keep Hive metastore records even if the corresponding DLF metadata no longer exists.
`locationMappings.source`	No	—	The source path to replace during export. End the path with a forward slash (`/`). Example: `hdfs://emr-header-1:9000/user/hive/`.
`locationMappings.target`	No	—	The destination path to replace with. Example: `oss://hive/warehouse/`.
`hiveConfPath`	No	`/etc/ecm/hive-conf/hive-site.xml`	The path to the Hive configuration file in the cluster.
`kerberosInfo.principal`	No	—	The Kerberos principal for clusters with Kerberos enabled. Example: `xxx/xxx@xxx.COM`.
`kerberosInfo.keytab`	No	—	The path to the Kerberos keytab file. Example: `/xxx/xxx.keytab`.

Location mapping example:

runOptions:
  locationMappings:
    source: hdfs://emr-header-1:9000/user/hive/
    target: oss://hive/warehouse/

Kerberos example:

runOptions:
  kerberosInfo:
    principal: xxx/xxx@xxx.COM
    keytab: /xxx/xxx.keytab

Step 2: Set up the Spark job

Log on to the header node of the EMR cluster.

Download the export tool JAR package.

wget https://dlf-lib.oss-cn-hangzhou.aliyuncs.com/migrator_jar/application-1.1.jar

Create the Spark job script.
```
vim /root/migrator_validate.sh
```

Add the following content to the script. Replace the OSS path at the end with the path where you uploaded the configuration file in Step 1.

#!/bin/bash
source /etc/profile
spark-submit --master yarn --deploy-mode client --driver-memory 12G --executor-memory 8G --executor-cores 2 --num-executors 4 --conf spark.sql.shuffle.partitions=200 --conf spark.sql.adaptive.enabled=false --class com.aliyun.dlf.migrator.app.MigratorApplication application-1.1.jar oss://xxxx/migrator_config_validate.yml

Make the script executable.
```
chmod +x /root/migrator_validate.sh
```

Step 3: Run the job

Run immediately:

/root/migrator_validate.sh

Schedule with cron:

crontab -e

Add the following line to run the job daily at 22:00:

0 22 * * * nohup sh /root/migrator_validate.sh > /root/validate.txt 2>&1 &