Data Lake Formation (DLF) provides a metadata export tool that copies DLF metadata to a Hive metastore. The tool runs as a Spark job on an E-MapReduce (EMR) cluster, uses DLF as the baseline, and syncs metadata to the target Hive metastore.
How it works: The tool compares DLF metadata against existing Hive metastore metadata, then writes DLF records to the Hive metastore (ApsaraDB RDS for MySQL). It does not delete DLF-side data. After export, both DLF and the Hive metastore contain the same metadata.
Limitations:
Cross-cluster metadata export is not supported when metadata locations are in Apsara File Storage for HDFS (HDFS). If metadata is stored in HDFS, the HDFS namespace for all database and table locations must match the namespace of the EMR cluster you are exporting to.
Metadata stored in Object Storage Service (OSS) does not have this restriction.
Prerequisites
Before you begin, make sure you have:
An EMR cluster with the Metastore service configured. The ApsaraDB RDS database used by the export tool must be the metadatabase configured for that Metastore service.
An ApsaraDB RDS database with metadata tables initialized. For initialization steps, see Configure a self-managed ApsaraDB RDS for MySQL database.
Read and write permissions on all DLF metadatabases and tables, including
ListDatabaseandListTables. Configure these permissions on the Data Permissions page in the DLF console.
Step 1: Prepare a configuration file
Create a YAML configuration file and upload it to an OSS path.
The following is a complete example:
!!com.aliyun.dlf.migrator.app.config.MigratorConfig
clientInfo:
accessKeyId: <Your AccessKey ID.>
accessKeySecret: <Your AccessKey secret.>
endPoint: dlf-vpc.cn-hangzhou.aliyuncs.com
regionId: cn-hangzhou
catalogId: <Your catalog ID. The default value is the user ID of your Alibaba Cloud account.>
mysql:
connectionUri: jdbc:mysql://emr-header-1:3306/hivemeta
driver: com.mysql.cj.jdbc.Driver
userName: root
password: xxxx
runOptions:
batchSize: 100
lowerCaseTableNames: false
schema: hivemeta
records: oss://xxxx/migrator/validate/log/
objectTypes:
- database
- table
- partition
- function
operations:
- validate
fixIfInConsistence: true
fixMode: to_hive
validateDatabases: [db1,db2]
excludeTables: [aa,bb]
excludeTablePrefixes: [xx,yy]
ignoreValidateCreateTime: true
skipFixTime: 1
ignoreDropOperation: falseConfiguration reference
clientInfo
| Parameter | Required | Description |
|---|---|---|
accessKeyId | Yes | The AccessKey ID of your Alibaba Cloud account. |
accessKeySecret | Yes | The AccessKey secret of your Alibaba Cloud account. |
endPoint | Yes | The DLF endpoint. Format: dlf-vpc.<region-id>.aliyuncs.com. |
regionId | Yes | The region ID, for example, cn-hangzhou. |
catalogId | Yes | The DLF data catalog ID. Defaults to the user ID of your Alibaba Cloud account. |
mysql
| Parameter | Required | Description |
|---|---|---|
connectionUri | Yes | The JDBC URL for connecting to the MySQL metadatabase, for example, jdbc:mysql://emr-header-1:3306/hivemeta. |
driver | Yes | The MySQL driver class. The default value com.mysql.cj.jdbc.Driver works in most cases. |
userName | Yes | The username for accessing the metadatabase. |
password | Yes | The password for accessing the metadatabase. |
runOptions
| Parameter | Required | Default | Description |
|---|---|---|---|
schema | Yes | — | The name of the Hive metadatabase. Must match the database name in connectionUri. If you change one, update the other. |
batchSize | Yes | — | The number of records per DLF SDK call. Cannot exceed 500. Use 100 for most cases — larger values may cause timeouts, and smaller values reduce throughput. |
lowerCaseTableNames | Yes | — | Whether table names in the ApsaraDB RDS metadatabase are lowercase. Set to false if table names are uppercase. |
records | Yes | — | The OSS path where the tool stores run logs, including processing records and errors. |
objectTypes | Yes | — | The types of objects to export. Valid values: database, table, partition, function. |
operations | Yes | — | Fixed value: validate. |
fixIfInConsistence | Yes | — | Fixed value: true. |
fixMode | Yes | — | Fixed value: to_hive. The tool compares DLF metadata against Hive metadata using DLF as the baseline, then syncs DLF metadata to the Hive metastore. |
validateDatabases | No | — | The names of databases to export. Only the listed databases are processed. Omit to export all databases. |
excludeTables | No | — | The names of tables to exclude from the export. |
excludeTablePrefixes | No | — | Table name prefixes to exclude. Tables whose names start with any listed prefix are skipped. |
compareTotalNumber | No | false | Whether to return a summary comparing the total count of DLF metadata and ApsaraDB RDS metadata. |
ignoreValidateCreateTime | No | — | Whether to ignore table creation time during comparison. Set to true if table creation timestamps differ between DLF and Hive but the tables are otherwise identical. |
skipFixTime | No | 240 | Only metadata created before this many minutes ago is compared. Metadata created within this window is skipped. Reduce this value (for example, to 1) if you want to include recently created metadata. Unit: minutes. |
ignoreDropOperation | No | — | Set to true to retain metadata in the Hive metastore after export instead of deleting it. Use this when you want to keep Hive metastore records even if the corresponding DLF metadata no longer exists. |
locationMappings.source | No | — | The source path to replace during export. End the path with a forward slash (/). Example: hdfs://emr-header-1:9000/user/hive/. |
locationMappings.target | No | — | The destination path to replace with. Example: oss://hive/warehouse/. |
hiveConfPath | No | /etc/ecm/hive-conf/hive-site.xml | The path to the Hive configuration file in the cluster. |
kerberosInfo.principal | No | — | The Kerberos principal for clusters with Kerberos enabled. Example: xxx/xxx@xxx.COM. |
kerberosInfo.keytab | No | — | The path to the Kerberos keytab file. Example: /xxx/xxx.keytab. |
Location mapping example:
runOptions:
locationMappings:
source: hdfs://emr-header-1:9000/user/hive/
target: oss://hive/warehouse/Kerberos example:
runOptions:
kerberosInfo:
principal: xxx/xxx@xxx.COM
keytab: /xxx/xxx.keytabStep 2: Set up the Spark job
Log on to the header node of the EMR cluster.
Download the export tool JAR package.
wget https://dlf-lib.oss-cn-hangzhou.aliyuncs.com/migrator_jar/application-1.1.jarCreate the Spark job script.
vim /root/migrator_validate.shAdd the following content to the script. Replace the OSS path at the end with the path where you uploaded the configuration file in Step 1.
#!/bin/bash source /etc/profile spark-submit --master yarn --deploy-mode client --driver-memory 12G --executor-memory 8G --executor-cores 2 --num-executors 4 --conf spark.sql.shuffle.partitions=200 --conf spark.sql.adaptive.enabled=false --class com.aliyun.dlf.migrator.app.MigratorApplication application-1.1.jar oss://xxxx/migrator_config_validate.ymlMake the script executable.
chmod +x /root/migrator_validate.sh
Step 3: Run the job
Run immediately:
/root/migrator_validate.shSchedule with cron:
crontab -eAdd the following line to run the job daily at 22:00:
0 22 * * * nohup sh /root/migrator_validate.sh > /root/validate.txt 2>&1 &