All Products
Search
Document Center

Data Lake Formation:Export metadata

Last Updated:Mar 26, 2026

Data Lake Formation (DLF) provides a metadata export tool that copies DLF metadata to a Hive metastore. The tool runs as a Spark job on an E-MapReduce (EMR) cluster, uses DLF as the baseline, and syncs metadata to the target Hive metastore.

How it works: The tool compares DLF metadata against existing Hive metastore metadata, then writes DLF records to the Hive metastore (ApsaraDB RDS for MySQL). It does not delete DLF-side data. After export, both DLF and the Hive metastore contain the same metadata.

Limitations:

  • Cross-cluster metadata export is not supported when metadata locations are in Apsara File Storage for HDFS (HDFS). If metadata is stored in HDFS, the HDFS namespace for all database and table locations must match the namespace of the EMR cluster you are exporting to.

  • Metadata stored in Object Storage Service (OSS) does not have this restriction.

Prerequisites

Before you begin, make sure you have:

  • An EMR cluster with the Metastore service configured. The ApsaraDB RDS database used by the export tool must be the metadatabase configured for that Metastore service.

  • An ApsaraDB RDS database with metadata tables initialized. For initialization steps, see Configure a self-managed ApsaraDB RDS for MySQL database.

  • Read and write permissions on all DLF metadatabases and tables, including ListDatabase and ListTables. Configure these permissions on the Data Permissions page in the DLF console.

Step 1: Prepare a configuration file

Create a YAML configuration file and upload it to an OSS path.

The following is a complete example:

!!com.aliyun.dlf.migrator.app.config.MigratorConfig
clientInfo:
  accessKeyId: <Your AccessKey ID.>
  accessKeySecret: <Your AccessKey secret.>
  endPoint: dlf-vpc.cn-hangzhou.aliyuncs.com
  regionId: cn-hangzhou
  catalogId: <Your catalog ID. The default value is the user ID of your Alibaba Cloud account.>
mysql:
  connectionUri: jdbc:mysql://emr-header-1:3306/hivemeta
  driver: com.mysql.cj.jdbc.Driver
  userName: root
  password: xxxx
runOptions:
  batchSize: 100
  lowerCaseTableNames: false
  schema: hivemeta
  records: oss://xxxx/migrator/validate/log/
  objectTypes:
    - database
    - table
    - partition
    - function
  operations:
    - validate
  fixIfInConsistence: true
  fixMode: to_hive
  validateDatabases: [db1,db2]
  excludeTables: [aa,bb]
  excludeTablePrefixes: [xx,yy]
  ignoreValidateCreateTime: true
  skipFixTime: 1
  ignoreDropOperation: false

Configuration reference

clientInfo

ParameterRequiredDescription
accessKeyIdYesThe AccessKey ID of your Alibaba Cloud account.
accessKeySecretYesThe AccessKey secret of your Alibaba Cloud account.
endPointYesThe DLF endpoint. Format: dlf-vpc.<region-id>.aliyuncs.com.
regionIdYesThe region ID, for example, cn-hangzhou.
catalogIdYesThe DLF data catalog ID. Defaults to the user ID of your Alibaba Cloud account.

mysql

ParameterRequiredDescription
connectionUriYesThe JDBC URL for connecting to the MySQL metadatabase, for example, jdbc:mysql://emr-header-1:3306/hivemeta.
driverYesThe MySQL driver class. The default value com.mysql.cj.jdbc.Driver works in most cases.
userNameYesThe username for accessing the metadatabase.
passwordYesThe password for accessing the metadatabase.

runOptions

ParameterRequiredDefaultDescription
schemaYesThe name of the Hive metadatabase. Must match the database name in connectionUri. If you change one, update the other.
batchSizeYesThe number of records per DLF SDK call. Cannot exceed 500. Use 100 for most cases — larger values may cause timeouts, and smaller values reduce throughput.
lowerCaseTableNamesYesWhether table names in the ApsaraDB RDS metadatabase are lowercase. Set to false if table names are uppercase.
recordsYesThe OSS path where the tool stores run logs, including processing records and errors.
objectTypesYesThe types of objects to export. Valid values: database, table, partition, function.
operationsYesFixed value: validate.
fixIfInConsistenceYesFixed value: true.
fixModeYesFixed value: to_hive. The tool compares DLF metadata against Hive metadata using DLF as the baseline, then syncs DLF metadata to the Hive metastore.
validateDatabasesNoThe names of databases to export. Only the listed databases are processed. Omit to export all databases.
excludeTablesNoThe names of tables to exclude from the export.
excludeTablePrefixesNoTable name prefixes to exclude. Tables whose names start with any listed prefix are skipped.
compareTotalNumberNofalseWhether to return a summary comparing the total count of DLF metadata and ApsaraDB RDS metadata.
ignoreValidateCreateTimeNoWhether to ignore table creation time during comparison. Set to true if table creation timestamps differ between DLF and Hive but the tables are otherwise identical.
skipFixTimeNo240Only metadata created before this many minutes ago is compared. Metadata created within this window is skipped. Reduce this value (for example, to 1) if you want to include recently created metadata. Unit: minutes.
ignoreDropOperationNoSet to true to retain metadata in the Hive metastore after export instead of deleting it. Use this when you want to keep Hive metastore records even if the corresponding DLF metadata no longer exists.
locationMappings.sourceNoThe source path to replace during export. End the path with a forward slash (/). Example: hdfs://emr-header-1:9000/user/hive/.
locationMappings.targetNoThe destination path to replace with. Example: oss://hive/warehouse/.
hiveConfPathNo/etc/ecm/hive-conf/hive-site.xmlThe path to the Hive configuration file in the cluster.
kerberosInfo.principalNoThe Kerberos principal for clusters with Kerberos enabled. Example: xxx/xxx@xxx.COM.
kerberosInfo.keytabNoThe path to the Kerberos keytab file. Example: /xxx/xxx.keytab.

Location mapping example:

runOptions:
  locationMappings:
    source: hdfs://emr-header-1:9000/user/hive/
    target: oss://hive/warehouse/

Kerberos example:

runOptions:
  kerberosInfo:
    principal: xxx/xxx@xxx.COM
    keytab: /xxx/xxx.keytab

Step 2: Set up the Spark job

  1. Log on to the header node of the EMR cluster.

  2. Download the export tool JAR package.

    wget https://dlf-lib.oss-cn-hangzhou.aliyuncs.com/migrator_jar/application-1.1.jar
  3. Create the Spark job script.

    vim /root/migrator_validate.sh
  4. Add the following content to the script. Replace the OSS path at the end with the path where you uploaded the configuration file in Step 1.

    #!/bin/bash
    source /etc/profile
    spark-submit --master yarn --deploy-mode client --driver-memory 12G --executor-memory 8G --executor-cores 2 --num-executors 4 --conf spark.sql.shuffle.partitions=200 --conf spark.sql.adaptive.enabled=false --class com.aliyun.dlf.migrator.app.MigratorApplication application-1.1.jar oss://xxxx/migrator_config_validate.yml
  5. Make the script executable.

    chmod +x /root/migrator_validate.sh

Step 3: Run the job

Run immediately:

/root/migrator_validate.sh

Schedule with cron:

crontab -e

Add the following line to run the job daily at 22:00:

0 22 * * * nohup sh /root/migrator_validate.sh > /root/validate.txt 2>&1 &