All Products
Search
Document Center

Data Lake Formation:Migrate metadata from AWS Glue Data Catalog to Alibaba Cloud DLF

Last Updated:Mar 26, 2026

Migrate metadata—databases, tables, partitions, and functions—from Amazon Web Services (AWS) Glue Data Catalog to Alibaba Cloud Data Lake Formation (DLF) using a Spark-based migration tool. The tool compares source and destination on each run and applies only the differences, so you can schedule it as a recurring job without duplicating data.

How it works

The migration tool runs a three-phase Spark job:

  1. Extract — Connects to the AWS Glue API and reads metadata for databases, tables, partitions, and functions.

  2. Transform — Converts the extracted metadata to match DLF's data format and structure. The migration logic can be incremental update or full synchronization.

  3. Load — Uploads the processed metadata to DLF via the DLF API.

On each run, the tool compares the source (AWS Glue) against the destination (DLF) and applies only the differences:

  • Metadata that exists in AWS Glue but not in DLF is created in DLF.

  • Metadata that exists in both but differs is updated in DLF.

  • Metadata that no longer exists in AWS Glue is deleted from DLF.

image

Performance considerations

Scenario Estimated duration
First migration (~5,000,000 partitions) Approximately 2 to 3 hours (based on customer experience)
Subsequent incremental runs Within 20 minutes

To speed up large migrations, increase the value of spark.executor.instances in the spark-submit command.

Prerequisites

Before you begin, ensure that you have:

  • An Alibaba Cloud E-MapReduce (EMR) cluster with Spark 2 and internet access, or access to a Spark 2 environment in AWS EMR

  • An Alibaba Cloud account with permissions to access DLF

  • An AWS account with access to AWS Glue Data Catalog

  • An Object Storage Service (OSS) bucket to store the configuration file and (optionally) migration logs

Migrate metadata

Step 1: Download the JAR package

Download glue-1.0-SNAPSHOT-jar-with-dependencies.jar using one of the following methods:

Step 2: Prepare the configuration file

Create a file named application.properties on your local machine.

Required settings

## AWS credentials
## Leave blank if submitting the job from an AWS EMR cluster; otherwise, specify your AccessKey pair.
aws.accessKeyId=<your-aws-access-key-id>
aws.secretAccessKey=<your-aws-access-key-secret>

## AWS region
aws.region=eu-central-1

## ID of the AWS Glue Data Catalog. Defaults to your AWS account ID.
aws.catalogId=<your-aws-catalog-id>

## Migration mode. Use fixed values.
mode=increment
comparePartition=true

## Alibaba Cloud credentials. The account must have DLF access permissions.
aliyun.accessKeyId=<your-aliyun-access-key-id>
aliyun.secretAccessKey=<your-aliyun-access-key-secret>
aliyun.region=eu-central-1
aliyun.endpoint=dlf.eu-central-1.aliyuncs.com

## DLF catalog ID. Defaults to your Alibaba Cloud account UID.
aliyun.catalogId=<your-dlf-catalog-id>

## Table location mapping: maps S3 paths to OSS paths.
## Format: {"<source-s3-path>":"<destination-oss-path>"}
location.convertMap={"s3://xxx-eu-central-1/glue-migrate/":"oss://xxx-eu-central-1/glue-migrate/"}

Optional settings

Parameter Description Example
aws.databases Databases to migrate. If not set, all databases are migrated. Comma-separated. db1,db2
aws.table.filter.prefix Migrate only tables whose names start with the specified prefix. Comma-separated, no spaces. ods_,adm_
aws.partition.filter.<db>.<table> Partition filter condition for a specific table. If not set, all partitions are migrated. dt>=20220101
result.output.path OSS path for storing detailed migration logs. oss://<bucket>/path1/

After creating the file, upload it to an accessible location—such as an OSS bucket, Hadoop Distributed File System (HDFS), or Amazon S3.

For example: oss://<bucket>/path/application.properties

Step 3: Submit the Spark job

Submit the Spark job on the master node of your Alibaba Cloud EMR cluster. For instructions on creating and logging in to an EMR cluster, see Create a cluster and Log on to a cluster.

You can also submit the job from an AWS EMR cluster, as long as the Spark environment has internet access.

Run the following command on the master node:

spark-submit \
  --name migrate \
  --class com.aliyun.dlf.migrator.glue.GlueToDLFApplication \
  --deploy-mode cluster \
  --conf spark.executor.instances=10 \
  glue-1.0-SNAPSHOT-jar-with-dependencies.jar \
  oss://<bucket>/path/application.properties

Replace oss://<bucket>/path/application.properties with the path where you uploaded the configuration file in Step 2.

Parameter reference:

Parameter Description
--name Job name. Set to any value.
--class Entry point class. Always set to com.aliyun.dlf.migrator.glue.GlueToDLFApplication.
--deploy-mode Deployment mode. Valid values: cluster, client.
spark.executor.instances Number of executors. Increase this value to speed up the job.

Step 4: Verify the migration results

Open the YARN web UI and view the stdout log of the job driver. For instructions on accessing the YARN web UI, see Access the web UIs of open source components in the EMR console.

image

The log contains the following information.

image
Field Description
itemType Metadata type. Valid values: database, table, function, partition.
itemName Name of the database, table, or partition.
diffResult Difference detected between AWS Glue and DLF. Valid values: dlfNew (exists in DLF but not in AWS Glue — you must call the delete operation in DLF), glueNew (exists in AWS Glue but not in DLF — you must call the create operation in DLF), needUpdate (exists in both but content differs — you must call the update operation to synchronize the metadata).
isSuccess Whether the operation succeeded.
errorMsg Error details if the operation failed.