All Products
Search
Document Center

AnalyticDB:Migrate data from Azure Databricks Delta Lake tables

Last Updated:Oct 31, 2025

This topic describes how to use PySpark to migrate data from Azure Databricks Delta Lake tables to AnalyticDB for MySQL.

Advantages

During data migration, you can import a small amount of data to the destination to test the process. Unlike other data lake table formats, Delta Lake tables use a design pattern with one data copy and two metadata copies. This topic provides a PySpark-based solution to migrate data from Azure Databricks Delta Lake tables to AnalyticDB for MySQL.

This solution has the following advantages:

  • The Spark engine in AnalyticDB for MySQL can directly integrate with Azure storage accounts. This integration bypasses the Databricks Java Database Connectivity (JDBC) endpoint, resulting in higher throughput.

  • The data read process is loosely coupled with the Databricks Unity Catalog API. You only need the storage path of the Delta Lake table to migrate data. This approach provides greater flexibility. The data synchronization code can be easily reused to migrate Databricks table data in other cloud environments.

  • During data synchronization, you only need to specify the Databricks schema or table name to automatically migrate the schema and table to AnalyticDB for MySQL. This script tool lets you easily and flexibly migrate additional data to AnalyticDB for MySQL.

Prerequisites

Preparations

  1. Download the JAR packages that the AnalyticDB for MySQL Spark engine requires to access the Azure storage account, and then upload the packages to OSS.

    Download links: jetty-util-ajax-9.4.51.v20230217.jar, jetty-server-9.4.51.v20230217.jar, jetty-io-9.4.51.v20230217.jar, jetty-util-9.4.51.v20230217.jar, azure-storage-8.6.0.jar, hadoop-azure-3.3.0.jar, and hadoop-azure-datalake-3.3.0.jar.

  2. Create a Notebook workspace.

    1. Log on to the AnalyticDB for MySQL console. In the upper-left corner, select the region where your cluster is located. In the navigation pane on the left, click Cluster List and then click the ID of the target cluster.

    2. In the navigation pane on the left, click Job Development > Notebook Development. Make sure that the preparations are complete, and then click Go To DMS Notebook.

      image

    3. Create a workspace.

  3. Import a data source.

    image

  4. Create a Spark cluster.

    1. Click the image icon to go to the Resource Management page. Then, click Compute Cluster.

    2. Click the Spark Cluster tab. Click Create Cluster and set the following parameters:

      image

      Parameter

      Description

      Example

      Cluster Name

      Enter a name for the cluster that helps you identify its scenario.

      spark_test

      Runtime Environment

      The following images are supported:

      • adb-spark:v3.3-python3.9-scala2.12

      • adb-spark:v3.5-python3.9-scala2.12

      adb-spark:v3.5-python3.9-scala2.12

      AnalyticDB For MySQL Instance

      Select an AnalyticDB for MySQL cluster from the drop-down list.

      amv-uf6i4bi88****

      AnalyticDB For MySQL Resource Group

      Select a job resource group from the drop-down list.

      testjob

      Spark App Executor Size

      Select the resource specification for the Spark executor.

      Different models correspond to different specifications. For more information, see the Model column in Spark application configuration parameters.

      large

      VSwitch

      Select a vSwitch in the current VPC.

      vsw-uf6n9ipl6qgo****

  5. Create and start a Notebook session. The first startup takes about 5 minutes.

    1. Click the image icon to go to the Resource Management page. Then, click Notebook Session.

    2. Click Create Session and set the following parameters:

      image

      Parameter descriptions:

      Parameter

      Description

      Example

      Cluster

      Select the Spark cluster created in Step 4.

      spark_test

      Session Name

      Enter a custom name for the session.

      new_session

      Image

      Select an image specification.

      • Spark3.5_Scala2.12_Python3.9:1.0.9 (Recommended)

      • Spark3.3_Scala2.12_Python3.9:1.0.9

      Spark3.5_Scala2.12_Python3.9:1.0.9

      Size

      The resource specification of the kernel.

      • 1 Core 4 GB

      • 2 Cores 8 GB

      • 4 Cores 16 GB

      • 8 Cores 32 GB

      • 16 Cores 64 GB

      4C16G

      Configuration

      The profile resource.

      Edit the profile name, resource release duration, data storage location, PyPI package management, and environment variable information.

      Important

      Resource release duration: If a resource is idle for longer than the specified duration, it is automatically released. If you set this parameter to 0, the resource is never automatically released.

      default_profile

Procedure

  1. Configure the Internet environment.

    1. Create an Internet NAT gateway. For more information, see Create an Internet NAT gateway.

      The Internet NAT gateway must reside in the same region as the AnalyticDB for MySQL cluster.

    2. Associate an elastic IP address (EIP) with the Internet NAT gateway. For more information, see Associate an EIP with the Internet NAT gateway.

    3. Create an SNAT entry.

      Important

      Create an SNAT entry at the VSwitch level. To ensure that data can be read from and imported into Azure Blob Storage, the vSwitch that you specify when you create the SNAT entry must be the same as the vSwitch that is used to create the Spark cluster.

  2. Download the sample file DirectlyReadDatabricksDeltaTable_V01.ipynb.

    The DirectlyReadDatabricksDeltaTable_V01.ipynb script includes the following steps:

    1. Install the Python dependencies required to access Azure Databricks storage.

    2. Initialize the Delta Lake file path parser and test network connectivity to Azure.

    3. Define the identity credentials for accessing Azure Databricks and AnalyticDB for MySQL.

    4. Start the Spark application.

    5. Automatically create a database with the same name. The script reads the Azure Databricks table as a DataFrame. Then, the script saves the data from the DataFrame to AnalyticDB for MySQL and creates a Delta Lake table with the same name.

  3. Upload the DirectlyReadDatabricksDeltaTable_V01.ipynb file to the Default folder.

    1. Go to DMS Notebook. In the navigation pane on the left, click the image icon to go to the Resource Manager page.

    2. Hover over Default, right-click, and then click Upload File. Select the sample file that you downloaded in Step 2.

      image

  4. Execute the DirectlyReadDatabricksDeltaTable_V01.ipynb file to migrate the Azure Databricks data.

    1. Open the DirectlyReadDatabricksDeltaTable_V01.ipynb file and replace the parameter values in the third cell with your actual values.

      image

      Parameter descriptions:

      Parameter

      Description

      Example

      DBRX_WORKSPACE_URL

      The workspace URL of Azure Databricks. The format is https://adb-28****.9.azuredatabricks.net. For more information, see Workspace URL.

      https://adb-28****.9.azuredatabricks.net

      DBRX_ACCESS_TOKEN

      The access token for Azure Databricks. Make sure that the token has the read permissions on the catalog or schema. For more information, see Manage personal access token permissions.

      dapi****

      DBRX_CATALOG

      The name of the Azure Databricks catalog.

      markov****

      DBRX_SCHEMA

      The name of the Azure Databricks schema.

      db_test_migration

      DBRX_TABLE

      The name of the Azure Databricks table. The table must be a Delta Lake table.

      flat_tpch_1

      AZURE_STORAGE_ACCOUNT_NAME

      The name of the Azure Data Lake Storage where the Azure Databricks table is located.

      mark****

      AZURE_STORAGE_ACCESS_KEY

      The access key for Azure Data Lake Storage.

      lx8z****

      OSS_TARGET_LOCATION

      The storage path of the Delta Lake table data in AnalyticDB for MySQL.

      oss://testBucketName/db_test_migration/

      OSS_JAR_ROOT_LOCATION

      The path of the OSS folder where the JAR packages required to access Azure Databricks are stored. You downloaded these packages in the "Preparations" section.

      oss://testBucketName/jars/

      MAX_ROWS_COUNT

      The maximum number of rows to read during migration. The default value is 10000. If you want to migrate a larger amount of data, increase the EIP bandwidth to 200 MB or more. The import efficiency is related to the bandwidth.

      10000

    2. Click Run All to execute the file sequentially.

  5. You can preview the folder.

    After the data migration is complete, you can view the migrated database and table in the AnalyticDB for MySQL cluster.

    1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.

    2. In the navigation pane on the left, click Data Management > Data Catalog.

    3. Click the target database and table to view details, such as the table type, storage data size, and column names.

      image