This topic describes how to use PySpark to migrate data from Azure Databricks Delta Lake tables to AnalyticDB for MySQL.
Advantages
During data migration, you can import a small amount of data to the destination to test the process. Unlike other data lake table formats, Delta Lake tables use a design pattern with one data copy and two metadata copies. This topic provides a PySpark-based solution to migrate data from Azure Databricks Delta Lake tables to AnalyticDB for MySQL.
This solution has the following advantages:
The Spark engine in AnalyticDB for MySQL can directly integrate with Azure storage accounts. This integration bypasses the Databricks Java Database Connectivity (JDBC) endpoint, resulting in higher throughput.
The data read process is loosely coupled with the Databricks Unity Catalog API. You only need the storage path of the Delta Lake table to migrate data. This approach provides greater flexibility. The data synchronization code can be easily reused to migrate Databricks table data in other cloud environments.
During data synchronization, you only need to specify the Databricks schema or table name to automatically migrate the schema and table to AnalyticDB for MySQL. This script tool lets you easily and flexibly migrate additional data to AnalyticDB for MySQL.
Prerequisites
An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
If you do not have a cluster that meets these requirements, you can log on to the AnalyticDB for MySQL console to create an Enterprise Edition or Basic Edition cluster.
A database account is created for the AnalyticDB for MySQL cluster.
If you use an Alibaba Cloud account, you need to only create a privileged account.
If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user.
You have created a job resource group.
An Object Storage Service (OSS) bucket is created in the same region as the AnalyticDB for MySQL cluster.
Preparations
Download the JAR packages that the AnalyticDB for MySQL Spark engine requires to access the Azure storage account, and then upload the packages to OSS.
Download links: jetty-util-ajax-9.4.51.v20230217.jar, jetty-server-9.4.51.v20230217.jar, jetty-io-9.4.51.v20230217.jar, jetty-util-9.4.51.v20230217.jar, azure-storage-8.6.0.jar, hadoop-azure-3.3.0.jar, and hadoop-azure-datalake-3.3.0.jar.
Create a Notebook workspace.
Log on to the AnalyticDB for MySQL console. In the upper-left corner, select the region where your cluster is located. In the navigation pane on the left, click Cluster List and then click the ID of the target cluster.
In the navigation pane on the left, click . Make sure that the preparations are complete, and then click Go To DMS Notebook.


Create a Spark cluster.
Click the
icon to go to the Resource Management page. Then, click Compute Cluster.Click the Spark Cluster tab. Click Create Cluster and set the following parameters:

Parameter
Description
Example
Cluster Name
Enter a name for the cluster that helps you identify its scenario.
spark_test
Runtime Environment
The following images are supported:
adb-spark:v3.3-python3.9-scala2.12
adb-spark:v3.5-python3.9-scala2.12
adb-spark:v3.5-python3.9-scala2.12
AnalyticDB For MySQL Instance
Select an AnalyticDB for MySQL cluster from the drop-down list.
amv-uf6i4bi88****
AnalyticDB For MySQL Resource Group
Select a job resource group from the drop-down list.
testjob
Spark App Executor Size
Select the resource specification for the Spark executor.
Different models correspond to different specifications. For more information, see the Model column in Spark application configuration parameters.
large
VSwitch
Select a vSwitch in the current VPC.
vsw-uf6n9ipl6qgo****
Create and start a Notebook session. The first startup takes about 5 minutes.
Click the
icon to go to the Resource Management page. Then, click Notebook Session.Click Create Session and set the following parameters:

Parameter descriptions:
Parameter
Description
Example
Cluster
Select the Spark cluster created in Step 4.
spark_test
Session Name
Enter a custom name for the session.
new_session
Image
Select an image specification.
Spark3.5_Scala2.12_Python3.9:1.0.9 (Recommended)
Spark3.3_Scala2.12_Python3.9:1.0.9
Spark3.5_Scala2.12_Python3.9:1.0.9
Size
The resource specification of the kernel.
1 Core 4 GB
2 Cores 8 GB
4 Cores 16 GB
8 Cores 32 GB
16 Cores 64 GB
4C16G
Configuration
The profile resource.
Edit the profile name, resource release duration, data storage location, PyPI package management, and environment variable information.
ImportantResource release duration: If a resource is idle for longer than the specified duration, it is automatically released. If you set this parameter to 0, the resource is never automatically released.
default_profile
Procedure
Configure the Internet environment.
Create an Internet NAT gateway. For more information, see Create an Internet NAT gateway.
The Internet NAT gateway must reside in the same region as the AnalyticDB for MySQL cluster.
Associate an elastic IP address (EIP) with the Internet NAT gateway. For more information, see Associate an EIP with the Internet NAT gateway.
- Important
Create an SNAT entry at the VSwitch level. To ensure that data can be read from and imported into Azure Blob Storage, the vSwitch that you specify when you create the SNAT entry must be the same as the vSwitch that is used to create the Spark cluster.
Download the sample file DirectlyReadDatabricksDeltaTable_V01.ipynb.
The DirectlyReadDatabricksDeltaTable_V01.ipynb script includes the following steps:
Install the Python dependencies required to access Azure Databricks storage.
Initialize the Delta Lake file path parser and test network connectivity to Azure.
Define the identity credentials for accessing Azure Databricks and AnalyticDB for MySQL.
Start the Spark application.
Automatically create a database with the same name. The script reads the Azure Databricks table as a DataFrame. Then, the script saves the data from the DataFrame to AnalyticDB for MySQL and creates a Delta Lake table with the same name.
Upload the
DirectlyReadDatabricksDeltaTable_V01.ipynbfile to the Default folder.Go to DMS Notebook. In the navigation pane on the left, click the
icon to go to the Resource Manager page.Hover over Default, right-click, and then click Upload File. Select the sample file that you downloaded in Step 2.

Execute the
DirectlyReadDatabricksDeltaTable_V01.ipynbfile to migrate the Azure Databricks data.Open the
DirectlyReadDatabricksDeltaTable_V01.ipynbfile and replace the parameter values in the third cell with your actual values.
Parameter descriptions:
Parameter
Description
Example
DBRX_WORKSPACE_URL
The workspace URL of Azure Databricks. The format is
https://adb-28****.9.azuredatabricks.net. For more information, see Workspace URL.https://adb-28****.9.azuredatabricks.net
DBRX_ACCESS_TOKEN
The access token for Azure Databricks. Make sure that the token has the read permissions on the catalog or schema. For more information, see Manage personal access token permissions.
dapi****
DBRX_CATALOG
The name of the Azure Databricks catalog.
markov****
DBRX_SCHEMA
The name of the Azure Databricks schema.
db_test_migration
DBRX_TABLE
The name of the Azure Databricks table. The table must be a Delta Lake table.
flat_tpch_1
AZURE_STORAGE_ACCOUNT_NAME
The name of the Azure Data Lake Storage where the Azure Databricks table is located.
mark****
AZURE_STORAGE_ACCESS_KEY
The access key for Azure Data Lake Storage.
lx8z****
OSS_TARGET_LOCATION
The storage path of the Delta Lake table data in AnalyticDB for MySQL.
oss://testBucketName/db_test_migration/
OSS_JAR_ROOT_LOCATION
The path of the OSS folder where the JAR packages required to access Azure Databricks are stored. You downloaded these packages in the "Preparations" section.
oss://testBucketName/jars/
MAX_ROWS_COUNT
The maximum number of rows to read during migration. The default value is 10000. If you want to migrate a larger amount of data, increase the EIP bandwidth to 200 MB or more. The import efficiency is related to the bandwidth.
10000
Click Run All to execute the file sequentially.
You can preview the folder.
After the data migration is complete, you can view the migrated database and table in the AnalyticDB for MySQL cluster.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
In the navigation pane on the left, click .
Click the target database and table to view details, such as the table type, storage data size, and column names.
