This topic describes how to migrate scheduling workflows from Azure Data Factory (ADF) to DataWorks using the LHM migration tool. The process involves three steps: exporting tasks from ADF, converting the tasks, and importing the tasks into DataWorks.
1. Environment preparation
1.1. Runtime environment preparation
No. | Item | Specifications | Quantity | Notes |
1 | ECS | 4-core 16 GB or higher | 1 | CentOS or AliyunOS images are supported. |
2 | JDK 17 | |||
3 | Runtime package | Download link |
1.2. Network preparation
The VPC of the ECS instance must have Internet access.
1.3. Account and permission preparation
1.3.1 Using the service registration authentication method
Register an application.
Obtain a client secret.
1.3.2. Configure Azure Data Factory permissions
1.3.3. Configure blob permissions
Optional. To read file content from a blob in a node, you must grant permissions to the blob in the same way that you grant permissions for ADF service registration.
1.3.4. Configure Databricks permissions
Optional. This step is required to read external script files for Databricks nodes in ADF.
Go to the Databricks workbench.
Click User Settings.
Go to the Developer page on the left.
Click Generate new token.
2. Export tasks from Azure Data Factory
You can export tasks using a software development kit (SDK). The dependencies are as follows:
<dependency>
<groupId>com.azure.resourcemanager</groupId>
<artifactId>azure-resourcemanager-datafactory</artifactId>
</dependency>
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-identity</artifactId>
</dependency>
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-storage-blob</artifactId>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>databricks-sdk-java</artifactId>
</dependency>2.1. Configuration file
{
"schedule_datasource": {
"name": "name",
"type": "adf",
"properties": {
"isUseProxy": false,
"proxyHost": "proxy.msl.cn",
"proxyPort": "8080",
"AzureCloud": "AZURE_CHINA_CLOUD",
"apiMode": "sdk"
"endpoint": "https://management.azure.com",
"subscriptionId": "xxx",
"factory": "bigdata-adf-jiman",
"project": "bigdata-adf-jiman",
"resourceGroupName": "biadata",
"tenantId": "xxx",
"clientId": "xxx",
"clientSecretValue": "xxxx",
"dbr_endpoint": "https://adb-xxxxx.16.azuredatabricks.net",
"dbr_token": "xxxxx",
"pipelineNameWhite": "dbr-demo"
},
"operaterType": "AUTO"
}
}2.2. Parameter description
No. | Parameter | Required | Example | Notes |
1 | isUseProxy | No | false | Required when you use a proxy to access Azure. |
2 | proxyHost | No | proxy.msl.cn | Required when you use a proxy to access Azure. |
3 | proxyPort | No | 8080 | Required when you use a proxy to access Azure. |
4 | AzureCloud | No | AZURE_CHINA_CLOUD | Valid values: The options are listed below. They correspond to different regions. The default value is AZURE_CHINA_CLOUD.
|
5 | endpoint | No | https://portal.azure.com | This is associated with AzureCloud and is optional. |
6 | subscriptionId | Yes | xxxxx | The subscription ID. You can obtain it from the overview page of the ADF homepage. |
7 | resourceGroupName | Yes | xxxxx | The resource group name. You can obtain it from the overview page of the ADF homepage. |
8 | factory | Yes | bigdata-adf-jiman | Mainly used as an identifier. |
9 | project | No | bigdata-adf-jiman | Keep this consistent with the factory parameter. |
10 | tenantId | Yes | xxxxx | You can obtain this during the service registration step. |
11 | clientId | Yes | xxxxx | You can obtain this during the service registration step. |
12 | clientSecretValue | Yes | xxxxx | You can obtain this during the service registration step. |
13 | dbr_endpoint | No | xxxxx | Obtain this from the Databricks homepage. |
14 | dbr_token | No | xxxxx | Obtain this from the Databricks homepage. |
15 | pipelineNameWhite | No | dbr-demo | A whitelist. To specify multiple pipeline names, separate them with commas (,). |
2.3. Run the export command
mkdir result
sh ./bin/run.sh read \
-c ./conf/<your_config_file>.json \
-o ./data/1_ReaderOutput/<source_export_package>.zip \
-t adf-readerCommand parameter descriptions
No. | Parameter | Required | Example | Notes |
1 | -c | Yes | ./conf/<your_config_file>.json | The path of the configuration file. |
2 | -o | Yes | ./data/1_ReaderOutput/<source_export_package>.zip | The output path. |
3 | -t | Yes | adf-reader | The plugin type. This is a static field. |
Example of a complete command
mkdir result
sh ./bin/run.sh read -c ./conf/read.json -f ./result/temp.zip -o ./result/read_out.zip -t adf-reader2.4. View the export results
Open the generated package ReaderOutput.zip in the ./data/1_ReaderOutput/ directory to preview the export results. The statistical report provides a summary of basic information about workflows, nodes, resources, functions, and data sources. The data/project folder contains the standardized data structure for the scheduling information.
The statistical report provides two special features:
1. You can change some properties of workflows and nodes in the report. The editable fields are identified by blue font. When you import the tasks into DataWorks in the next stage, the tool applies these property changes.
2. You can skip certain workflows during the import to DataWorks by deleting their corresponding rows in the workflow child table. This acts as a workflow blacklist. Note: If workflows have dependencies on each other, they must be imported in the same batch. Do not use the blacklist to separate them, because this will cause an error.
For more information, see Use the overview report in scheduling migration to add or modify scheduling properties.
3. Convert scheduling tasks
3.1. Configuration file
{
"conf": {
"locale": "zh_CN"
},
"self": {
"if.use.default.convert": false,
"if.use.migrationx.before": false,
"if.use.dataworks.newidea": true,
"filter.rule": []
},
"schedule_datasource": {
"name": "adf",
"type": "Adf"
},
"target_schedule_datasource": {
"name": "name",
"type": "DataWorks"
}
}3.2. Parameter description
The default settings are sufficient.
No. | Parameter | Required | Example | Notes |
1 | filter.rule | No | [ { "type": "black", "element": "node", "field": "name", "value": "DataXNode" } ] | The filter rule. This is optional. type specifies a blacklist or whitelist. Options: BLACK | WHITE. element specifies the element type to filter. Supported types: NODE | WORKFLOW. field specifies whether to filter by name or ID. Options: ID | NAME. value specifies the value. To specify multiple values, separate them with commas (,). |
schedule_datasource.name identifies the data source, which corresponds to the root directory of all workflows.
3.3. Built-in node conversion and mapping logic
3.3.1. General node mapping logic
ADF node type | DataWorks node type | Notes |
Copy | DI | Currently, only the original JSON script is available. The specific content of the Data Integration (DI) node depends on the data source type. The conversion requires information about all DI-related data source types. |
Delete | DIDE_SHELL | A shell script that concatenates the file paths for deletion. |
SqlServerStoredProcedure | ODPS-SQL | The default is an ODPS node. If the logic determines it is a SQL Server, the node type is Sql Server. |
DatabricksSparkJar | ODSP-SPARK | The ODPS Spark node is converted using existing parameters. Some information may be missing. |
DatabricksSparkPython | ODSP-SPARK | The ODPS Spark node is converted using existing parameters. Some information may be missing. |
DatabricksNotebook | NOTEBOOK | Scala-related cells are not processed. |
SynapseNotebook | NOTEBOOK | Scala-related cells are not processed. |
SparkJob | ODSP-SPARK | Temporarily converted to a virtual node. |
AppendVariable | CONTROLLER_ASSIGNMENT | Assignment node. |
ExecutePipeline | SUB_PROCESS | |
Script | ODPS_SQL | |
Wait | DIDE_SHELL | |
WebActivity | DIDE_SHELL | |
IfCondition | CONTROLLER_BRANCH | Converted to a branch, a subprocess, and a merge node. |
FOREACH | CONTROLLER_TRAVERSE | Foreach |
Switch | CONTROLLER_BRANCH | Converted to a branch, a subprocess, and a merge node. |
Until | CONTROLLER_CYCLE | do-while node. |
Lookup | CONTROLLER_ASSIGNMENT | |
Filter | CONTROLLER_ASSIGNMENT | |
GetMeta | CONTROLLER_ASSIGNMENT | |
SetVariable | CONTROLLER_ASSIGNMENT | |
HDInsightHive | ODPS-SQL | |
HDInsightSpark | ODPS_Spark | |
HDInsightMapReduce | ODPS_MR | |
Other | VIRTUAL |
Note: Resource files must be uploaded manually.
3.3.2. Logical node mapping logic
ifcondition
foreach
switch
3.3.3. Run the conversion command
sh ./bin/run.sh convert \
-c ./conf/<your_config_file>.json \
-f ./data/1_ReaderOutput/<source_export_package>.zip \
-o ./data/2_ConverterOutput/<conversion_output_package>.zip \
-t wedata-dw-converterNo. | Parameter | Required | Example | Notes |
1 | -c | Yes | ./conf/<your_config_file>.json | The path of the configuration file. |
2 | -f | Yes | ./data/1_ReaderOutput/<source_export_package>.zip | The exported package from the source. |
3 | -o | Yes | ./result/convert_out.zip | The path for the output package. |
4 | -t | Yes | adf-dw-converter | The plugin type. This is a static field. |
3.3.4. View the conversion results
Open the generated package ConverterOutput.zip in the ./data/2_ConverterOutput/ directory to preview the conversion results.
The statistical report provides a summary of basic information about the converted workflows, nodes, resources, functions, and data sources.
The data/project folder contains the main converted scheduling migration package.
The statistical report provides two special features:
1. You can change some properties of workflows and nodes in the report. The editable fields are identified by blue font. When you import the tasks into DataWorks in the next stage, the tool applies these property changes.
2. You can skip certain workflows during the import to DataWorks by deleting their corresponding rows in the workflow child table. This acts as a workflow blacklist. Note: If workflows have dependencies on each other, they must be imported in the same batch. Do not use the blacklist to separate them, because this will cause an error.
For more information, see Use the overview report in scheduling migration to add or modify scheduling properties.
4. Import tasks to DataWorks
The LHM migration tool converts scheduling elements from the source to the DataWorks scheduling format. The tool provides a unified entry point for uploading that lets you import workflows into DataWorks for different migration scenarios.
The import tool supports multiple write operations and automatically creates or updates workflows in overwrite mode.
1. Prerequisites
1.1. Successful conversion
The conversion tool must have run successfully, converting the source scheduling information into the DataWorks scheduling format and generating the ConverterOutput.zip package.
(Optional, recommended) Open the conversion output package and view the statistical report to verify that the migration scope was converted successfully.
1.2. DataWorks configuration
In DataWorks, perform the following actions:
1. Create a workspace.
2. Create an AccessKey pair (AccessKey ID and AccessKey secret) and ensure it has administrator permissions for the workspace. (We strongly recommend creating an AccessKey pair that is bound to your account for easier troubleshooting.)
3. In the workspace, create data sources, attach computing resources, and create resource groups.
4. In the workspace, upload resource files and create user-defined functions (UDFs).
1.3. Network connectivity check
Verify the connection to the DataWorks endpoint.
For a list of service endpoints, see:
ping dataworks.aliyuncs.com2. Import configuration items
Create a JSON export configuration file, such as writer.json, in the project's conf folder.
Before you use the file, delete the comments from the JSON.
{
"schedule_datasource": {
"name": "YourDataWorks", // Give your DataWorks data source a name.
"type": "DataWorks",
"properties": {
"endpoint": "dataworks.cn-hangzhou.aliyuncs.com", // Service endpoint
"project_id": "YourProjectId", // Workspace ID
"project_name": "YourProject", // Workspace name
"ak": "************", // AccessKey ID
"sk": "************" // AccessKey secret
},
"operaterType": "MANUAL"
},
"conf": {
"di.resource.group.identifier": "Serverless_res_group_***_***", // Scheduling resource group
"resource.group.identifier": "Serverless_res_group_***_***", // Data Integration resource group
"dataworks.node.type.xls": "/Software/bwm-client/conf/CodeProgramType.xls", // Path of the DataWorks node type table
"qps.limit": 5 // The upper limit for queries per second (QPS) of API requests sent to DataWorks
}
}2.1. Service endpoint
Select a service endpoint based on the region where your DataWorks workspace is located. For more information, see:
2.2. Workspace ID and name
Open the DataWorks console. Go to the workspace details page. Obtain the workspace ID and name from the basic information on the right.
2.3. Create and authorize an AccessKey pair
On the user page, create an AccessKey pair that has administrator read and write permissions for the target DataWorks workspace.
Permission management involves two locations. If you are using a Resource Access Management (RAM) user, you must first grant the RAM user permissions to operate DataWorks.
Policy page: https://ram.console.aliyun.com/policies
Then, in the DataWorks workspace, assign workspace permissions to the account.
Note: You can set a network access control policy for the AccessKey. Make sure that the IP address of the machine where the migration tool is located has access.
2.4. Resource group
From the menu bar on the left of the DataWorks workspace details page, go to the resource group page. Attach a resource group and obtain its ID.
A general-purpose resource group can be used for node scheduling and data integration. You can set the scheduling resource group (resource.group.identifier) and the data integration resource group (di.resource.group.identifier) to the same general-purpose resource group.
2.5. QPS settings
The tool imports data by calling DataWorks APIs. Different DataWorks editions have different queries per second (QPS) and daily call limits for read and write OpenAPI operations. For more information, see Limits.
For DataWorks Basic Edition, Standard Edition, and Professional Edition, we recommend setting "qps.limit" to 5. For Enterprise Edition, we recommend setting "qps.limit" to 20.
Note: Avoid running multiple import tools at the same time.
2.6. DataWorks node type ID settings
In DataWorks, some node types are assigned different TypeIds in different regions. The specific TypeID is determined by the Data Development interface in DataWorks. This characteristic mainly applies to database nodes. For more information, see Database nodes.
For example, the NodeTypeId for a MySQL node is 1000039 in the Hangzhou region and 1000041 in the Shenzhen region.
To adapt to these regional differences in DataWorks, the tool provides a configurable method to set the node TypeId table used by the tool.
The table is imported using a configuration item in the import tool:
"conf": {
"dataworks.node.type.xls": "/Software/bwm-client/conf/CodeProgramType.xls" // Path of the DataWorks node type table
}To obtain a node type ID from the DataWorks Data Development interface, create a new workflow, add a new node to the workflow, save it, and then view the workflow's Spec.
If the node type is configured incorrectly, the following error is reported when the workflow is published.
3. Run the DataWorks import tool
The import tool is called from the command line. The command is as follows:
sh ./bin/run.sh write \
-c ./conf/<your_config_file>.json \
-f ./data/2_ConverterOutput/<conversion_output_package>.zip \
-o ./data/4_WriterOutput/<import_result_package>.zip \
-t dw-newide-writerIn the command, -c is the path of the configuration file, -f is the storage path of the ConverterOutput package, -o is the storage path of the WriterOutput package, and -t is the name of the submission plugin.
For example, to import Project A into DataWorks:
sh ./bin/run.sh write \
-c ./conf/projectA_write.json \
-f ./data/2_ConverterOutput/projectA_ConverterOutput.zip \
-o ./data/4_WriterOutput/projectA_WriterOutput.zip \
-t dw-newide-writerThe import tool prints process information during runtime. Check for any errors during the process. After the import is complete, statistics on successful and failed imports are printed in the command line. Note that the failure of some nodes to import does not stop the overall import process. If a small number of nodes fail to import, you can modify them manually in DataWorks.
4. View the import results
After the import is complete, you can view the results in DataWorks. You can also monitor the workflow import process. If you find a problem and need to stop the import, run the jps command to find BwmClientApp and then run kill -9 to terminate the process.
5. Q&A
5.1. The source is under continuous development. How are these increments and changes submitted to DataWorks?
The migration tool operates in overwrite mode. You can rerun the export, conversion, and import processes to submit incremental changes from the source to DataWorks. Note that the tool matches workflows by their full path to decide whether to create or update them. To ensure changes are migrated correctly, do not move the workflows.
5.2. The source is under continuous development, and workflows in DataWorks are being transformed and governed. Will incremental migration overwrite the changes in DataWorks?
Yes, it will. The migration tool operates in overwrite mode. We recommend that you perform subsequent transformations in DataWorks only after the migration is complete. Alternatively, you can use a phased migration approach. After a batch of workflows is migrated and you confirm they will not be overwritten again, you can start transforming them in DataWorks. Different batches do not affect each other.
5.3. The entire package takes too long to import. Can I import only a part of it?
Yes, you can. You can manually trim the package to perform a partial import. In the data/project/workflow folder, keep the workflows you want to import and delete the others. Then, recompress the folder into a package and run the import tool. Note that workflows with mutual dependencies must be imported together. Otherwise, the node lineage between them will be lost.