All Products
Search
Document Center

Migration Hub:Azure Data Factory (ADF) - DataWorks

Last Updated:Sep 17, 2025

This topic describes how to migrate scheduling workflows from Azure Data Factory (ADF) to DataWorks using the LHM migration tool. The process involves three steps: exporting tasks from ADF, converting the tasks, and importing the tasks into DataWorks.

1. Environment preparation

1.1. Runtime environment preparation

No.

Item

Specifications

Quantity

Notes

1

ECS

4-core 16 GB or higher

1

CentOS or AliyunOS images are supported.

2

JDK 17

3

Runtime package

Download link

1.2. Network preparation

  • The VPC of the ECS instance must have Internet access.

1.3. Account and permission preparation

1.3.1 Using the service registration authentication method

  • Register an application.

  • Obtain a client secret.

1.3.2. Configure Azure Data Factory permissions

1.3.3. Configure blob permissions

Optional. To read file content from a blob in a node, you must grant permissions to the blob in the same way that you grant permissions for ADF service registration.

1.3.4. Configure Databricks permissions

Optional. This step is required to read external script files for Databricks nodes in ADF.

  1. Go to the Databricks workbench.

  2. Click User Settings.

  3. Go to the Developer page on the left.

  4. Click Generate new token.

2. Export tasks from Azure Data Factory

You can export tasks using a software development kit (SDK). The dependencies are as follows:

<dependency>
      <groupId>com.azure.resourcemanager</groupId>
      <artifactId>azure-resourcemanager-datafactory</artifactId>
    </dependency>
    <dependency>
      <groupId>com.azure</groupId>
      <artifactId>azure-identity</artifactId>
    </dependency>
    <dependency>
      <groupId>com.azure</groupId>
      <artifactId>azure-storage-blob</artifactId>
    </dependency>
    <dependency>
      <groupId>com.databricks</groupId>
      <artifactId>databricks-sdk-java</artifactId>
    </dependency>

2.1. Configuration file

{
  "schedule_datasource": {
    "name": "name",
    "type": "adf",
    "properties": {
      "isUseProxy": false,
      "proxyHost": "proxy.msl.cn",
      "proxyPort": "8080",
      "AzureCloud": "AZURE_CHINA_CLOUD",
      "apiMode": "sdk"
      "endpoint": "https://management.azure.com",
      "subscriptionId": "xxx",
      "factory": "bigdata-adf-jiman",
      "project": "bigdata-adf-jiman",
      "resourceGroupName": "biadata",
      "tenantId": "xxx",
      "clientId": "xxx",
      "clientSecretValue": "xxxx",
      "dbr_endpoint": "https://adb-xxxxx.16.azuredatabricks.net",
      "dbr_token": "xxxxx",
      "pipelineNameWhite": "dbr-demo"
    },
    "operaterType": "AUTO"
  }
}

2.2. Parameter description

No.

Parameter

Required

Example

Notes

1

isUseProxy

No

false

Required when you use a proxy to access Azure.

2

proxyHost

No

proxy.msl.cn

Required when you use a proxy to access Azure.

3

proxyPort

No

8080

Required when you use a proxy to access Azure.

4

AzureCloud

No

AZURE_CHINA_CLOUD

Valid values:

The options are listed below. They correspond to different regions. The default value is AZURE_CHINA_CLOUD.

  1. AZURE_PUBLIC_CLOUD

  2. AZURE_CHINA_CLOUD

  3. AZURE_US_GOVERNMENT_CLOUD

5

endpoint

No

https://portal.azure.com

This is associated with AzureCloud and is optional.

6

subscriptionId

Yes

xxxxx

The subscription ID. You can obtain it from the overview page of the ADF homepage.

7

resourceGroupName

Yes

xxxxx

The resource group name. You can obtain it from the overview page of the ADF homepage.

8

factory

Yes

bigdata-adf-jiman

Mainly used as an identifier.

9

project

No

bigdata-adf-jiman

Keep this consistent with the factory parameter.

10

tenantId

Yes

xxxxx

You can obtain this during the service registration step.

11

clientId

Yes

xxxxx

You can obtain this during the service registration step.

12

clientSecretValue

Yes

xxxxx

You can obtain this during the service registration step.

13

dbr_endpoint

No

xxxxx

Obtain this from the Databricks homepage.

14

dbr_token

No

xxxxx

Obtain this from the Databricks homepage.

15

pipelineNameWhite

No

dbr-demo

A whitelist. To specify multiple pipeline names, separate them with commas (,).

2.3. Run the export command

mkdir result
sh ./bin/run.sh read \
-c ./conf/<your_config_file>.json \
-o ./data/1_ReaderOutput/<source_export_package>.zip \
-t adf-reader

Command parameter descriptions

No.

Parameter

Required

Example

Notes

1

-c

Yes

./conf/<your_config_file>.json

The path of the configuration file.

2

-o

Yes

./data/1_ReaderOutput/<source_export_package>.zip

The output path.

3

-t

Yes

adf-reader

The plugin type. This is a static field.

Example of a complete command

mkdir result
sh ./bin/run.sh read -c ./conf/read.json -f ./result/temp.zip -o ./result/read_out.zip  -t adf-reader

2.4. View the export results

Open the generated package ReaderOutput.zip in the ./data/1_ReaderOutput/ directory to preview the export results. The statistical report provides a summary of basic information about workflows, nodes, resources, functions, and data sources. The data/project folder contains the standardized data structure for the scheduling information.

The statistical report provides two special features:

1. You can change some properties of workflows and nodes in the report. The editable fields are identified by blue font. When you import the tasks into DataWorks in the next stage, the tool applies these property changes.

2. You can skip certain workflows during the import to DataWorks by deleting their corresponding rows in the workflow child table. This acts as a workflow blacklist. Note: If workflows have dependencies on each other, they must be imported in the same batch. Do not use the blacklist to separate them, because this will cause an error.

For more information, see Use the overview report in scheduling migration to add or modify scheduling properties.

3. Convert scheduling tasks

3.1. Configuration file

{
  "conf": {
    "locale": "zh_CN"
  },
  "self": {
    "if.use.default.convert": false,
    "if.use.migrationx.before": false,
    "if.use.dataworks.newidea": true,
    "filter.rule": []
  },
  "schedule_datasource": {
    "name": "adf",
    "type": "Adf"
  },
  "target_schedule_datasource": {
    "name": "name",
    "type": "DataWorks"
  }
}

3.2. Parameter description

The default settings are sufficient.

No.

Parameter

Required

Example

Notes

1

filter.rule

No

[

{

"type": "black",

"element": "node",

"field": "name",

"value": "DataXNode"

}

]

The filter rule. This is optional.

type specifies a blacklist or whitelist. Options: BLACK | WHITE.

element specifies the element type to filter. Supported types: NODE | WORKFLOW.

field specifies whether to filter by name or ID. Options: ID | NAME.

value specifies the value. To specify multiple values, separate them with commas (,).

schedule_datasource.name identifies the data source, which corresponds to the root directory of all workflows.

3.3. Built-in node conversion and mapping logic

3.3.1. General node mapping logic

ADF node type

DataWorks node type

Notes

Copy

DI

Currently, only the original JSON script is available. The specific content of the Data Integration (DI) node depends on the data source type. The conversion requires information about all DI-related data source types.

Delete

DIDE_SHELL

A shell script that concatenates the file paths for deletion.

SqlServerStoredProcedure

ODPS-SQL

The default is an ODPS node. If the logic determines it is a SQL Server, the node type is Sql Server.

DatabricksSparkJar

ODSP-SPARK

The ODPS Spark node is converted using existing parameters. Some information may be missing.

DatabricksSparkPython

ODSP-SPARK

The ODPS Spark node is converted using existing parameters. Some information may be missing.

DatabricksNotebook

NOTEBOOK

Scala-related cells are not processed.

SynapseNotebook

NOTEBOOK

Scala-related cells are not processed.

SparkJob

ODSP-SPARK

Temporarily converted to a virtual node.

AppendVariable

CONTROLLER_ASSIGNMENT

Assignment node.

ExecutePipeline

SUB_PROCESS

Script

ODPS_SQL

Wait

DIDE_SHELL

WebActivity

DIDE_SHELL

IfCondition

CONTROLLER_BRANCH

Converted to a branch, a subprocess, and a merge node.

FOREACH

CONTROLLER_TRAVERSE

Foreach

Switch

CONTROLLER_BRANCH

Converted to a branch, a subprocess, and a merge node.

Until

CONTROLLER_CYCLE

do-while node.

Lookup

CONTROLLER_ASSIGNMENT

Filter

CONTROLLER_ASSIGNMENT

GetMeta

CONTROLLER_ASSIGNMENT

SetVariable

CONTROLLER_ASSIGNMENT

HDInsightHive

ODPS-SQL

HDInsightSpark

ODPS_Spark

HDInsightMapReduce

ODPS_MR

Other

VIRTUAL

Note: Resource files must be uploaded manually.

3.3.2. Logical node mapping logic

  • ifcondition

  • foreach

  • switch

3.3.3. Run the conversion command

sh ./bin/run.sh convert \
-c ./conf/<your_config_file>.json \
-f ./data/1_ReaderOutput/<source_export_package>.zip \
-o ./data/2_ConverterOutput/<conversion_output_package>.zip \
-t wedata-dw-converter

No.

Parameter

Required

Example

Notes

1

-c

Yes

./conf/<your_config_file>.json

The path of the configuration file.

2

-f

Yes

./data/1_ReaderOutput/<source_export_package>.zip

The exported package from the source.

3

-o

Yes

./result/convert_out.zip

The path for the output package.

4

-t

Yes

adf-dw-converter

The plugin type. This is a static field.

3.3.4. View the conversion results

Open the generated package ConverterOutput.zip in the ./data/2_ConverterOutput/ directory to preview the conversion results.

The statistical report provides a summary of basic information about the converted workflows, nodes, resources, functions, and data sources.

The data/project folder contains the main converted scheduling migration package.

The statistical report provides two special features:

1. You can change some properties of workflows and nodes in the report. The editable fields are identified by blue font. When you import the tasks into DataWorks in the next stage, the tool applies these property changes.

2. You can skip certain workflows during the import to DataWorks by deleting their corresponding rows in the workflow child table. This acts as a workflow blacklist. Note: If workflows have dependencies on each other, they must be imported in the same batch. Do not use the blacklist to separate them, because this will cause an error.

For more information, see Use the overview report in scheduling migration to add or modify scheduling properties.

4. Import tasks to DataWorks

The LHM migration tool converts scheduling elements from the source to the DataWorks scheduling format. The tool provides a unified entry point for uploading that lets you import workflows into DataWorks for different migration scenarios.

The import tool supports multiple write operations and automatically creates or updates workflows in overwrite mode.

1. Prerequisites

1.1. Successful conversion

The conversion tool must have run successfully, converting the source scheduling information into the DataWorks scheduling format and generating the ConverterOutput.zip package.

(Optional, recommended) Open the conversion output package and view the statistical report to verify that the migration scope was converted successfully.

1.2. DataWorks configuration

In DataWorks, perform the following actions:

1. Create a workspace.

2. Create an AccessKey pair (AccessKey ID and AccessKey secret) and ensure it has administrator permissions for the workspace. (We strongly recommend creating an AccessKey pair that is bound to your account for easier troubleshooting.)

3. In the workspace, create data sources, attach computing resources, and create resource groups.

4. In the workspace, upload resource files and create user-defined functions (UDFs).

1.3. Network connectivity check

Verify the connection to the DataWorks endpoint.

For a list of service endpoints, see:

Service endpoints

ping dataworks.aliyuncs.com

2. Import configuration items

Create a JSON export configuration file, such as writer.json, in the project's conf folder.

  • Before you use the file, delete the comments from the JSON.

{
  "schedule_datasource": {
    "name": "YourDataWorks", // Give your DataWorks data source a name.
    "type": "DataWorks",
    "properties": {
      "endpoint": "dataworks.cn-hangzhou.aliyuncs.com", // Service endpoint
      "project_id": "YourProjectId", // Workspace ID
      "project_name": "YourProject", // Workspace name
      "ak": "************", // AccessKey ID
      "sk": "************" // AccessKey secret
    },
    "operaterType": "MANUAL"
  },
  "conf": {
    "di.resource.group.identifier": "Serverless_res_group_***_***", // Scheduling resource group
    "resource.group.identifier": "Serverless_res_group_***_***", // Data Integration resource group
    "dataworks.node.type.xls": "/Software/bwm-client/conf/CodeProgramType.xls", // Path of the DataWorks node type table
    "qps.limit": 5 // The upper limit for queries per second (QPS) of API requests sent to DataWorks
  }
}

2.1. Service endpoint

Select a service endpoint based on the region where your DataWorks workspace is located. For more information, see:

Service endpoints

2.2. Workspace ID and name

Open the DataWorks console. Go to the workspace details page. Obtain the workspace ID and name from the basic information on the right.

2.3. Create and authorize an AccessKey pair

On the user page, create an AccessKey pair that has administrator read and write permissions for the target DataWorks workspace.

Permission management involves two locations. If you are using a Resource Access Management (RAM) user, you must first grant the RAM user permissions to operate DataWorks.

Policy page: https://ram.console.aliyun.com/policies

Then, in the DataWorks workspace, assign workspace permissions to the account.

Note: You can set a network access control policy for the AccessKey. Make sure that the IP address of the machine where the migration tool is located has access.

2.4. Resource group

From the menu bar on the left of the DataWorks workspace details page, go to the resource group page. Attach a resource group and obtain its ID.

A general-purpose resource group can be used for node scheduling and data integration. You can set the scheduling resource group (resource.group.identifier) and the data integration resource group (di.resource.group.identifier) to the same general-purpose resource group.

2.5. QPS settings

The tool imports data by calling DataWorks APIs. Different DataWorks editions have different queries per second (QPS) and daily call limits for read and write OpenAPI operations. For more information, see Limits.

For DataWorks Basic Edition, Standard Edition, and Professional Edition, we recommend setting "qps.limit" to 5. For Enterprise Edition, we recommend setting "qps.limit" to 20.

Note: Avoid running multiple import tools at the same time.

2.6. DataWorks node type ID settings

In DataWorks, some node types are assigned different TypeIds in different regions. The specific TypeID is determined by the Data Development interface in DataWorks. This characteristic mainly applies to database nodes. For more information, see Database nodes.

For example, the NodeTypeId for a MySQL node is 1000039 in the Hangzhou region and 1000041 in the Shenzhen region.

To adapt to these regional differences in DataWorks, the tool provides a configurable method to set the node TypeId table used by the tool.

The table is imported using a configuration item in the import tool:

"conf": {
    "dataworks.node.type.xls": "/Software/bwm-client/conf/CodeProgramType.xls" // Path of the DataWorks node type table
 }

To obtain a node type ID from the DataWorks Data Development interface, create a new workflow, add a new node to the workflow, save it, and then view the workflow's Spec.

If the node type is configured incorrectly, the following error is reported when the workflow is published.

3. Run the DataWorks import tool

The import tool is called from the command line. The command is as follows:

sh ./bin/run.sh write \
-c ./conf/<your_config_file>.json \
-f ./data/2_ConverterOutput/<conversion_output_package>.zip \
-o ./data/4_WriterOutput/<import_result_package>.zip \
-t dw-newide-writer

In the command, -c is the path of the configuration file, -f is the storage path of the ConverterOutput package, -o is the storage path of the WriterOutput package, and -t is the name of the submission plugin.

For example, to import Project A into DataWorks:

sh ./bin/run.sh write \
-c ./conf/projectA_write.json \
-f ./data/2_ConverterOutput/projectA_ConverterOutput.zip \
-o ./data/4_WriterOutput/projectA_WriterOutput.zip \
-t dw-newide-writer

The import tool prints process information during runtime. Check for any errors during the process. After the import is complete, statistics on successful and failed imports are printed in the command line. Note that the failure of some nodes to import does not stop the overall import process. If a small number of nodes fail to import, you can modify them manually in DataWorks.

4. View the import results

After the import is complete, you can view the results in DataWorks. You can also monitor the workflow import process. If you find a problem and need to stop the import, run the jps command to find BwmClientApp and then run kill -9 to terminate the process.

5. Q&A

5.1. The source is under continuous development. How are these increments and changes submitted to DataWorks?

The migration tool operates in overwrite mode. You can rerun the export, conversion, and import processes to submit incremental changes from the source to DataWorks. Note that the tool matches workflows by their full path to decide whether to create or update them. To ensure changes are migrated correctly, do not move the workflows.

5.2. The source is under continuous development, and workflows in DataWorks are being transformed and governed. Will incremental migration overwrite the changes in DataWorks?

Yes, it will. The migration tool operates in overwrite mode. We recommend that you perform subsequent transformations in DataWorks only after the migration is complete. Alternatively, you can use a phased migration approach. After a batch of workflows is migrated and you confirm they will not be overwritten again, you can start transforming them in DataWorks. Different batches do not affect each other.

5.3. The entire package takes too long to import. Can I import only a part of it?

Yes, you can. You can manually trim the package to perform a partial import. In the data/project/workflow folder, keep the workflows you want to import and delete the others. Then, recompress the folder into a package and run the import tool. Note that workflows with mutual dependencies must be imported together. Otherwise, the node lineage between them will be lost.