Migrate DolphinScheduler Jobs to DataWorks via LHM - Migration Hub

This topic describes how to migrate DolphinScheduler scheduling workflows to DataWorks using the LHM scheduling migration tool. The process involves exporting tasks from DolphinScheduler, converting them, and then importing them into DataWorks.

1. Export DolphinScheduler scheduling workflows

The export tool retrieves project, workflow definition, data source definition, and resource file information by calling the DolphinScheduler API. The tool supports all versions of DolphinScheduler, including 1.x, 2.x, and 3.x. To export your workflows, follow these steps.

1. Prerequisites

Prepare a JDK 17 runtime environment. Ensure network connectivity between the runtime environment and DolphinScheduler. Download the scheduling migration tool and decompress it locally.

To test the network connectivity, you can call the DolphinScheduler ListProject API to verify that it returns information and that the returned list contains the project you want to migrate. For information about how to obtain a token, see the next section.

# DolphinScheduler 1.x
curl -H "token:<YourToken>" -X GET http://<YourIp>:12345/dolphinscheduler/projects/query-project-list

# DolphinScheduler 2.x
curl -H "token:<YourToken>" -X GET http://<YourIp>:12345/dolphinscheduler/projects/list

# DolphinScheduler 3.x
curl -H "token:<YourToken>" -X GET http://<YourIp>:12345/dolphinscheduler/projects/list

2. Configure connection information

In the `conf` folder of the project directory, create an export configuration file in JSON format, such as `read.json`.

Delete the comments from the JSON file before use.

{
  "schedule_datasource": {
    "name": "YourDolphin", // Give your DolphinScheduler data source a name.
    "type": "DolphinScheduler", // Data source type (DolphinScheduler)
    "properties": {
      "endpoint": "http://localhost:12345", // Endpoint
      "project": "Comprehensive Test", // Project name
      "token": "***********************" // Token
    },
    "operaterType": "AUTO" // Connection type (AUTO: Automatically get scheduling information through the API)
  },
  "conf": {

  }
}

2.1. Get the endpoint

The endpoint is the API endpoint, which is usually the same as the frontend page address. For example, the endpoint is `http://120.55.X.XXX:12345` in the following figure.

If the DolphinScheduler address is `http://your-company:12345/dolphinscheduler/ui/home`, the endpoint is `http://your-company:12345`.

Because DolphinScheduler is an open-source scheduling engine, its API module may be customized. If an API call fails, you can go to the Swagger page and run simple tests to confirm the API attributes.

2.2. Get the token

On the Token Management page in the Security Center, create a token and set a long expiration time.

Note: The user token must have permissions for the project that you want to migrate.

2.3. Get the project

Open the project management page. Copy the name of the project that you want to migrate and enter it as the value for the `Project` parameter.

3. Run the scheduling discovery tool

Each time you run the scheduling discovery tool, it generates two files that store the following information:

The raw information that is output by the DolphinScheduler API (ApiOutput package).
A package parsed by the discovery tool, which standardizes the data structure of the raw information (ReaderOutput package).

The ReaderOutput package is the final result of the scheduling export. The ApiOutput package is an intermediate result used only for troubleshooting during the export process.

Run the discovery tool from the command line. The command is as follows:

sh ./bin/run.sh read \
-c ./conf/<your_config_file>.json \
-f ./data/0_OriginalPackage/<api_raw_info_package>.zip \
-o ./data/1_ReaderOutput/<source_discovery_export_package>.zip \
-t <PluginName>

In the command, `-c` specifies the configuration file path, `-f` specifies the storage path for the ApiOutput package, `-o` specifies the storage path for the ReaderOutput package, and `-t` specifies the discovery plugin name.

The export plugins for DolphinScheduler 1.x, 2.x, and 3.x are `dolphinv1-reader`, `dolphinv2-reader`, and `dolphinv3-reader`, respectively.

For example, to export Project A from DolphinScheduler 3.2.0:

sh ./bin/run.sh read \
-c ./conf/projectA_read.json \
-f ./data/0_OriginalPackage/projectA_ApiOutput.zip \
-o ./data/1_ReaderOutput/projectA_ReaderOutput.zip \
-t dolphinv3-reader

4. View the export results

Open the generated `ReaderOutput.zip` package in the `./data/1_ReaderOutput/` directory to preview the export results.

The statistical report summarizes the basic information about workflows, nodes, resources, functions, and data sources in DolphinScheduler.

The `data/project` folder contains the standardized data structure of the DolphinScheduler scheduling information.

Statistical report:

Sheet 1, named 'Overview', displays a summary of the Reader export results. Sheets named 'WORKFLOW', 'WORKFLOWNODE', and so on contain specific information about workflows, nodes, resources, functions, and data sources.

The statistical report provides two special features:

1. You can change some properties of workflows and nodes in the report. The editable fields are displayed in blue font. In the next stage, which is scheduling conversion, the tool retrieves the property changes from the table and applies them during initialization.

2. The report lets you skip workflows during conversion by deleting rows in the workflow sub-table. This is also known as creating a workflow blacklist. Note: If workflows have dependencies on each other, the related workflows must be converted in the same batch. Do not separate them using the blacklist. Separating them will cause an error.

For more information, see Use the overview report in scheduling migration to add or modify scheduling properties.

5. Q&A

5.1. (Batch discovery) Can I discover multiple projects at once?

Yes, you can. You can enter multiple project names for the `project` configuration item. Separate the names with a comma without spaces. Because DolphinScheduler project names can contain spaces, any spaces are treated as part of the name.

Delete the comments from the JSON file before use.

{
  "schedule_datasource": {
    "name": "YourDolphin", // Give your DolphinScheduler data source a name.
    "type": "DolphinScheduler", // Data source type (DolphinScheduler)
    "properties": {
      "endpoint": "http://localhost:12345", // Endpoint
      "project": "Project1,Project2", // Project name
      "token": "***********************" // Token
    },
    "operaterType": "AUTO" // Connection type (AUTO: Automatically get scheduling information through the API)
  },
  "conf": {

  }
}

In the run command, the `-f` and `-o` input parameters must be folder paths. The tool automatically creates a separate export package for each project.

sh ./bin/run.sh read \
-c ./conf/<your_config_file>.json \
-f ./data/0_OriginalPackage/ \
-o ./data/1_ReaderOutput/ \
-t <dolphinv1/2/3-reader>

5.2. (Manual mode) What if there is no API?

Some developers remove the DolphinScheduler API module, which prevents you from obtaining scheduling information through an API connection. As an alternative, you can manually create the raw information package in the `./data/0_OriginalPackage/` directory and change `operaterType` to `MANUAL` in the configuration item. The tool then uses the manually created raw package as input to complete the DolphinScheduler discovery.

Delete the comments from the JSON file before use.

{
  "schedule_datasource": {
    "name": "YourDolphin", // Give your DolphinScheduler data source a name.
    "type": "DolphinScheduler", // Data source type (DolphinScheduler)
    "properties": {
      "endpoint": "http://localhost:12345", // Endpoint
      "project": "Comprehensive Test", // Project name
      "token": "***********************" // Token
    },
    "operaterType": "MANUAL" // Connection type (MANUAL: Offline mode)
  },
  "conf": {

  }
}

Example of the raw package structure:

.
├── package_info.json
├── projects.json
├── projects
│   └── Comprehensive Test
│       └── processDefinition
│           └── process_definitions_page_1.json
├── datasource
│   └── datasource_page_1.json
├── resource
│   └── resources.json
└── udfFunction
    └── udf_function_page_1.json

`package_info.json` contains package information, including the DolphinScheduler version.

{
  "version": "3.2.0"
}

The `projects.json` file contains project information. When you create this file manually, focus on filling in the `id`, `userId`, `code`, and `name` fields.

[
  {
    "id": 2,
    "userId": 1,
    "code": 16372996967936,
    "name": "Comprehensive Test",
    "description": "",
    "createTime": "2025-01-20 11:40:39",
    "updateTime": "2025-01-20 11:40:39",
    "perm": 0,
    "defCount": 0,
    "instRunningCount": 0
  }
]

The `projects` folder stores workflow definitions. When you create this folder manually, change its subdirectory to the project name. Then, export the workflow definitions from the DolphinScheduler interface, rename them sequentially to `process_definitions_page_*.json`, and place them in the `processDefinition` directory.

`datasource`, `resource`, and `udfFunction` contain data source, resource file, and UDF information, respectively. Because the DolphinScheduler interface does not have an export feature for these elements, you can omit them. To do this, fill the `datasource_page_1.json`, `resources.json`, and `udf_function_page_1.json` files with an empty array `[]`. Omitting these elements has a minor impact on the workflow migration details. This impact involves the mapping of SQL nodes to data sources, the mapping of DataX nodes (in non-custom template mode) to data sources, and the migration of node-to-resource reference relationships. The affected nodes are created in DataWorks as expected, but you must manually configure the binding of these nodes to data sources and resources in DataWorks.

5.3. What if the token is valid but parts of the exported workflow are missing?

First, check if the token has permissions for the project.

In addition, we have found that the APIs of some minor versions of DolphinScheduler 1.x may cause data loss during export. You can use the statistical report in the export results to identify and complete the missing workflows.

2. Convert DolphinScheduler workflows to DataWorks workflows

DolphinScheduler is a popular open-source scheduling engine. DataWorks fully supports the scheduling capabilities of DolphinScheduler. After the migration tool converts the workflows, they can run in DataWorks with the same effect as in DolphinScheduler.

1. Prerequisites

The discovery tool has run successfully, the DolphinScheduler scheduling information has been exported, and the `ReaderOutput.zip` file has been generated.

(Optional, recommended) Open the discovery export package and view the statistical report to check whether the full scope of the migration has been exported.

2. Conversion configuration items

2.1. Conversion configuration template

Delete the comments from the JSON file before use.

{
  "conf": {},
  "self": {
    "if.use.default.convert": false,
    "if.use.migrationx.before": false,
    "if.use.dataworks.newidea": true,
    "owner.map": [ // Owner mapping
      {
        "src": "1", // DolphinScheduler user ID
        "tgt": "202006995118212119" // DataWorks user ID
      }
    ],
    "conf": [
      {
        "nodes": "all", // Scope of the rule group
        "rule": {
          "settings": {
            // Convert DolphinScheduler Shell nodes to DataWorks Shell nodes
            "workflow.converter.shellNodeType": "DIDE_SHELL",
            // Convert unknown nodes to DataWorks virtual nodes by default
            "workflow.converter.target.unknownNodeTypeAs": "VIRTUAL",
            // Convert DolphinScheduler SQL nodes to corresponding DataWorks SQL or database nodes based on the data source type
            "workflow.converter.dolphinscheduler.sqlNodeTypeMapping": {
              "CLICKHOUSE": "CLICK_SQL",
              "HIVE": "ODPS_SQL",
              "STARROCKS": "StarRocks",
              "DORIS": "HOLOGRES_SQL",
              "MYSQL": "MYSQL",
              "REDSHIFT": "Redshift",
              "SQLSERVER": "SQLSERVER",
              "PRESTO": "EMR_PRESTO",
              "POSTGRESQL": "POSTGRESQL",
              "ORACLE": "Oracle",
              "ATHENA": "MYSQL"
            },
            // Mapping of DolphinScheduler and DataWorks data source names
            "workflow.converter.connection.mapping": {
              "mysqlDb1": "dataworks_mysqlDb1",
              "srDb1": "dataworks_srDb1"
            },
            // Main compute engine attached to DataWorks (EMR, MaxCompute, or Hologres)
            "workflow.converter.target.engine.type": "EMR",
            // Convert DolphinScheduler Spark nodes to DataWorks MaxCompute Spark nodes
            "workflow.converter.sparkSubmitAs": "ODPS_SPARK",
            "workflow.converter.sparkVersion": "3.x",
          }
        }
      }
    ]
  },
  "schedule_datasource": {
    "name": "DsProject",
    "type": "DolphinScheduler"
  },
  "target_schedule_datasource": {}
}

2.2. Owner mapping

DolphinScheduler records the owner of each workflow. The owner is important information for team development. The tool supports mapping DolphinScheduler users to DataWorks users to mark the corresponding owners for workflows and nodes.

Obtain the DolphinScheduler username and ID from the user management page.

You can add users as members to a DataWorks workspace. Obtain the user ID from the upper-right corner.

You can also obtain the ID from the Owner drop-down list on the Data Development page.

2.3. Node conversion rules

2.3.1. Scope of rules

You can set the scope for node conversion rules. For example, to convert all nodes according to a uniform rule, you can configure `"nodes": "all"` and fill in the `Settings`. Typically, you only need to configure one `all` rule group.

Delete the comments from the JSON file before use.

{
  "conf": {},
  "self": {
    "conf": [
      {
        "nodes": "all", // The scope of the rule group is ALL. All nodes are converted according to this rule.
        "rule": {
          "settings": {
            // Settings
          }
      }
    ]
  }
}

If some nodes require separate conversion rules, you can specify the scope of the rule by entering the task ID or name in `nodes`. To set a rule for a batch of nodes, separate the IDs or names with commas. We recommend that you use IDs to specify the scope because using names may lead to incorrect settings. You can also use a regular expression to match node names. We also strongly recommend that you set a `normal` rule group to provide a default conversion rule for the remaining nodes.

Delete the comments from the JSON file before use.

{
  "conf": {},
  "self": {
    "conf": [
      {
        "nodes": "node1Name, node2Id", // The scope of the rule group is node1 and node2.
        "rule": {
          "settings": {
            // Settings 1
          }
        },
        {
        "nodes": "node3Name, node4Id", // The scope of the rule group is node3 and node4.
        "rule": {
          "settings": {
            // Settings 2
          }
        },
        {
        "nodes": "regexExpression", // Supports filtering node names with a regular expression.
        "rule": {
          "settings": {
            // Settings 3
          }
        },
        {
        "nodes": "normal", // Conversion rule for other nodes.
        "rule": {
          "settings": {
            // Settings 4
          }
        }
        ]
  }
}

2.3.2. Conversion rules

DolphinScheduler 1.x, 2.x, and 3.x support different node types. Therefore, the conversion solutions and configuration items vary. The details are as follows.

2.3.2.1. DolphinScheduler 3.x conversion configuration items

The tool currently supports the conversion of the following DolphinScheduler 3.x node types:

SHELL, SQL, PYTHON, DATAX, SQOOP, SEATUNNEL, HIVECLI, SPARK (Java, Python, Sql), MR, PROCEDURE, HTTP, CONDITIONS, SWITCH, DEPENDENT, and SUB_PROCESS.

You can configure DataWorks mapping rules for the following types:

SHELL (workflow.converter.shellNodeType):

We recommend converting them to DIDE_SHELL, EMR_SHELL, or VIRTUAL nodes.

SQL (workflow.converter.dolphinscheduler.sqlNodeTypeMapping):

We recommend converting them to various SQL nodes or database nodes.

PROCEDURE (workflow.converter.dolphinscheduler.sqlNodeTypeMapping):

We recommend converting them to various SQL nodes or database nodes.

PYTHON (workflow.converter.pyNodeType):

We recommend converting them to PYTHON, PYODPS, PYODPS3, or EMR_SHELL nodes.

HIVECLI (workflow.converter.dolphinscheduler.sqlNodeTypeMapping/HIVE):

We recommend converting them to EMR_HIVE or ODPS_SQL nodes.

SPARK (workflow.converter.sparkSubmitAs):

We recommend converting SparkJava and SparkPython nodes to ODPS_SPARK or EMR_SPARK nodes.

We recommend converting SparkSql nodes to ODPS_SQL or EMR_SPARK_SQL nodes.

MR (workflow.converter.mrNodeType):

We recommend converting them to ODPS_MR or EMR_MR nodes.

For more information about DataWorks node types, see the following enumeration class:

https://github.com/aliyun/dataworks-spec/blob/b0f4a4fd769215d5f81c0bbe990addd7498df5f4/spec/src/main/java/com/aliyun/dataworks/common/spec/domain/dw/types/CodeProgramType.java#L180

Node types with fixed conversion rules:

DATAX: Converted to DI nodes. Both custom template mode (JSON Script mode) and regular mode (frontend entry mode) are supported.

The following data source reader plugin configuration conversions are supported: MYSQL to mysql, POSTGRESQL to postgresql, ORACLE to oracle, SQLSERVER to sqlserver, ODPS to odps, OSS to oss, HIVE to hdfs, Hadoop Distributed File System (HDFS) to hdfs, CLICKHOUSE to clickhouse, and MONGODB to mongodb.

The following data source writer plugin configuration conversions are supported: MYSQL to mysql, POSTGRESQL to postgresql, ORACLE to oracle, SQLSERVER to sqlserver, ODPS to odps, OSS to oss, HIVE to hdfs, HDFS to hdfs, CLICKHOUSE to clickhouse, and MONGODB to mongodb.

SQOOP: Converted to DI nodes.

The following data source reader plugin configuration conversions are supported: Mysql to mysql, Hive to hive, and HDFS to hdfs.

The following data source writer plugin configuration conversions are supported: Mysql to mysql, Hive to hive, and HDFS to hdfs.

SEATUNNEL: Converted to DI nodes.

Script conversion is not yet supported. Only nodes and scheduling information are converted.

HTTP: Converted to DIDE_SHELL (general Shell) nodes. The migration tool automatically concatenates the request parameters into a curl command.
SWITCH: Converted to CONTROLLER_BRANCH (branch) nodes. The functionality is the same before and after the migration.
SUB_PROCESS: Converted to SUB_PROCESS nodes. The functionality is the same before and after the migration. Note: When you import the workflow to DataWorks, the migration tool enables the 'Can be referenced' switch for the referenced workflow. The referenced workflow can be started only by a SUB_PROCESS call and cannot be scheduled to start on its own.

DEPENDENT: Converted to VIRTUAL nodes. The dependency is converted to a node lineage dependency. For example, if a Dependent node depends on Workflow A, the dependency is converted to a lineage from the tail node of Workflow A to the Dependent node. If a Dependent node depends on Node A, the dependency is converted to a lineage from Node A to the Dependent node. The following figure provides an illustration.

CONDITIONS: The node contains two layers of logic, which are implemented using two-layer CONTROLLER_JOIN (merge) nodes. In the case shown in the figure below, the CONDITIONS node has two upstream nodes (A and B) and two downstream nodes (C and D). The logical expression is `((!A&B)|(A&!B)|(!A&!B))`. If the expression is true, the flow proceeds to C. If the expression is false, the flow proceeds to D. At the upper layer, three merge nodes are generated to calculate the results of `!A&B`, `A&!B`, and `!A&!B`. At the lower layer, two nodes are generated. One node triggers the downstream node C to execute when `((!A&B)|(A&!B)|(!A&!B))==true`. The other node triggers the downstream node D to execute when `(!(!A&B)&!(A&!B)&!(!A&!B))==true`. This process replicates the effect of the CONDITIONS node.

2.3.2.2. DolphinScheduler 2.x conversion configuration items

The tool currently supports the conversion of the following DolphinScheduler 2.x node types:

SHELL, SQL, PYTHON, DATAX, SQOOP, HIVECLI, SPARK (Java, Python, Sql), MR, PROCEDURE, HTTP, CONDITIONS, SWITCH, DEPENDENT, and SUB_PROCESS.

Compared with version 2.x, DolphinScheduler 3.x adds only the SEATUNNEL node type. The conversion solutions and configuration items for the other nodes are the same as those for DolphinScheduler 3.x. For more information, see the previous section.

2.3.2.3. DolphinScheduler 1.x conversion configuration items

The tool currently supports the conversion of the following DolphinScheduler 1.x node types:

SHELL, SQL, PYTHON, DATAX, SQOOP, SPARK (Java, Python, Sql), MR, CONDITIONS, DEPENDENT, and SUB_PROCESS.

The conversion solutions and configuration items for these nodes are the same as those for DolphinScheduler 3.x. For more information, see the previous section.

3. Run the scheduling conversion tool

Run the conversion tool from the command line. The command is as follows:

sh ./bin/run.sh convert \
-c ./conf/<your_config_file>.json \
-f ./data/1_ReaderOutput/<source_discovery_export_package>.zip \
-o ./data/2_ConverterOutput/<conversion_result_output_package>.zip \
-t <PluginName>

In the command, `-c` specifies the configuration file path, `-f` specifies the storage path for the ReaderOutput package, `-o` specifies the storage path for the ConverterOutput package, and `-t` specifies the conversion plugin name. The conversion plugins for DolphinScheduler 1.x, 2.x, and 3.x are `dolphinv1-dw-converter`, `dolphinv2-dw-converter`, and `dolphinv3-dw-converter`, respectively.

For example, to convert DolphinScheduler 3.x Project A:

sh ./bin/run.sh convert \
-c ./conf/projectA_convert.json \
-f ./data/1_ReaderOutput/projectA_ReaderOutput.zip \
-o ./data/2_ConverterOutput/projectA_ConverterOutput.zip \
-t dolphinv3-dw-converter

The conversion tool prints process information during operation. Check for any errors during the process. After the conversion is complete, statistics on successful and failed conversions are printed on the command line. Note that the failure of some node conversions does not affect the overall conversion process. If a few nodes fail to convert, you can manually modify them after you migrate them to DataWorks.

4. View the conversion results

Open the generated `ConverterOutput.zip` package in the `./data/2_ConverterOutput/` directory to preview the export results.

The statistical report summarizes the basic information about the converted workflows, nodes, resources, functions, and data sources.

The `data/project` folder is the converted scheduling migration package.

The statistical report provides two special features:

1. You can change some properties of workflows and nodes in the report. The editable fields are displayed in blue font. In the next stage, which is importing to DataWorks, the tool retrieves the property changes from the table and applies them.

2. The report lets you skip workflows when importing to DataWorks by deleting rows in the workflow sub-table (workflow blacklist). Note: If workflows have dependencies on each other, the related workflows must be imported in the same batch. Do not separate them using the blacklist. Separating them will cause an error.

For more information, see Use the overview report in scheduling migration to add or modify scheduling properties.

3. Import to DataWorks

The heterogeneous conversion feature of the LHM migration tool transforms the source scheduling elements into the DataWorks scheduling format. The tool provides a unified upload entry for different migration scenarios to import workflows into DataWorks.

The import tool supports multiple rounds of writing. It automatically chooses whether to create or update workflows (overwrite mode).

1. Prerequisites

1.1. Successful conversion

The conversion tool has run successfully, the source scheduling information has been converted to DataWorks scheduling information, and the `ConverterOutput.zip` file has been generated.

(Optional, recommended) Open the conversion output package and view the statistical report to check whether the full scope of the migration has been successfully converted.

1.2. DataWorks configuration

Perform the following actions in DataWorks:

1. Create a workspace.

2. Create an AccessKey pair and ensure that it has administrator permissions for the workspace. We strongly recommend that you create an AccessKey pair that is bound to your account to help with troubleshooting if writing issues occur.

3. In the workspace, create data sources, attach computing resources, and create resource groups.

4. In the workspace, upload resource files and create UDFs.

1.3. Network connectivity check

Verify that you can connect to the DataWorks endpoint.

List of service endpoints:

Service endpoints

ping dataworks.aliyuncs.com

2. Import configuration items

In the `conf` folder of the project directory, create an export configuration file in JSON format, such as `writer.json`.

Delete the comments from the JSON file before use.

{
  "schedule_datasource": {
    "name": "YourDataWorks", // Give your DataWorks data source a name.
    "type": "DataWorks",
    "properties": {
      "endpoint": "dataworks.cn-hangzhou.aliyuncs.com", // Service endpoint
      "project_id": "YourProjectId", // Workspace ID
      "project_name": "YourProject", // Workspace name
      "ak": "************", // AK
      "sk": "************", // SK
    },
    "operaterType": "MANUAL"
  },
  "conf": {
    "di.resource.group.identifier": "Serverless_res_group_***_***", // Scheduling resource group
    "resource.group.identifier": "Serverless_res_group_***_***", // Data integration resource group
    "dataworks.node.type.xls": "/Software/bwm-client/conf/CodeProgramType.xls", // Path to the DataWorks node type table
    "qps.limit": 5 // QPS limit for sending API requests to DataWorks
  }
}

2.1. Service endpoint

Select a service endpoint based on the region where your DataWorks workspace is located. For more information, see:

Service endpoints

2.2. Workspace ID and name

Open the DataWorks console. Go to the workspace product page. Obtain the workspace ID and name from the basic information on the right.

2.3. Create and grant permissions to an AccessKey pair

On the user page, create an AccessKey pair that has administrator read and write permissions for the target DataWorks workspace.

Permission management involves two locations. If the account is a Resource Access Management (RAM) user, you must first grant the RAM user permissions to perform DataWorks operations.

Access policy page: https://ram.console.alibabacloud.com/policies

Then, in the DataWorks workspace, assign workspace permissions to the account.

Note: You can set a network access control policy for an AccessKey. Make sure that the IP address of the machine where the migration tool is located is allowed to establish access.

2.4. Resource groups

In the navigation pane on the left of the DataWorks workspace product page, go to the resource group page. Attach a resource group and obtain its ID.

A general-purpose resource group can be used for node scheduling and data integration. You can set both the scheduling resource group (`resource.group.identifier`) and the data integration resource group (`di.resource.group.identifier`) to the same general-purpose resource group in the configuration.

2.5. QPS settings

The tool imports data by calling DataWorks APIs. Different DataWorks editions have different queries-per-second (QPS) limits and daily call limits for read and write OpenAPI calls. For more information, see Limits.

For DataWorks Basic Edition, Standard Edition, and Professional Edition, we recommend setting `"qps.limit": 5`. For Enterprise Edition, we recommend setting `"qps.limit": 20`.

Note: Avoid running multiple import tools at the same time.

2.6. DataWorks node type ID settings

In DataWorks, some node types are assigned different TypeIds in different regions. The specific TypeID depends on the DataWorks Data Development interface. This characteristic mainly applies to database nodes. For more information, see Database nodes.

For example, a MySQL node has a NodeTypeId of 1000039 in the Hangzhou region and 1000041 in the Shenzhen region.

To adapt to these differences between DataWorks regions, the tool provides a configurable method for you to set the node TypeId table that the tool uses.

The table is imported using the import tool's configuration items:

"conf": {
    "dataworks.node.type.xls": "/Software/bwm-client/conf/CodeProgramType.xls" // Path to the DataWorks node type table
 }

To obtain the node type ID from the DataWorks Data Development interface, you can create a new workflow in the interface, create a new node in the workflow, and then click Save to view the workflow's spec.

If the node type is configured incorrectly, the following error is reported when the workflow is published.

3. Run the DataWorks import tool

Run the conversion tool from the command line. The command is as follows:

sh ./bin/run.sh write \
-c ./conf/<your_config_file>.json \
-f ./data/2_ConverterOutput/<conversion_result_output_package>.zip \
-o ./data/4_WriterOutput/<import_result_storage_package>.zip \
-t dw-newide-writer

In the command, `-c` specifies the configuration file path, `-f` specifies the storage path for the ConverterOutput package, `-o` specifies the storage path for the WriterOutput package, and `-t` specifies the submission plugin name.

For example, to import Project A to DataWorks:

sh ./bin/run.sh write \
-c ./conf/projectA_write.json \
-f ./data/2_ConverterOutput/projectA_ConverterOutput.zip \
-o ./data/4_WriterOutput/projectA_WriterOutput.zip \
-t dw-newide-writer

The import tool prints process information during operation. Check for any errors during the process. After the import is complete, statistics on successful and failed imports are printed on the command line. Note that the failure of some node imports does not affect the overall import process. If a few nodes fail to import, you can manually modify them in DataWorks.

4. View the import results

After the import is complete, you can view the results in DataWorks. You can also monitor the workflows as they are imported one by one. If you find a problem and need to stop the import, you can run the `jps` command to find `BwmClientApp` and then run the `kill -9` command to stop the import.

5. Q&A

5.1. The source is under continuous development. How can I submit these increments and changes to DataWorks?

The migration tool runs in overwrite mode. You can rerun the export, conversion, and import processes to submit incremental changes from the source to DataWorks. Note that the tool matches workflows by their full path to decide whether to create or update them. To migrate changes, do not move the workflows.

5.2. The source is under continuous development, and I am also modifying and managing workflows on DataWorks. Will incremental migration overwrite the changes on DataWorks?

Yes, it will. The migration tool runs in overwrite mode. We recommend that you make further modifications in DataWorks after the migration is complete. Alternatively, you can migrate in batches. After you confirm that a batch of migrated workflows will not be overwritten again, you can start modifying them in DataWorks. Different batches do not affect each other.

5.3. The entire package takes too long to import. Can I import only a part of it?

Yes, you can. You can manually crop the package to be imported to perform a partial import. In the `data/project/workflow` folder, keep the workflows that you need to import and delete the others. Recompress the folder into a package and then run the import tool. Note that workflows with mutual dependencies must be imported together. Otherwise, the node lineage between the workflows is lost.