All Products
Search
Document Center

DataWorks:Export tasks from open source engines

Last Updated:Aug 17, 2023

DataWorks allows you to migrate tasks from open source scheduling engines, such as Oozie, Azkaban, Airflow, and DolphinScheduler, to DataWorks. This topic describes how to export tasks from open source scheduling engines.

Background information

Before you import a task of an open source scheduling engine to DataWorks, you must export the task to your on-premises machine or Object Storage Service (OSS). For more information about the import procedure, see Import tasks of open source engines.

Limits

For Airflow tasks, you can export only tasks of Airflow 1.10.x. In addition, the export of Airflow tasks depends on Python 3.6 or later.

Export a task from Oozie

Export requirements

The package must contain XML-formatted definition files and configuration files of a task. The package is exported in the ZIP format.

Structure of the package

Oozie task descriptions are stored in a Hadoop Distributed File System (HDFS) directory. For example, on the Apache Oozie official website, each subdirectory under the apps directory in the Examples package is a task of Oozie. Each subdirectory contains XML-formatted definition files and configuration files of a task. The following figure shows the structure of the exported package. Directory structure

Export a job from Azkaban

Download a flow

You can download a specific flow in the Azkaban console.

  1. Log on to the Azkaban console and go to the Projects page.

  2. Select a project whose package you want to download. On the project page, click Flows to show all flows of the project.

  3. Click Download in the upper-right corner of the page to download the package of the project.

    DownloadNative Azkaban packages can be exported. No format limit is imposed on the packages of Azkaban. The exported package in the ZIP format contains information about all jobs and relationships of a specific project of Azkaban. You can directly upload the ZIP package exported from the Azkaban console to the Scheduling Engine Import page in DataWorks.

Conversion logic

The following table describes the mappings between Azkaban items and DataWorks items and the conversion logic.

Azkaban item

DataWorks item

Conversion logic

Flow

Workflow in DataStudio

Jobs in a flow are placed in the workflow that corresponds to the flow and used as nodes in the workflow.

Nested flows in a flow are converted into separate workflows in DataWorks. After the conversion, the dependencies between the nodes in the workflow are automatically established.

Command-type job

Shell node

In DataWorks on EMR mode, a command-type job is converted into an E-MapReduce (EMR) Shell node. You can specify the mapped node type by configuring the related parameter in the Advanced Settings dialog box.

If you call other scripts in the CLI of a command-type job, the script file obtained after analysis can be registered as a resource file of DataWorks and the resource file is referenced in the converted Shell code.

Hive-type job

ODPS SQL node

In DataWorks on MaxCompute mode, a Hive-type job is converted into an ODPS SQL node. You can specify the mapped node type by configuring the related parameter in the Advanced Settings dialog box.

Other types of nodes that are not supported by DataWorks

Zero load node or Shell node

You can specify the mapped node type by configuring the related parameter in the Advanced Settings dialog box.

Advanced Settings

Export a task from Airflow

Procedure

  1. Go to the runtime environment of Airflow.

  2. Use the Python library of Airflow to load the directed acyclic graph (DAG) folder that is scheduled on Airflow. The DAG Python file is stored in the DAG folder.

  3. Use the export tool to read the task information and dependencies stored in the DAG Python file based on the Python library of Airflow in memory. Then, write the generated DAG information to a JSON file and export the file.

    You can download the export tool on the Scheduling Engine Export page of Cloud tasks in DataWorks Migrant Assistant. For information about how to go to the Scheduling Engine Export page, see Export a task of another open source engine.

Usage notes for the export tool

Usage notes for the export tool:

  1. Run the following command to decompress the airflow-exporter.tgz package:

    tar zxvf airflow-exporter.tgz
  2. Run the following command to set the PYTHONPATH parameter to the directory of the Python library:

    export PYTHONPATH=/usr/local/lib/python3.6/site-packages
  3. Run the following command to export the task from Airflow:

    cd airflow-exporter
    python3.6 ./parser -d /path/to/airflow/dag/floder/ -o output.json
  4. Run the following command to compress the exported output.json file into a ZIP package:

    zip out.zip output.json

After the ZIP package is generated, you can perform the following operations to create an import task to import the ZIP package to DataWorks: Go to the Migrant Assistant page in the DataWorks console. In the left-side navigation pane, choose Cloud tasks > Scheduling Engine Import. For more information, see Import tasks of open source engines.

Export a node from DolphinScheduler

How it works

The DataWorks export tool obtains the JSON configurations of a process in DolphinScheduler by calling the API operation that is used to export multiple processes from DolphinScheduler at a time. The export tool generates a ZIP file based on the JSON configurations. In the left-side navigation pane of the Migrant Assistant page, you can choose Cloud tasks > Scheduling Engine Import, and create an import task of the DolphinScheduler type to import the ZIP file. DataWorks Migration Assistant parses and converts code and dependencies of nodes in a process in the ZIP file into valid file configurations for related DataWorks nodes. For information about how to import a task of an open source scheduling engine, see Scheduling Engine Import.

Limits

  • Limits on version: Only DolphinScheduler 1.3.x allows you to export DolphinScheduler nodes.

  • Limits on node type conversion:

    • SQL nodes: Only some types of compute engines support the conversion of SQL nodes. During the node type conversion, the syntax of SQL code is not converted and the SQL code is not modified.

    • Cron expressions: In specific scenarios, cron expressions may be pruned or cron expressions may not be supported. You must check whether the scheduling time that is configured meets your business requirements. For information about scheduling time, see Configure time properties.

    • Python nodes: DataWorks does not provide Python nodes. DataWorks can convert a Python node in DolphinScheduler into a Python file resource and a Shell node that references the Python file resource. However, issues may occur when scheduling parameters of the Python node are passed. Therefore, debugging and checks are required. For information about scheduling parameters, see Configure and use scheduling parameters.

    • Depend nodes: DataWorks cannot convert Depend nodes for which cross-cycle scheduling dependencies are configured. If cross-cycle scheduling dependencies are configured for a Depend node, DataWorks converts the dependencies into the same-cycle scheduling dependencies of the mapped auto triggered node in DataWorks. For information about how to configure same-cycle scheduling dependencies, see Configure same-cycle scheduling dependencies.

Conversion logic

The following table describes the mappings between DolphinScheduler items and DataWorks items and the conversion logic.

DolphinScheduler item

DataWorks item

Conversion logic

Process

Workflow

Nodes in a DolphinScheduler process are converted into nodes in a DataWorks workflow. In addition, zero load nodes used as the start and end nodes in the DataWorks workflow are automatically added.

SubProcess node

Create and use a zero load node

  • When a process to which a SubProcess node belongs is converted into a DataWorks workflow, two zero load nodes are added to the workflow. One is used as the start node and the other is used as the end node.

  • The SubProcess node is converted into a zero load node in the DataWorks workflow. The start node depends on the zero load node, and the descendant nodes of the SubProcess node depend on the end node.

SubProcess node conversion logic

Conditions node

Configure a merge node

A dependency item and a conditional relation of a Conditions node are converted into a DataWorks merge node and related logic. The outermost logical dependency configured for the Conditions node uses the mapped merge node and another two merge nodes to determine whether a success or failure path is used.

Note

DataWorks cannot convert Conditions nodes for which cross-cycle scheduling dependencies are configured. If cross-cycle scheduling dependencies are configured for a Conditions node, DataWorks converts the dependencies into the same-cycle scheduling dependencies of the mapped auto triggered node in DataWorks.

Conditions node conversion logic

Depend node

Create and use a zero load node

The dependencies of a Depend node are converted into the input of the mapped zero load node based on the scheduling configurations. Concatenation rule for the input of the mapped zero load node:

{current_dataworks_project_name}.{dolphin_project_name}.{dolphin_process_name}.{dolphin_task_name}

SQL node

Types of the compute engines that support conversion of SQL nodes in DolphinScheduler:

  • HIVE

  • SPARK

  • CLICKHOUSE

  • POSTGRESQL

You can specify the type of mapped compute engine node by configuring the related parameter in the Advanced Settings dialog box of an import task.

Note

The syntax of SQL code is not converted and the SQL code is not modified.

Python node

A Python file resource and a Shell node that references the Python file resource

Issues may occur when scheduling parameters of the Python node are passed. Therefore, debugging and checks are required. For information about scheduling parameters, see Configure and use scheduling parameters.

MR node

You can specify the type of mapped compute engine node by configuring the related parameter in the Advanced Settings dialog box of an import task.

Spark node

You can specify the type of mapped compute engine node by configuring the related parameter in the Advanced Settings dialog box of an import task.

Sqoop node

Batch synchronization node that is configured in script mode

The data sources used for the batch synchronization node vary based on your business requirements. For information about how to configure a batch synchronization node by using the code editor, see Configure a batch synchronization node by using the code editor.

Other types of nodes that are not supported by DataWorks

Zero load node

N/A.

Environment preparations

Export procedure

  1. Decompress the export tool package.

    Run the following commands to decompress the export tool package:

    $ tar xzvf migrationx-reader.zip
    $ cd migrationx-reader/
  2. Generate a token used to call DolphinScheduler APIs.

    For information about how to generate a token, see DolphinScheduler documentation.

  3. Export a ZIP file.

    Run the following command to export the required ZIP file:

    $ python ./bin/reader.py -a dolphinscheduler -e http://dolphinschedulerhost:port -t token -v 1.3.9 -p project_name -f ds_dump.zip

After you export the ZIP file, choose Cloud tasks > Scheduling Engine Import in the left-side navigation pane of the Migrant Assistant page, and create an import task of the DolphinScheduler type to import the ZIP file. DataWorks Migration Assistant parses and converts code and dependencies of nodes in a process in the ZIP file into valid file configurations for related DataWorks nodes. For information about how to import a task of an open source scheduling engine, see Scheduling Engine Import.

Export tasks from other open source engines

DataWorks provides a standard template for you to export tasks of open source engines other than Oozie, Azkaban, Airflow, and DolphinScheduler. Before you export a task, you must download the standard template and modify the content based on the file structure in the template. You can go to the Scheduling Engine Export page to download the standard template and view the file structure.

  1. Log on to the DataWorks console. In the left-side navigation pane, choose Data Modeling and Development > DataStudio. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.

  2. On the DataStudio page, click the Icon icon in the upper-left corner and choose All Products > More > Migration Assistant.

  3. In the left-side navigation pane, choose Cloud tasks > Scheduling Engine Export.

  4. Click the Standard Template tab.

  5. On the Standard Template tab, click standard format Template to download the template.

  6. Modify the content of the template to generate a package that you want to export.