DataWorks allows you to migrate tasks from open source scheduling engines, such as Oozie, Azkaban, and Airflow to DataWorks. This topic describes the requirements for exporting such tasks.

Background information

Before you import a task of an open source scheduling engine to DataWorks, you must export the task to your on-premises machine or Object Storage Service (OSS). For more information about the import procedure, see Import tasks of open source engines.

Export a task from Oozie

Requirements and structure of the exported package:
  • Requirements

    The package must contain XML-formatted definition files and configuration items of a flow task. The package is exported in the ZIP format.

  • Structure
    Oozie task descriptions are saved in an HDFS directory. For example, each subdirectory under the apps directory in the Examples package at the Apache Oozie official website is a flow task of Oozie. Each subdirectory contains XML-formatted definition files and configuration items of a flow task. Directories

Export a task from Azkaban

You can download a specific flow task in the Azkaban console.

  1. Log on to the Azkaban console to go to the Projects page.
  2. Select a project whose package you want to download. On the page for the project, click Flows to show all flow tasks under the project.
  3. Click Download in the upper-right corner of the page to download the package of the project.
    download

    No limit is imposed on the exported packages of Azkaban. The exported package in the ZIP format contains information about all tasks and relationships under a specific project of Azkaban.

Export a task from Airflow

  • To export a task from Airflow, perform the following steps:
    1. Go to the running environment of Airflow.
    2. Use the Python library of Airflow to load the directed acyclic graph (DAG) folder that is scheduled on Airflow. The DAG Python file is stored in the DAG folder.
    3. Use the export tool to read the task information and dependencies stored in the DAG Python file based on the Python library of Airflow in memory. Then, write the generated DAG information to a JSON file and export the file.

      You can download the export tool on the Schemes of Scheduling Engine Export page of Migration Assistant in the DataWorks console. For more information about how to go to the Schemes of Scheduling Engine Export page, see Export a task of another open source engine.

  • Usage notes of the export tool
    1. Execute the following statement to decompress the airflow-exporter.tgz package:
      tar zxvf airflow-exporter.tgz
    2. Execute the following statement to set PYTHONPATH to the directory of the Python library:
      export PYTHONPATH=/usr/local/lib/python3.6/site-packages
    3. Execute the following statement to export the task from Airflow:
      cd airflow-exporter
      2python3.6 ./parser -d /path/to/airflow/dag/floder/ -o output.json
    4. Go to the Scheduling Engine Import tab of the Migration Assistant page in the DataWorks console to import the task.
      1. Execute the following statement to compress the exported output.json file into a ZIP package:
        zip out.zip output.json
      2. Go to the Import Tasks page of Migration Assistant in the DataWorks console to import the generated out.zip package. For more information, see Import tasks of open source engines.

Export a task of another open source engine

DataWorks provides a standard template for you to export tasks of open source engines except for Oozie, Azkaban, and Airflow. Before you run an export task, you must download the standard template and modify the content based on the file structure in the template. You can go to the Schemes of Scheduling Engine Export page to download the standard template and view the file structure.

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region where your workspace resides, find the workspace, and then click Data Analytics in the Actions column.
  2. Click the Icon icon in the upper-left corner. Then, choose All Products > Other > Migration Assistant.
  3. In the left-side navigation pane, choose Cloud tasks > Scheduling Engine Export to go to the Schemes of Scheduling Engine Export page.
  4. Click Standard Template.
  5. On the Standard Template tab, click standard format Template to download the template.
  6. Modify the content in the template to generate a package that you want to export.