This topic describes the basic features of the scheduling migration tool in Lakehouse Migration.
Function overview
Lakehouse Migration (LHM) lets you quickly migrate jobs from open source and other cloud scheduling engines to DataWorks.
The scheduling migration process consists of three steps: exporting source jobs, transforming heterogeneous jobs, and importing jobs into DataWorks. Intermediate results are accessible, which gives you full control over the migration.
Flexible transformation configurations support multiple compute engines in DataWorks, such as MaxCompute, EMR, and Hologres.
It features a lightweight deployment that only requires a JDK 17 runtime environment and network connectivity.
Enhanced data security. The migration runs locally, and intermediate results are not uploaded.
Architecture diagram:
Scheduling migration steps
The LHM scheduling migration tool migrates and transforms jobs from any scheduling engine to DataWorks in a three-step process.
Export scheduling tasks from the migration source (source discovery).
The tool retrieves scheduling task information from the source and parses it into the standard LHM data structure for scheduling workflows. This step standardizes the data structure.
Transform scheduling properties from the migration source to DataWorks properties.
Source scheduling task properties are transformed into DataWorks task properties. This includes task types, scheduling settings, task parameters, and scripts for some task types. The transformation is based on the standard LHM data structure for scheduling workflows.
Import scheduling tasks into DataWorks.
The tool automatically builds DataWorks workflow definitions and imports tasks by calling the DataWorks software development kit (SDK). The tool automatically determines whether to create or update tasks. This supports multiple migration rounds and the synchronization of source changes.
Scheduling migration capability matrix
The LHM tool currently supports automated migration of tasks from the following scheduling engines to DataWorks.
Scheduling migration from open source engines to DataWorks
Source type | Source version | Supported node types for transformation |
DolphinScheduler | 1.x | Shell, SQL, Python, DataX, Sqoop, Spark (Java, Python, SQL), MapReduce, Conditions, Dependent, SubProcess |
2.x | Shell, SQL, Python, DataX, Sqoop, HiveCLI, Spark (Java, Python, SQL), MapReduce, Procedure, HTTP, Conditions, Switch, Dependent, SubProcess | |
3.x | Shell, SQL, Python, DataX, Sqoop, SeaTunnel, HiveCLI, Spark (Java, Python, SQL), MapReduce, Procedure, HTTP, Conditions, Switch, Dependent, SubProcess (renamed to SubWorkflow in version 3.3.0-alpha) | |
Airflow | 2.x | EmptyOperator, DummyOperator, ExternalTaskSensor, BashOperator, HiveToMySqlTransfer, PrestoToMySqlTransfer, PythonOperator, HiveOperator, SqoopOperator, SparkSqlOperator, SparkSubmitOperator, SQLExecuteQueryOperator, PostgresOperator, MySqlOperator |
AzkabanBeta | 3.x | Noop, Shell, Subprocess |
OozieBeta | 5.x | Start, End, Kill, Decision, Fork, Join, MapReduce, Pig, FS, SubWorkflow, Java |
HUEBeta | Latest | Fork, Join, OK, Error, Sqoop, Hive, Hive2, Shell |
Latest refers to the latest version as of May 2025.
Scheduling migration from other cloud scheduling engines to DataWorks
Source type | Source version | Supported node types for transformation |
DataArts (DGC) | Latest | CDMJob, HiveSQL, DWSSQL, DLISQL, RDSSQL, SparkSQL, Shell, DLISpark, MRSSpark, DLFSubJob, RESTAPI, Note, Dummy |
WeData | Latest | Shell, HiveSql, JDBCSql, Python, SparkPy, SparkSql, Foreach, ForeachStart, ForeachEnd, Offline Sync |
Azure Data Factory (ADF)Beta | Latest | DatabricksNotebook, ExecutePipeline, Copy, Script, Wait, WebActivity, AppendVariable, Delete, DatabricksSparkJar, DatabricksSparkPython, Fail, Filter, ForEach, GetMetadata, HDInsightHive, HDInsightMapReduce, HDInsightSpark, IfCondition, Lookup, SetVariable, SqlServerStoredProcedure, Switch, Until, Validation, SparkJob |
Scheduling migration from EMR Workflow to DataWorks
EMR Workflow | 2024.03 (Latest) | Shell, SQL, Python, DataX, Sqoop, SeaTunnel, HiveCLI, Spark, ImpalaShell, RemoteShell, MapReduce, Procedure, HTTP, Conditions, Switch, Dependent, SubProcess |
DataWorks like-for-like migration path
Source type | Source version | Supported node types for transformation |
DataWorks | New version | All nodes included in a periodically scheduled workflow |
DataWorks Spec | New version | All nodes included in a periodically scheduled workflow |