Lakehouse Migration Center quick start - Migration Hub - Alibaba Cloud Documentation Center

This guide walks through an end-to-end migration from Apache Hive to MaxCompute using the Lakehouse Migration Center. The migration pipeline covers cluster exploration, metadata migration, data migration, data verification, SQL code conversion, and task scheduling migration.

Prerequisites

An Alibaba Cloud account with RAM permissions for Migration Hub
AccessKey ID and AccessKey Secret for Alibaba Cloud services
Network connectivity between the migration tool machine and the source cluster (Apache Hive, Apache DolphinScheduler, or other supported systems)
Network connectivity to target Alibaba Cloud services (MaxCompute, DataWorks)
Access credentials for source systems (JDBC URLs, Kerberos keytab files, or LDAP credentials as applicable)

Migration workflow

Phase	Action
1. Create a migration scenario	Define source and target data sources and their mapping relationships
2. Configure the migration pipeline	Set preferences for cluster exploration, data migration, data verification, SQL conversion, and scheduling migration
3. Run the pipeline	Execute the migration and monitor progress

Step 1: Create a migration scenario

A migration scenario defines source data sources, target data sources, and the mapping between them. A single scenario supports multiple data migrations and multiple scheduling cluster migrations.

Before creating a scenario, use the Exploration Agent to discover big data components and clusters in the source environment.

Go to One-stop Migration in the console and add a migration scenario.

1a. Enter basic scenario information

Provide a name and description for the migration scenario.

1b. Create a source data source

Data sources fall into two categories:

Category	Description	Example
Data cluster data sources	Storage and compute clusters	Apache Hive, HDFS
Scheduling cluster data sources	Workflow orchestration systems	Apache DolphinScheduler

Hive data source fields

Field	Subfield	Description	Required
Data source name		Must be globally unique. Digits, English letters, and underscores only. Cannot be changed after creation.	Yes
Data source type		Select Hive.	Yes
Hive JDBC URL		JDBC address for HiveServer2. Format: `jdbc:hive2://<ip>:<port>/<database>`. With Kerberos: `jdbc:hive2://<ip>:<port>/<database>;principal=<your_principal>`	Yes
Database name		The Hive database to migrate.	Yes
Hive version		Select 1.x, 2.x, or 3.x.	Yes
Hive logon method		JDBC authentication method: anonymous, LDAP, or Kerberos.	Yes
Configuration	Format options	Authentication data for the selected logon method. See authentication scenarios below.	Conditional
	LDAP - username	LDAP username.	If LDAP
	LDAP - password	LDAP password.	If LDAP
	Kerberos - keytab file	Upload the keytab file.	If Kerberos
	Kerberos - conf file	Upload the Kerberos conf file.	If Kerberos
	Kerberos - principal	The Kerberos principal.	If Kerberos

Authentication scenarios:

JDBC authentication	Metastore authentication	Configuration
Username and password	Any	Select LDAP and enter the username and password
Kerberos	Kerberos	Select Kerberos authentication and enter the authentication details
Username and password	Kerberos	Enter both LDAP and Kerberos details
Anonymous	No Kerberos	No additional configuration required

Hive metadata access

Select a metadata type:

Metadata type	Key fields	Notes
Hive Metastore	metastoreUris: Hive Metastore URL (e.g., `thrift://10.0.2.165:9083`). If `hive.metastore.sasl.enabled` equals `true` in the cluster configuration, select Kerberos authentication and provide the keytab, conf, and principal in the Configuration section above.	Direct connection to Hive Metastore
DLF	Endpoint: DLF access endpoint (the migration tool machine must have connectivity). AccessKey ID / AccessKey Secret: Credentials for DLF access. Data catalog: The catalog containing the data to verify. The CatalogId must match the JDBC URL above. DLF Settings: Additional DLF configuration.	Access the Metastore through Data Lake Formation
MySQL metadata warehouse	jdbcURL: e.g., `jdbc:mysql://rm-xxxx.mysql.rds.aliyuncs.com/hive_meta_db?createDatabaseIfNotExist=true&characterEncoding=UTF-8`. Username and Password.	MySQL-backed Hive Metastore
PostgreSQL metadata warehouse	jdbcURL: e.g., `jdbc:postgresql://rm-xxxx.pg.rds.aliyuncs.com/hive_meta_db?createDatabaseIfNotExist=true&characterEncoding=UTF-8`. Username and Password.	PostgreSQL-backed Hive Metastore
AWS Glue	AccessKey ID, Access Secret, Region.	AWS Glue Data Catalog

MaxCompute data source fields

Field	Description	Required
Data source name	Must be globally unique. Digits, English letters, and underscores only. Cannot be changed after creation.	Yes
Data source type	Select MaxCompute.	Yes
ODPS Endpoint	The MaxCompute endpoint. ODPS (Open Data Processing Service) is the former name of MaxCompute. Example: `http://service.cn-shanghai.maxcompute.aliyun.com/api`	Yes
AccessKey ID	AccessKey ID for MaxCompute access.	Yes
AccessKey Secret	AccessKey Secret for MaxCompute access.	Yes
Default Project	MaxCompute Migration Assist (MMA) uses this project's quota to run SQL on MaxCompute.	Yes
Tunnel Endpoint	The MaxCompute Tunnel endpoint. Not required for data verification.	No

DolphinScheduler data source fields

Provide the source information either by uploading a zip package or by connecting through the API.

Field	Description	Required
Data source name	Must be globally unique. Digits, English letters, and underscores only. Cannot be changed after creation.	Yes
Data source type	Select DolphinScheduler.	Yes
Version	The DolphinScheduler version.	Yes
Connection address	The DolphinScheduler service access URL.	Yes
Token	Access token obtained through OAuth 2.0 for API-based data retrieval.	Yes
The project name	The DolphinScheduler project space required for API access.	Yes

After creating each data source, test the connection to verify connectivity.

1c. Create a target data source

Create the target data source using the same process described in Step 1b. For a Hive-to-MaxCompute migration, create a MaxCompute data source as the target.

1d. Associate source and target data sources

The association list displays source data sources (clusters) and their relationships to target data sources (clusters).

In the Operation column, click Associate Target Data Source.
Select the target data source for the association.
Click Complete to save.

Step 2: Configure the migration pipeline

The system initializes a pipeline DSL (Domain-Specific Language) based on the migration scenario. Configure migration preferences in the following sections.

2.1 Cluster exploration

Configure how and when the system discovers data and scheduling clusters.

Exploration type	Example	Scheduling
Data cluster exploration	Hive-to-MaxCompute migration	Supports cron expressions for scheduled discovery
Scheduling cluster exploration	DolphinScheduler-to-DataWorks migration	Supports cron expressions for scheduled discovery

2.2 Metadata migration and data migration

The following example uses the Hive-to-MaxCompute (Hive2MC) scenario. Metadata migration and data migration share the same settings.

Source-to-target mapping

Specify the mapping between source Hive databases and target MaxCompute projects. To configure in bulk, click Download Template, fill in the mapping information, and upload the file.

Blacklist

Specify tables to skip during migration. Use the format dbname.tablename for individual tables, or dbname.* to exclude all tables in a database.

Migration method

Method Steps

Hive UDTF (Tunnel) 1. Retrieve Hive metadata through Hive Metastore (table names, schemas, partitions, data volume, update times). 2. Create tables and partitions in MaxCompute. 3. Submit SQL commands that run a UDTF on the source Hive cluster. 4. The UDTF calls the MaxCompute Tunnel SDK to write data to the target.

OSS transfer 1. Use the Alibaba Cloud OSS Data Online Migration service to move Hive data from HDFS, Amazon S3, or OBS to Alibaba Cloud OSS. 2. Retrieve Hive metadata through Hive Metastore. 3. Create OSS external tables and corresponding internal (standard) tables in MaxCompute based on Hive metadata and OSS path mappings. 4. Run INSERT INTO <standard_table> SELECT * FROM <oss_external_table> to import data from OSS into MaxCompute.

Method	Steps
Hive UDTF (Tunnel)	1. Retrieve Hive metadata through Hive Metastore (table names, schemas, partitions, data volume, update times). 2. Create tables and partitions in MaxCompute. 3. Submit SQL commands that run a UDTF on the source Hive cluster. 4. The UDTF calls the MaxCompute Tunnel SDK to write data to the target.
OSS transfer	1. Use the Alibaba Cloud OSS Data Online Migration service to move Hive data from HDFS, Amazon S3, or OBS to Alibaba Cloud OSS. 2. Retrieve Hive metadata through Hive Metastore. 3. Create OSS external tables and corresponding internal (standard) tables in MaxCompute based on Hive metadata and OSS path mappings. 4. Run `INSERT INTO <standard_table> SELECT * FROM <oss_external_table>` to import data from OSS into MaxCompute.

Advanced settings -- global parameters (optional)

Parameter	Default	Description
`hive.task.partition.max.num`	50	Maximum number of partitions processed by a single task
`hive.task.partition.max.size`	5	Maximum data size (in GB) processed by a single task

Advanced settings -- custom parameters (optional)

Parameter	Example value	Description
`mapreduce.map.speculative`	`false`	Enable speculative execution in the map phase. Default: `true`.
`mapreduce.map.memory.mb`	`8192`	Maximum memory (in MB) for a single map task
`yarn.scheduler.maximum-allocation-mb`	`8192`	Maximum memory YARN can allocate per request
`hive.fetch.task.conversion`	`none`	Force all HDFS reads through MapReduce tasks
`mapreduce.task.timeout`	`3600000`	Task timeout in milliseconds
`mapreduce.job.running.map.limit`	`100`	Maximum number of concurrent map tasks
`mapreduce.max.split.size`	`512000000`	Maximum split size in bytes
`mapreduce.map.maxattempts`	`0`	Maximum retry attempts for map tasks
`hive.execution.engine`	`mr`	The Hive execution engine to use

2.3 Data verification

Configure verification rules to validate data integrity after migration.

Create a verification engine.
Define verification rules:
- Left variable: Table name or data type in the table.
- Logical condition: Equal to, contains, or does not contain.
- Right variable: Keywords (when the left variable is a table name) or the source/target verification execution engine (when the left variable is a data type).
Select a verification mode: data volume (count) comparison, metric comparison, weak content comparison, custom comparison, full text comparison, or percentage of null values comparison.
Set the Tolerance for difference: Define acceptable variance rules. Add or remove rules as needed.

2.4 SQL code conversion

Upload a table mapping file that specifies the mapping between source and target fully qualified names. For example, map the Hive qualified name db.tablename to the MaxCompute qualified name project.tablename.

2.5 Task scheduling migration

Blacklist: Specify objects to skip, such as workflows or task nodes.
Submission method: Submit through the migration assistant, or submit through the DataWorks OpenAPI (recommended).
Target engine: Select the target engine information.
Default conversion rules: Edit the JSON-based conversion rules. The default configuration:

{
  "format": "SPEC",
  "locale": "zh_CN",
  "skipUnSupportType": true,
  "transformContinueWithError": true,
  "specContinueWithError": true,
  "tempTaskTypes": [
    "SPARK",
    "HIVECLI"
  ],
  "skipTaskCodes": [],
  "settings": {
    "workflow.converter.shellNodeType": "EMR_SHELL",
    "workflow.converter.commandSqlAs": "EMR_SQL",
    "workflow.converter.sparkSubmitAs": "EMR_SPARK",
    "workflow.converter.target.unknownNodeTypeAs": "DIDE_SHELL",
    "workflow.converter.mrNodeType": "EMR_MR",
    "workflow.converter.target.engine.type": "EMR",
    "workflow.converter.dolphinscheduler.sqlNodeTypeMapping": {
      "POSTGRESQL": "EMR_HIVE",
      "MYSQL": "EMR_HIVE",
      "HIVE": "EMR_HIVE",
      "CLICKHOUSE": "CLICK_SQL"
    }
  }
}

Specify node types that require custom processing in the tempTaskTypes array. If left empty, nodes are converted according to the default rules defined in settings. If specified, custom rules take precedence and the original content is preserved.

Custom conversion plugin and rules: To apply custom conversion rules, first download the conversion plugin project, customize it, and upload it.

Custom conversion rule example:

{
  "if.use.migrationx.before": false,
  "if.use.default.convert": true,
  "conf": [
    {
      "rule": {
        "settings": {
          "workflow.converter.shellNodeType": "DIDE_SHELL",
          "workflow.converter.commandSqlAs": "DIDE_SHELL",
          "workflow.converter.sparkSubmitAs": "ODPS_SPARK",
          "workflow.converter.target.unknownNodeTypeAs": "DIDE_SHELL",
          "workflow.converter.mrNodeType": "ODPS_MR",
          "workflow.converter.target.engine.type": "MaxCompute",
          "workflow.converter.dolphinscheduler.sqlNodeTypeMapping": {
            "POSTGRESQL": "ODPS_SQL",
            "MYSQL": "ODPS_SQL",
            "HIVE": "ODPS_SQL"
          }
        }
      },
      "nodes": "all, name, idsStr"
    }
  ]
}

Custom rules override default rules when conflicts exist. Configure multiple JSON segments to apply different conversion rules to different node ranges. This allows source shell tasks of different types to convert to different target node types based on the selected node scope.

Submission settings: Configure status preferences for elements submitted to the target DataWorks environment.

After completing all configuration, click Save.

Step 3: Run the migration pipeline

3.1 Start the pipeline

Click Running Status to view the end-to-end migration progress. Exploration tasks run according to the cron schedule configured in Step 2.1.

3.2 Cluster exploration

Data cluster exploration: The Exploration Agent scans the source cluster and reports results. View exploration result details for each run.
Scheduling cluster exploration: The agent scans scheduling workflows. View exploration result details for each run.

3.3 Data migration

The first exploration of a source data cluster automatically generates a full data migration task. Each subsequent exploration compares results with the previous run and generates incremental data migration tasks for changed objects.

Each atomic object (table or partition) follows this execution sequence:

Phase	Action
1. Schema migration	Create the table structure in the target
2. Data migration	Transfer the data
3. Data verification	Validate the migrated data

A single object is considered successfully migrated only when all three phases complete and pass verification.

View migration task details to see statistical information for each task. For atomic objects that fail migration, click retry to re-attempt.

3.4 Scheduling migration

Each scheduling cluster exploration generates a batch of task instances to migrate. Manually execute the following:

Run the conversion to transform source scheduling definitions into the target format.
Review the conversion report.
Submit the converted definitions to the target DataWorks environment.

What to do next

Monitor pipeline health using the Running Status view to track ongoing migrations and identify failed objects.
Fine-tune migration parameters for performance and reliability in the pipeline settings.
After the initial load completes, configure incremental migrations to keep source and target in sync.
Review the conversion report after scheduling migration to verify that all task nodes converted correctly.