All Products
Search
Document Center

Migration Hub:Lakehouse Migration Center quick start

Last Updated:Feb 28, 2026

This guide walks through an end-to-end migration from Apache Hive to MaxCompute using the Lakehouse Migration Center. The migration pipeline covers cluster exploration, metadata migration, data migration, data verification, SQL code conversion, and task scheduling migration.

Prerequisites

  • An Alibaba Cloud account with RAM permissions for Migration Hub

  • AccessKey ID and AccessKey Secret for Alibaba Cloud services

  • Network connectivity between the migration tool machine and the source cluster (Apache Hive, Apache DolphinScheduler, or other supported systems)

  • Network connectivity to target Alibaba Cloud services (MaxCompute, DataWorks)

  • Access credentials for source systems (JDBC URLs, Kerberos keytab files, or LDAP credentials as applicable)

Migration workflow

PhaseAction
1. Create a migration scenarioDefine source and target data sources and their mapping relationships
2. Configure the migration pipelineSet preferences for cluster exploration, data migration, data verification, SQL conversion, and scheduling migration
3. Run the pipelineExecute the migration and monitor progress

Step 1: Create a migration scenario

A migration scenario defines source data sources, target data sources, and the mapping between them. A single scenario supports multiple data migrations and multiple scheduling cluster migrations.

Before creating a scenario, use the Exploration Agent to discover big data components and clusters in the source environment.

Go to One-stop Migration in the console and add a migration scenario.

1a. Enter basic scenario information

Provide a name and description for the migration scenario.

1b. Create a source data source

Data sources fall into two categories:

CategoryDescriptionExample
Data cluster data sourcesStorage and compute clustersApache Hive, HDFS
Scheduling cluster data sourcesWorkflow orchestration systemsApache DolphinScheduler

Hive data source fields

FieldSubfieldDescriptionRequired
Data source nameMust be globally unique. Digits, English letters, and underscores only. Cannot be changed after creation.Yes
Data source typeSelect Hive.Yes
Hive JDBC URLJDBC address for HiveServer2. Format: jdbc:hive2://<ip>:<port>/<database>. With Kerberos: jdbc:hive2://<ip>:<port>/<database>;principal=<your_principal>Yes
Database nameThe Hive database to migrate.Yes
Hive versionSelect 1.x, 2.x, or 3.x.Yes
Hive logon methodJDBC authentication method: anonymous, LDAP, or Kerberos.Yes
ConfigurationFormat optionsAuthentication data for the selected logon method. See authentication scenarios below.Conditional
LDAP - usernameLDAP username.If LDAP
LDAP - passwordLDAP password.If LDAP
Kerberos - keytab fileUpload the keytab file.If Kerberos
Kerberos - conf fileUpload the Kerberos conf file.If Kerberos
Kerberos - principalThe Kerberos principal.If Kerberos

Authentication scenarios:

JDBC authenticationMetastore authenticationConfiguration
Username and passwordAnySelect LDAP and enter the username and password
KerberosKerberosSelect Kerberos authentication and enter the authentication details
Username and passwordKerberosEnter both LDAP and Kerberos details
AnonymousNo KerberosNo additional configuration required

Hive metadata access

Select a metadata type:

Metadata typeKey fieldsNotes
Hive MetastoremetastoreUris: Hive Metastore URL (e.g., thrift://10.0.2.165:9083). If hive.metastore.sasl.enabled equals true in the cluster configuration, select Kerberos authentication and provide the keytab, conf, and principal in the Configuration section above.Direct connection to Hive Metastore
DLFEndpoint: DLF access endpoint (the migration tool machine must have connectivity). AccessKey ID / AccessKey Secret: Credentials for DLF access. Data catalog: The catalog containing the data to verify. The CatalogId must match the JDBC URL above. DLF Settings: Additional DLF configuration.Access the Metastore through Data Lake Formation
MySQL metadata warehousejdbcURL: e.g., jdbc:mysql://rm-xxxx.mysql.rds.aliyuncs.com/hive_meta_db?createDatabaseIfNotExist=true&characterEncoding=UTF-8. Username and Password.MySQL-backed Hive Metastore
PostgreSQL metadata warehousejdbcURL: e.g., jdbc:postgresql://rm-xxxx.pg.rds.aliyuncs.com/hive_meta_db?createDatabaseIfNotExist=true&characterEncoding=UTF-8. Username and Password.PostgreSQL-backed Hive Metastore
AWS GlueAccessKey ID, Access Secret, Region.AWS Glue Data Catalog

MaxCompute data source fields

FieldDescriptionRequired
Data source nameMust be globally unique. Digits, English letters, and underscores only. Cannot be changed after creation.Yes
Data source typeSelect MaxCompute.Yes
ODPS EndpointThe MaxCompute endpoint. ODPS (Open Data Processing Service) is the former name of MaxCompute. Example: http://service.cn-shanghai.maxcompute.aliyun.com/apiYes
AccessKey IDAccessKey ID for MaxCompute access.Yes
AccessKey SecretAccessKey Secret for MaxCompute access.Yes
Default ProjectMaxCompute Migration Assist (MMA) uses this project's quota to run SQL on MaxCompute.Yes
Tunnel EndpointThe MaxCompute Tunnel endpoint. Not required for data verification.No

DolphinScheduler data source fields

Provide the source information either by uploading a zip package or by connecting through the API.

FieldDescriptionRequired
Data source nameMust be globally unique. Digits, English letters, and underscores only. Cannot be changed after creation.Yes
Data source typeSelect DolphinScheduler.Yes
VersionThe DolphinScheduler version.Yes
Connection addressThe DolphinScheduler service access URL.Yes
TokenAccess token obtained through OAuth 2.0 for API-based data retrieval.Yes
The project nameThe DolphinScheduler project space required for API access.Yes

After creating each data source, test the connection to verify connectivity.

1c. Create a target data source

Create the target data source using the same process described in Step 1b. For a Hive-to-MaxCompute migration, create a MaxCompute data source as the target.

1d. Associate source and target data sources

The association list displays source data sources (clusters) and their relationships to target data sources (clusters).

  1. In the Operation column, click Associate Target Data Source.

  2. Select the target data source for the association.

  3. Click Complete to save.

Step 2: Configure the migration pipeline

The system initializes a pipeline DSL (Domain-Specific Language) based on the migration scenario. Configure migration preferences in the following sections.

2.1 Cluster exploration

Configure how and when the system discovers data and scheduling clusters.

Exploration typeExampleScheduling
Data cluster explorationHive-to-MaxCompute migrationSupports cron expressions for scheduled discovery
Scheduling cluster explorationDolphinScheduler-to-DataWorks migrationSupports cron expressions for scheduled discovery

2.2 Metadata migration and data migration

The following example uses the Hive-to-MaxCompute (Hive2MC) scenario. Metadata migration and data migration share the same settings.

Source-to-target mapping

Specify the mapping between source Hive databases and target MaxCompute projects. To configure in bulk, click Download Template, fill in the mapping information, and upload the file.

Blacklist

Specify tables to skip during migration. Use the format dbname.tablename for individual tables, or dbname.* to exclude all tables in a database.

Migration method

MethodSteps
Hive UDTF (Tunnel)1. Retrieve Hive metadata through Hive Metastore (table names, schemas, partitions, data volume, update times). 2. Create tables and partitions in MaxCompute. 3. Submit SQL commands that run a UDTF on the source Hive cluster. 4. The UDTF calls the MaxCompute Tunnel SDK to write data to the target.
OSS transfer1. Use the Alibaba Cloud OSS Data Online Migration service to move Hive data from HDFS, Amazon S3, or OBS to Alibaba Cloud OSS. 2. Retrieve Hive metadata through Hive Metastore. 3. Create OSS external tables and corresponding internal (standard) tables in MaxCompute based on Hive metadata and OSS path mappings. 4. Run INSERT INTO <standard_table> SELECT * FROM <oss_external_table> to import data from OSS into MaxCompute.

Advanced settings -- global parameters (optional)

ParameterDefaultDescription
hive.task.partition.max.num50Maximum number of partitions processed by a single task
hive.task.partition.max.size5Maximum data size (in GB) processed by a single task

Advanced settings -- custom parameters (optional)

ParameterExample valueDescription
mapreduce.map.speculativefalseEnable speculative execution in the map phase. Default: true.
mapreduce.map.memory.mb8192Maximum memory (in MB) for a single map task
yarn.scheduler.maximum-allocation-mb8192Maximum memory YARN can allocate per request
hive.fetch.task.conversionnoneForce all HDFS reads through MapReduce tasks
mapreduce.task.timeout3600000Task timeout in milliseconds
mapreduce.job.running.map.limit100Maximum number of concurrent map tasks
mapreduce.max.split.size512000000Maximum split size in bytes
mapreduce.map.maxattempts0Maximum retry attempts for map tasks
hive.execution.enginemrThe Hive execution engine to use

2.3 Data verification

Configure verification rules to validate data integrity after migration.

  1. Create a verification engine.

  2. Define verification rules:

    • Left variable: Table name or data type in the table.

    • Logical condition: Equal to, contains, or does not contain.

    • Right variable: Keywords (when the left variable is a table name) or the source/target verification execution engine (when the left variable is a data type).

  3. Select a verification mode: data volume (count) comparison, metric comparison, weak content comparison, custom comparison, full text comparison, or percentage of null values comparison.

  4. Set the Tolerance for difference: Define acceptable variance rules. Add or remove rules as needed.

2.4 SQL code conversion

Upload a table mapping file that specifies the mapping between source and target fully qualified names. For example, map the Hive qualified name db.tablename to the MaxCompute qualified name project.tablename.

2.5 Task scheduling migration

  1. Blacklist: Specify objects to skip, such as workflows or task nodes.

  2. Submission method: Submit through the migration assistant, or submit through the DataWorks OpenAPI (recommended).

  3. Target engine: Select the target engine information.

  4. Default conversion rules: Edit the JSON-based conversion rules. The default configuration:

{
  "format": "SPEC",
  "locale": "zh_CN",
  "skipUnSupportType": true,
  "transformContinueWithError": true,
  "specContinueWithError": true,
  "tempTaskTypes": [
    "SPARK",
    "HIVECLI"
  ],
  "skipTaskCodes": [],
  "settings": {
    "workflow.converter.shellNodeType": "EMR_SHELL",
    "workflow.converter.commandSqlAs": "EMR_SQL",
    "workflow.converter.sparkSubmitAs": "EMR_SPARK",
    "workflow.converter.target.unknownNodeTypeAs": "DIDE_SHELL",
    "workflow.converter.mrNodeType": "EMR_MR",
    "workflow.converter.target.engine.type": "EMR",
    "workflow.converter.dolphinscheduler.sqlNodeTypeMapping": {
      "POSTGRESQL": "EMR_HIVE",
      "MYSQL": "EMR_HIVE",
      "HIVE": "EMR_HIVE",
      "CLICKHOUSE": "CLICK_SQL"
    }
  }
}

Specify node types that require custom processing in the tempTaskTypes array. If left empty, nodes are converted according to the default rules defined in settings. If specified, custom rules take precedence and the original content is preserved.

  1. Custom conversion plugin and rules: To apply custom conversion rules, first download the conversion plugin project, customize it, and upload it.

Custom conversion rule example:

{
  "if.use.migrationx.before": false,
  "if.use.default.convert": true,
  "conf": [
    {
      "rule": {
        "settings": {
          "workflow.converter.shellNodeType": "DIDE_SHELL",
          "workflow.converter.commandSqlAs": "DIDE_SHELL",
          "workflow.converter.sparkSubmitAs": "ODPS_SPARK",
          "workflow.converter.target.unknownNodeTypeAs": "DIDE_SHELL",
          "workflow.converter.mrNodeType": "ODPS_MR",
          "workflow.converter.target.engine.type": "MaxCompute",
          "workflow.converter.dolphinscheduler.sqlNodeTypeMapping": {
            "POSTGRESQL": "ODPS_SQL",
            "MYSQL": "ODPS_SQL",
            "HIVE": "ODPS_SQL"
          }
        }
      },
      "nodes": "all, name, idsStr"
    }
  ]
}

Custom rules override default rules when conflicts exist. Configure multiple JSON segments to apply different conversion rules to different node ranges. This allows source shell tasks of different types to convert to different target node types based on the selected node scope.

  1. Submission settings: Configure status preferences for elements submitted to the target DataWorks environment.

After completing all configuration, click Save.

Step 3: Run the migration pipeline

3.1 Start the pipeline

Click Running Status to view the end-to-end migration progress. Exploration tasks run according to the cron schedule configured in Step 2.1.

3.2 Cluster exploration

  • Data cluster exploration: The Exploration Agent scans the source cluster and reports results. View exploration result details for each run.

  • Scheduling cluster exploration: The agent scans scheduling workflows. View exploration result details for each run.

3.3 Data migration

The first exploration of a source data cluster automatically generates a full data migration task. Each subsequent exploration compares results with the previous run and generates incremental data migration tasks for changed objects.

Each atomic object (table or partition) follows this execution sequence:

PhaseAction
1. Schema migrationCreate the table structure in the target
2. Data migrationTransfer the data
3. Data verificationValidate the migrated data

A single object is considered successfully migrated only when all three phases complete and pass verification.

View migration task details to see statistical information for each task. For atomic objects that fail migration, click retry to re-attempt.

3.4 Scheduling migration

Each scheduling cluster exploration generates a batch of task instances to migrate. Manually execute the following:

  1. Run the conversion to transform source scheduling definitions into the target format.

  2. Review the conversion report.

  3. Submit the converted definitions to the target DataWorks environment.

What to do next

  • Monitor pipeline health using the Running Status view to track ongoing migrations and identify failed objects.

  • Fine-tune migration parameters for performance and reliability in the pipeline settings.

  • After the initial load completes, configure incremental migrations to keep source and target in sync.

  • Review the conversion report after scheduling migration to verify that all task nodes converted correctly.