All Products
Search
Document Center

Migration Hub:Lake warehouse migration center quick start

Last Updated:Jun 25, 2025

This topic describes how to quickly get started with the lake warehouse one-stop migration center.

Getting started: Create a migration scenario

Use the exploration Agent to discover the big data components and clusters on the customer's source side.

Add a migration scenario under the "One-stop Migration" menu. Configure the source data source, target data source, and the mapping relationship between the source and target. A migration scenario can include multiple data migrations and multiple scheduling cluster migrations.

Step 1. Create basic scenario information

Step 2. Create source data source

Create a data source: Create a source data source to be migrated. Data source types are divided into data cluster data sources and scheduling cluster data sources.

Hive data source configuration (as an example):

Note:

Configuration item name

Subkey name

Description

Data source name

Cannot contain special characters other than letters, numbers, and Chinese characters. Cannot be modified when editing. Must be globally unique.

Data source type

Hive

Hive JDBC URL

JDBC address for accessing HiveServer2, for example: jdbc:hive2://ip:port/database. Note that if Kerberos authentication is enabled, the JDBC connection string format is: jdbc:hive2://ip:port/database;principal=your principal

Database name

Hive database

Hive version

1.x,2.x,3.x

Hive logon method

Hive JDBC logon method, divided into anonymous, LDAP, and Kerberos authentication

Configuration

Format options

Mainly configures the required authentication data: such as LDAP username and password, authentication files required for Kerberos logon. This configuration item has multiple options.

  • If Hive JDBC uses username and password to log on, select LDAP and fill in the username and password information.

  • If Hive JDBC and Metastore have Kerberos authentication enabled, select Kerberos authentication and fill in the authentication information.

  • If Hive JDBC uses username and password, and Metastore uses Kerberos authentication, both need to be filled in.

  • If Hive JDBC uses anonymous logon and Metastore does not enable Kerberos authentication, neither needs to be selected.

LDAP-username

LDAP username

LDAP-password

LDAP password

Kerberos authentication-keytab file

Upload keytab file

Kerberos authentication-conf file

Upload conf file

Kerberos authentication-principal

Kerberos principal

Metadata type and access configuration

1.Hive Metastore

Does metadata access require Kerberos authentication

Please check if hive.metastore.sasl.enabled equals true in the cluster configuration. If so, you need to select Kerberos authentication here and configure the corresponding Kerberos authentication file information in the above configuration.

metastoreUris

Hive metastore URL example: thrift://10.0.2.165:9083

2.DLF

Get Metastore connection through DLF

Endpoint

Endpoint for accessing DLF, requires the machine where the tool is located to be connected

AccessKey ID

AccessKey ID used to access DLF

AccessKey Secret

AccessKey Secret used to access DLF

Data catalog

The Catalog where the data to be verified is located. Note: The CatalogId must correspond to the JDBC URL above

DLF Settings

Other DLF configuration information

3.MySQL metadata warehouse

jdbcURL

Example: jdbc:mysql://rm-xxxx.mysql.rds.aliyuncs.com/hive_meta_db?createDatabaseIfNotExist=true&characterEncoding=UTF-8

Username

Password

4.PG metadata warehouse

jdbcURL

Example: jdbc:postgresql://rm-xxxx.mysql.rds.aliyuncs.com/hive_meta_db?createDatabaseIfNotExist=true&characterEncoding=UTF-8

Username

Password

5.AWS Glue

AccessKey ID

Access Secret

Region

MaxCompute data source configuration:

Configuration item name

Description

Data source name

Cannot contain special characters other than English letters, numbers, and Chinese characters. Cannot be modified when editing

Data source type

MaxCompute

ODPS Endpoint

Endpoint for accessing MaxCompute, requires the machine where the tool is located to be connected mc.endpoint

Example: http://service.cn-shanghai.maxcompute.aliyun.com/api

AccessKey ID

AccessKey ID used to access MaxCompute

AccessKey Secret

AccessKey Secret used to access MaxCompute

Default Project

MMA will use this project's quota to execute SQL on MaxCompute

Tunnel Endpoint

Data verification does not depend on this item, you can fill in anything

Scheduling cluster data source with DolphinScheduler as an example:

You can choose to directly upload the source information zip package of the dolphin scheduling system or connect to the API to obtain relevant migration information.

Configuration item name

Description

Data source name

Cannot contain special characters other than English letters, numbers, and Chinese characters. Cannot be modified when editing

Data source type

DolphinScheduler

Version

DolphinScheduler version selection

Connection address

DolphinScheduler system service access address

Token

Access Token (AccessToken): When using API operations to obtain data from the dolphin scheduling system, an access token obtained through mechanisms such as OAuth2.0 is required.

The project name

Access related API, requires DolphinScheduler system project space information

After creating the data source, you need to test whether the connectivity is normal.

Step 3. Create target data source

Creation method is the same as Step 2.

Step 4. Associate target

This list takes the source data source (cluster) as the main perspective and displays the association relationship with the target data source (cluster). Click "Associate Target Data Source" in the operation column to select the corresponding target data source for association. This mapping relationship will serve as the source and target color relationship during actual migration.

Click "Complete" to save.

Set up Pipeline (one-stop migration)

After creating the scenario, you need to set up the migration scenario.

We will initialize the migration Pipeline DSL based on the migration scenario. You only need to configure migration preferences in the following interface:

1. Cluster exploration settings

Data cluster exploration: Taking Hive to MC migration as an example, supports configuring cron expressions for timed scheduling.

Scheduling cluster exploration: Taking Dolphin to DataWorks migration as an example, supports configuring cron expressions for timed scheduling.

2. Metadata migration & data migration settings

Taking the Hive2MC scenario as an example, metadata migration and data migration settings are reused.

1) Fill in the mapping relationship between source Hive DB -> target MaxCompute Project. You can "Download Template" to maintain information and then upload it.

2) Blacklist: Ignored during migration. Fill in the format dbname.tablename, dbname.* (representing all tables under the db).

3) Select migration method

Hive UDTF(Tunnel):

  • 1. Obtain Hive metadata through Hive MetaStore, including all table names, table schemas, partition information, data volume, update time, etc.

  • 2. Create tables and partitions on the MaxCompute side based on the obtained metadata information (and the migration scope issued by the workbench task instructions).

  • 3. Submit SQL commands to execute UDTF to the source Hive.

  • 4. UDTF calls MaxCompute's Tunnel SDK to write data to the target.

OSS transfer, MaxC external table to internal table:

  • 1. Use Alibaba Cloud OSS Data Online Migration service to migrate Hive data (HDFS/S3/OBS, etc.) to Alibaba Cloud OSS.

  • 2. Obtain Hive metadata through Hive MetaStore, including all table names, table schemas, partition information, data volume, update time, etc.

  • 3. Create OSS external tables and corresponding standard tables in the target MaxCompute based on the obtained Hive metadata and OSS path information (corresponding to HDFS).

  • 4. Query data from external tables and insert into standard tables: insert standard table from select oss external table, importing data from OSS to MaxCompute.

4) Advanced settings

Global parameters:

hive.task.partition.max.num=50: Maximum number of partitions processed by a single task (optional)

hive.task.partition.max.size=5: Maximum data size processed by a single task (unit: G) (optional)

Custom parameters:

e.g.: mapreduce.map.speculative=false: Whether to enable speculative execution in the map phase, default is true

mapreduce.map.memory.mb=8192: Resource upper limit that a single map task can use

yarn.scheduler.maximum-allocation-mb=8192: Maximum memory that yarn can request

hive.fetch.task.conversion=none: All queries involving hdfs reading go through mapreduce tasks

mapreduce.task.timeout=3600000 (unit: ms)

mapreduce.job.running.map.limit=100: Number of map tasks executed simultaneously

mapreduce.max.split.size=512000000

mapreduce.map.maxattempts=0

hive.execution.engine=mr

3. Data verification settings

1) Create a new verification engine

2) Select verification rules:

Left variable: Table name, data type in table

Logical condition: Equal to, contains, does not contain

Right variable: When the left variable is table name, fill in keywords; when the left variable is data type, source verification execution engine, target verification engine

Verification mode: Data volume (count) comparison, metric comparison, weak content comparison, custom comparison, full text comparison, percentage of null values comparison

Difference toleration rate: Supports filling in rules, supports adding rows and deleting existing rows

4. SQL code conversion settings

Upload table mapping file, provide mapping relationship between source trituple and target trituple, such as the mapping relationship between db.tablename of the Hive cluster and project.tablename of the target MaxCompute cluster

5. Task scheduling migration settings

1) Blacklist: Supports setting blacklists for objects such as workflow, tasknode, etc.

2) Supports submission through migration assistant, also supports submission through DataWorks Openapi (recommended)

3) Need to select target engine information

4) Edit default rules

{
  "format": "SPEC",
  "locale": "zh_CN",
  "skipUnSupportType": true,
  "transformContinueWithError": true,
  "specContinueWithError": true,
  "tempTaskTypes": [
    "SPARK",
    "HIVECLI"
  ],
  "skipTaskCodes": [],
  "settings": {
    "workflow.converter.shellNodeType": "EMR_SHELL",
    "workflow.converter.commandSqlAs": "EMR_SQL",
    "workflow.converter.sparkSubmitAs": "EMR_SPARK",
    "workflow.converter.target.unknownNodeTypeAs": "DIDE_SHELL",
    "workflow.converter.mrNodeType": "EMR_MR",
    "workflow.converter.target.engine.type": "EMR",
    "workflow.converter.dolphinscheduler.sqlNodeTypeMapping": {
      "POSTGRESQL": "EMR_HIVE",
      "MYSQL": "EMR_HIVE",
      "HIVE": "EMR_HIVE",
      "CLICKHOUSE": "CLICK_SQL"
    }
  }
}

Fill in the node types that need custom processing in tempTaskTypes. If not filled, they will be converted according to default rules (if there are corresponding node conversion rules in settings). If filled, the custom rules will prevail and the original content will be preserved.

5) Custom conversion plugin & rules

To use custom conversion rules, you need to first download the conversion plugin custom project and upload it before you can use it

Custom conversion rule configuration:

{
  "if.use.migrationx.before": false,
  "if.use.default.convert": true,
  "conf": [
    {
      "rule": {
        "settings": {
          "workflow.converter.shellNodeType": "DIDE_SHELL",
          "workflow.converter.commandSqlAs": "DIDE_SHELL",
          "workflow.converter.sparkSubmitAs": "ODPS_SPARK",
          "workflow.converter.target.unknownNodeTypeAs": "DIDE_SHELL",
          "workflow.converter.mrNodeType": "ODPS_MR",
          "workflow.converter.target.engine.type": "MaxCompute",
          "workflow.converter.dolphinscheduler.sqlNodeTypeMapping": {
            "POSTGRESQL": "ODPS_SQL",
            "MYSQL": "ODPS_SQL",
            "HIVE": "ODPS_SQL"
          }
        }
      },
      "nodes": "all, name, idsStr"
    }
  ]
}

You can customize node type conversion rules in the project. When there is a conflict with the default rules, the custom rules prevail.

You can configure multiple JSON segments to select the conversion rules used for the selected nodes range. This means that different types of shells in the source can be converted to different node types on the target side by selecting the appropriate nodes range.

6) Submission settings

Preference management for the status of corresponding elements when submitting to the target DataWorks. After completing the above configuration, click save.

Run Pipeline one-stop migration

1. Start

Click "Running Status" to view the one-stop migration progress. You can see the corresponding exploration tasks according to the time set for the cluster exploration task, and the results of each exploration.

2. Cluster exploration

Data cluster exploration example:

Exploration result details:

Scheduling cluster exploration example:

Exploration result details:

3. Data migration

The first exploration of the source cluster's data will automatically generate a stock data migration task. Each subsequent exploration will compare with the previous cluster exploration task results to determine the incremental part and automatically generate incremental data migration tasks. Atomic-level objects (tables or partitions) will be executed in the order of schema migration -> data migration -> data verification. When all three tasks are completed and pass verification, the migration of a single object is considered successful.

You can view the details of this migration task through the following entry:

Taking the data migration details interface as an example:

Supports statistical information display for a single task and provides a "retry" entry for atomic objects that failed migration.

4. Scheduling migration

Each exploration of the source cluster's scheduling will automatically generate the corresponding batch of task instances to be migrated. You need to manually execute the conversion and submit to the target.

Conversion report: