This guide walks through an end-to-end migration from Apache Hive to MaxCompute using the Lakehouse Migration Center. The migration pipeline covers cluster exploration, metadata migration, data migration, data verification, SQL code conversion, and task scheduling migration.
Prerequisites
An Alibaba Cloud account with RAM permissions for Migration Hub
AccessKey ID and AccessKey Secret for Alibaba Cloud services
Network connectivity between the migration tool machine and the source cluster (Apache Hive, Apache DolphinScheduler, or other supported systems)
Network connectivity to target Alibaba Cloud services (MaxCompute, DataWorks)
Access credentials for source systems (JDBC URLs, Kerberos keytab files, or LDAP credentials as applicable)
Migration workflow
| Phase | Action |
|---|---|
| 1. Create a migration scenario | Define source and target data sources and their mapping relationships |
| 2. Configure the migration pipeline | Set preferences for cluster exploration, data migration, data verification, SQL conversion, and scheduling migration |
| 3. Run the pipeline | Execute the migration and monitor progress |
Step 1: Create a migration scenario
A migration scenario defines source data sources, target data sources, and the mapping between them. A single scenario supports multiple data migrations and multiple scheduling cluster migrations.
Before creating a scenario, use the Exploration Agent to discover big data components and clusters in the source environment.
Go to One-stop Migration in the console and add a migration scenario.
1a. Enter basic scenario information
Provide a name and description for the migration scenario.
1b. Create a source data source
Data sources fall into two categories:
| Category | Description | Example |
|---|---|---|
| Data cluster data sources | Storage and compute clusters | Apache Hive, HDFS |
| Scheduling cluster data sources | Workflow orchestration systems | Apache DolphinScheduler |
Hive data source fields
| Field | Subfield | Description | Required |
|---|---|---|---|
| Data source name | Must be globally unique. Digits, English letters, and underscores only. Cannot be changed after creation. | Yes | |
| Data source type | Select Hive. | Yes | |
| Hive JDBC URL | JDBC address for HiveServer2. Format: jdbc:hive2://<ip>:<port>/<database>. With Kerberos: jdbc:hive2://<ip>:<port>/<database>;principal=<your_principal> | Yes | |
| Database name | The Hive database to migrate. | Yes | |
| Hive version | Select 1.x, 2.x, or 3.x. | Yes | |
| Hive logon method | JDBC authentication method: anonymous, LDAP, or Kerberos. | Yes | |
| Configuration | Format options | Authentication data for the selected logon method. See authentication scenarios below. | Conditional |
| LDAP - username | LDAP username. | If LDAP | |
| LDAP - password | LDAP password. | If LDAP | |
| Kerberos - keytab file | Upload the keytab file. | If Kerberos | |
| Kerberos - conf file | Upload the Kerberos conf file. | If Kerberos | |
| Kerberos - principal | The Kerberos principal. | If Kerberos |
Authentication scenarios:
| JDBC authentication | Metastore authentication | Configuration |
|---|---|---|
| Username and password | Any | Select LDAP and enter the username and password |
| Kerberos | Kerberos | Select Kerberos authentication and enter the authentication details |
| Username and password | Kerberos | Enter both LDAP and Kerberos details |
| Anonymous | No Kerberos | No additional configuration required |
Hive metadata access
Select a metadata type:
| Metadata type | Key fields | Notes |
|---|---|---|
| Hive Metastore | metastoreUris: Hive Metastore URL (e.g., thrift://10.0.2.165:9083). If hive.metastore.sasl.enabled equals true in the cluster configuration, select Kerberos authentication and provide the keytab, conf, and principal in the Configuration section above. | Direct connection to Hive Metastore |
| DLF | Endpoint: DLF access endpoint (the migration tool machine must have connectivity). AccessKey ID / AccessKey Secret: Credentials for DLF access. Data catalog: The catalog containing the data to verify. The CatalogId must match the JDBC URL above. DLF Settings: Additional DLF configuration. | Access the Metastore through Data Lake Formation |
| MySQL metadata warehouse | jdbcURL: e.g., jdbc:mysql://rm-xxxx.mysql.rds.aliyuncs.com/hive_meta_db?createDatabaseIfNotExist=true&characterEncoding=UTF-8. Username and Password. | MySQL-backed Hive Metastore |
| PostgreSQL metadata warehouse | jdbcURL: e.g., jdbc:postgresql://rm-xxxx.pg.rds.aliyuncs.com/hive_meta_db?createDatabaseIfNotExist=true&characterEncoding=UTF-8. Username and Password. | PostgreSQL-backed Hive Metastore |
| AWS Glue | AccessKey ID, Access Secret, Region. | AWS Glue Data Catalog |
MaxCompute data source fields
| Field | Description | Required |
|---|---|---|
| Data source name | Must be globally unique. Digits, English letters, and underscores only. Cannot be changed after creation. | Yes |
| Data source type | Select MaxCompute. | Yes |
| ODPS Endpoint | The MaxCompute endpoint. ODPS (Open Data Processing Service) is the former name of MaxCompute. Example: http://service.cn-shanghai.maxcompute.aliyun.com/api | Yes |
| AccessKey ID | AccessKey ID for MaxCompute access. | Yes |
| AccessKey Secret | AccessKey Secret for MaxCompute access. | Yes |
| Default Project | MaxCompute Migration Assist (MMA) uses this project's quota to run SQL on MaxCompute. | Yes |
| Tunnel Endpoint | The MaxCompute Tunnel endpoint. Not required for data verification. | No |
DolphinScheduler data source fields
Provide the source information either by uploading a zip package or by connecting through the API.
| Field | Description | Required |
|---|---|---|
| Data source name | Must be globally unique. Digits, English letters, and underscores only. Cannot be changed after creation. | Yes |
| Data source type | Select DolphinScheduler. | Yes |
| Version | The DolphinScheduler version. | Yes |
| Connection address | The DolphinScheduler service access URL. | Yes |
| Token | Access token obtained through OAuth 2.0 for API-based data retrieval. | Yes |
| The project name | The DolphinScheduler project space required for API access. | Yes |
After creating each data source, test the connection to verify connectivity.
1c. Create a target data source
Create the target data source using the same process described in Step 1b. For a Hive-to-MaxCompute migration, create a MaxCompute data source as the target.
1d. Associate source and target data sources
The association list displays source data sources (clusters) and their relationships to target data sources (clusters).
In the Operation column, click Associate Target Data Source.
Select the target data source for the association.
Click Complete to save.
Step 2: Configure the migration pipeline
The system initializes a pipeline DSL (Domain-Specific Language) based on the migration scenario. Configure migration preferences in the following sections.
2.1 Cluster exploration
Configure how and when the system discovers data and scheduling clusters.
| Exploration type | Example | Scheduling |
|---|---|---|
| Data cluster exploration | Hive-to-MaxCompute migration | Supports cron expressions for scheduled discovery |
| Scheduling cluster exploration | DolphinScheduler-to-DataWorks migration | Supports cron expressions for scheduled discovery |
2.2 Metadata migration and data migration
The following example uses the Hive-to-MaxCompute (Hive2MC) scenario. Metadata migration and data migration share the same settings.
Source-to-target mapping
Specify the mapping between source Hive databases and target MaxCompute projects. To configure in bulk, click Download Template, fill in the mapping information, and upload the file.
Blacklist
Specify tables to skip during migration. Use the format dbname.tablename for individual tables, or dbname.* to exclude all tables in a database.
Migration method
| Method | Steps |
|---|---|
| Hive UDTF (Tunnel) | 1. Retrieve Hive metadata through Hive Metastore (table names, schemas, partitions, data volume, update times). 2. Create tables and partitions in MaxCompute. 3. Submit SQL commands that run a UDTF on the source Hive cluster. 4. The UDTF calls the MaxCompute Tunnel SDK to write data to the target. |
| OSS transfer | 1. Use the Alibaba Cloud OSS Data Online Migration service to move Hive data from HDFS, Amazon S3, or OBS to Alibaba Cloud OSS. 2. Retrieve Hive metadata through Hive Metastore. 3. Create OSS external tables and corresponding internal (standard) tables in MaxCompute based on Hive metadata and OSS path mappings. 4. Run INSERT INTO <standard_table> SELECT * FROM <oss_external_table> to import data from OSS into MaxCompute. |
Advanced settings -- global parameters (optional)
| Parameter | Default | Description |
|---|---|---|
hive.task.partition.max.num | 50 | Maximum number of partitions processed by a single task |
hive.task.partition.max.size | 5 | Maximum data size (in GB) processed by a single task |
Advanced settings -- custom parameters (optional)
| Parameter | Example value | Description |
|---|---|---|
mapreduce.map.speculative | false | Enable speculative execution in the map phase. Default: true. |
mapreduce.map.memory.mb | 8192 | Maximum memory (in MB) for a single map task |
yarn.scheduler.maximum-allocation-mb | 8192 | Maximum memory YARN can allocate per request |
hive.fetch.task.conversion | none | Force all HDFS reads through MapReduce tasks |
mapreduce.task.timeout | 3600000 | Task timeout in milliseconds |
mapreduce.job.running.map.limit | 100 | Maximum number of concurrent map tasks |
mapreduce.max.split.size | 512000000 | Maximum split size in bytes |
mapreduce.map.maxattempts | 0 | Maximum retry attempts for map tasks |
hive.execution.engine | mr | The Hive execution engine to use |
2.3 Data verification
Configure verification rules to validate data integrity after migration.
Create a verification engine.
Define verification rules:
Left variable: Table name or data type in the table.
Logical condition: Equal to, contains, or does not contain.
Right variable: Keywords (when the left variable is a table name) or the source/target verification execution engine (when the left variable is a data type).
Select a verification mode: data volume (count) comparison, metric comparison, weak content comparison, custom comparison, full text comparison, or percentage of null values comparison.
Set the Tolerance for difference: Define acceptable variance rules. Add or remove rules as needed.
2.4 SQL code conversion
Upload a table mapping file that specifies the mapping between source and target fully qualified names. For example, map the Hive qualified name db.tablename to the MaxCompute qualified name project.tablename.
2.5 Task scheduling migration
Blacklist: Specify objects to skip, such as workflows or task nodes.
Submission method: Submit through the migration assistant, or submit through the DataWorks OpenAPI (recommended).
Target engine: Select the target engine information.
Default conversion rules: Edit the JSON-based conversion rules. The default configuration:
{
"format": "SPEC",
"locale": "zh_CN",
"skipUnSupportType": true,
"transformContinueWithError": true,
"specContinueWithError": true,
"tempTaskTypes": [
"SPARK",
"HIVECLI"
],
"skipTaskCodes": [],
"settings": {
"workflow.converter.shellNodeType": "EMR_SHELL",
"workflow.converter.commandSqlAs": "EMR_SQL",
"workflow.converter.sparkSubmitAs": "EMR_SPARK",
"workflow.converter.target.unknownNodeTypeAs": "DIDE_SHELL",
"workflow.converter.mrNodeType": "EMR_MR",
"workflow.converter.target.engine.type": "EMR",
"workflow.converter.dolphinscheduler.sqlNodeTypeMapping": {
"POSTGRESQL": "EMR_HIVE",
"MYSQL": "EMR_HIVE",
"HIVE": "EMR_HIVE",
"CLICKHOUSE": "CLICK_SQL"
}
}
}Specify node types that require custom processing in the tempTaskTypes array. If left empty, nodes are converted according to the default rules defined in settings. If specified, custom rules take precedence and the original content is preserved.
Custom conversion plugin and rules: To apply custom conversion rules, first download the conversion plugin project, customize it, and upload it.
Custom conversion rule example:
{
"if.use.migrationx.before": false,
"if.use.default.convert": true,
"conf": [
{
"rule": {
"settings": {
"workflow.converter.shellNodeType": "DIDE_SHELL",
"workflow.converter.commandSqlAs": "DIDE_SHELL",
"workflow.converter.sparkSubmitAs": "ODPS_SPARK",
"workflow.converter.target.unknownNodeTypeAs": "DIDE_SHELL",
"workflow.converter.mrNodeType": "ODPS_MR",
"workflow.converter.target.engine.type": "MaxCompute",
"workflow.converter.dolphinscheduler.sqlNodeTypeMapping": {
"POSTGRESQL": "ODPS_SQL",
"MYSQL": "ODPS_SQL",
"HIVE": "ODPS_SQL"
}
}
},
"nodes": "all, name, idsStr"
}
]
}Custom rules override default rules when conflicts exist. Configure multiple JSON segments to apply different conversion rules to different node ranges. This allows source shell tasks of different types to convert to different target node types based on the selected node scope.
Submission settings: Configure status preferences for elements submitted to the target DataWorks environment.
After completing all configuration, click Save.
Step 3: Run the migration pipeline
3.1 Start the pipeline
Click Running Status to view the end-to-end migration progress. Exploration tasks run according to the cron schedule configured in Step 2.1.
3.2 Cluster exploration
Data cluster exploration: The Exploration Agent scans the source cluster and reports results. View exploration result details for each run.
Scheduling cluster exploration: The agent scans scheduling workflows. View exploration result details for each run.
3.3 Data migration
The first exploration of a source data cluster automatically generates a full data migration task. Each subsequent exploration compares results with the previous run and generates incremental data migration tasks for changed objects.
Each atomic object (table or partition) follows this execution sequence:
| Phase | Action |
|---|---|
| 1. Schema migration | Create the table structure in the target |
| 2. Data migration | Transfer the data |
| 3. Data verification | Validate the migrated data |
A single object is considered successfully migrated only when all three phases complete and pass verification.
View migration task details to see statistical information for each task. For atomic objects that fail migration, click retry to re-attempt.
3.4 Scheduling migration
Each scheduling cluster exploration generates a batch of task instances to migrate. Manually execute the following:
Run the conversion to transform source scheduling definitions into the target format.
Review the conversion report.
Submit the converted definitions to the target DataWorks environment.
What to do next
Monitor pipeline health using the Running Status view to track ongoing migrations and identify failed objects.
Fine-tune migration parameters for performance and reliability in the pipeline settings.
After the initial load completes, configure incremental migrations to keep source and target in sync.
Review the conversion report after scheduling migration to verify that all task nodes converted correctly.