This topic describes how to quickly get started with the lake warehouse one-stop migration center.
Getting started: Create a migration scenario
Use the exploration Agent to discover the big data components and clusters on the customer's source side.
Add a migration scenario under the "One-stop Migration" menu. Configure the source data source, target data source, and the mapping relationship between the source and target. A migration scenario can include multiple data migrations and multiple scheduling cluster migrations.
Step 1. Create basic scenario information
Step 2. Create source data source
Create a data source: Create a source data source to be migrated. Data source types are divided into data cluster data sources and scheduling cluster data sources.
Hive data source configuration (as an example):
Note:
Configuration item name | Subkey name | Description |
Data source name | Cannot contain special characters other than letters, numbers, and Chinese characters. Cannot be modified when editing. Must be globally unique. | |
Data source type | Hive | |
Hive JDBC URL | JDBC address for accessing HiveServer2, for example: jdbc:hive2://ip:port/database. Note that if Kerberos authentication is enabled, the JDBC connection string format is: jdbc:hive2://ip:port/database;principal=your principal | |
Database name | Hive database | |
Hive version | 1.x,2.x,3.x | |
Hive logon method | Hive JDBC logon method, divided into anonymous, LDAP, and Kerberos authentication | |
Configuration | Format options | Mainly configures the required authentication data: such as LDAP username and password, authentication files required for Kerberos logon. This configuration item has multiple options.
|
LDAP-username | LDAP username | |
LDAP-password | LDAP password | |
Kerberos authentication-keytab file | Upload keytab file | |
Kerberos authentication-conf file | Upload conf file | |
Kerberos authentication-principal | Kerberos principal | |
Metadata type and access configuration | 1.Hive Metastore | |
Does metadata access require Kerberos authentication | Please check if hive.metastore.sasl.enabled equals true in the cluster configuration. If so, you need to select Kerberos authentication here and configure the corresponding Kerberos authentication file information in the above configuration. | |
metastoreUris | Hive metastore URL example: thrift://10.0.2.165:9083 | |
2.DLF | Get Metastore connection through DLF | |
Endpoint | Endpoint for accessing DLF, requires the machine where the tool is located to be connected | |
AccessKey ID | AccessKey ID used to access DLF | |
AccessKey Secret | AccessKey Secret used to access DLF | |
Data catalog | The Catalog where the data to be verified is located. Note: The CatalogId must correspond to the JDBC URL above | |
DLF Settings | Other DLF configuration information | |
3.MySQL metadata warehouse | ||
jdbcURL | Example: jdbc:mysql://rm-xxxx.mysql.rds.aliyuncs.com/hive_meta_db?createDatabaseIfNotExist=true&characterEncoding=UTF-8 | |
Username | ||
Password | ||
4.PG metadata warehouse | ||
jdbcURL | Example: jdbc:postgresql://rm-xxxx.mysql.rds.aliyuncs.com/hive_meta_db?createDatabaseIfNotExist=true&characterEncoding=UTF-8 | |
Username | ||
Password | ||
5.AWS Glue | ||
AccessKey ID | ||
Access Secret | ||
Region |
MaxCompute data source configuration:
Configuration item name | Description |
Data source name | Cannot contain special characters other than English letters, numbers, and Chinese characters. Cannot be modified when editing |
Data source type | MaxCompute |
ODPS Endpoint | Endpoint for accessing MaxCompute, requires the machine where the tool is located to be connected mc.endpoint Example: http://service.cn-shanghai.maxcompute.aliyun.com/api |
AccessKey ID | AccessKey ID used to access MaxCompute |
AccessKey Secret | AccessKey Secret used to access MaxCompute |
Default Project | MMA will use this project's quota to execute SQL on MaxCompute |
Tunnel Endpoint | Data verification does not depend on this item, you can fill in anything |
Scheduling cluster data source with DolphinScheduler as an example:
You can choose to directly upload the source information zip package of the dolphin scheduling system or connect to the API to obtain relevant migration information.
Configuration item name | Description |
Data source name | Cannot contain special characters other than English letters, numbers, and Chinese characters. Cannot be modified when editing |
Data source type | DolphinScheduler |
Version | DolphinScheduler version selection |
Connection address | DolphinScheduler system service access address |
Token | Access Token (AccessToken): When using API operations to obtain data from the dolphin scheduling system, an access token obtained through mechanisms such as OAuth2.0 is required. |
The project name | Access related API, requires DolphinScheduler system project space information |
After creating the data source, you need to test whether the connectivity is normal.
Step 3. Create target data source
Creation method is the same as Step 2.
Step 4. Associate target
This list takes the source data source (cluster) as the main perspective and displays the association relationship with the target data source (cluster). Click "Associate Target Data Source" in the operation column to select the corresponding target data source for association. This mapping relationship will serve as the source and target color relationship during actual migration.
Click "Complete" to save.
Set up Pipeline (one-stop migration)
After creating the scenario, you need to set up the migration scenario.
We will initialize the migration Pipeline DSL based on the migration scenario. You only need to configure migration preferences in the following interface:
1. Cluster exploration settings
Data cluster exploration: Taking Hive to MC migration as an example, supports configuring cron expressions for timed scheduling.
Scheduling cluster exploration: Taking Dolphin to DataWorks migration as an example, supports configuring cron expressions for timed scheduling.
2. Metadata migration & data migration settings
Taking the Hive2MC scenario as an example, metadata migration and data migration settings are reused.
1) Fill in the mapping relationship between source Hive DB -> target MaxCompute Project. You can "Download Template" to maintain information and then upload it.
2) Blacklist: Ignored during migration. Fill in the format dbname.tablename, dbname.* (representing all tables under the db).
3) Select migration method
Hive UDTF(Tunnel):
1. Obtain Hive metadata through Hive MetaStore, including all table names, table schemas, partition information, data volume, update time, etc.
2. Create tables and partitions on the MaxCompute side based on the obtained metadata information (and the migration scope issued by the workbench task instructions).
3. Submit SQL commands to execute UDTF to the source Hive.
4. UDTF calls MaxCompute's Tunnel SDK to write data to the target.
OSS transfer, MaxC external table to internal table:
1. Use Alibaba Cloud OSS Data Online Migration service to migrate Hive data (HDFS/S3/OBS, etc.) to Alibaba Cloud OSS.
2. Obtain Hive metadata through Hive MetaStore, including all table names, table schemas, partition information, data volume, update time, etc.
3. Create OSS external tables and corresponding standard tables in the target MaxCompute based on the obtained Hive metadata and OSS path information (corresponding to HDFS).
4. Query data from external tables and insert into standard tables: insert standard table from select oss external table, importing data from OSS to MaxCompute.
4) Advanced settings
Global parameters:
hive.task.partition.max.num=50: Maximum number of partitions processed by a single task (optional)
hive.task.partition.max.size=5: Maximum data size processed by a single task (unit: G) (optional)
Custom parameters:
e.g.: mapreduce.map.speculative=false: Whether to enable speculative execution in the map phase, default is true
mapreduce.map.memory.mb=8192: Resource upper limit that a single map task can use
yarn.scheduler.maximum-allocation-mb=8192: Maximum memory that yarn can request
hive.fetch.task.conversion=none: All queries involving hdfs reading go through mapreduce tasks
mapreduce.task.timeout=3600000 (unit: ms)
mapreduce.job.running.map.limit=100: Number of map tasks executed simultaneously
mapreduce.max.split.size=512000000
mapreduce.map.maxattempts=0
hive.execution.engine=mr
3. Data verification settings
1) Create a new verification engine
2) Select verification rules:
Left variable: Table name, data type in table
Logical condition: Equal to, contains, does not contain
Right variable: When the left variable is table name, fill in keywords; when the left variable is data type, source verification execution engine, target verification engine
Verification mode: Data volume (count) comparison, metric comparison, weak content comparison, custom comparison, full text comparison, percentage of null values comparison
Difference toleration rate: Supports filling in rules, supports adding rows and deleting existing rows
4. SQL code conversion settings
Upload table mapping file, provide mapping relationship between source trituple and target trituple, such as the mapping relationship between db.tablename of the Hive cluster and project.tablename of the target MaxCompute cluster
5. Task scheduling migration settings
1) Blacklist: Supports setting blacklists for objects such as workflow, tasknode, etc.
2) Supports submission through migration assistant, also supports submission through DataWorks Openapi (recommended)
3) Need to select target engine information
4) Edit default rules
{
"format": "SPEC",
"locale": "zh_CN",
"skipUnSupportType": true,
"transformContinueWithError": true,
"specContinueWithError": true,
"tempTaskTypes": [
"SPARK",
"HIVECLI"
],
"skipTaskCodes": [],
"settings": {
"workflow.converter.shellNodeType": "EMR_SHELL",
"workflow.converter.commandSqlAs": "EMR_SQL",
"workflow.converter.sparkSubmitAs": "EMR_SPARK",
"workflow.converter.target.unknownNodeTypeAs": "DIDE_SHELL",
"workflow.converter.mrNodeType": "EMR_MR",
"workflow.converter.target.engine.type": "EMR",
"workflow.converter.dolphinscheduler.sqlNodeTypeMapping": {
"POSTGRESQL": "EMR_HIVE",
"MYSQL": "EMR_HIVE",
"HIVE": "EMR_HIVE",
"CLICKHOUSE": "CLICK_SQL"
}
}
}Fill in the node types that need custom processing in tempTaskTypes. If not filled, they will be converted according to default rules (if there are corresponding node conversion rules in settings). If filled, the custom rules will prevail and the original content will be preserved.
5) Custom conversion plugin & rules
To use custom conversion rules, you need to first download the conversion plugin custom project and upload it before you can use it
Custom conversion rule configuration:
{
"if.use.migrationx.before": false,
"if.use.default.convert": true,
"conf": [
{
"rule": {
"settings": {
"workflow.converter.shellNodeType": "DIDE_SHELL",
"workflow.converter.commandSqlAs": "DIDE_SHELL",
"workflow.converter.sparkSubmitAs": "ODPS_SPARK",
"workflow.converter.target.unknownNodeTypeAs": "DIDE_SHELL",
"workflow.converter.mrNodeType": "ODPS_MR",
"workflow.converter.target.engine.type": "MaxCompute",
"workflow.converter.dolphinscheduler.sqlNodeTypeMapping": {
"POSTGRESQL": "ODPS_SQL",
"MYSQL": "ODPS_SQL",
"HIVE": "ODPS_SQL"
}
}
},
"nodes": "all, name, idsStr"
}
]
}You can customize node type conversion rules in the project. When there is a conflict with the default rules, the custom rules prevail.
You can configure multiple JSON segments to select the conversion rules used for the selected nodes range. This means that different types of shells in the source can be converted to different node types on the target side by selecting the appropriate nodes range.
6) Submission settings
Preference management for the status of corresponding elements when submitting to the target DataWorks. After completing the above configuration, click save.
Run Pipeline one-stop migration
1. Start
Click "Running Status" to view the one-stop migration progress. You can see the corresponding exploration tasks according to the time set for the cluster exploration task, and the results of each exploration.
2. Cluster exploration
Data cluster exploration example:
Exploration result details:
Scheduling cluster exploration example:
Exploration result details:
3. Data migration
The first exploration of the source cluster's data will automatically generate a stock data migration task. Each subsequent exploration will compare with the previous cluster exploration task results to determine the incremental part and automatically generate incremental data migration tasks. Atomic-level objects (tables or partitions) will be executed in the order of schema migration -> data migration -> data verification. When all three tasks are completed and pass verification, the migration of a single object is considered successful.
You can view the details of this migration task through the following entry:
Taking the data migration details interface as an example:
Supports statistical information display for a single task and provides a "retry" entry for atomic objects that failed migration.
4. Scheduling migration
Each exploration of the source cluster's scheduling will automatically generate the corresponding batch of task instances to be migrated. You need to manually execute the conversion and submit to the target.
Conversion report: