This topic describes how to use a Hive user-defined table-valued function (UDTF) to migrate Hive data.
Preparations
Prepare a network environment that meets the following requirements:
Each node in the Hive cluster can access MaxCompute.
The server on which MaxCompute Migration Assist (MMA) is installed can access the Hive metastore server and Hive server.
Create a Hive UDTF for data migration.
On the Help page of MMA, download the JAR package of the UDTF of the required version, such as
mma-udtf.jar
.Run the following command to upload the
mma-udtf.jar
package to Hadoop Distributed File System (HDFS):hdfs dfs -put -f mma-udtf.jar hdfs:///tmp/
Use the Beeline client or Hive CLI to log on to Hive and create a Hive UDTF.
DROP FUNCTION IF EXISTS default.odps_data_dump_multi; CREATE FUNCTION default.odps_data_dump_multi as 'com.aliyun.odps.mma.io.McDataTransmissionUDTF' USING JAR 'hdfs:///tmp/mma-udtf.jar';
If Kerberos authentication is configured for Hive, copy the following files to the server on which MMA is installed:
hive.keytab
gss-jaas.conf
NoteThe gss-jaas.conf file contains the path of the keytab file. Make sure that the path of the keytab file is the same as the path of the hive.keytab file on the server on which MMA is installed.
krb5.conf
NoteThe krb5.conf file contains the IP address of the Key Distribution Center (KDC) server. The server on which MMA is installed must be able to access this IP address.
Procedure
Add a data source.
In the left-side navigation pane of the MMA console, click Data Sources. The Data Sources page appears.
Click Add Data Source to go to the Add Data Source page.
Set the Data Source Type parameter to HIVE and click Next.
Configure the parameters for the data source. The following table describes the parameters.
Parameter
Description
Data Source Name
The name of the data source. You can specify a custom name.
hive metastore url
The URL of the Hive metastore. Enter the actual URL of your Hive metastore, such as thrift://192.168.0.212:9083.
hive metastore client socket timeout
Default value: 600.
hive jdbc url
The URL of the Hive Java Database Connectivity (JDBC) driver. The URL is in the
jdbc:hive2://localhost:10000/default
format.hive jdbc user name
The username of the Hive JDBC driver. This parameter is required.
hive jdbc password
The password of the Hive JDBC driver. This parameter is optional.
Enable Kerberos Authentication for Hive Metastore
If you select this option, you must configure the following Kerberos-related parameters.
kerberos principal
Make sure that the value of this parameter is the same as the value of the kdc_realm parameter in the krb5.conf file.
Kerberos keytab File Path
The path of the keytab file on the server on which MMA is installed.
Kerberos gss-jaas.conf File Path
The path of the gss-jaas.conf file on the server on which MMA is installed.
Kerberos krb5.conf File Path
The path of the krb5.conf file on the server on which MMA is installed.
Maximum Number of Partitions Processed in a Task
Default value: 50. This parameter specifies the maximum number of partitions whose data can be migrated in an MMA task. You can migrate data of multiple partitions at the same time to reduce the number of times that Hive SQL statements are committed and reduce the time for committing Hive SQL statements.
Maximum Size Processed in a Task (GB)
Unit: GB. Default value: 5. This parameter specifies the maximum size of data that can be migrated by a single MMA task from all partitions.
Hive Job Configurations for Engines Such As MR, Spark, and Tez
By default, this parameter specifies specific configurations of MapReduce (MR) tasks.
NoteIf the engine used by Hive is not MR, you must configure the hive.execution.engine parameter to specify the engine used by Hive and modify the task parameters used by different engines. This configuration is used to specify a Spark queue and resolve issues such as insufficient memory of the YARN container.
Database Whitelist
The names of Hive databases whose data you want to migrate. Separate multiple names with commas (,).
Database Blacklist
The names of Hive databases whose data you do not want to migrate. Separate multiple names with commas (,).
Meta API Access Concurrency
The number of concurrent accesses to a Hive metastore. You can configure this parameter to increase the speed at which Hive metadata is obtained.
Table Blacklist (dbname.tablename Format)
The Hive database tables whose data you do not want to migrate. The value of this parameter for a single table is in the
dbname.tablename
format. Separate multiple table names with commas (,).Table Whitelist (dbname.tablename Format)
The Hive database tables whose data you want to migrate. The value of this parameter for a single table is in the
dbname.tablename
format. Separate multiple table names with commas (,).Click Submit at the bottom of the page.
NoteIf the configurations are correct and the server on which MMA is installed can access the metastore URL and JDBC URL, MMA pulls Hive metadata by using the metastore URL. The Hive metadata includes information about databases, tables, and partitions. Otherwise, an error occurs. In this case, you must check the configurations, enter correct values, and submit the configurations again.
When the progress bar for pulling metadata reaches 100%, you are redirected to the Data Sources page.
Create a data migration task.
MMA allows you to create migration tasks for a single database, multiple tables, and multiple partitions.
NoteA migration task for a single database migrates data from a single database.
A migration task for multiple tables migrates data from one or more tables.
A migration task for multiple partitions migrates data from one or more partitions.
Migrate data from multiple tables.
In the left-side navigation pane of the MMA console, click Data Sources. On the Data Sources page, click the name of the data source whose data you want to migrate.
On the details page of the selected data source, click the name of the database whose data you want to migrate.
Select the tables whose data you want to migrate and click Create Migration Task.
In the Create Migration Task dialog box, configure the parameters based on your business requirements. The following table describes the parameters.
Parameter
Description
Name
The name of the migration task. We recommend that you enter a task name that can help you organize migration records.
Task Type
The type of the task. You can configure this parameter based on your business requirements. Valid values:
MaxCompute Projects in the Same Region.
MaxCompute Projects Across Regions.
MaxCompute Verification. This type of task compares data of all the same tables between the source and destination projects.
MaxCompute Project
The destination MaxCompute project.
Tables
A list of tables whose data you want to migrate. Separate multiple table names with commas (,).
Enable Verification
By default, this feature is enabled.
Incremental Update
By default, this feature is enabled. If this feature is enabled, data of the partitions whose data has been migrated is not migrated again.
Migrate Schema Only
Specifies whether to migrate only the table schema and partition values. You can determine whether to enable this feature based on your business requirements.
Partition Filter
For more information, see Partition filter expressions.
Table Name Mapping
The name of the table in the destination project after data of the table is migrated.
Click OK.
NoteIf the configuration of the migration task is correct, you can view the migration task in the Tasks section on the
page and view the related subtasks in the Subtasks section on the page.
Migrate data from multiple partitions.
In the left-side navigation pane of the MMA console, click Data Sources. On the Data Sources page, click the name of the data source whose data you want to migrate.
On the details page of the selected data source, click the name of the database whose data you want to migrate.
Click the Partitions tab. On this tab, select the partitions whose data you want to migrate.
Click Create Migration Task. In the Create Migration Task dialog box, configure the parameters based on your business requirements. The following table describes the parameters.
Parameter
Description
Name
The name of the migration task. We recommend that you enter a task name that can help you organize migration records.
Task Type
The type of the task. You can configure this parameter based on your business requirements. Valid values:
MaxCompute Projects in the Same Region.
MaxCompute Projects Across Regions.
MaxCompute Verification. This type of task compares data of all the same tables between the source and destination projects.
MaxCompute Project
The destination MaxCompute project.
Enable Verification
By default, this feature is enabled.
Migrate Schema Only
Specifies whether to migrate only the table schema and partition values. You can determine whether to enable this feature based on your business requirements.
Partitions
Retain the default value.
Table Name Mapping
The name of the table in the destination project after data of the table is migrated.
Click OK.
NoteIf the configuration of the migration task is correct, you can view the migration task in the Tasks section on the
page and view the related subtasks in the Subtasks section on the page.
Migrate data from a single database.
In the left-side navigation pane of the MMA console, click Data Sources. On the Data Sources page, click the name of the data source whose data you want to migrate.
Find the database whose data you want to migrate and click Migrate.
In the Create Migration Task dialog box, configure the parameters based on your business requirements. The following table describes the parameters.
Parameter
Description
Name
The name of the migration task. We recommend that you enter a task name that can help you organize migration records.
Task Type
The type of the task. You can configure this parameter based on your business requirements. Valid values:
MaxCompute Projects in the Same Region.
MaxCompute Projects Across Regions.
MaxCompute Verification. This type of task compares data of all the same tables between the source and destination projects.
MaxCompute Project
The destination MaxCompute project.
Table Whitelist
A list of tables whose data you want to migrate. Separate multiple tables with commas (,).
Enable Verification
By default, this feature is enabled.
Incremental Update
By default, this feature is enabled. If this feature is enabled, data of the partitions whose data has been migrated is not migrated again.
Migrate Schema Only
Specifies whether to migrate only the table schema and partition values.
Partition Filter
For more information, see Partition filter expressions.
Table Name Mapping
The name of the table in the destination project after data of the table is migrated.
Click OK.
NoteIf the configuration of the migration task is correct, you can view the migration task in the Tasks section on the
page and view the related subtasks in the Subtasks section on the page.