All Products
Search
Document Center

MaxCompute:Migrate Hive data

Last Updated:Jan 22, 2025

This topic describes how to use a Hive user-defined table-valued function (UDTF) to migrate Hive data.

Preparations

  • Prepare a network environment that meets the following requirements:

    • Each node in the Hive cluster can access MaxCompute.

    • The server on which MaxCompute Migration Assist (MMA) is installed can access the Hive metastore server and Hive server.

  • Create a Hive UDTF for data migration.

    1. On the Help page of MMA, download the JAR package of the UDTF of the required version, such as mma-udtf.jar.

    2. Run the following command to upload the mma-udtf.jar package to Hadoop Distributed File System (HDFS):

      hdfs dfs -put -f mma-udtf.jar hdfs:///tmp/
    3. Use the Beeline client or Hive CLI to log on to Hive and create a Hive UDTF.

      DROP FUNCTION IF EXISTS default.odps_data_dump_multi;
      CREATE FUNCTION default.odps_data_dump_multi as 'com.aliyun.odps.mma.io.McDataTransmissionUDTF' USING JAR 'hdfs:///tmp/mma-udtf.jar';
  • If Kerberos authentication is configured for Hive, copy the following files to the server on which MMA is installed:

    1. hive.keytab

    2. gss-jaas.conf

      Note

      The gss-jaas.conf file contains the path of the keytab file. Make sure that the path of the keytab file is the same as the path of the hive.keytab file on the server on which MMA is installed.

    3. krb5.conf

      Note

      The krb5.conf file contains the IP address of the Key Distribution Center (KDC) server. The server on which MMA is installed must be able to access this IP address.

Procedure

  1. Add a data source.

    1. In the left-side navigation pane of the MMA console, click Data Sources. The Data Sources page appears.

    2. Click Add Data Source to go to the Add Data Source page.

    3. Set the Data Source Type parameter to HIVE and click Next.

    4. Configure the parameters for the data source. The following table describes the parameters.

      Parameter

      Description

      Data Source Name

      The name of the data source. You can specify a custom name.

      hive metastore url

      The URL of the Hive metastore. Enter the actual URL of your Hive metastore, such as thrift://192.168.0.212:9083.

      hive metastore client socket timeout

      Default value: 600.

      hive jdbc url

      The URL of the Hive Java Database Connectivity (JDBC) driver. The URL is in the jdbc:hive2://localhost:10000/default format.

      hive jdbc user name

      The username of the Hive JDBC driver. This parameter is required.

      hive jdbc password

      The password of the Hive JDBC driver. This parameter is optional.

      Enable Kerberos Authentication for Hive Metastore

      If you select this option, you must configure the following Kerberos-related parameters.

      kerberos principal

      Make sure that the value of this parameter is the same as the value of the kdc_realm parameter in the krb5.conf file.

      Kerberos keytab File Path

      The path of the keytab file on the server on which MMA is installed.

      Kerberos gss-jaas.conf File Path

      The path of the gss-jaas.conf file on the server on which MMA is installed.

      Kerberos krb5.conf File Path

      The path of the krb5.conf file on the server on which MMA is installed.

      Maximum Number of Partitions Processed in a Task

      Default value: 50. This parameter specifies the maximum number of partitions whose data can be migrated in an MMA task. You can migrate data of multiple partitions at the same time to reduce the number of times that Hive SQL statements are committed and reduce the time for committing Hive SQL statements.

      Maximum Size Processed in a Task (GB)

      Unit: GB. Default value: 5. This parameter specifies the maximum size of data that can be migrated by a single MMA task from all partitions.

      Hive Job Configurations for Engines Such As MR, Spark, and Tez

      By default, this parameter specifies specific configurations of MapReduce (MR) tasks.

      Note

      If the engine used by Hive is not MR, you must configure the hive.execution.engine parameter to specify the engine used by Hive and modify the task parameters used by different engines. This configuration is used to specify a Spark queue and resolve issues such as insufficient memory of the YARN container.

      Database Whitelist

      The names of Hive databases whose data you want to migrate. Separate multiple names with commas (,).

      Database Blacklist

      The names of Hive databases whose data you do not want to migrate. Separate multiple names with commas (,).

      Meta API Access Concurrency

      The number of concurrent accesses to a Hive metastore. You can configure this parameter to increase the speed at which Hive metadata is obtained.

      Table Blacklist (dbname.tablename Format)

      The Hive database tables whose data you do not want to migrate. The value of this parameter for a single table is in the dbname.tablename format. Separate multiple table names with commas (,).

      Table Whitelist (dbname.tablename Format)

      The Hive database tables whose data you want to migrate. The value of this parameter for a single table is in the dbname.tablename format. Separate multiple table names with commas (,).

    5. Click Submit at the bottom of the page.

      Note

      If the configurations are correct and the server on which MMA is installed can access the metastore URL and JDBC URL, MMA pulls Hive metadata by using the metastore URL. The Hive metadata includes information about databases, tables, and partitions. Otherwise, an error occurs. In this case, you must check the configurations, enter correct values, and submit the configurations again.

    6. When the progress bar for pulling metadata reaches 100%, you are redirected to the Data Sources page.

  2. Create a data migration task.

    MMA allows you to create migration tasks for a single database, multiple tables, and multiple partitions.

    Note
    • A migration task for a single database migrates data from a single database.

    • A migration task for multiple tables migrates data from one or more tables.

    • A migration task for multiple partitions migrates data from one or more partitions.

    • Migrate data from multiple tables.

      1. In the left-side navigation pane of the MMA console, click Data Sources. On the Data Sources page, click the name of the data source whose data you want to migrate.

      2. On the details page of the selected data source, click the name of the database whose data you want to migrate.

      3. Select the tables whose data you want to migrate and click Create Migration Task.

      4. In the Create Migration Task dialog box, configure the parameters based on your business requirements. The following table describes the parameters.

        Parameter

        Description

        Name

        The name of the migration task. We recommend that you enter a task name that can help you organize migration records.

        Task Type

        The type of the task. You can configure this parameter based on your business requirements. Valid values:

        • MaxCompute Projects in the Same Region.

        • MaxCompute Projects Across Regions.

        • MaxCompute Verification. This type of task compares data of all the same tables between the source and destination projects.

        MaxCompute Project

        The destination MaxCompute project.

        Tables

        A list of tables whose data you want to migrate. Separate multiple table names with commas (,).

        Enable Verification

        By default, this feature is enabled.

        Incremental Update

        By default, this feature is enabled. If this feature is enabled, data of the partitions whose data has been migrated is not migrated again.

        Migrate Schema Only

        Specifies whether to migrate only the table schema and partition values. You can determine whether to enable this feature based on your business requirements.

        Partition Filter

        For more information, see Partition filter expressions.

        Table Name Mapping

        The name of the table in the destination project after data of the table is migrated.

      5. Click OK.

        Note

        If the configuration of the migration task is correct, you can view the migration task in the Tasks section on the Migration Tasks page and view the related subtasks in the Subtasks section on the Migration Tasks page.

    • Migrate data from multiple partitions.

      1. In the left-side navigation pane of the MMA console, click Data Sources. On the Data Sources page, click the name of the data source whose data you want to migrate.

      2. On the details page of the selected data source, click the name of the database whose data you want to migrate.

      3. Click the Partitions tab. On this tab, select the partitions whose data you want to migrate.

      4. Click Create Migration Task. In the Create Migration Task dialog box, configure the parameters based on your business requirements. The following table describes the parameters.

        Parameter

        Description

        Name

        The name of the migration task. We recommend that you enter a task name that can help you organize migration records.

        Task Type

        The type of the task. You can configure this parameter based on your business requirements. Valid values:

        • MaxCompute Projects in the Same Region.

        • MaxCompute Projects Across Regions.

        • MaxCompute Verification. This type of task compares data of all the same tables between the source and destination projects.

        MaxCompute Project

        The destination MaxCompute project.

        Enable Verification

        By default, this feature is enabled.

        Migrate Schema Only

        Specifies whether to migrate only the table schema and partition values. You can determine whether to enable this feature based on your business requirements.

        Partitions

        Retain the default value.

        Table Name Mapping

        The name of the table in the destination project after data of the table is migrated.

      5. Click OK.

        Note

        If the configuration of the migration task is correct, you can view the migration task in the Tasks section on the Migration Tasks page and view the related subtasks in the Subtasks section on the Migration Tasks page.

    • Migrate data from a single database.

      1. In the left-side navigation pane of the MMA console, click Data Sources. On the Data Sources page, click the name of the data source whose data you want to migrate.

      2. Find the database whose data you want to migrate and click Migrate.

      3. In the Create Migration Task dialog box, configure the parameters based on your business requirements. The following table describes the parameters.

        Parameter

        Description

        Name

        The name of the migration task. We recommend that you enter a task name that can help you organize migration records.

        Task Type

        The type of the task. You can configure this parameter based on your business requirements. Valid values:

        • MaxCompute Projects in the Same Region.

        • MaxCompute Projects Across Regions.

        • MaxCompute Verification. This type of task compares data of all the same tables between the source and destination projects.

        MaxCompute Project

        The destination MaxCompute project.

        Table Whitelist

        A list of tables whose data you want to migrate. Separate multiple tables with commas (,).

        Enable Verification

        By default, this feature is enabled.

        Incremental Update

        By default, this feature is enabled. If this feature is enabled, data of the partitions whose data has been migrated is not migrated again.

        Migrate Schema Only

        Specifies whether to migrate only the table schema and partition values.

        Partition Filter

        For more information, see Partition filter expressions.

        Table Name Mapping

        The name of the table in the destination project after data of the table is migrated.

      4. Click OK.

        Note

        If the configuration of the migration task is correct, you can view the migration task in the Tasks section on the Migration Tasks page and view the related subtasks in the Subtasks section on the Migration Tasks page.