All Products
Search
Document Center

AnalyticDB for MySQL:Import data from a Hive data source

Last Updated:Jun 28, 2023

AnalyticDB for MySQL Data Lakehouse Edition (V3.0) allows you to migrate data from a Hive data source to an Object Storage Service (OSS) bucket. This topic describes how to use the data migration feature of AnalyticDB for MySQL Data Lakehouse Edition (V3.0) to migrate data from a Hive data source to an OSS bucket, analyze data, and manage the data migration job.

Overview

AnalyticDB for MySQL Data Lakehouse Edition (V3.0) allows you to migrate data from a Hive data source to an OSS bucket. You can migrate the metadata and data from the data source to OSS with one click, or migrate multiple databases and tables from the data source to OSS in parallel.

Prerequisites

  • An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster is created. For more information, see Create a Data Lakehouse Edition (V3.0) cluster.
  • A job resource group is created. For more information, see Create a resource group.

  • A database account is created.
  • An E-MapReduce (EMR) cluster or an ECS-based Cloudera Distributed Hadoop (CDH) cluster is created in the same region as the Data Lakehouse Edition (V3.0) cluster.

    • If an EMR cluster is created, the business scenario of the EMR cluster is set to New Data Lake. The metadata of the cluster is stored in a self-managed ApsaraDB RDS database or built-in MySQL database. The Hive service is selected for the cluster. Additionally, the Hive Storage Mode parameter is set to HDFS, and the Data Lake Storage check box is cleared. For more information, see Create a cluster.

      Important
      • Data of a Hive data source in an EMR cluster that has Metadata set to DLF Unified Metadata cannot be migrated to an OSS bucket.

      • Data of Hive data sources in an EMR cluster that has Hive Storage Mode set to Data Lake Storage are stored in OSS. You can use metadata discovery to import data to Data Lakehouse Edition (V3.0). For more information, see Use metadata discovery to import data to Data Lakehouse Edition.

  • A partitioned table is created in the Hive data source. For more information, see Use Hive to perform basic operations.

Procedure

Create a Hive data source

Note

If you want to migrate data from an existing Hive data source, skip this step and create a data migration job. For more information, see the "Create a data migration job" section of this topic.

  1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster that you want to manage and click the cluster ID.

  2. In the left-side navigation pane, choose Data Ingestion > Data Sources.

  3. In the upper-right corner, click Create Data Source.

  4. On the Create Data Source page, configure the parameters that are described in the following table.

    EMR cluster

    Parameter

    Description

    Data Source Type

    The data source. Select Hive.

    Data Source Name

    The name of the data source. By default, the system generates a name based on the data source type and the current time. You can modify the name based on your business requirements.

    Data Source Description

    The description of the data source. For example, you can enter the use case and business limits.

    Deployment Mode

    The deployment mode of the EMR cluster. Select Alibaba Cloud Instance.

    Instance

    The EMR cluster to which the Hive date source belongs.

    Hive Metastore Uri

    The Uniform Resource Identifier (URI) of the Hive metastore. Specify this parameter in the thrift://<IP address of the master node>:<Port number> format. By default, the port number is 9083.

    To view the private IP address of the master node, log on to the EMR console and click EMR on ECS in the left-side navigation pane. On the Nodes page, click the 加号..png icon to the left of the emr-master node.

    ECS-based CDH cluster

    Parameter

    Description

    Data Source Type

    The data source. Select Hive.

    Data Source Name

    The name of the data source. By default, the system generates a name based on the data source type and the current time. You can modify the name based on your business requirements.

    Data Source Description

    The description of the data source. For example, you can enter the use case and business limits.

    Deployment Mode

    The deployment mode of the CDH cluster. Select ECS-based CDH.

    Instance

    The Elastic Compute Service (ECS) instance that hosts the CDH cluster.

    Hive Metastore Uri

    The URI of the Hive metastore. In this case, specify the public IP address of the ECS instance that hosts the CDH cluster. Specify this parameter in the thrift://<IP address of the ECS instance>:<Port number> format. By default, the port number is 9083.

    Host Configuration Information

    The host-to-IP mappings. Specify the mappings in different lines.

    Examples:

    192.168.2.153 master.cdh

    192.168.2.154 node1.cdh

    192.168.2.155 node2.cdh

  5. Click Create.

Create a data migration job

  1. In the left-side navigation pane, click Data Migration.

  2. In the upper-right corner, click Create Migration Job.

  3. On the Hive Data Source tab of the Create Migration Job page, configure the parameters in the Source and Destination Settings, Database/Table Migration Settings, and Migration Settings sections.

    Parameters in the Source and Destination Settings section

    Parameter

    Description

    Data Job Name

    The name of the data migration job. By default, the system generates a name based on the data source type and the current time. You can modify the name based on your business requirements.

    Data Source

    The data source. You can select an existing Hive data source or create a data source.

    Destination Type

    The data storage type in AnalyticDB for MySQL. Only Data Lake - OSS Storage is supported.

    OSS Path

    The OSS storage path of the AnalyticDB for MySQL cluster data.

    Important
    • All buckets that reside in the same region as the AnalyticDB for MySQL cluster are displayed. You can select one of the buckets. Configure this parameter based on your business requirements. After you configure this parameter, it cannot be modified.

    • We recommend that you select an empty directory that does not have nested relationships with the directories of other data migration jobs. This prevents historical data from being overwritten. For example, assume that the involved OSS storage paths of two data migration jobs are oss://adb_demo/test/sls1/ and oss://adb_demo/test/. In this case, data overwriting occurs during data migration because these two paths have nested relationships with each other.

    Parameters in the Database/Table Migration Settings section

    Important

    If a database name or table name is added to both a whitelist and a blacklist, the blacklist prevails. In this case, the system does not migrate the database or table.

    Parameter

    Description

    Database/Table Migration Whitelist

    The names of databases and tables that you want to migrate. Specify the names by using regular expressions. Separate multiple regular expressions with commas (,).

    Database/Table Migration Blacklist

    The names of databases and tables that you do not want to migrate. Specify the names by using regular expressions. Separate multiple regular expressions with commas (,).

    Parameters in the Migration Settings section

    Parameter

    Description

    Handling Same Named Destination Table

    The way in which to handle a table that uses the same name as another table in the destination OSS bucket. Valid values:

    • Skip Migration: skips the table and continues to migrate other tables.

    • Report Error and Abort Migration: pauses the entire data migrate job.

    Job Resource Group

    The job resource group that runs the data migration job.

    Required ACUs

    The number of AnalyticDB compute units (ACUs) that are required for the job resource group to run the data migration job. The value ranges from 4 to the maximum number of computing resources that are available in the job resource group. To improve the stability and performance of data migration, we recommend that you specify a relatively large number of ACUs.

    Parallel Tasks

    The number of tasks that are started to run in parallel. Default value: 1. Valid values: 1 to 8.

    Each task migrates a table. Each task requires at least 4 ACUs. If the number of ACUs is not enough, the tasks are run in sequence.

    Advanced Settings

    The personalized settings for the data migration job. If you want to configure personalized settings, contact technical support.

  4. Click Submit.

Start the data migration job

  1. On the Data Migration page, find the data migration job that you created and click Start in the Actions column.

  2. In the upper-right corner, click Search. If the state of the job changes to Starting, the data migration job is started.

Analyze data

After the data migration job is completed, you can use metadata discovery to import data from OSS to Data Lakehouse Edition (V3.0). Then, you can use Spark JAR Development to analyze the data that is imported to Data Lakehouse Edition (V3.0). For more information about Spark development, see Spark development editor and Spark batch application development.

  1. Procedure

    1. In the left-side navigation pane, choose Job Development > Spark JAR Development.

    2. Enter SQL statements in the default template and click Run Now.

      -- Here is just an example of SparkSQL. Modify the content and run your spark program.
      
      conf spark.driver.resourceSpec=medium;
      conf spark.executor.instances=2;
      conf spark.executor.resourceSpec=medium;
      conf spark.app.name=Spark SQL Test;
      conf spark.adb.connectors=oss;
      
      -- Here are your sql statements
      show tables from lakehouse20220413156_adbTest;
    3. (Optional) On the Applications tab, find an application and click Log in the Actions column to view the Spark SQL running log of the application.

Manage the data migration job

On the Data Migration page, you can perform the following operations in the Actions column.

Operation

Description

Start

Starts the data migration job.

Execution Details

Allows you to view the data migration job details, such as the data migration job settings and number of migrated tables.

Edit

Allows you to edit the properties of the data migration job.

Pause

Pauses the data migration job.

Delete

Deletes the data migration job.