how to import data from a Hive data source to an AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster - AnalyticDB for MySQL

AnalyticDB for MySQL Data Lakehouse Edition (V3.0) allows you to migrate data from a Hive data source to an Object Storage Service (OSS) bucket. This topic describes how to use the data migration feature of AnalyticDB for MySQL Data Lakehouse Edition (V3.0) to migrate data from a Hive data source to an OSS bucket, analyze data, and manage the data migration job.

Overview

AnalyticDB for MySQL Data Lakehouse Edition (V3.0) allows you to migrate data from a Hive data source to an OSS bucket. You can migrate the metadata and data from the data source to OSS with one click, or migrate multiple databases and tables from the data source to OSS in parallel.

Prerequisites

An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster is created. For more information, see Create a Data Lakehouse Edition (V3.0) cluster.
A job resource group is created. For more information, see Create a resource group.
A database account is created.
- If you use an Alibaba Cloud account, you only need to create a privileged database account. For more information, see Create a privileged account.
- If you use a Resource Access Management (RAM) user, you must create both a privileged database account and a standard database account and associate the standard account to the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
An E-MapReduce (EMR) cluster or an ECS-based Cloudera Distributed Hadoop (CDH) cluster is created in the same region as the Data Lakehouse Edition (V3.0) cluster.
- If an EMR cluster is created, the business scenario of the EMR cluster is set to New Data Lake. The metadata of the cluster is stored in a self-managed ApsaraDB RDS database or built-in MySQL database. The Hive service is selected for the cluster. Additionally, the Hive Storage Mode parameter is set to HDFS, and the Data Lake Storage check box is cleared. For more information, see Create a cluster.
  Important
  Data of a Hive data source in an EMR cluster that has Metadata set to DLF Unified Metadata cannot be migrated to an OSS bucket.
  Data of Hive data sources in an EMR cluster that has Hive Storage Mode set to Data Lake Storage are stored in OSS. You can use metadata discovery to import data to Data Lakehouse Edition (V3.0). For more information, see Use metadata discovery to import data to Data Lakehouse Edition.
A partitioned table is created in the Hive data source. For more information, see Use Hive to perform basic operations.

Procedure

Step 1: Create a Hive data source.
Step 2: Create a data migration job.
Step 3: Start the data migration job.
Step 4: Analyze data.
Step 5: (Optional) Manage the data migration job.

Create a Hive data source

Note

If you want to migrate data from an existing Hive data source, skip this step and create a data migration job. For more information, see the "Create a data migration job" section of this topic.

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Data Ingestion > Data Sources.
In the upper-right corner, click Create Data Source.

On the Create Data Source page, configure the parameters that are described in the following table.

EMR cluster

Parameter	Description
Data Source Type	The data source. Select Hive.
Data Source Name	The name of the data source. By default, the system generates a name based on the data source type and the current time. You can modify the name based on your business requirements.
Data Source Description	The description of the data source. For example, you can enter the use case and business limits.
Deployment Mode	The deployment mode of the EMR cluster. Select Alibaba Cloud Instance.
Instance	The EMR cluster to which the Hive date source belongs.
Hive Metastore Uri	The Uniform Resource Identifier (URI) of the Hive metastore. Specify this parameter in the `thrift://<IP address of the master node>:<Port number>` format. By default, the port number is 9083. To view the private IP address of the master node, log on to the EMR console and click EMR on ECS in the left-side navigation pane. On the Nodes page, click the icon to the left of the emr-master node.

ECS-based CDH cluster

Parameter	Description
Data Source Type	The data source. Select Hive.
Data Source Name	The name of the data source. By default, the system generates a name based on the data source type and the current time. You can modify the name based on your business requirements.
Data Source Description	The description of the data source. For example, you can enter the use case and business limits.
Deployment Mode	The deployment mode of the CDH cluster. Select ECS-based CDH.
Instance	The Elastic Compute Service (ECS) instance that hosts the CDH cluster.
Hive Metastore Uri	The URI of the Hive metastore. In this case, specify the public IP address of the ECS instance that hosts the CDH cluster. Specify this parameter in the `thrift://<IP address of the ECS instance>:<Port number>` format. By default, the port number is 9083.
Host Configuration Information	The host-to-IP mappings. Specify the mappings in different lines. Examples: 192.168.2.153 master.cdh 192.168.2.154 node1.cdh 192.168.2.155 node2.cdh

Click Create.

Create a data migration job

In the left-side navigation pane, click Data Migration.
In the upper-right corner, click Create Migration Job.

On the Hive Data Source tab of the Create Migration Job page, configure the parameters in the Source and Destination Settings, Database/Table Migration Settings, and Migration Settings sections.

Parameters in the Source and Destination Settings section

Parameter	Description
Data Job Name	The name of the data migration job. By default, the system generates a name based on the data source type and the current time. You can modify the name based on your business requirements.
Data Source	The data source. You can select an existing Hive data source or create a data source.
Destination Type	The data storage type in AnalyticDB for MySQL. Only Data Lake - OSS Storage is supported.
OSS Path	The OSS storage path of the AnalyticDB for MySQL cluster data. Important All buckets that reside in the same region as the AnalyticDB for MySQL cluster are displayed. You can select one of the buckets. Configure this parameter based on your business requirements. After you configure this parameter, it cannot be modified. We recommend that you select an empty directory that does not have nested relationships with the directories of other data migration jobs. This prevents historical data from being overwritten. For example, assume that the involved OSS storage paths of two data migration jobs are oss://adb_demo/test/sls1/ and oss://adb_demo/test/. In this case, data overwriting occurs during data migration because these two paths have nested relationships with each other.

Parameters in the Database/Table Migration Settings section

Important

If a database name or table name is added to both a whitelist and a blacklist, the blacklist prevails. In this case, the system does not migrate the database or table.

Parameter	Description
Database/Table Migration Whitelist	The names of databases and tables that you want to migrate. Specify the names by using regular expressions. Separate multiple regular expressions with commas (,).
Database/Table Migration Blacklist	The names of databases and tables that you do not want to migrate. Specify the names by using regular expressions. Separate multiple regular expressions with commas (,).

Parameters in the Migration Settings section

Parameter	Description
Handling Same Named Destination Table	The way in which to handle a table that uses the same name as another table in the destination OSS bucket. Valid values: Skip Migration: skips the table and continues to migrate other tables. Report Error and Abort Migration: pauses the entire data migrate job.
Job Resource Group	The job resource group that runs the data migration job.
Required ACUs	The number of AnalyticDB compute units (ACUs) that are required for the job resource group to run the data migration job. The value ranges from 4 to the maximum number of computing resources that are available in the job resource group. To improve the stability and performance of data migration, we recommend that you specify a relatively large number of ACUs.
Parallel Tasks	The number of tasks that are started to run in parallel. Default value: 1. Valid values: 1 to 8. Each task migrates a table. Each task requires at least 4 ACUs. If the number of ACUs is not enough, the tasks are run in sequence.
Advanced Settings	The personalized settings for the data migration job. If you want to configure personalized settings, contact technical support.

Click Submit.

Start the data migration job

On the Data Migration page, find the data migration job that you created and click Start in the Actions column.
In the upper-right corner, click Search. If the state of the job changes to Starting, the data migration job is started.

Analyze data

After the data migration job is completed, you can use metadata discovery to import data from OSS to Data Lakehouse Edition (V3.0). Then, you can use Spark JAR Development to analyze the data that is imported to Data Lakehouse Edition (V3.0). For more information about Spark development, see Spark development editor and Spark batch application development.

Procedure

In the left-side navigation pane, choose Job Development > Spark JAR Development.

Enter SQL statements in the default template and click Run Now.

-- Here is just an example of SparkSQL. Modify the content and run your spark program.

conf spark.driver.resourceSpec=medium;
conf spark.executor.instances=2;
conf spark.executor.resourceSpec=medium;
conf spark.app.name=Spark SQL Test;
conf spark.adb.connectors=oss;

-- Here are your sql statements
show tables from lakehouse20220413156_adbTest;

(Optional) On the Applications tab, find an application and click Log in the Actions column to view the Spark SQL running log of the application.

Manage the data migration job

On the Data Migration page, you can perform the following operations in the Actions column.

Operation	Description
Start	Starts the data migration job.
Execution Details	Allows you to view the data migration job details, such as the data migration job settings and number of migrated tables.
Edit	Allows you to edit the properties of the data migration job.
Pause	Pauses the data migration job.
Delete	Deletes the data migration job.