Migrate Hive Data from EMR/CDH to OSS for Lakehouse Analytics - AnalyticDB for MySQL

If your Hive tables are stored in an EMR or CDH cluster, you can use the data migration feature of AnalyticDB for MySQL to copy them into an Object Storage Service (OSS) bucket, then query the migrated data using Spark SQL in Data Lakehouse Edition.

Limitations

Review the following constraints before you start:

DLF Unified Metadata not supported: EMR clusters with Metadata set to DLF Unified Metadata cannot be used as a Hive data source for migration.
Data Lake Storage mode not supported: If your EMR cluster has Hive Storage Mode set to Data Lake Storage, the Hive data is already stored in OSS. Use metadata discoveryData Lakehouse Edition to import it to Data Lakehouse Edition instead.
Blacklist takes precedence: If a database or table name appears in both the whitelist and the blacklist, the blacklist prevails and the item is not migrated.
OSS path is permanent: After you set the OSS path for a migration job, it cannot be changed.
ACU range: The number of ACUs for a migration job must be between 4 and the maximum computing resources available in the job resource group.
Parallel tasks: Each parallel task migrates one table and requires at least 4 ACUs. If ACUs are insufficient, tasks run sequentially. Default value: 1. Maximum value: 8.

Prerequisites

Before you begin, make sure you have:

An AnalyticDB for MySQL Data Lakehouse Edition cluster
A job resource group for the cluster. For more information, see Create a resource group
A database account for the cluster:
- If you use an Alibaba Cloud account, create a privileged account. See the Create a privileged account section of the Create a database account topic
- If you use a Resource Access Management (RAM) user, create both a privileged account and a standard account, and associate the standard account with the RAM user. See Create a database account and Associate or disassociate a database account with or from a RAM user
One of the following clusters in the same region as the AnalyticDB for MySQL cluster:
- An E-MapReduce (EMR) cluster with Business Scenario set to Data Lake, Metadata set to Self-managed RDS or Built-in MySQL, the Hive service configured, and Hive Storage Mode set to HDFS (the Data Lake Storage check box cleared). See Create a cluster
- A CDH (Cloudera's Distribution Including Apache Hadoop) cluster deployed on an Elastic Compute Service (ECS) instance
A partitioned table in the Hive data source. See Use Hive to perform basic operations

Billing

Migrating data from Hive to OSS incurs the following fees:

AnalyticDB for MySQL: ACU elastic resource fees. See Billable items of Data Lakehouse Edition
OSS: Storage fees and request fees for operations such as GET and PUT. See Billing overview

Migration workflow

The migration process has five steps:

Create a Hive data source — Connect AnalyticDB for MySQL to your EMR or CDH cluster by specifying the Hive Metastore Uniform Resource Identifier (URI). Skip this step if the data source already exists.
Create a data migration job — Define the source data source, destination OSS path, database and table filters, and compute resources for the job.
Start the data migration job — Trigger the job to copy Hive tables into the OSS bucket.
Analyze data — Use metadata discovery to import data from OSS to Data Lakehouse Edition, then run Spark SQL to query it.
(Optional) Manage the data migration job — Start, pause, edit, or delete the job as needed.

Create a Hive data source

If you already have a Hive data source, skip this step and go to Create a data migration job.

Log on to the AnalyticDB for MySQL console. In the upper-left corner, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, click the ID of the cluster you want to manage.
In the left-side navigation pane, choose Data Ingestion > Data Sources.
In the upper-right corner, click Create Data Source.
On the Create Data Source page, configure the parameters for your cluster type:

EMR cluster

Parameter	Description
Data Source Type	Select Hive.
Data Source Name	The system generates a default name based on the data source type and current time. Modify it to match your naming convention.
Data Source Description	Enter a description, such as the use case and business scope.
Deployment Mode	Select Alibaba Cloud Instance.
Instance	Select the EMR cluster that hosts the Hive data source.
Hive Metastore URI	The URI of the Hive Metastore, in the format `thrift://<master-node-IP>:<port>`. The default port is 9083. To find the private IP address of the master node, log on to the EMR console, click EMR on ECS in the left-side navigation pane, click the ID of the EMR cluster that you want to manage, open the Nodes tab, and expand the emr-master node.

ECS-based CDH cluster

Parameter	Description
Data Source Type	Select Hive.
Data Source Name	The system generates a default name based on the data source type and current time. Modify it to match your naming convention.
Data Source Description	Enter a description, such as the use case and business scope.
Deployment Mode	Select ECS-based CDH.
Instance	Select the ECS instance where the CDH cluster is deployed.
Hive Metastore URI	The URI of the Hive Metastore, using the public IP address of the ECS instance where the CDH cluster is deployed. Format: `thrift://<ECS-instance-IP>:<port>`. The default port is 9083.
Host Configuration Information	The hostname-to-IP mappings, one per line. Example:<br>`192.168.2.153 master.cdh`<br>`192.168.2.154 node1.cdh`<br>`192.168.2.155 node2.cdh`

Click Create.

Create a data migration job

In the left-side navigation pane, click Data Migration.
In the upper-right corner, click Create Migration Job.

On the Hive Data Source tab of the Create Migration Job page, configure the following sections:

Source and destination settings

Parameter	Description
Job Name	The system generates a default name based on the data source type and current time. Modify it as needed.
Data Source	Select an existing Hive data source or create one.
Destination Type	Only Data Lake - OSS Storage is supported.
OSS Path	The OSS path where migrated data is stored. All buckets in the same region as the cluster are listed. Select an empty directory with no nested relationship to paths used by other migration jobs — nested paths can cause data to be overwritten. This setting cannot be changed after the job is created.

Database/table migration settings

Important

If a database or table name appears in both lists, the blacklist takes precedence and the item is not migrated.

Parameter	Description
Database/Table Migration Whitelist	Names of databases and tables to migrate, specified as regular expressions. Separate multiple expressions with commas.
Database/Table Migration Blacklist	Names of databases and tables to exclude from migration, specified as regular expressions. Separate multiple expressions with commas.

Migration settings

Parameter	Description
Handling Same Named Destination Table	How to handle a table whose name conflicts with an existing table in the destination. Options: Skip Migration (skip the conflicting table and continue migrating others) or Report Error and Abort Migration (stop the entire job).
Job Resource Group	The job resource group that runs the migration job.
Required ACUs	The number of ACUs allocated to the job. Valid values: 4 to the maximum available in the job resource group. For better stability and throughput, allocate more ACUs.
Parallel Tasks	The number of tables to migrate concurrently. Default: 1. Valid values: 1–8. Each task requires at least 4 ACUs. If ACUs are insufficient for the configured number of tasks, the tasks run sequentially.
Advanced Settings	Custom settings for the migration job. Contact technical support if you need to configure this.

Click Submit.

Start the data migration job

On the Data Migration page, find the job and click Start in the Actions column.
Click Search in the upper-right corner. When the job status changes to Starting, the migration is underway.

Analyze data

After the migration job completes, use metadata discovery to import data from OSS to Data Lakehouse Edition, then use Spark JAR Development to analyze the imported data.

In the left-side navigation pane, choose Job Development > Spark JAR Development.

Enter your Spark SQL statements in the default template and click Run Now. The following example shows the configuration structure used to query tables in Data Lakehouse Edition:

-- Configure Spark resources and connectors
conf spark.driver.resourceSpec=medium;
conf spark.executor.instances=2;
conf spark.executor.resourceSpec=medium;
conf spark.app.name=Spark SQL Test;
conf spark.adb.connectors=oss;

-- Query migrated tables
show tables from lakehouse20220413156_adbTest;

(Optional) On the Applications tab, find your application and click Log in the Actions column to view the Spark SQL run log.

For more information about Spark development, see Spark editor and Spark batch application development overview.

Manage the data migration job

On the Data Migration page, use the following operations in the Actions column:

Operation	Description
Start	Starts the migration job.
View Details	Shows migration job settings and the number of migrated tables.
Edit	Modifies the configuration of the migration job.
Pause	Pauses the migration job.
Delete	Deletes the migration job.