If your Hive tables are stored in an EMR or CDH cluster, you can use the data migration feature of AnalyticDB for MySQL to copy them into an Object Storage Service (OSS) bucket, then query the migrated data using Spark SQL in Data Lakehouse Edition.
Limitations
Review the following constraints before you start:
DLF Unified Metadata not supported: EMR clusters with Metadata set to DLF Unified Metadata cannot be used as a Hive data source for migration.
Data Lake Storage mode not supported: If your EMR cluster has Hive Storage Mode set to Data Lake Storage, the Hive data is already stored in OSS. Use metadata discoveryData Lakehouse Edition to import it to Data Lakehouse Edition instead.
Blacklist takes precedence: If a database or table name appears in both the whitelist and the blacklist, the blacklist prevails and the item is not migrated.
OSS path is permanent: After you set the OSS path for a migration job, it cannot be changed.
ACU range: The number of ACUs for a migration job must be between 4 and the maximum computing resources available in the job resource group.
Parallel tasks: Each parallel task migrates one table and requires at least 4 ACUs. If ACUs are insufficient, tasks run sequentially. Default value: 1. Maximum value: 8.
Prerequisites
Before you begin, make sure you have:
An AnalyticDB for MySQL Data Lakehouse Edition cluster
A job resource group for the cluster. For more information, see Create a resource group
A database account for the cluster:
If you use an Alibaba Cloud account, create a privileged account. See the Create a privileged account section of the Create a database account topic
If you use a Resource Access Management (RAM) user, create both a privileged account and a standard account, and associate the standard account with the RAM user. See Create a database account and Associate or disassociate a database account with or from a RAM user
One of the following clusters in the same region as the AnalyticDB for MySQL cluster:
An E-MapReduce (EMR) cluster with Business Scenario set to Data Lake, Metadata set to Self-managed RDS or Built-in MySQL, the Hive service configured, and Hive Storage Mode set to HDFS (the Data Lake Storage check box cleared). See Create a cluster
A CDH (Cloudera's Distribution Including Apache Hadoop) cluster deployed on an Elastic Compute Service (ECS) instance
A partitioned table in the Hive data source. See Use Hive to perform basic operations
Billing
Migrating data from Hive to OSS incurs the following fees:
AnalyticDB for MySQL: ACU elastic resource fees. See Billable items of Data Lakehouse Edition
OSS: Storage fees and request fees for operations such as GET and PUT. See Billing overview
Migration workflow
The migration process has five steps:
Create a Hive data source — Connect AnalyticDB for MySQL to your EMR or CDH cluster by specifying the Hive Metastore Uniform Resource Identifier (URI). Skip this step if the data source already exists.
Create a data migration job — Define the source data source, destination OSS path, database and table filters, and compute resources for the job.
Start the data migration job — Trigger the job to copy Hive tables into the OSS bucket.
Analyze data — Use metadata discovery to import data from OSS to Data Lakehouse Edition, then run Spark SQL to query it.
(Optional) Manage the data migration job — Start, pause, edit, or delete the job as needed.
Create a Hive data source
If you already have a Hive data source, skip this step and go to Create a data migration job.
Log on to the AnalyticDB for MySQL console. In the upper-left corner, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, click the ID of the cluster you want to manage.
In the left-side navigation pane, choose Data Ingestion > Data Sources.
In the upper-right corner, click Create Data Source.
On the Create Data Source page, configure the parameters for your cluster type:
EMR cluster
| Parameter | Description |
|---|---|
| Data Source Type | Select Hive. |
| Data Source Name | The system generates a default name based on the data source type and current time. Modify it to match your naming convention. |
| Data Source Description | Enter a description, such as the use case and business scope. |
| Deployment Mode | Select Alibaba Cloud Instance. |
| Instance | Select the EMR cluster that hosts the Hive data source. |
| Hive Metastore URI | The URI of the Hive Metastore, in the format thrift://<master-node-IP>:<port>. The default port is 9083. To find the private IP address of the master node, log on to the EMR console, click EMR on ECS in the left-side navigation pane, click the ID of the EMR cluster that you want to manage, open the Nodes tab, and expand the emr-master node. |
ECS-based CDH cluster
| Parameter | Description |
|---|---|
| Data Source Type | Select Hive. |
| Data Source Name | The system generates a default name based on the data source type and current time. Modify it to match your naming convention. |
| Data Source Description | Enter a description, such as the use case and business scope. |
| Deployment Mode | Select ECS-based CDH. |
| Instance | Select the ECS instance where the CDH cluster is deployed. |
| Hive Metastore URI | The URI of the Hive Metastore, using the public IP address of the ECS instance where the CDH cluster is deployed. Format: thrift://<ECS-instance-IP>:<port>. The default port is 9083. |
| Host Configuration Information | The hostname-to-IP mappings, one per line. Example:<br>192.168.2.153 master.cdh<br>192.168.2.154 node1.cdh<br>192.168.2.155 node2.cdh |
Click Create.
Create a data migration job
In the left-side navigation pane, click Data Migration.
In the upper-right corner, click Create Migration Job.
On the Hive Data Source tab of the Create Migration Job page, configure the following sections:
Source and destination settings
Parameter Description Job Name The system generates a default name based on the data source type and current time. Modify it as needed. Data Source Select an existing Hive data source or create one. Destination Type Only Data Lake - OSS Storage is supported. OSS Path The OSS path where migrated data is stored. All buckets in the same region as the cluster are listed. Select an empty directory with no nested relationship to paths used by other migration jobs — nested paths can cause data to be overwritten. This setting cannot be changed after the job is created. Database/table migration settings
ImportantIf a database or table name appears in both lists, the blacklist takes precedence and the item is not migrated.
Parameter Description Database/Table Migration Whitelist Names of databases and tables to migrate, specified as regular expressions. Separate multiple expressions with commas. Database/Table Migration Blacklist Names of databases and tables to exclude from migration, specified as regular expressions. Separate multiple expressions with commas. Migration settings
Parameter Description Handling Same Named Destination Table How to handle a table whose name conflicts with an existing table in the destination. Options: Skip Migration (skip the conflicting table and continue migrating others) or Report Error and Abort Migration (stop the entire job). Job Resource Group The job resource group that runs the migration job. Required ACUs The number of ACUs allocated to the job. Valid values: 4 to the maximum available in the job resource group. For better stability and throughput, allocate more ACUs. Parallel Tasks The number of tables to migrate concurrently. Default: 1. Valid values: 1–8. Each task requires at least 4 ACUs. If ACUs are insufficient for the configured number of tasks, the tasks run sequentially. Advanced Settings Custom settings for the migration job. Contact technical support if you need to configure this. Click Submit.
Start the data migration job
On the Data Migration page, find the job and click Start in the Actions column.
Click Search in the upper-right corner. When the job status changes to Starting, the migration is underway.
Analyze data
After the migration job completes, use metadata discovery to import data from OSS to Data Lakehouse Edition, then use Spark JAR Development to analyze the imported data.
In the left-side navigation pane, choose Job Development > Spark JAR Development.
Enter your Spark SQL statements in the default template and click Run Now. The following example shows the configuration structure used to query tables in Data Lakehouse Edition:
-- Configure Spark resources and connectors conf spark.driver.resourceSpec=medium; conf spark.executor.instances=2; conf spark.executor.resourceSpec=medium; conf spark.app.name=Spark SQL Test; conf spark.adb.connectors=oss; -- Query migrated tables show tables from lakehouse20220413156_adbTest;(Optional) On the Applications tab, find your application and click Log in the Actions column to view the Spark SQL run log.
For more information about Spark development, see Spark editor and Spark batch application development overview.
Manage the data migration job
On the Data Migration page, use the following operations in the Actions column:
| Operation | Description |
|---|---|
| Start | Starts the migration job. |
| View Details | Shows migration job settings and the number of migrated tables. |
| Edit | Modifies the configuration of the migration job. |
| Pause | Pauses the migration job. |
| Delete | Deletes the migration job. |