All Products
Search
Document Center

AnalyticDB:Import data from a Hive data source

Last Updated:Mar 28, 2026

If your Hive tables are stored in an EMR or CDH cluster, you can use the data migration feature of AnalyticDB for MySQL to copy them into an Object Storage Service (OSS) bucket, then query the migrated data using Spark SQL in Data Lakehouse Edition.

Limitations

Review the following constraints before you start:

  • DLF Unified Metadata not supported: EMR clusters with Metadata set to DLF Unified Metadata cannot be used as a Hive data source for migration.

  • Data Lake Storage mode not supported: If your EMR cluster has Hive Storage Mode set to Data Lake Storage, the Hive data is already stored in OSS. Use metadata discoveryData Lakehouse Edition to import it to Data Lakehouse Edition instead.

  • Blacklist takes precedence: If a database or table name appears in both the whitelist and the blacklist, the blacklist prevails and the item is not migrated.

  • OSS path is permanent: After you set the OSS path for a migration job, it cannot be changed.

  • ACU range: The number of ACUs for a migration job must be between 4 and the maximum computing resources available in the job resource group.

  • Parallel tasks: Each parallel task migrates one table and requires at least 4 ACUs. If ACUs are insufficient, tasks run sequentially. Default value: 1. Maximum value: 8.

Prerequisites

Before you begin, make sure you have:

  • An AnalyticDB for MySQL Data Lakehouse Edition cluster

  • A job resource group for the cluster. For more information, see Create a resource group

  • A database account for the cluster:

  • One of the following clusters in the same region as the AnalyticDB for MySQL cluster:

    • An E-MapReduce (EMR) cluster with Business Scenario set to Data Lake, Metadata set to Self-managed RDS or Built-in MySQL, the Hive service configured, and Hive Storage Mode set to HDFS (the Data Lake Storage check box cleared). See Create a cluster

    • A CDH (Cloudera's Distribution Including Apache Hadoop) cluster deployed on an Elastic Compute Service (ECS) instance

  • A partitioned table in the Hive data source. See Use Hive to perform basic operations

Billing

Migrating data from Hive to OSS incurs the following fees:

Migration workflow

The migration process has five steps:

  1. Create a Hive data source — Connect AnalyticDB for MySQL to your EMR or CDH cluster by specifying the Hive Metastore Uniform Resource Identifier (URI). Skip this step if the data source already exists.

  2. Create a data migration job — Define the source data source, destination OSS path, database and table filters, and compute resources for the job.

  3. Start the data migration job — Trigger the job to copy Hive tables into the OSS bucket.

  4. Analyze data — Use metadata discovery to import data from OSS to Data Lakehouse Edition, then run Spark SQL to query it.

  5. (Optional) Manage the data migration job — Start, pause, edit, or delete the job as needed.

Create a Hive data source

If you already have a Hive data source, skip this step and go to Create a data migration job.
  1. Log on to the AnalyticDB for MySQL console. In the upper-left corner, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, click the ID of the cluster you want to manage.

  2. In the left-side navigation pane, choose Data Ingestion > Data Sources.

  3. In the upper-right corner, click Create Data Source.

  4. On the Create Data Source page, configure the parameters for your cluster type:

EMR cluster

ParameterDescription
Data Source TypeSelect Hive.
Data Source NameThe system generates a default name based on the data source type and current time. Modify it to match your naming convention.
Data Source DescriptionEnter a description, such as the use case and business scope.
Deployment ModeSelect Alibaba Cloud Instance.
InstanceSelect the EMR cluster that hosts the Hive data source.
Hive Metastore URIThe URI of the Hive Metastore, in the format thrift://<master-node-IP>:<port>. The default port is 9083. To find the private IP address of the master node, log on to the EMR console, click EMR on ECS in the left-side navigation pane, click the ID of the EMR cluster that you want to manage, open the Nodes tab, and expand the emr-master node.

ECS-based CDH cluster

ParameterDescription
Data Source TypeSelect Hive.
Data Source NameThe system generates a default name based on the data source type and current time. Modify it to match your naming convention.
Data Source DescriptionEnter a description, such as the use case and business scope.
Deployment ModeSelect ECS-based CDH.
InstanceSelect the ECS instance where the CDH cluster is deployed.
Hive Metastore URIThe URI of the Hive Metastore, using the public IP address of the ECS instance where the CDH cluster is deployed. Format: thrift://<ECS-instance-IP>:<port>. The default port is 9083.
Host Configuration InformationThe hostname-to-IP mappings, one per line. Example:<br>192.168.2.153 master.cdh<br>192.168.2.154 node1.cdh<br>192.168.2.155 node2.cdh
  1. Click Create.

Create a data migration job

  1. In the left-side navigation pane, click Data Migration.

  2. In the upper-right corner, click Create Migration Job.

  3. On the Hive Data Source tab of the Create Migration Job page, configure the following sections:

    Source and destination settings

    ParameterDescription
    Job NameThe system generates a default name based on the data source type and current time. Modify it as needed.
    Data SourceSelect an existing Hive data source or create one.
    Destination TypeOnly Data Lake - OSS Storage is supported.
    OSS PathThe OSS path where migrated data is stored. All buckets in the same region as the cluster are listed. Select an empty directory with no nested relationship to paths used by other migration jobs — nested paths can cause data to be overwritten. This setting cannot be changed after the job is created.

    Database/table migration settings

    Important

    If a database or table name appears in both lists, the blacklist takes precedence and the item is not migrated.

    ParameterDescription
    Database/Table Migration WhitelistNames of databases and tables to migrate, specified as regular expressions. Separate multiple expressions with commas.
    Database/Table Migration BlacklistNames of databases and tables to exclude from migration, specified as regular expressions. Separate multiple expressions with commas.

    Migration settings

    ParameterDescription
    Handling Same Named Destination TableHow to handle a table whose name conflicts with an existing table in the destination. Options: Skip Migration (skip the conflicting table and continue migrating others) or Report Error and Abort Migration (stop the entire job).
    Job Resource GroupThe job resource group that runs the migration job.
    Required ACUsThe number of ACUs allocated to the job. Valid values: 4 to the maximum available in the job resource group. For better stability and throughput, allocate more ACUs.
    Parallel TasksThe number of tables to migrate concurrently. Default: 1. Valid values: 18. Each task requires at least 4 ACUs. If ACUs are insufficient for the configured number of tasks, the tasks run sequentially.
    Advanced SettingsCustom settings for the migration job. Contact technical support if you need to configure this.
  4. Click Submit.

Start the data migration job

  1. On the Data Migration page, find the job and click Start in the Actions column.

  2. Click Search in the upper-right corner. When the job status changes to Starting, the migration is underway.

Analyze data

After the migration job completes, use metadata discovery to import data from OSS to Data Lakehouse Edition, then use Spark JAR Development to analyze the imported data.

  1. In the left-side navigation pane, choose Job Development > Spark JAR Development.

  2. Enter your Spark SQL statements in the default template and click Run Now. The following example shows the configuration structure used to query tables in Data Lakehouse Edition:

    -- Configure Spark resources and connectors
    conf spark.driver.resourceSpec=medium;
    conf spark.executor.instances=2;
    conf spark.executor.resourceSpec=medium;
    conf spark.app.name=Spark SQL Test;
    conf spark.adb.connectors=oss;
    
    -- Query migrated tables
    show tables from lakehouse20220413156_adbTest;
  3. (Optional) On the Applications tab, find your application and click Log in the Actions column to view the Spark SQL run log.

For more information about Spark development, see Spark editor and Spark batch application development overview.

Manage the data migration job

On the Data Migration page, use the following operations in the Actions column:

OperationDescription
StartStarts the migration job.
View DetailsShows migration job settings and the number of migrated tables.
EditModifies the configuration of the migration job.
PausePauses the migration job.
DeleteDeletes the migration job.