All Products
Search
Document Center

E-MapReduce:Migrate Hadoop clusters to DataLake clusters

Last Updated:Mar 26, 2026

Migrate your workloads from a Hadoop cluster (old EMR console) to a DataLake cluster (new EMR console). The new EMR console is a next-generation, cloud-native open source big data platform. It provides new cluster types: DataLake, Dataflow, OLAP, and custom. DataLake clusters are an upgraded version of Hadoop clusters. Three factors determine the migration path: cluster version, metadata storage type, and data storage method.

Prerequisites

Before you begin, ensure that you have:

  • Access to both the old EMR console and the new EMR console

  • Permissions to create and release clusters

  • Identified the metadata storage type, data storage architecture, and scheduling system used in your old cluster

Preparations

Assess the old cluster

Analyze your current big data architecture before migration. Identify and record the following:

  • Services and versions: List every service running in the old cluster and its version. This determines upgrade compatibility and required feature updates.

  • Metadata storage type: Determine whether the old cluster uses Data Lake Formation (DLF) or a self-managed ApsaraDB RDS database. This drives the metadata migration path.

  • Data storage architecture: Identify whether data is stored on local Hadoop Distributed File System (HDFS), Object Storage Service (OSS), or JindoFS in block mode. This determines the data migration method.

  • User authentication: Check whether OpenLDAP, Ranger, or Kerberos is deployed. Plan how the new cluster inherits these security configurations.

  • Scheduling system: Identify the development and scheduling platform to maintain task continuity during migration.

If you are migrating multiple old clusters, migrate them one at a time to preserve business continuity.

Collect instance configurations

View the hardware and software configurations of the old cluster on the Basic Information and Nodes tabs of your cluster on the EMR on ECS page. Record the following:

Configuration type Items to record Purpose
Software Cluster version, service versions, components in use, Hive metadata storage type Determine compatible versions for the new cluster
Hardware Zone, node group specifications, and billing method Replicate or optimize hardware in the new cluster

(Optional) Export service configurations

If your old cluster has customized service configurations, export them so you can apply them to the new cluster at creation time rather than reconfiguring manually afterward.

  1. Export the configurations. Follow Export and import service configurations to export configuration files. The exported files contain the following parameters:

    When exporting, observe these constraints: - Select Configuration Files: Select only files that have been edited. - Export Mode: This cannot be set to Export Only Custom or Modified Configurations for Hadoop clusters. - Export Format: Select JSON to enable direct import to the new cluster.
    Parameter Description
    ApplicationName Service name
    ConfigFileName Configuration file name
    ConfigItemKey Configuration item key
    ConfigItemValue Configuration item value
  2. Review and clean up the exported files. Remove configurations that do not apply to the new environment:

    • YARN resource parameters: Adjust values to match the actual hardware specifications of the new cluster.

    • JindoFS credential providers: Replace JindoFS-specific credential settings with OSS or OSS-HDFS configurations. See Configure a credential provider for OSS or OSS-HDFS.

  3. Use the cleaned configuration files as preset configurations when creating the new cluster. See (Optional) Custom software configuration later in this guide.

(Optional) Review bootstrap actions

Check whether the old cluster has bootstrap action scripts configured. If so, evaluate each script before using it in the new cluster:

After updating, upload the scripts to OSS and reference the new paths when creating the cluster.

Important

Validate all bootstrap action scripts in a test environment before applying them to a production cluster.

(Optional) Review auto scaling rules

If auto scaling is configured on the old cluster, record the following parameters before creating the new cluster, then reconfigure them after the new cluster is ready:

  1. In the old EMR console, go to EMR on ECS, open the cluster, and click the Auto Scaling tab.

  2. Click Configure Rule for the relevant auto scaling group and record:

    • Maximum Number of Instances

    • Minimum Number of Instances

    • Graceful Shutdown

    • Trigger Mode

    • Trigger Rule (Scale Out and Scale In)

For reconfiguration on the new cluster, see Add auto scaling rules.

On the new cluster, you can also configure Instance Type Selection Mode, Billing Method, Instance Type, and Graceful Shutdown in the Configure Auto Scaling panel. See Manage node groups.

(Optional) Assess cluster load

Before sizing the new cluster, review resource utilization in the old cluster:

Activate EMR Doctor before using it in the old console. See Activate EMR Doctor (Hadoop clusters).

Plan the migration

Based on your assessment, decide the following before proceeding:

  • Product version and services: Determine which services and versions to deploy in the new cluster. See Select the product version and optional services in Step 1.

  • Metadata storage: Choose DLF Unified Metadata or Self-managed RDS.

    Built-in MySQL is for test environments only and must not be used in production.
  • Data storage: Choose OSS-HDFS or OSS.

Step 1: Create clusters in the new EMR console

Create a new cluster

Follow Create a cluster and apply the configurations you collected. Pay attention to the following settings.

Product version and optional services (select one at least)

Select services and versions based on compatibility with your old cluster.

Service compatibility

As open source communities release newer versions, some services in a DataLake cluster are at later versions than those in a Hadoop cluster. The following table shows backward compatibility ranges. Within each range, a later version can read data produced by an earlier version.

Service Range 1 Range 2 Range 3 Range 4
Spark 2.X 3.X
Hive 2.X 3.X
Tez All versions compatible
Delta Lake 0.6.X 0.8.0–1.1.0
Iceberg 0.12.X 0.13.X
Hudi 0.6.X 0.8.X 0.9.X 0.10.X
Sqoop All versions compatible
Ranger 1.X 2.X
OpenLDAP All versions compatible
Compatibility information is for reference only. Verify against the official documentation of each service.
Some services available in the old console are not supported in the new console, including Hue, Zeppelin, and Oozie. Migrate those to EMR Notebook or EMR Workflow, or deploy equivalent engines in the new cluster.

Product version

Select the EMR version series based on the old cluster version. In data lake scenarios, EMR provides two series: EMR V3.X and EMR V5.X.

Old cluster version New cluster version series
EMR V3.35.0 (YARN 2.8.5, HDFS 2.8.5, Hive 2.3.7, Spark 2.4.7) EMR V3.X series
EMR V5.6.0 (YARN 3.2.1, HDFS 3.2.1, Hive 3.1.2, Spark 3.2.1) EMR V5.X series

When software version requirements are met, select the latest available EMR version to access the newest features.

Select HDFS or OSS-HDFS

Starting from EMR V5.12.1 and EMR V3.46.1, you can choose HDFS or OSS-HDFS as the underlying storage from optional services.

image

Select the optional service based on your planned storage solution:

Storage in new console Service to select
OSS OSS-HDFS
OSS-HDFS OSS-HDFS
If you select OSS-HDFS, configure the Root Storage Directory of Cluster parameter to specify a bucket with OSS-HDFS enabled as the cluster's root storage path.

Metadata

Metadata storage type Description
DLF Unified Metadata (recommended) Metadata is stored in DLF. If the old cluster already uses DLF, set the DLF Catalog to the same value — metadata is automatically shared and no migration is required.
Self-managed RDS Metadata is stored in an ApsaraDB RDS database you manage. Configure the existing database parameters. See Configure a self-managed ApsaraDB RDS for MySQL database.
Built-in MySQL Metadata is stored in a local MySQL database on the cluster. For test environments only.

(Optional) Custom software configuration

If you exported service configurations from the old cluster, turn on Custom Software Configuration when creating the new cluster and paste the configuration content into the field. See Customize software configurations.

Hardware configurations

Select hardware for master, core, and task nodes based on the resource utilization data you collected:

  • Use the latest ECS instance families and cloud disk types to access updated hardware capabilities.

  • Add node groups with the same role after cluster creation if needed.

  • Assign Public Network IP: Turn on this switch for the master node group if you need Internet access to the master node or to the web UIs of open source components.

(Optional) Create a gateway

If you use a gateway in the old console to submit jobs and isolate clusters, deploy one in the new environment using EMR-CLI. This tool deploys a gateway on an existing ECS instance and automatically synchronizes cluster configurations. See Use EMR-CLI to deploy a gateway.

Step 2: Migrate and verify data

With the new cluster running, migrate metadata, data, and jobs from the old cluster.

Migrate metadata

Select the migration method based on the metadata management modes in the old and new clusters. DLF Unified Metadata is strongly recommended for the new cluster.

Old metadata mode New metadata mode Migration method
DLF DLF No migration needed. Set the same DLF catalog on the new cluster.
Unified metabase DLF See Migration of EMR metadata.
Local MySQL DLF See Migrate metadata.
Self-managed ApsaraDB RDS DLF See Migrate metadata.

Migrate data

Select the migration method based on the storage modes in the old and new clusters.

Old storage New storage Migration method
OSS OSS No migration needed.
OSS OSS-HDFS Use Jindo DistCp.
JindoFS in block mode OSS-HDFS Use Jindo DistCp.
HDFS OSS-HDFS Use Jindo DistCp.

Verify data

Skip this step if no data migration was required.

After migration, verify the correctness of HDFS data and data in Hive databases and tables. If inconsistencies are found, rerun affected tasks or supplement missing data immediately.

Requirement Verification method
File verification Calculate checksum values before and after migration and compare them to confirm no changes or data corruption.
Rough data verification Check table-level statistics — row counts, numeric column sums and averages, and minimum and maximum values — to quickly evaluate overall consistency.
Detailed data verification Compare each row of data to confirm all records are intact after migration.

Migrate jobs

Migrate jobs based on your scheduling environment:

Step 3: Run parallel verification

Before cutting over traffic, run jobs on both the old and new clusters simultaneously to verify data consistency and business accuracy. This is sometimes referred to as double-run verification.

The specific approach depends on your business architecture, data processing requirements, and risk tolerance. Design a parallel run plan that fits your scenario. The goal is to confirm the new cluster produces results identical to the old cluster before you commit to the switch.

Step 4: Cut over and release the old cluster

After parallel verification confirms that the new cluster handles all workloads correctly, schedule a maintenance window and perform the cutover:

  1. Gradually shift jobs from the old cluster to the new cluster.

  2. Increase the job processing volume on the new cluster incrementally.

  3. Monitor the new cluster until all jobs run stably.

Once all business data is running on the new cluster and no workloads remain on the old cluster, release the old cluster following Release a cluster.

What's next