Migrate your workloads from a Hadoop cluster (old EMR console) to a DataLake cluster (new EMR console). The new EMR console is a next-generation, cloud-native open source big data platform. It provides new cluster types: DataLake, Dataflow, OLAP, and custom. DataLake clusters are an upgraded version of Hadoop clusters. Three factors determine the migration path: cluster version, metadata storage type, and data storage method.
Prerequisites
Before you begin, ensure that you have:
-
Access to both the old EMR console and the new EMR console
-
Permissions to create and release clusters
-
Identified the metadata storage type, data storage architecture, and scheduling system used in your old cluster
Preparations
Assess the old cluster
Analyze your current big data architecture before migration. Identify and record the following:
-
Services and versions: List every service running in the old cluster and its version. This determines upgrade compatibility and required feature updates.
-
Metadata storage type: Determine whether the old cluster uses Data Lake Formation (DLF) or a self-managed ApsaraDB RDS database. This drives the metadata migration path.
-
Data storage architecture: Identify whether data is stored on local Hadoop Distributed File System (HDFS), Object Storage Service (OSS), or JindoFS in block mode. This determines the data migration method.
-
User authentication: Check whether OpenLDAP, Ranger, or Kerberos is deployed. Plan how the new cluster inherits these security configurations.
-
Scheduling system: Identify the development and scheduling platform to maintain task continuity during migration.
If you are migrating multiple old clusters, migrate them one at a time to preserve business continuity.
Collect instance configurations
View the hardware and software configurations of the old cluster on the Basic Information and Nodes tabs of your cluster on the EMR on ECS page. Record the following:
| Configuration type | Items to record | Purpose |
|---|---|---|
| Software | Cluster version, service versions, components in use, Hive metadata storage type | Determine compatible versions for the new cluster |
| Hardware | Zone, node group specifications, and billing method | Replicate or optimize hardware in the new cluster |
(Optional) Export service configurations
If your old cluster has customized service configurations, export them so you can apply them to the new cluster at creation time rather than reconfiguring manually afterward.
-
Export the configurations. Follow Export and import service configurations to export configuration files. The exported files contain the following parameters:
When exporting, observe these constraints: - Select Configuration Files: Select only files that have been edited. - Export Mode: This cannot be set to Export Only Custom or Modified Configurations for Hadoop clusters. - Export Format: Select JSON to enable direct import to the new cluster.
Parameter Description ApplicationName Service name ConfigFileName Configuration file name ConfigItemKey Configuration item key ConfigItemValue Configuration item value -
Review and clean up the exported files. Remove configurations that do not apply to the new environment:
-
YARN resource parameters: Adjust values to match the actual hardware specifications of the new cluster.
-
JindoFS credential providers: Replace JindoFS-specific credential settings with OSS or OSS-HDFS configurations. See Configure a credential provider for OSS or OSS-HDFS.
-
-
Use the cleaned configuration files as preset configurations when creating the new cluster. See (Optional) Custom software configuration later in this guide.
(Optional) Review bootstrap actions
Check whether the old cluster has bootstrap action scripts configured. If so, evaluate each script before using it in the new cluster:
-
Update JAR package names and paths to match the new console's file locations. See Paths of frequently used files.
-
Update any OSS commands used in the scripts. See Manage bootstrap actions.
After updating, upload the scripts to OSS and reference the new paths when creating the cluster.
Validate all bootstrap action scripts in a test environment before applying them to a production cluster.
(Optional) Review auto scaling rules
If auto scaling is configured on the old cluster, record the following parameters before creating the new cluster, then reconfigure them after the new cluster is ready:
-
In the old EMR console, go to EMR on ECS, open the cluster, and click the Auto Scaling tab.
-
Click Configure Rule for the relevant auto scaling group and record:
-
Maximum Number of Instances
-
Minimum Number of Instances
-
Graceful Shutdown
-
Trigger Mode
-
Trigger Rule (Scale Out and Scale In)
-
For reconfiguration on the new cluster, see Add auto scaling rules.
On the new cluster, you can also configure Instance Type Selection Mode, Billing Method, Instance Type, and Graceful Shutdown in the Configure Auto Scaling panel. See Manage node groups.
(Optional) Assess cluster load
Before sizing the new cluster, review resource utilization in the old cluster:
-
Service metrics: View YARN and HDFS resource usage. See View service metrics.
-
EMR Doctor daily reports: Review computing resources, YARN scheduling, and HDFS storage distribution. See View daily cluster reports and analysis results in the reports.
Activate EMR Doctor before using it in the old console. See Activate EMR Doctor (Hadoop clusters).
Plan the migration
Based on your assessment, decide the following before proceeding:
-
Product version and services: Determine which services and versions to deploy in the new cluster. See Select the product version and optional services in Step 1.
-
Metadata storage: Choose DLF Unified Metadata or Self-managed RDS.
Built-in MySQL is for test environments only and must not be used in production.
-
Data storage: Choose OSS-HDFS or OSS.
Step 1: Create clusters in the new EMR console
Create a new cluster
Follow Create a cluster and apply the configurations you collected. Pay attention to the following settings.
Product version and optional services (select one at least)
Select services and versions based on compatibility with your old cluster.
Service compatibility
As open source communities release newer versions, some services in a DataLake cluster are at later versions than those in a Hadoop cluster. The following table shows backward compatibility ranges. Within each range, a later version can read data produced by an earlier version.
| Service | Range 1 | Range 2 | Range 3 | Range 4 |
|---|---|---|---|---|
| Spark | 2.X | 3.X | — | — |
| Hive | 2.X | 3.X | — | — |
| Tez | All versions compatible | — | — | — |
| Delta Lake | 0.6.X | 0.8.0–1.1.0 | — | — |
| Iceberg | 0.12.X | 0.13.X | — | — |
| Hudi | 0.6.X | 0.8.X | 0.9.X | 0.10.X |
| Sqoop | All versions compatible | — | — | — |
| Ranger | 1.X | 2.X | — | — |
| OpenLDAP | All versions compatible | — | — | — |
Compatibility information is for reference only. Verify against the official documentation of each service.
Some services available in the old console are not supported in the new console, including Hue, Zeppelin, and Oozie. Migrate those to EMR Notebook or EMR Workflow, or deploy equivalent engines in the new cluster.
Product version
Select the EMR version series based on the old cluster version. In data lake scenarios, EMR provides two series: EMR V3.X and EMR V5.X.
| Old cluster version | New cluster version series |
|---|---|
| EMR V3.35.0 (YARN 2.8.5, HDFS 2.8.5, Hive 2.3.7, Spark 2.4.7) | EMR V3.X series |
| EMR V5.6.0 (YARN 3.2.1, HDFS 3.2.1, Hive 3.1.2, Spark 3.2.1) | EMR V5.X series |
When software version requirements are met, select the latest available EMR version to access the newest features.
Select HDFS or OSS-HDFS
Starting from EMR V5.12.1 and EMR V3.46.1, you can choose HDFS or OSS-HDFS as the underlying storage from optional services.
Select the optional service based on your planned storage solution:
| Storage in new console | Service to select |
|---|---|
| OSS | OSS-HDFS |
| OSS-HDFS | OSS-HDFS |
If you select OSS-HDFS, configure the Root Storage Directory of Cluster parameter to specify a bucket with OSS-HDFS enabled as the cluster's root storage path.
Metadata
| Metadata storage type | Description |
|---|---|
| DLF Unified Metadata (recommended) | Metadata is stored in DLF. If the old cluster already uses DLF, set the DLF Catalog to the same value — metadata is automatically shared and no migration is required. |
| Self-managed RDS | Metadata is stored in an ApsaraDB RDS database you manage. Configure the existing database parameters. See Configure a self-managed ApsaraDB RDS for MySQL database. |
| Built-in MySQL | Metadata is stored in a local MySQL database on the cluster. For test environments only. |
(Optional) Custom software configuration
If you exported service configurations from the old cluster, turn on Custom Software Configuration when creating the new cluster and paste the configuration content into the field. See Customize software configurations.
Hardware configurations
Select hardware for master, core, and task nodes based on the resource utilization data you collected:
-
Use the latest ECS instance families and cloud disk types to access updated hardware capabilities.
-
Add node groups with the same role after cluster creation if needed.
-
Assign Public Network IP: Turn on this switch for the master node group if you need Internet access to the master node or to the web UIs of open source components.
(Optional) Create a gateway
If you use a gateway in the old console to submit jobs and isolate clusters, deploy one in the new environment using EMR-CLI. This tool deploys a gateway on an existing ECS instance and automatically synchronizes cluster configurations. See Use EMR-CLI to deploy a gateway.
Step 2: Migrate and verify data
With the new cluster running, migrate metadata, data, and jobs from the old cluster.
Migrate metadata
Select the migration method based on the metadata management modes in the old and new clusters. DLF Unified Metadata is strongly recommended for the new cluster.
| Old metadata mode | New metadata mode | Migration method |
|---|---|---|
| DLF | DLF | No migration needed. Set the same DLF catalog on the new cluster. |
| Unified metabase | DLF | See Migration of EMR metadata. |
| Local MySQL | DLF | See Migrate metadata. |
| Self-managed ApsaraDB RDS | DLF | See Migrate metadata. |
Migrate data
Select the migration method based on the storage modes in the old and new clusters.
| Old storage | New storage | Migration method |
|---|---|---|
| OSS | OSS | No migration needed. |
| OSS | OSS-HDFS | Use Jindo DistCp. |
| JindoFS in block mode | OSS-HDFS | Use Jindo DistCp. |
| HDFS | OSS-HDFS | Use Jindo DistCp. |
Verify data
Skip this step if no data migration was required.
After migration, verify the correctness of HDFS data and data in Hive databases and tables. If inconsistencies are found, rerun affected tasks or supplement missing data immediately.
| Requirement | Verification method |
|---|---|
| File verification | Calculate checksum values before and after migration and compare them to confirm no changes or data corruption. |
| Rough data verification | Check table-level statistics — row counts, numeric column sums and averages, and minimum and maximum values — to quickly evaluate overall consistency. |
| Detailed data verification | Compare each row of data to confirm all records are intact after migration. |
Migrate jobs
Migrate jobs based on your scheduling environment:
-
EMR Data Platform (old console): Migrate to EMR Workflow. See Announcement on the migration of data from EMR Data Platform in the old EMR console.
-
DataWorks or a self-managed platform: Follow the migration documentation for your platform. Update key configurations such as the computing cluster endpoint to point to the new cluster.
Step 3: Run parallel verification
Before cutting over traffic, run jobs on both the old and new clusters simultaneously to verify data consistency and business accuracy. This is sometimes referred to as double-run verification.
The specific approach depends on your business architecture, data processing requirements, and risk tolerance. Design a parallel run plan that fits your scenario. The goal is to confirm the new cluster produces results identical to the old cluster before you commit to the switch.
Step 4: Cut over and release the old cluster
After parallel verification confirms that the new cluster handles all workloads correctly, schedule a maintenance window and perform the cutover:
-
Gradually shift jobs from the old cluster to the new cluster.
-
Increase the job processing volume on the new cluster incrementally.
-
Monitor the new cluster until all jobs run stably.
Once all business data is running on the new cluster and no workloads remain on the old cluster, release the old cluster following Release a cluster.