All Products
Search
Document Center

E-MapReduce:Migrate data from a Hadoop file system to JindoFS

Last Updated:Mar 26, 2026

Migrate data from Hadoop Distributed File System (HDFS) to JindoFileSystem (JindoFS), which stores data in Object Storage Service (OSS).

Prerequisites

Before you begin, make sure you have:

  • An EMR cluster with JindoFS configured

  • Access to the source data in HDFS or OSS

  • Sufficient permissions to read from the source path and write to the JindoFS target path

Choose a migration method

Select a method based on the amount of data to migrate:

Scenario Method Command
Small datasets or a few files Hadoop FS shell hadoop dfs -cp
Large datasets Hadoop DistCp hadoop distcp

Use DistCp for large-scale migrations. DistCp runs as a MapReduce job, distributing work across multiple mappers for parallel execution. This is significantly faster than shell commands for large volumes of data, and provides built-in error handling and job recovery.

Migrate data with Hadoop FS shell commands

For small amounts of data, use Hadoop FS shell commands to copy files directly.

Copy from HDFS to JindoFS:

hadoop dfs -cp hdfs://emr-cluster/README.md jfs://emr-jfs/

Copy from OSS to JindoFS:

hadoop dfs -cp oss://oss_bucket/README.md jfs://emr-jfs/

Migrate data with DistCp

For large datasets, use DistCp to run a distributed copy job.

Copy a directory from HDFS to JindoFS:

hadoop distcp hdfs://emr-cluster/files jfs://emr-jfs/output/

Copy a directory from OSS to JindoFS:

hadoop distcp oss://oss_bucket/files jfs://emr-jfs/output/

For all available parameters, see DistCp Version2 Guide.

Use the cache mode

After migrating data to JindoFS, consider enabling cache mode for frequently accessed data. In cache mode, JindoFS stores data files as objects in OSS without changing the metadata and data. When you access these OSS objects, JindoFS caches the data and metadata in the local cluster so that subsequent reads are faster.

For details, see Use the cache mode.

What's next

  • Verify the migration by listing files in the JindoFS target path

  • Configure cache mode to accelerate access to frequently read data