Migrate data from Hadoop Distributed File System (HDFS) to JindoFileSystem (JindoFS), which stores data in Object Storage Service (OSS).
Prerequisites
Before you begin, make sure you have:
-
An EMR cluster with JindoFS configured
-
Access to the source data in HDFS or OSS
-
Sufficient permissions to read from the source path and write to the JindoFS target path
Choose a migration method
Select a method based on the amount of data to migrate:
| Scenario | Method | Command |
|---|---|---|
| Small datasets or a few files | Hadoop FS shell | hadoop dfs -cp |
| Large datasets | Hadoop DistCp | hadoop distcp |
Use DistCp for large-scale migrations. DistCp runs as a MapReduce job, distributing work across multiple mappers for parallel execution. This is significantly faster than shell commands for large volumes of data, and provides built-in error handling and job recovery.
Migrate data with Hadoop FS shell commands
For small amounts of data, use Hadoop FS shell commands to copy files directly.
Copy from HDFS to JindoFS:
hadoop dfs -cp hdfs://emr-cluster/README.md jfs://emr-jfs/
Copy from OSS to JindoFS:
hadoop dfs -cp oss://oss_bucket/README.md jfs://emr-jfs/
Migrate data with DistCp
For large datasets, use DistCp to run a distributed copy job.
Copy a directory from HDFS to JindoFS:
hadoop distcp hdfs://emr-cluster/files jfs://emr-jfs/output/
Copy a directory from OSS to JindoFS:
hadoop distcp oss://oss_bucket/files jfs://emr-jfs/output/
For all available parameters, see DistCp Version2 Guide.
Use the cache mode
After migrating data to JindoFS, consider enabling cache mode for frequently accessed data. In cache mode, JindoFS stores data files as objects in OSS without changing the metadata and data. When you access these OSS objects, JindoFS caches the data and metadata in the local cluster so that subsequent reads are faster.
For details, see Use the cache mode.
What's next
-
Verify the migration by listing files in the JindoFS target path
-
Configure cache mode to accelerate access to frequently read data