Use Jindo DistCp to migrate data from Hadoop Distributed File System (HDFS) to OSS-HDFS — Alibaba Cloud's HDFS-compatible interface for Object Storage Service (OSS). Jindo DistCp runs as a MapReduce job, distributing file transfers across your cluster and supporting both full and incremental migrations.
Jindo DistCp supports copy operations between directories in HDFS, between HDFS and OSS, between HDFS and OSS-HDFS, and between buckets in OSS-HDFS. It is 1.59 times faster than Hadoop DistCp, copies files without changing file names to ensure data consistency, and supports Hadoop 2.7.x and Hadoop 3.x. Deep OSS integration allows you to compress data and convert the storage class to Archive during migration.
Prerequisites
Before you begin, ensure that you have:
An Alibaba Cloud E-MapReduce (EMR) cluster running EMR-5.6.0 or later, or EMR-3.40.0 or later. See Create a cluster
(Self-managed ECS clusters only) A Hadoop 2.7.0 or later (or Hadoop 3.x) environment that can run MapReduce jobs, with JindoData deployed. JindoData includes JindoSDK and JindoFSx. Download the latest version
OSS-HDFS enabled on your destination bucket, with access permissions configured. See Enable OSS-HDFS
Migrate data from HDFS to OSS-HDFS
Step 1: Log in to the EMR cluster
Log in to the EMR console. In the left-side navigation pane, click EMR on ECS.
Click your EMR cluster.
On the Nodes tab, click the
icon next to the node group to expand it.Click the ECS instance ID, then click Connect on the Instances page.
For SSH login instructions, see Log on to a cluster.
Step 2: Verify source access
List the HDFS root directory to verify source access:
hdfs dfs -ls /Expected output:
Found 8 items
drwxrwxrwx - admin supergroup 0 2023-10-26 10:55 /.sysinfo
drwxrwxrwx - hadoop supergroup 0 2023-10-26 10:55 /apps
drwxrwxrwx - root supergroup 0 2022-08-03 15:54 /data
-rw-r----- 1 root supergroup 13 2022-08-25 11:45 /examplefile.txt
drwxrwxrwx - spark supergroup 0 2023-10-26 14:49 /spark-history
drwx-wx-wx - hive supergroup 0 2023-10-26 13:35 /tmp
drwxrwxrwx - hive supergroup 0 2023-10-26 14:48 /user
drwxrwxrwx - hadoop supergroup 0 2023-10-26 14:48 /yarnStep 3: Navigate to the Jindo DistCp tools directory
cd /opt/apps/JINDOSDK/jindosdk-current/toolsRun ls to confirm the jar file is present and note the exact version:
lsStep 4: Run the migration
All migration commands use hadoop jar to submit a MapReduce job. Replace jindo-distcp-tool-6.1.0.jar with the version found in Step 3.
Full data migration
Copy all data from a source directory in HDFS to a destination path in OSS-HDFS:
hadoop jar jindo-distcp-tool-6.1.0.jar \
--src /tmp/ \
--dest oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir/ \
--hadoopConf fs.oss.accessKeyId=<your-access-key-id> \
--hadoopConf fs.oss.accessKeySecret=<your-access-key-secret> \
--parallelism 10Replace the placeholders with your actual values:
| Placeholder | Description |
|---|---|
<your-access-key-id> | AccessKey ID used to access OSS-HDFS |
<your-access-key-secret> | AccessKey secret used to access OSS-HDFS |
Parameters:
| Parameter | Description | Example |
|---|---|---|
--src | Source path in HDFS | /tmp/ |
--dest | Destination path in OSS-HDFS | oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir/ |
--hadoopConf | The AccessKey pair used to access OSS-HDFS. The AccessKey pair consists of an AccessKey ID and an AccessKey secret. | fs.oss.accessKeyId=LTAI************************ |
--parallelism | The number of data migration threads or data copying threads that can be concurrently run based on the number of resources in your cluster. | 10 |
Incremental data migration
After a full migration, use --update to copy only incremental data from the source directory:
hadoop jar jindo-distcp-tool-6.1.0.jar \
--src /data/ \
--dest oss://destbucket.cn-hangzhou.oss-dls.aliyuncs.com/dir/ \
--hadoopConf fs.oss.accessKeyId=<your-access-key-id> \
--hadoopConf fs.oss.accessKeySecret=<your-access-key-secret> \
--update \
--parallelism 10What's next
To avoid specifying the endpoint and AccessKey pair in every command, pre-configure them in
core-site.xml. See Connect non-EMR clusters to OSS-HDFS.For additional Jindo DistCp options, including compression, storage class conversion, and more copy policies, see Use Jindo DistCp.