Use Jindo DistCp to migrate data from HDFS to OSS-HDFS - Object Storage Service

Use Jindo DistCp to migrate data from Hadoop Distributed File System (HDFS) to OSS-HDFS — Alibaba Cloud's HDFS-compatible interface for Object Storage Service (OSS). Jindo DistCp runs as a MapReduce job, distributing file transfers across your cluster and supporting both full and incremental migrations.

Jindo DistCp supports copy operations between directories in HDFS, between HDFS and OSS, between HDFS and OSS-HDFS, and between buckets in OSS-HDFS. It is 1.59 times faster than Hadoop DistCp, copies files without changing file names to ensure data consistency, and supports Hadoop 2.7.x and Hadoop 3.x. Deep OSS integration allows you to compress data and convert the storage class to Archive during migration.

Prerequisites

Before you begin, ensure that you have:

An Alibaba Cloud E-MapReduce (EMR) cluster running EMR-5.6.0 or later, or EMR-3.40.0 or later. See Create a cluster
(Self-managed ECS clusters only) A Hadoop 2.7.0 or later (or Hadoop 3.x) environment that can run MapReduce jobs, with JindoData deployed. JindoData includes JindoSDK and JindoFSx. Download the latest version
OSS-HDFS enabled on your destination bucket, with access permissions configured. See Enable OSS-HDFS

Migrate data from HDFS to OSS-HDFS

Step 1: Log in to the EMR cluster

Log in to the EMR console. In the left-side navigation pane, click EMR on ECS.
Click your EMR cluster.
On the Nodes tab, click the icon next to the node group to expand it.
Click the ECS instance ID, then click Connect on the Instances page.

For SSH login instructions, see Log on to a cluster.

Step 2: Verify source access

List the HDFS root directory to verify source access:

hdfs dfs -ls /

Expected output:

Found 8 items
drwxrwxrwx   - admin  supergroup          0 2023-10-26 10:55 /.sysinfo
drwxrwxrwx   - hadoop supergroup          0 2023-10-26 10:55 /apps
drwxrwxrwx   - root   supergroup          0 2022-08-03 15:54 /data
-rw-r-----   1 root   supergroup         13 2022-08-25 11:45 /examplefile.txt
drwxrwxrwx   - spark  supergroup          0 2023-10-26 14:49 /spark-history
drwx-wx-wx   - hive   supergroup          0 2023-10-26 13:35 /tmp
drwxrwxrwx   - hive   supergroup          0 2023-10-26 14:48 /user
drwxrwxrwx   - hadoop supergroup          0 2023-10-26 14:48 /yarn

Step 3: Navigate to the Jindo DistCp tools directory

cd /opt/apps/JINDOSDK/jindosdk-current/tools

Run ls to confirm the jar file is present and note the exact version:

ls

Step 4: Run the migration

All migration commands use hadoop jar to submit a MapReduce job. Replace jindo-distcp-tool-6.1.0.jar with the version found in Step 3.

Full data migration

Copy all data from a source directory in HDFS to a destination path in OSS-HDFS:

hadoop jar jindo-distcp-tool-6.1.0.jar \
  --src /tmp/ \
  --dest oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir/ \
  --hadoopConf fs.oss.accessKeyId=<your-access-key-id> \
  --hadoopConf fs.oss.accessKeySecret=<your-access-key-secret> \
  --parallelism 10

Replace the placeholders with your actual values:

Placeholder	Description
`<your-access-key-id>`	AccessKey ID used to access OSS-HDFS
`<your-access-key-secret>`	AccessKey secret used to access OSS-HDFS

Parameters:

Parameter	Description	Example
`--src`	Source path in HDFS	`/tmp/`
`--dest`	Destination path in OSS-HDFS	`oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir/`
`--hadoopConf`	The AccessKey pair used to access OSS-HDFS. The AccessKey pair consists of an AccessKey ID and an AccessKey secret.	`fs.oss.accessKeyId=LTAI************************`
`--parallelism`	The number of data migration threads or data copying threads that can be concurrently run based on the number of resources in your cluster.	`10`

Incremental data migration

After a full migration, use --update to copy only incremental data from the source directory:

hadoop jar jindo-distcp-tool-6.1.0.jar \
  --src /data/ \
  --dest oss://destbucket.cn-hangzhou.oss-dls.aliyuncs.com/dir/ \
  --hadoopConf fs.oss.accessKeyId=<your-access-key-id> \
  --hadoopConf fs.oss.accessKeySecret=<your-access-key-secret> \
  --update \
  --parallelism 10

What's next

To avoid specifying the endpoint and AccessKey pair in every command, pre-configure them in core-site.xml. See Connect non-EMR clusters to OSS-HDFS.
For additional Jindo DistCp options, including compression, storage class conversion, and more copy policies, see Use Jindo DistCp.