Transfer Data Between HDFS and OSS at Scale with Jindo DistCp - E-MapReduce

Prerequisites

Before you begin, make sure you have:

An EMR cluster of version 3.28.0 or later. For more information, see Create a cluster.
SSH access to the master node of the cluster. For more information, see Log on to a cluster.

Quick start

Log on to the master node and run a basic copy:

jindo distcp --src /opt/tmp --dest oss://<yourBucketName>/tmp

Replace <yourBucketName> with the name of your OSS bucket. By default, all files under --src are copied. If the destination root directory does not exist, Jindo DistCp creates it automatically.

To see all available options, run:

jindo distcp --help

Options reference

The following table lists all supported options. All examples in this document use /data/incoming/hourly_table as the source and oss://<yourBucketName>/hourly_table as the destination. Replace these placeholders with your actual values.

Option	Description	Notes
`--src`	Source directory to copy files from	Required
`--dest`	Destination directory to copy files to	Required
`--parallelism`	Number of parallel reduce tasks. Sets `mapreduce.job.reduces`. Default: `7`	Tune based on available cluster resources
`--srcPattern`	Regular expression to filter source files by full path	—
`--deleteOnSuccess`	Delete source files after a successful copy	Cannot be used with `--groupBy` or `--targetSize`
`--outputCodec`	Compression codec for copied files. Valid values: `gzip`, `gz`, `lzo`, `lzop`, `snappy`, `none`, `keep`. Default: `keep`	`none` decompresses already-compressed files; `keep` preserves the original format
`--outputManifest`	Name of the output manifest file (GZ format only)	Use with `--requirePreviousManifest=false` to create a new manifest
`--requirePreviousManifest`	Require a previous manifest to be present before running	Set to `false` when creating the first manifest
`--previousManifest`	Path to an existing manifest file	Use with `--outputManifest` for incremental copy tracking
`--copyFromManifest`	Copy only the files listed in a manifest file	Use with `--previousManifest` to replay a manifest
`--srcPrefixesFile`	Path to a text file with one source URI prefix per line	Enables copying from multiple source folders in a single job
`--groupBy`	Regular expression to group input files for merging	Use with `--targetSize`
`--targetSize`	Target size for merged output files, in MB	Use with `--groupBy`
`--enableBalancePlan`	Optimizes job allocation when file sizes within each tier (small/large) are relatively uniform	Cannot be used with `--groupBy` or `--targetSize`
`--enableDynamicPlan`	Dynamically allocates copy tasks for workloads where file sizes vary significantly	Cannot be used with `--groupBy` or `--targetSize`
`--enableTransaction`	Enables job-level transaction integrity	—
`--diff`	Compares source and destination file lists after a copy	If differences exist, generates a manifest file in the destination directory; does not return accurate size comparisons when compression or decompression was applied
`--cleanUpPending`	Removes residual files left by failed uploads, tracked by uploadId	—
`--key`	AccessKey ID for OSS access outside EMR	Use when AccessKey-free access is unavailable
`--secret`	AccessKey secret for OSS access outside EMR	Use when AccessKey-free access is unavailable
`--endPoint`	OSS endpoint	Use when AccessKey-free access is unavailable
`--policy`	OSS storage class for destination. Valid values: `archive`, `ia`	—

Copy files

Use --src and --dest to specify the source and destination directories. By default, all files under --src are copied. If the destination root directory does not exist, Jindo DistCp creates it automatically.

The following command copies all files from the /opt/tmp HDFS directory to an OSS bucket:

jindo distcp --src /opt/tmp --dest oss://<yourBucketName>/tmp

Filter files by pattern

Use --srcPattern to copy only files whose full path matches a regular expression.

The /data/incoming/hourly_table/2017-02-01/03 directory contains the following files:

Found 6 items
-rw-r-----   2 root hadoop       2252 2020-04-17 20:42 /data/incoming/hourly_table/2017-02-01/03/000151.sst
-rw-r-----   2 root hadoop       4891 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/1.log
-rw-r-----   2 root hadoop       4891 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/2.log
-rw-r-----   2 root hadoop       4891 2020-04-17 20:42 /data/incoming/hourly_table/2017-02-01/03/OPTIONS-000109
-rw-r-----   2 root hadoop       1016 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/emp01.txt
-rw-r-----   2 root hadoop       1016 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/emp06.txt

To copy only the .log files, set --srcPattern to .*\.log:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --srcPattern .*\.log --parallelism 20

Verify the result:

hdfs dfs -ls oss://<yourBucketName>/hourly_table/2017-02-01/03

Only the log files are copied:

Found 2 items
-rw-rw-rw-   1       4891 2020-04-17 20:52 oss://<yourBucketName>/hourly_table/2017-02-01/03/1.log
-rw-rw-rw-   1       4891 2020-04-17 20:52 oss://<yourBucketName>/hourly_table/2017-02-01/03/2.log

Delete source files after copy

Add --deleteOnSuccess to remove source files from the source directory after the copy completes successfully:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --deleteOnSuccess --parallelism 20

Compress files during copy

Use --outputCodec to compress copied files on the fly. The following command compresses all files using gzip:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --outputCodec=gz --parallelism 20

Verify the result:

hdfs dfs -ls oss://<yourBucketName>/hourly_table/2017-02-01/03

All files in the destination directory have the .gz extension:

Found 6 items
-rw-rw-rw-   1        938 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/000151.sst.gz
-rw-rw-rw-   1       1956 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/1.log.gz
-rw-rw-rw-   1       1956 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/2.log.gz
-rw-rw-rw-   1       1956 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/OPTIONS-000109.gz
-rw-rw-rw-   1        506 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/emp01.txt.gz
-rw-rw-rw-   1        506 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/emp06.txt.gz

Valid values for --outputCodec: gzip, gz, lzo, lzop, snappy, none, keep. Default: keep.

none: decompresses files if they are already compressed; leaves uncompressed files as-is.
keep: preserves the original compression format.

To use the LZO codec in an open-source Hadoop cluster, install the native gplcompression library and the hadoop-lzo package.

Track copies with manifest files

A manifest file records the destination path, source path, and size for every file copied in a job. Use manifest files to track incremental copies and verify completeness.

Create a manifest file

Set --requirePreviousManifest=false to generate a new manifest file. The file is always stored in GZ format.

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --outputManifest=manifest-2020-04-17.gz --requirePreviousManifest=false --parallelism 20

Inspect the manifest:

hadoop fs -text oss://<yourBucketName>/hourly_table/manifest-2020-04-17.gz > before.lst
cat before.lst

Each line in the manifest is a JSON record:

{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/000151.sst","baseName":"2017-02-01/03/000151.sst","srcDir":"oss://<yourBucketName>/hourly_table","size":2252}
{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/1.log","baseName":"2017-02-01/03/1.log","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}
{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/2.log","baseName":"2017-02-01/03/2.log","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}
{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/OPTIONS-000109","baseName":"2017-02-01/03/OPTIONS-000109","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}
{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/emp01.txt","baseName":"2017-02-01/03/emp01.txt","srcDir":"oss://<yourBucketName>/hourly_table","size":1016}
{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/emp06.txt","baseName":"2017-02-01/03/emp06.txt","srcDir":"oss://<yourBucketName>/hourly_table","size":1016}

Copy only new files using a previous manifest

When new files are added to the source directory, use --previousManifest to skip already-copied files and transfer only the new ones. The new manifest (--outputManifest) includes both previously and newly copied files, giving you a complete history.

The following command copies only files added since manifest-2020-04-17.gz:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --outputManifest=manifest-2020-04-18.gz --previousManifest=oss://<yourBucketName>/hourly_table/manifest-2020-04-17.gz --parallelism 20

Compare the two manifests to see what was newly copied:

hadoop fs -text oss://<yourBucketName>/hourly_table/manifest-2020-04-18.gz > current.lst
diff before.lst current.lst

The diff output shows the two newly added files:

3a4,5
> {"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/5.log","baseName":"2017-02-01/03/5.log","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}
> {"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/6.log","baseName":"2017-02-01/03/6.log","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}

Copy files from a manifest

Use --copyFromManifest with --previousManifest to copy exactly the files listed in a previously generated manifest:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --previousManifest=oss://<yourBucketName>/hourly_table/manifest-2020-04-17.gz --copyFromManifest --parallelism 20

Copy files from multiple source folders

--srcPrefixesFile accepts a text file with one source URI prefix per line, allowing a single job to copy from multiple folders.

List the source folders:

hdfs dfs -ls oss://<yourBucketName>/hourly_table

Found 4 items
drwxrwxrwx   -          0 1970-01-01 08:00 oss://<yourBucketName>/hourly_table/2017-02-01
drwxrwxrwx   -          0 1970-01-01 08:00 oss://<yourBucketName>/hourly_table/2017-02-02

Run the copy and specify the prefixes file:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --srcPrefixesFile file:///opt/folders.txt --parallelism 20

View the generated folders.txt to confirm its contents:

cat folders.txt

hdfs://emr-header-1.cluster-50466:9000/data/incoming/hourly_table/2017-02-01
hdfs://emr-header-1.cluster-50466:9000/data/incoming/hourly_table/2017-02-02

Merge small files

Reading many small files from HDFS degrades processing performance. Use --groupBy and --targetSize together to merge small files into larger output files. --groupBy is a regular expression that groups files; --targetSize sets the maximum size of each merged file in MB.

The source directory contains eight small files:

hdfs dfs -ls /data/incoming/hourly_table/2017-02-01/03

Found 8 items
-rw-r-----   2 root hadoop       2252 2020-04-17 20:42 /data/incoming/hourly_table/2017-02-01/03/000151.sst
-rw-r-----   2 root hadoop       4891 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/1.log
-rw-r-----   2 root hadoop       4891 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/2.log
-rw-r-----   2 root hadoop       4891 2020-04-17 21:08 /data/incoming/hourly_table/2017-02-01/03/5.log
-rw-r-----   2 root hadoop       4891 2020-04-17 21:08 /data/incoming/hourly_table/2017-02-01/03/6.log
-rw-r-----   2 root hadoop       4891 2020-04-17 20:42 /data/incoming/hourly_table/2017-02-01/03/OPTIONS-000109
-rw-r-----   2 root hadoop       1016 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/emp01.txt
-rw-r-----   2 root hadoop       1016 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/emp06.txt

The following command merges all .txt files into a single output file of up to 10 MB:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --targetSize=10 --groupBy='.*/([a-z]+).*.txt' --parallelism 20

Verify the result. The two TXT files are merged into one:

hdfs dfs -ls oss://<yourBucketName>/hourly_table/2017-02-01/03/
Found 1 items
-rw-rw-rw-   1       2032 2020-04-17 21:18 oss://<yourBucketName>/hourly_table/2017-02-01/03/emp2

Optimize copy performance

Jindo DistCp provides two job allocation strategies to improve performance when copying mixed file sizes. Choose based on your workload, then tune --parallelism to control concurrency.

Strategy	When to use	Option
Balance plan	File sizes within the small-file group and within the large-file group are relatively uniform	`--enableBalancePlan`
Dynamic plan	File sizes vary significantly across files, with most files being small	`--enableDynamicPlan`

If neither option is specified, files are allocated randomly across reduce tasks.

Neither --enableBalancePlan nor --enableDynamicPlan can be used together with --groupBy or --targetSize.

Balance plan:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --enableBalancePlan --parallelism 20

Dynamic plan:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --enableDynamicPlan --parallelism 20

Tune --parallelism to control the number of concurrent reduce tasks. The default is 7. Increase this value when the cluster has more available resources to improve copy throughput.

Enable transaction integrity

--enableTransaction enforces job-level transaction integrity. All files in the job either succeed or are rolled back together:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --enableTransaction --parallelism 20

Verify copy completeness

After a copy job completes, use --diff to compare the source and destination file lists:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --diff

If all files are successfully copied:

INFO distcp.JindoDistCp: distcp has been done completely

If any files are missing from the destination, Jindo DistCp generates a manifest file in the destination directory listing them. Use --copyFromManifest and --previousManifest to copy those files:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --previousManifest=file:///opt/manifest-2020-04-17.gz --copyFromManifest --parallelism 20

If Jindo DistCp performed compression or decompression during the copy, --diff does not return accurate file size comparisons. If the destination is an HDFS directory, specify --dest in one of these formats: /path, hdfs://hostname:ip/path, or hdfs://headerIp:ip/path. Other formats such as hdfs:///path and hdfs:/path are not supported.

Check DistCp counters

After a job completes, check the DistCp Counters section in the MapReduce job output to verify how much data was transferred:

Distcp Counters
        Bytes Destination Copied=11010048000
        Bytes Source Read=11010048000
        Files Copied=1001

Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0

If Jindo DistCp compressed or decompressed files during the copy, Bytes Destination Copied and Bytes Source Read may differ.

Access OSS with an AccessKey pair

By default, EMR uses AccessKey-free access for OSS. If you run Jindo DistCp from outside EMR, or AccessKey-free access is unavailable, specify an AccessKey pair using --key, --secret, and --endPoint:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --key <yourAccessKeyId> --secret <yourAccessKeySecret> --endPoint oss-cn-hangzhou.aliyuncs.com --parallelism 20

Replace <yourAccessKeyId> with your Alibaba Cloud AccessKey ID and <yourAccessKeySecret> with your AccessKey secret.

Write to OSS Archive or Infrequent Access storage

Use --policy to write data directly to a lower-cost OSS storage class during the copy.

Archive storage:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --policy archive --parallelism 20

Infrequent Access (IA) storage:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --policy ia --parallelism 20

Clean up residual files

When a DistCp task fails partway through, partially uploaded files may remain in the destination. OSS tracks these files internally by uploadId, but they are invisible in directory listings. Add --cleanUpPending to automatically clean them up after the task completes:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --cleanUpPending --parallelism 20

Alternatively, delete residual files manually from the OSS console.

FAQ

Why does `--diff` report incorrect file sizes after compression?

When Jindo DistCp compresses or decompresses files during the copy, the source and destination file sizes differ by design. --diff compares raw file sizes and therefore reports a mismatch. This is expected behavior, not a copy failure. Check the DistCp Counters instead to confirm the number of files copied.

Can I use `--enableBalancePlan` or `--enableDynamicPlan` with `--groupBy`?

No. Both plan options are incompatible with --groupBy and --targetSize. When merging small files, file allocation is determined by the grouping pattern, so the plan options have no effect and are not allowed in the same command.

Can I use `--deleteOnSuccess` with `--groupBy`?

No. --deleteOnSuccess cannot be used with --groupBy or --targetSize. After merging, the output is a new merged file; deleting the original inputs in the same job is not supported.

Why is the destination HDFS path format restricted for `--diff`?

When the destination is an HDFS directory, Jindo DistCp requires --dest to be in the format /path, hdfs://hostname:ip/path, or hdfs://headerIp:ip/path. Formats such as hdfs:///path and hdfs:/path are not supported and will cause the diff operation to fail.