All Products
Search
Document Center

E-MapReduce:Jindo DistCp instructions

Last Updated:Mar 26, 2026

Jindo DistCp is a distributed data copy tool built for E-MapReduce (EMR). It transfers files between Hadoop Distributed File System (HDFS) and OSS, and supports filtering, on-the-fly compression, small-file merging, incremental copy, and job-level transaction integrity.

Prerequisites

Before you begin, make sure you have:

  • An EMR cluster of version 3.28.0 or later. For more information, see Create a cluster.

  • SSH access to the master node of the cluster. For more information, see Log on to a cluster.

Quick start

Log on to the master node and run a basic copy:

jindo distcp --src /opt/tmp --dest oss://<yourBucketName>/tmp

Replace <yourBucketName> with the name of your OSS bucket. By default, all files under --src are copied. If the destination root directory does not exist, Jindo DistCp creates it automatically.

To see all available options, run:

jindo distcp --help

Options reference

The following table lists all supported options. All examples in this document use /data/incoming/hourly_table as the source and oss://<yourBucketName>/hourly_table as the destination. Replace these placeholders with your actual values.

Option Description Notes
--src Source directory to copy files from Required
--dest Destination directory to copy files to Required
--parallelism Number of parallel reduce tasks. Sets mapreduce.job.reduces. Default: 7 Tune based on available cluster resources
--srcPattern Regular expression to filter source files by full path
--deleteOnSuccess Delete source files after a successful copy Cannot be used with --groupBy or --targetSize
--outputCodec Compression codec for copied files. Valid values: gzip, gz, lzo, lzop, snappy, none, keep. Default: keep none decompresses already-compressed files; keep preserves the original format
--outputManifest Name of the output manifest file (GZ format only) Use with --requirePreviousManifest=false to create a new manifest
--requirePreviousManifest Require a previous manifest to be present before running Set to false when creating the first manifest
--previousManifest Path to an existing manifest file Use with --outputManifest for incremental copy tracking
--copyFromManifest Copy only the files listed in a manifest file Use with --previousManifest to replay a manifest
--srcPrefixesFile Path to a text file with one source URI prefix per line Enables copying from multiple source folders in a single job
--groupBy Regular expression to group input files for merging Use with --targetSize
--targetSize Target size for merged output files, in MB Use with --groupBy
--enableBalancePlan Optimizes job allocation when file sizes within each tier (small/large) are relatively uniform Cannot be used with --groupBy or --targetSize
--enableDynamicPlan Dynamically allocates copy tasks for workloads where file sizes vary significantly Cannot be used with --groupBy or --targetSize
--enableTransaction Enables job-level transaction integrity
--diff Compares source and destination file lists after a copy If differences exist, generates a manifest file in the destination directory; does not return accurate size comparisons when compression or decompression was applied
--cleanUpPending Removes residual files left by failed uploads, tracked by uploadId
--key AccessKey ID for OSS access outside EMR Use when AccessKey-free access is unavailable
--secret AccessKey secret for OSS access outside EMR Use when AccessKey-free access is unavailable
--endPoint OSS endpoint Use when AccessKey-free access is unavailable
--policy OSS storage class for destination. Valid values: archive, ia

Copy files

Use --src and --dest to specify the source and destination directories. By default, all files under --src are copied. If the destination root directory does not exist, Jindo DistCp creates it automatically.

The following command copies all files from the /opt/tmp HDFS directory to an OSS bucket:

jindo distcp --src /opt/tmp --dest oss://<yourBucketName>/tmp

Filter files by pattern

Use --srcPattern to copy only files whose full path matches a regular expression.

The /data/incoming/hourly_table/2017-02-01/03 directory contains the following files:

Found 6 items
-rw-r-----   2 root hadoop       2252 2020-04-17 20:42 /data/incoming/hourly_table/2017-02-01/03/000151.sst
-rw-r-----   2 root hadoop       4891 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/1.log
-rw-r-----   2 root hadoop       4891 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/2.log
-rw-r-----   2 root hadoop       4891 2020-04-17 20:42 /data/incoming/hourly_table/2017-02-01/03/OPTIONS-000109
-rw-r-----   2 root hadoop       1016 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/emp01.txt
-rw-r-----   2 root hadoop       1016 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/emp06.txt

To copy only the .log files, set --srcPattern to .*\.log:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --srcPattern .*\.log --parallelism 20

Verify the result:

hdfs dfs -ls oss://<yourBucketName>/hourly_table/2017-02-01/03

Only the log files are copied:

Found 2 items
-rw-rw-rw-   1       4891 2020-04-17 20:52 oss://<yourBucketName>/hourly_table/2017-02-01/03/1.log
-rw-rw-rw-   1       4891 2020-04-17 20:52 oss://<yourBucketName>/hourly_table/2017-02-01/03/2.log

Delete source files after copy

Add --deleteOnSuccess to remove source files from the source directory after the copy completes successfully:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --deleteOnSuccess --parallelism 20

Compress files during copy

Use --outputCodec to compress copied files on the fly. The following command compresses all files using gzip:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --outputCodec=gz --parallelism 20

Verify the result:

hdfs dfs -ls oss://<yourBucketName>/hourly_table/2017-02-01/03

All files in the destination directory have the .gz extension:

Found 6 items
-rw-rw-rw-   1        938 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/000151.sst.gz
-rw-rw-rw-   1       1956 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/1.log.gz
-rw-rw-rw-   1       1956 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/2.log.gz
-rw-rw-rw-   1       1956 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/OPTIONS-000109.gz
-rw-rw-rw-   1        506 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/emp01.txt.gz
-rw-rw-rw-   1        506 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/emp06.txt.gz

Valid values for --outputCodec: gzip, gz, lzo, lzop, snappy, none, keep. Default: keep.

  • none: decompresses files if they are already compressed; leaves uncompressed files as-is.

  • keep: preserves the original compression format.

To use the LZO codec in an open-source Hadoop cluster, install the native gplcompression library and the hadoop-lzo package.

Track copies with manifest files

A manifest file records the destination path, source path, and size for every file copied in a job. Use manifest files to track incremental copies and verify completeness.

Create a manifest file

Set --requirePreviousManifest=false to generate a new manifest file. The file is always stored in GZ format.

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --outputManifest=manifest-2020-04-17.gz --requirePreviousManifest=false --parallelism 20

Inspect the manifest:

hadoop fs -text oss://<yourBucketName>/hourly_table/manifest-2020-04-17.gz > before.lst
cat before.lst

Each line in the manifest is a JSON record:

{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/000151.sst","baseName":"2017-02-01/03/000151.sst","srcDir":"oss://<yourBucketName>/hourly_table","size":2252}
{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/1.log","baseName":"2017-02-01/03/1.log","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}
{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/2.log","baseName":"2017-02-01/03/2.log","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}
{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/OPTIONS-000109","baseName":"2017-02-01/03/OPTIONS-000109","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}
{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/emp01.txt","baseName":"2017-02-01/03/emp01.txt","srcDir":"oss://<yourBucketName>/hourly_table","size":1016}
{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/emp06.txt","baseName":"2017-02-01/03/emp06.txt","srcDir":"oss://<yourBucketName>/hourly_table","size":1016}

Copy only new files using a previous manifest

When new files are added to the source directory, use --previousManifest to skip already-copied files and transfer only the new ones. The new manifest (--outputManifest) includes both previously and newly copied files, giving you a complete history.

The following command copies only files added since manifest-2020-04-17.gz:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --outputManifest=manifest-2020-04-18.gz --previousManifest=oss://<yourBucketName>/hourly_table/manifest-2020-04-17.gz --parallelism 20

Compare the two manifests to see what was newly copied:

hadoop fs -text oss://<yourBucketName>/hourly_table/manifest-2020-04-18.gz > current.lst
diff before.lst current.lst

The diff output shows the two newly added files:

3a4,5
> {"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/5.log","baseName":"2017-02-01/03/5.log","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}
> {"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/6.log","baseName":"2017-02-01/03/6.log","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}

Copy files from a manifest

Use --copyFromManifest with --previousManifest to copy exactly the files listed in a previously generated manifest:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --previousManifest=oss://<yourBucketName>/hourly_table/manifest-2020-04-17.gz --copyFromManifest --parallelism 20

Copy files from multiple source folders

--srcPrefixesFile accepts a text file with one source URI prefix per line, allowing a single job to copy from multiple folders.

List the source folders:

hdfs dfs -ls oss://<yourBucketName>/hourly_table
Found 4 items
drwxrwxrwx   -          0 1970-01-01 08:00 oss://<yourBucketName>/hourly_table/2017-02-01
drwxrwxrwx   -          0 1970-01-01 08:00 oss://<yourBucketName>/hourly_table/2017-02-02

Run the copy and specify the prefixes file:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --srcPrefixesFile file:///opt/folders.txt --parallelism 20

View the generated folders.txt to confirm its contents:

cat folders.txt
hdfs://emr-header-1.cluster-50466:9000/data/incoming/hourly_table/2017-02-01
hdfs://emr-header-1.cluster-50466:9000/data/incoming/hourly_table/2017-02-02

Merge small files

Reading many small files from HDFS degrades processing performance. Use --groupBy and --targetSize together to merge small files into larger output files. --groupBy is a regular expression that groups files; --targetSize sets the maximum size of each merged file in MB.

The source directory contains eight small files:

hdfs dfs -ls /data/incoming/hourly_table/2017-02-01/03
Found 8 items
-rw-r-----   2 root hadoop       2252 2020-04-17 20:42 /data/incoming/hourly_table/2017-02-01/03/000151.sst
-rw-r-----   2 root hadoop       4891 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/1.log
-rw-r-----   2 root hadoop       4891 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/2.log
-rw-r-----   2 root hadoop       4891 2020-04-17 21:08 /data/incoming/hourly_table/2017-02-01/03/5.log
-rw-r-----   2 root hadoop       4891 2020-04-17 21:08 /data/incoming/hourly_table/2017-02-01/03/6.log
-rw-r-----   2 root hadoop       4891 2020-04-17 20:42 /data/incoming/hourly_table/2017-02-01/03/OPTIONS-000109
-rw-r-----   2 root hadoop       1016 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/emp01.txt
-rw-r-----   2 root hadoop       1016 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/emp06.txt

The following command merges all .txt files into a single output file of up to 10 MB:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --targetSize=10 --groupBy='.*/([a-z]+).*.txt' --parallelism 20

Verify the result. The two TXT files are merged into one:

hdfs dfs -ls oss://<yourBucketName>/hourly_table/2017-02-01/03/
Found 1 items
-rw-rw-rw-   1       2032 2020-04-17 21:18 oss://<yourBucketName>/hourly_table/2017-02-01/03/emp2

Optimize copy performance

Jindo DistCp provides two job allocation strategies to improve performance when copying mixed file sizes. Choose based on your workload, then tune --parallelism to control concurrency.

Strategy When to use Option
Balance plan File sizes within the small-file group and within the large-file group are relatively uniform --enableBalancePlan
Dynamic plan File sizes vary significantly across files, with most files being small --enableDynamicPlan

If neither option is specified, files are allocated randomly across reduce tasks.

Neither --enableBalancePlan nor --enableDynamicPlan can be used together with --groupBy or --targetSize.

Balance plan:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --enableBalancePlan --parallelism 20

Dynamic plan:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --enableDynamicPlan --parallelism 20

Tune --parallelism to control the number of concurrent reduce tasks. The default is 7. Increase this value when the cluster has more available resources to improve copy throughput.

Enable transaction integrity

--enableTransaction enforces job-level transaction integrity. All files in the job either succeed or are rolled back together:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --enableTransaction --parallelism 20

Verify copy completeness

After a copy job completes, use --diff to compare the source and destination file lists:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --diff

If all files are successfully copied:

INFO distcp.JindoDistCp: distcp has been done completely

If any files are missing from the destination, Jindo DistCp generates a manifest file in the destination directory listing them. Use --copyFromManifest and --previousManifest to copy those files:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --previousManifest=file:///opt/manifest-2020-04-17.gz --copyFromManifest --parallelism 20
If Jindo DistCp performed compression or decompression during the copy, --diff does not return accurate file size comparisons. If the destination is an HDFS directory, specify --dest in one of these formats: /path, hdfs://hostname:ip/path, or hdfs://headerIp:ip/path. Other formats such as hdfs:///path and hdfs:/path are not supported.

Check DistCp counters

After a job completes, check the DistCp Counters section in the MapReduce job output to verify how much data was transferred:

Distcp Counters
        Bytes Destination Copied=11010048000
        Bytes Source Read=11010048000
        Files Copied=1001

Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
If Jindo DistCp compressed or decompressed files during the copy, Bytes Destination Copied and Bytes Source Read may differ.

Access OSS with an AccessKey pair

By default, EMR uses AccessKey-free access for OSS. If you run Jindo DistCp from outside EMR, or AccessKey-free access is unavailable, specify an AccessKey pair using --key, --secret, and --endPoint:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --key <yourAccessKeyId> --secret <yourAccessKeySecret> --endPoint oss-cn-hangzhou.aliyuncs.com --parallelism 20

Replace <yourAccessKeyId> with your Alibaba Cloud AccessKey ID and <yourAccessKeySecret> with your AccessKey secret.

Write to OSS Archive or Infrequent Access storage

Use --policy to write data directly to a lower-cost OSS storage class during the copy.

  • Archive storage:

    jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --policy archive --parallelism 20
  • Infrequent Access (IA) storage:

    jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --policy ia --parallelism 20

Clean up residual files

When a DistCp task fails partway through, partially uploaded files may remain in the destination. OSS tracks these files internally by uploadId, but they are invisible in directory listings. Add --cleanUpPending to automatically clean them up after the task completes:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --cleanUpPending --parallelism 20

Alternatively, delete residual files manually from the OSS console.

FAQ

Why does `--diff` report incorrect file sizes after compression?

When Jindo DistCp compresses or decompresses files during the copy, the source and destination file sizes differ by design. --diff compares raw file sizes and therefore reports a mismatch. This is expected behavior, not a copy failure. Check the DistCp Counters instead to confirm the number of files copied.

Can I use `--enableBalancePlan` or `--enableDynamicPlan` with `--groupBy`?

No. Both plan options are incompatible with --groupBy and --targetSize. When merging small files, file allocation is determined by the grouping pattern, so the plan options have no effect and are not allowed in the same command.

Can I use `--deleteOnSuccess` with `--groupBy`?

No. --deleteOnSuccess cannot be used with --groupBy or --targetSize. After merging, the output is a new merged file; deleting the original inputs in the same job is not supported.

Why is the destination HDFS path format restricted for `--diff`?

When the destination is an HDFS directory, Jindo DistCp requires --dest to be in the format /path, hdfs://hostname:ip/path, or hdfs://headerIp:ip/path. Formats such as hdfs:///path and hdfs:/path are not supported and will cause the diff operation to fail.