Jindo DistCp is a distributed data copy tool built for E-MapReduce (EMR). It transfers files between Hadoop Distributed File System (HDFS) and OSS, and supports filtering, on-the-fly compression, small-file merging, incremental copy, and job-level transaction integrity.
Prerequisites
Before you begin, make sure you have:
-
An EMR cluster of version 3.28.0 or later. For more information, see Create a cluster.
-
SSH access to the master node of the cluster. For more information, see Log on to a cluster.
Quick start
Log on to the master node and run a basic copy:
jindo distcp --src /opt/tmp --dest oss://<yourBucketName>/tmp
Replace <yourBucketName> with the name of your OSS bucket. By default, all files under --src are copied. If the destination root directory does not exist, Jindo DistCp creates it automatically.
To see all available options, run:
jindo distcp --help
Options reference
The following table lists all supported options. All examples in this document use /data/incoming/hourly_table as the source and oss://<yourBucketName>/hourly_table as the destination. Replace these placeholders with your actual values.
| Option | Description | Notes |
|---|---|---|
--src |
Source directory to copy files from | Required |
--dest |
Destination directory to copy files to | Required |
--parallelism |
Number of parallel reduce tasks. Sets mapreduce.job.reduces. Default: 7 |
Tune based on available cluster resources |
--srcPattern |
Regular expression to filter source files by full path | — |
--deleteOnSuccess |
Delete source files after a successful copy | Cannot be used with --groupBy or --targetSize |
--outputCodec |
Compression codec for copied files. Valid values: gzip, gz, lzo, lzop, snappy, none, keep. Default: keep |
none decompresses already-compressed files; keep preserves the original format |
--outputManifest |
Name of the output manifest file (GZ format only) | Use with --requirePreviousManifest=false to create a new manifest |
--requirePreviousManifest |
Require a previous manifest to be present before running | Set to false when creating the first manifest |
--previousManifest |
Path to an existing manifest file | Use with --outputManifest for incremental copy tracking |
--copyFromManifest |
Copy only the files listed in a manifest file | Use with --previousManifest to replay a manifest |
--srcPrefixesFile |
Path to a text file with one source URI prefix per line | Enables copying from multiple source folders in a single job |
--groupBy |
Regular expression to group input files for merging | Use with --targetSize |
--targetSize |
Target size for merged output files, in MB | Use with --groupBy |
--enableBalancePlan |
Optimizes job allocation when file sizes within each tier (small/large) are relatively uniform | Cannot be used with --groupBy or --targetSize |
--enableDynamicPlan |
Dynamically allocates copy tasks for workloads where file sizes vary significantly | Cannot be used with --groupBy or --targetSize |
--enableTransaction |
Enables job-level transaction integrity | — |
--diff |
Compares source and destination file lists after a copy | If differences exist, generates a manifest file in the destination directory; does not return accurate size comparisons when compression or decompression was applied |
--cleanUpPending |
Removes residual files left by failed uploads, tracked by uploadId | — |
--key |
AccessKey ID for OSS access outside EMR | Use when AccessKey-free access is unavailable |
--secret |
AccessKey secret for OSS access outside EMR | Use when AccessKey-free access is unavailable |
--endPoint |
OSS endpoint | Use when AccessKey-free access is unavailable |
--policy |
OSS storage class for destination. Valid values: archive, ia |
— |
Copy files
Use --src and --dest to specify the source and destination directories. By default, all files under --src are copied. If the destination root directory does not exist, Jindo DistCp creates it automatically.
The following command copies all files from the /opt/tmp HDFS directory to an OSS bucket:
jindo distcp --src /opt/tmp --dest oss://<yourBucketName>/tmp
Filter files by pattern
Use --srcPattern to copy only files whose full path matches a regular expression.
The /data/incoming/hourly_table/2017-02-01/03 directory contains the following files:
Found 6 items
-rw-r----- 2 root hadoop 2252 2020-04-17 20:42 /data/incoming/hourly_table/2017-02-01/03/000151.sst
-rw-r----- 2 root hadoop 4891 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/1.log
-rw-r----- 2 root hadoop 4891 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/2.log
-rw-r----- 2 root hadoop 4891 2020-04-17 20:42 /data/incoming/hourly_table/2017-02-01/03/OPTIONS-000109
-rw-r----- 2 root hadoop 1016 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/emp01.txt
-rw-r----- 2 root hadoop 1016 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/emp06.txt
To copy only the .log files, set --srcPattern to .*\.log:
jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --srcPattern .*\.log --parallelism 20
Verify the result:
hdfs dfs -ls oss://<yourBucketName>/hourly_table/2017-02-01/03
Only the log files are copied:
Found 2 items
-rw-rw-rw- 1 4891 2020-04-17 20:52 oss://<yourBucketName>/hourly_table/2017-02-01/03/1.log
-rw-rw-rw- 1 4891 2020-04-17 20:52 oss://<yourBucketName>/hourly_table/2017-02-01/03/2.log
Delete source files after copy
Add --deleteOnSuccess to remove source files from the source directory after the copy completes successfully:
jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --deleteOnSuccess --parallelism 20
Compress files during copy
Use --outputCodec to compress copied files on the fly. The following command compresses all files using gzip:
jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --outputCodec=gz --parallelism 20
Verify the result:
hdfs dfs -ls oss://<yourBucketName>/hourly_table/2017-02-01/03
All files in the destination directory have the .gz extension:
Found 6 items
-rw-rw-rw- 1 938 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/000151.sst.gz
-rw-rw-rw- 1 1956 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/1.log.gz
-rw-rw-rw- 1 1956 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/2.log.gz
-rw-rw-rw- 1 1956 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/OPTIONS-000109.gz
-rw-rw-rw- 1 506 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/emp01.txt.gz
-rw-rw-rw- 1 506 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/emp06.txt.gz
Valid values for --outputCodec: gzip, gz, lzo, lzop, snappy, none, keep. Default: keep.
-
none: decompresses files if they are already compressed; leaves uncompressed files as-is. -
keep: preserves the original compression format.
To use the LZO codec in an open-source Hadoop cluster, install the native gplcompression library and the hadoop-lzo package.
Track copies with manifest files
A manifest file records the destination path, source path, and size for every file copied in a job. Use manifest files to track incremental copies and verify completeness.
Create a manifest file
Set --requirePreviousManifest=false to generate a new manifest file. The file is always stored in GZ format.
jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --outputManifest=manifest-2020-04-17.gz --requirePreviousManifest=false --parallelism 20
Inspect the manifest:
hadoop fs -text oss://<yourBucketName>/hourly_table/manifest-2020-04-17.gz > before.lst
cat before.lst
Each line in the manifest is a JSON record:
{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/000151.sst","baseName":"2017-02-01/03/000151.sst","srcDir":"oss://<yourBucketName>/hourly_table","size":2252}
{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/1.log","baseName":"2017-02-01/03/1.log","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}
{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/2.log","baseName":"2017-02-01/03/2.log","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}
{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/OPTIONS-000109","baseName":"2017-02-01/03/OPTIONS-000109","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}
{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/emp01.txt","baseName":"2017-02-01/03/emp01.txt","srcDir":"oss://<yourBucketName>/hourly_table","size":1016}
{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/emp06.txt","baseName":"2017-02-01/03/emp06.txt","srcDir":"oss://<yourBucketName>/hourly_table","size":1016}
Copy only new files using a previous manifest
When new files are added to the source directory, use --previousManifest to skip already-copied files and transfer only the new ones. The new manifest (--outputManifest) includes both previously and newly copied files, giving you a complete history.
The following command copies only files added since manifest-2020-04-17.gz:
jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --outputManifest=manifest-2020-04-18.gz --previousManifest=oss://<yourBucketName>/hourly_table/manifest-2020-04-17.gz --parallelism 20
Compare the two manifests to see what was newly copied:
hadoop fs -text oss://<yourBucketName>/hourly_table/manifest-2020-04-18.gz > current.lst
diff before.lst current.lst
The diff output shows the two newly added files:
3a4,5
> {"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/5.log","baseName":"2017-02-01/03/5.log","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}
> {"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/6.log","baseName":"2017-02-01/03/6.log","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}
Copy files from a manifest
Use --copyFromManifest with --previousManifest to copy exactly the files listed in a previously generated manifest:
jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --previousManifest=oss://<yourBucketName>/hourly_table/manifest-2020-04-17.gz --copyFromManifest --parallelism 20
Copy files from multiple source folders
--srcPrefixesFile accepts a text file with one source URI prefix per line, allowing a single job to copy from multiple folders.
List the source folders:
hdfs dfs -ls oss://<yourBucketName>/hourly_tableFound 4 items
drwxrwxrwx - 0 1970-01-01 08:00 oss://<yourBucketName>/hourly_table/2017-02-01
drwxrwxrwx - 0 1970-01-01 08:00 oss://<yourBucketName>/hourly_table/2017-02-02
Run the copy and specify the prefixes file:
jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --srcPrefixesFile file:///opt/folders.txt --parallelism 20
View the generated folders.txt to confirm its contents:
cat folders.txthdfs://emr-header-1.cluster-50466:9000/data/incoming/hourly_table/2017-02-01
hdfs://emr-header-1.cluster-50466:9000/data/incoming/hourly_table/2017-02-02
Merge small files
Reading many small files from HDFS degrades processing performance. Use --groupBy and --targetSize together to merge small files into larger output files. --groupBy is a regular expression that groups files; --targetSize sets the maximum size of each merged file in MB.
The source directory contains eight small files:
hdfs dfs -ls /data/incoming/hourly_table/2017-02-01/03Found 8 items
-rw-r----- 2 root hadoop 2252 2020-04-17 20:42 /data/incoming/hourly_table/2017-02-01/03/000151.sst
-rw-r----- 2 root hadoop 4891 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/1.log
-rw-r----- 2 root hadoop 4891 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/2.log
-rw-r----- 2 root hadoop 4891 2020-04-17 21:08 /data/incoming/hourly_table/2017-02-01/03/5.log
-rw-r----- 2 root hadoop 4891 2020-04-17 21:08 /data/incoming/hourly_table/2017-02-01/03/6.log
-rw-r----- 2 root hadoop 4891 2020-04-17 20:42 /data/incoming/hourly_table/2017-02-01/03/OPTIONS-000109
-rw-r----- 2 root hadoop 1016 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/emp01.txt
-rw-r----- 2 root hadoop 1016 2020-04-17 20:47 /data/incoming/hourly_table/2017-02-01/03/emp06.txt
The following command merges all .txt files into a single output file of up to 10 MB:
jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --targetSize=10 --groupBy='.*/([a-z]+).*.txt' --parallelism 20
Verify the result. The two TXT files are merged into one:
hdfs dfs -ls oss://<yourBucketName>/hourly_table/2017-02-01/03/
Found 1 items
-rw-rw-rw- 1 2032 2020-04-17 21:18 oss://<yourBucketName>/hourly_table/2017-02-01/03/emp2
Optimize copy performance
Jindo DistCp provides two job allocation strategies to improve performance when copying mixed file sizes. Choose based on your workload, then tune --parallelism to control concurrency.
| Strategy | When to use | Option |
|---|---|---|
| Balance plan | File sizes within the small-file group and within the large-file group are relatively uniform | --enableBalancePlan |
| Dynamic plan | File sizes vary significantly across files, with most files being small | --enableDynamicPlan |
If neither option is specified, files are allocated randomly across reduce tasks.
Neither--enableBalancePlannor--enableDynamicPlancan be used together with--groupByor--targetSize.
Balance plan:
jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --enableBalancePlan --parallelism 20
Dynamic plan:
jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --enableDynamicPlan --parallelism 20
Tune --parallelism to control the number of concurrent reduce tasks. The default is 7. Increase this value when the cluster has more available resources to improve copy throughput.
Enable transaction integrity
--enableTransaction enforces job-level transaction integrity. All files in the job either succeed or are rolled back together:
jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --enableTransaction --parallelism 20
Verify copy completeness
After a copy job completes, use --diff to compare the source and destination file lists:
jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --diff
If all files are successfully copied:
INFO distcp.JindoDistCp: distcp has been done completely
If any files are missing from the destination, Jindo DistCp generates a manifest file in the destination directory listing them. Use --copyFromManifest and --previousManifest to copy those files:
jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --previousManifest=file:///opt/manifest-2020-04-17.gz --copyFromManifest --parallelism 20
If Jindo DistCp performed compression or decompression during the copy,--diffdoes not return accurate file size comparisons. If the destination is an HDFS directory, specify--destin one of these formats:/path,hdfs://hostname:ip/path, orhdfs://headerIp:ip/path. Other formats such ashdfs:///pathandhdfs:/pathare not supported.
Check DistCp counters
After a job completes, check the DistCp Counters section in the MapReduce job output to verify how much data was transferred:
Distcp Counters
Bytes Destination Copied=11010048000
Bytes Source Read=11010048000
Files Copied=1001
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
If Jindo DistCp compressed or decompressed files during the copy,Bytes Destination CopiedandBytes Source Readmay differ.
Access OSS with an AccessKey pair
By default, EMR uses AccessKey-free access for OSS. If you run Jindo DistCp from outside EMR, or AccessKey-free access is unavailable, specify an AccessKey pair using --key, --secret, and --endPoint:
jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --key <yourAccessKeyId> --secret <yourAccessKeySecret> --endPoint oss-cn-hangzhou.aliyuncs.com --parallelism 20
Replace <yourAccessKeyId> with your Alibaba Cloud AccessKey ID and <yourAccessKeySecret> with your AccessKey secret.
Write to OSS Archive or Infrequent Access storage
Use --policy to write data directly to a lower-cost OSS storage class during the copy.
-
Archive storage:
jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --policy archive --parallelism 20 -
Infrequent Access (IA) storage:
jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --policy ia --parallelism 20
Clean up residual files
When a DistCp task fails partway through, partially uploaded files may remain in the destination. OSS tracks these files internally by uploadId, but they are invisible in directory listings. Add --cleanUpPending to automatically clean them up after the task completes:
jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --cleanUpPending --parallelism 20
Alternatively, delete residual files manually from the OSS console.
FAQ
Why does `--diff` report incorrect file sizes after compression?
When Jindo DistCp compresses or decompresses files during the copy, the source and destination file sizes differ by design. --diff compares raw file sizes and therefore reports a mismatch. This is expected behavior, not a copy failure. Check the DistCp Counters instead to confirm the number of files copied.
Can I use `--enableBalancePlan` or `--enableDynamicPlan` with `--groupBy`?
No. Both plan options are incompatible with --groupBy and --targetSize. When merging small files, file allocation is determined by the grouping pattern, so the plan options have no effect and are not allowed in the same command.
Can I use `--deleteOnSuccess` with `--groupBy`?
No. --deleteOnSuccess cannot be used with --groupBy or --targetSize. After merging, the output is a new merged file; deleting the original inputs in the same job is not supported.
Why is the destination HDFS path format restricted for `--diff`?
When the destination is an HDFS directory, Jindo DistCp requires --dest to be in the format /path, hdfs://hostname:ip/path, or hdfs://headerIp:ip/path. Formats such as hdfs:///path and hdfs:/path are not supported and will cause the diff operation to fail.