Jindo DistCp is a distributed copy tool developed by the Alibaba Cloud data lake storage team for transferring large volumes of data between storage systems—Hadoop Distributed File System (HDFS), OSS-HDFS, Object Storage Service (OSS), and Amazon Simple Storage Service (Amazon S3). It uses MapReduce to parallelize transfers, handle errors, and recover from failures. When copying from HDFS to OSS-HDFS, Jindo DistCp uses a custom CopyCommitter to copy files without renaming them, keeping copies consistent with the source. Jindo DistCp supports all features provided by Amazon S3 DistCp and HDFS DistCp. Compared with HDFS DistCp, Jindo DistCp significantly improves the efficiency, stability, and security of data copying.
Prerequisites
Before you begin, ensure that you have:
Java Development Kit (JDK) 1.8.0 installed
The Jindo DistCp JAR file (
jindo-distcp-tool-x.x.x.jar):On EMR V5.6.0 or later (minor versions), or EMR V3.40.0 or later (minor versions): the JAR is pre-installed at
/opt/apps/JINDOSDK/jindosdk-current/tools.On Hadoop 2.3 or later (non-EMR): download the
jindosdk-${version}.tar.gzpackage from Download JindoData, extract it, and locatejindo-distcp-tool-x.x.x.jarin the/toolsfolder.
Parameters
Run Jindo DistCp with the hadoop jar command:
hadoop jar jindo-distcp-tool-${version}.jar --src <source> --dest <destination> [options]The following table summarizes all parameters. See the sections below for details on each parameter.
| Parameter | Required | Default | Version | OSS | OSS-HDFS | Description |
|---|---|---|---|---|---|---|
--src | Yes | — | 4.3.0+ | Supported | Supported | Source path |
--dest | Yes | — | 4.3.0+ | Supported | Supported | Destination path |
--bandWidth | No | -1 | 4.3.0+ | Supported | Supported | Per-task bandwidth limit (MB); -1 means no limit |
--codec | No | keep | 4.3.0+ | Supported | Supported | Compression codec for destination files |
--policy | No | Standard | 4.3.0+ | Supported | Not supported | OSS storage class for destination files |
--filters | No | — | 4.3.0+ | Supported | Supported | File containing regex patterns to exclude |
--srcPrefixesFile | No | — | 4.3.0+ | Supported | Supported | File containing regex patterns to include |
--parallelism | No | 10 | 4.3.0+ | Supported | Supported | Number of map tasks (equivalent to mapreduce.job.maps) |
--jobBatch | No | 10,000 | 4.5.1+ | Supported | Supported | Maximum files per job |
--taskBatch | No | 1 | 4.3.0+ | Supported | Supported | Files per map task |
--tmp | No | /tmp | 4.3.0+ | Supported | Supported | HDFS temporary directory |
--hadoopConf <key=value> | No | — | 4.3.0+ | Supported | Supported | Inline Hadoop configuration (for credentials) |
--disableChecksum | No | false | 4.3.0+ | Supported | Supported | Skips post-copy checksum verification |
--deleteOnSuccess | No | false | 4.3.0+ | Supported | Supported | Deletes source files after a successful copy |
--enableTransaction | No | false | 4.3.0+ | Supported | Supported | Enables job-level atomicity |
--ignore | No | false | 4.3.0+ | Supported | Supported | Continues the job when individual file copies fail |
--enableCMS | No | false | 4.5.1+ | Supported | Supported | Enables CloudMonitor monitoring and alerting |
--diff | No | DistCpMode.COPY | 4.3.0+ | Supported | Supported | Generates a file recording differences between source and destination |
--update | No | DistCpMode.COPY | 4.3.0+ | Supported | Supported | Copies only files missing from or differing at the destination |
--preserveMeta | No | false | 4.4.0+ | Not supported | Supported | Copies file metadata (Owner, Group, Permission, etc.) |
--policyapplies to OSS only.--preserveMetaapplies to OSS-HDFS only because OSS does not have an equivalent HDFS metadata model. All other parameters work with both OSS and OSS-HDFS.
--src and --dest
Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported
--src specifies the source path. --dest specifies the destination path. Both parameters support the following prefixes: hdfs://, oss://, s3://, cos://, obs://.
By default, Jindo DistCp copies the contents of the source directory—not the directory itself. If the destination directory does not exist, Jindo DistCp creates it.
Copy a directory:
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-oss-bucket/hourly_tableThis copies all files under /data/hourly_table into oss://example-oss-bucket/hourly_table/.
Copy a single file (specify a destination directory, not a file path):
hadoop jar jindo-distcp-tool-${version}.jar \
--src /test.txt \
--dest oss://example-oss-bucket/tmp--bandWidth
Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported
Limits the bandwidth used per map task. Unit: MB. Use this to prevent copy jobs from saturating the network. The default value -1 means no limit.
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-oss-bucket/hourly_table \
--bandWidth 6--codec
Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported
Compresses or decompresses files during the copy. Valid values: gzip, gz, lzo, lzop, snappy, none, keep.
| Value | Behavior |
|---|---|
keep (default) | Copies files as-is, without compressing or decompressing |
none | Copies without compression; decompresses the source file if it is compressed |
gzip, gz, lzo, lzop, snappy | Compresses the destination file using the specified codec |
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-oss-bucket/hourly_table \
--codec gzAfter the job completes, destination files have the .gz extension:
oss://example-oss-bucket/hourly_table/2017-02-01/03/000151.sst.gz
oss://example-oss-bucket/hourly_table/2017-02-01/03/1.log.gz
oss://example-oss-bucket/hourly_table/2017-02-01/03/2.log.gz
oss://example-oss-bucket/hourly_table/2017-02-01/03/OPTIONS-000109.gz
oss://example-oss-bucket/hourly_table/2017-02-01/03/emp01.txt.gz
oss://example-oss-bucket/hourly_table/2017-02-01/03/emp06.txt.gzTo use lzo on an open source Hadoop cluster, install the gplcompression native library and the hadoop-lzo package first. On clusters without these dependencies, use a different codec.--policy
Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Not supported
Sets the storage class for files copied to OSS. The default is Standard. Supported values:
| Value | Storage class | Notes |
|---|---|---|
| (not set) | Standard | Default |
ia | Infrequent Access (IA) | Lower cost for data accessed less than once a month |
archive | Archive | For long-term archival; retrieval takes minutes |
coldArchive | Cold Archive | For rarely accessed data; only supported in specific regions |
Cold Archive example (check region availability first):
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-bucket/hourly_table \
--policy coldArchive \
--parallelism 20Archive example:
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-bucket/hourly_table \
--policy archive \
--parallelism 20IA example:
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-bucket/hourly_table \
--policy ia \
--parallelism 20--filters
Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported
Excludes files from the copy based on regex patterns. Point --filters to a file containing one regex pattern per line. Files whose paths match any pattern are skipped.
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-oss-bucket/hourly_table \
--filters filter.txtIf filter.txt contains .*test.*, files with test anywhere in their path are excluded.
--srcPrefixesFile
Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported
Copies only the files whose paths match the regex patterns in the specified file. This is the inverse of --filters: only matching files are included.
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-oss-bucket/hourly_table \
--srcPrefixesFile prefixes.txtIf prefixes.txt contains .*test.*, only files with test in their path are copied.
--parallelism
Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported
Sets the number of map tasks for the DistCp job, equivalent to the mapreduce.job.maps parameter. The default in EMR is 10.
Because files are the smallest unit of work in a DistCp job, increasing the number of map tasks beyond the total file count provides no benefit. Adding more tasks improves throughput only when the cluster has spare CPU and network capacity. Tune this value based on your cluster size, the number of files, and available bandwidth.
hadoop jar jindo-distcp-tool-${version}.jar \
--src /opt/tmp \
--dest oss://example-oss-bucket/tmp \
--parallelism 20--jobBatch
Version: 4.5.1+ | OSS: Supported | OSS-HDFS: Supported
Sets the maximum number of files processed per DistCp job. Default: 10000. Increase this value when copying large datasets to reduce job overhead.
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-oss-bucket/hourly_table \
--jobBatch 50000--taskBatch
Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported
Sets the number of files processed per map task. Default: 1.
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-oss-bucket/hourly_table \
--taskBatch 1--tmp
Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported
Specifies the HDFS directory used to store temporary data. Default: /tmp (resolves to hdfs:///tmp/).
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-oss-bucket/hourly_table \
--tmp /tmp--hadoopConf
Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported
Passes Hadoop configuration properties inline. Use this to provide credentials for OSS or OSS-HDFS in non-EMR environments, or when AccessKey-free access is not available.
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-oss-bucket/hourly_table \
--hadoopConf fs.oss.accessKeyId=<your-access-key-id> \
--hadoopConf fs.oss.accessKeySecret=<your-access-key-secret>To avoid passing credentials on every command, add them to the core-site.xml of the Hadoop-Common service in the EMR console:
<configuration>
<property>
<name>fs.oss.accessKeyId</name>
<value>xxx</value>
</property>
<property>
<name>fs.oss.accessKeySecret</name>
<value>xxx</value>
</property>
</configuration>--disableChecksum
Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported
Skips checksum verification after copying. By default, Jindo DistCp verifies file checksums to confirm data integrity. Disable this only when checksum verification causes false failures—for example, when copying between storage systems that use incompatible checksum algorithms.
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-oss-bucket/hourly_table \
--disableChecksum--deleteOnSuccess
Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported
Deletes source files after a successful copy—similar to a mv operation. Use this for data migration when source files are no longer needed after transfer.
Deletion is irreversible. Verify that the copy completed successfully before using this flag in production.
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-oss-bucket/hourly_table \
--deleteOnSuccess--enableTransaction
Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported
Enables job-level atomicity. By default, Jindo DistCp ensures data integrity at the task level—a failed task does not affect other tasks. With --enableTransaction, the entire job succeeds or fails as a unit: no partial results are written to the destination.
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-oss-bucket/hourly_table \
--enableTransaction--ignore
Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported
Continues the job when individual file copies fail, rather than stopping the entire job. Failed files are recorded in the Jindo DistCp counters (see COPY_FAILED). If CloudMonitor is enabled, alerts are sent through your configured notification channels.
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-oss-bucket/hourly_table \
--ignore--diff
Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported
Enables DIF mode. In DIF mode, if a source file fails to be copied to the destination directory, a file is generated in the directory where the command is run to record the differences between source and destination files.
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-oss-bucket/hourly_table \
--diffSample output when differences exist:
JindoCounter
DIFF_FILES=1To include metadata in the comparison, combine --diff with --preserveMeta:
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-oss-bucket/hourly_table \
--diff --preserveMetaLimitations:
File size differences are not accurate if Jindo DistCp applied compression or decompression during a previous copy.
When
--destis an HDFS path, use/path,hdfs://hostname:port/path, orhdfs://headerIp:port/path. The formatshdfs:///pathandhdfs:/pathare not supported.
--update
Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported
Copies only files that are missing from the destination or differ from the corresponding destination files. Use this to resume an interrupted job or to sync new files added to the source.
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-oss-bucket/hourly_table \
--update--preserveMeta
Version: 4.4.0+ | OSS: Not supported | OSS-HDFS: Supported
Copies file metadata along with file content. The following metadata attributes are preserved: Owner, Group, Permission, Atime, Mtime, Replication, BlockSize, XAttrs, and ACL.
OSS does not support this parameter because it does not have an equivalent metadata model to HDFS. Use this parameter only when the destination is OSS-HDFS.
hadoop jar jindo-distcp-tool-${version}.jar \
--src /data/hourly_table \
--dest oss://example-oss-bucket/hourly_table \
--preserveMeta--enableCMS
Version: 4.5.1+ | OSS: Supported | OSS-HDFS: Supported
Enables monitoring and alerting through CloudMonitor. When active, CloudMonitor sends notifications through your configured alert channels when copy errors occur.
Jindo DistCp counters
After each job, Jindo DistCp reports execution statistics through counters. Use these to verify copy results and diagnose issues.
| Counter | Description |
|---|---|
FILES_EXPECTED | Number of files expected to be copied |
BYTES_EXPECTED | Number of bytes expected to be copied |
FILES_COPIED | Number of files successfully copied |
BYTES_COPIED | Number of bytes successfully copied |
FILES_SKIPPED | Number of files skipped during incremental update (--update) |
BYTES_SKIPPED | Number of bytes skipped during incremental update |
COPY_FAILED | Number of files that failed to copy |
CHECKSUM_DIFF | Number of files that failed checksum verification; included in COPY_FAILED (in copy mode) and in DIFF_FILES (in --diff mode) |
DIFF_FILES | Number of files that differ between source and destination (--diff mode) |
SAME_FILES | Number of files that are identical in source and destination (--diff mode) |
DST_MISS | Files missing from the destination; included in DIFF_FILES |
LENGTH_DIFF | Files with different sizes in source and destination; included in DIFF_FILES |
DIFF_FAILED | Files that could not be compared |