All Products
Search
Document Center

E-MapReduce:Use Jindo DistCp

Last Updated:Mar 26, 2026

Jindo DistCp is a distributed copy tool developed by the Alibaba Cloud data lake storage team for transferring large volumes of data between storage systems—Hadoop Distributed File System (HDFS), OSS-HDFS, Object Storage Service (OSS), and Amazon Simple Storage Service (Amazon S3). It uses MapReduce to parallelize transfers, handle errors, and recover from failures. When copying from HDFS to OSS-HDFS, Jindo DistCp uses a custom CopyCommitter to copy files without renaming them, keeping copies consistent with the source. Jindo DistCp supports all features provided by Amazon S3 DistCp and HDFS DistCp. Compared with HDFS DistCp, Jindo DistCp significantly improves the efficiency, stability, and security of data copying.

Prerequisites

Before you begin, ensure that you have:

  • Java Development Kit (JDK) 1.8.0 installed

  • The Jindo DistCp JAR file (jindo-distcp-tool-x.x.x.jar):

    • On EMR V5.6.0 or later (minor versions), or EMR V3.40.0 or later (minor versions): the JAR is pre-installed at /opt/apps/JINDOSDK/jindosdk-current/tools.

    • On Hadoop 2.3 or later (non-EMR): download the jindosdk-${version}.tar.gz package from Download JindoData, extract it, and locate jindo-distcp-tool-x.x.x.jar in the /tools folder.

Parameters

Run Jindo DistCp with the hadoop jar command:

hadoop jar jindo-distcp-tool-${version}.jar --src <source> --dest <destination> [options]

The following table summarizes all parameters. See the sections below for details on each parameter.

ParameterRequiredDefaultVersionOSSOSS-HDFSDescription
--srcYes4.3.0+SupportedSupportedSource path
--destYes4.3.0+SupportedSupportedDestination path
--bandWidthNo-14.3.0+SupportedSupportedPer-task bandwidth limit (MB); -1 means no limit
--codecNokeep4.3.0+SupportedSupportedCompression codec for destination files
--policyNoStandard4.3.0+SupportedNot supportedOSS storage class for destination files
--filtersNo4.3.0+SupportedSupportedFile containing regex patterns to exclude
--srcPrefixesFileNo4.3.0+SupportedSupportedFile containing regex patterns to include
--parallelismNo104.3.0+SupportedSupportedNumber of map tasks (equivalent to mapreduce.job.maps)
--jobBatchNo10,0004.5.1+SupportedSupportedMaximum files per job
--taskBatchNo14.3.0+SupportedSupportedFiles per map task
--tmpNo/tmp4.3.0+SupportedSupportedHDFS temporary directory
--hadoopConf <key=value>No4.3.0+SupportedSupportedInline Hadoop configuration (for credentials)
--disableChecksumNofalse4.3.0+SupportedSupportedSkips post-copy checksum verification
--deleteOnSuccessNofalse4.3.0+SupportedSupportedDeletes source files after a successful copy
--enableTransactionNofalse4.3.0+SupportedSupportedEnables job-level atomicity
--ignoreNofalse4.3.0+SupportedSupportedContinues the job when individual file copies fail
--enableCMSNofalse4.5.1+SupportedSupportedEnables CloudMonitor monitoring and alerting
--diffNoDistCpMode.COPY4.3.0+SupportedSupportedGenerates a file recording differences between source and destination
--updateNoDistCpMode.COPY4.3.0+SupportedSupportedCopies only files missing from or differing at the destination
--preserveMetaNofalse4.4.0+Not supportedSupportedCopies file metadata (Owner, Group, Permission, etc.)
--policy applies to OSS only. --preserveMeta applies to OSS-HDFS only because OSS does not have an equivalent HDFS metadata model. All other parameters work with both OSS and OSS-HDFS.

--src and --dest

Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported

--src specifies the source path. --dest specifies the destination path. Both parameters support the following prefixes: hdfs://, oss://, s3://, cos://, obs://.

By default, Jindo DistCp copies the contents of the source directory—not the directory itself. If the destination directory does not exist, Jindo DistCp creates it.

Copy a directory:

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-oss-bucket/hourly_table

This copies all files under /data/hourly_table into oss://example-oss-bucket/hourly_table/.

Copy a single file (specify a destination directory, not a file path):

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /test.txt \
  --dest oss://example-oss-bucket/tmp

--bandWidth

Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported

Limits the bandwidth used per map task. Unit: MB. Use this to prevent copy jobs from saturating the network. The default value -1 means no limit.

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-oss-bucket/hourly_table \
  --bandWidth 6

--codec

Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported

Compresses or decompresses files during the copy. Valid values: gzip, gz, lzo, lzop, snappy, none, keep.

ValueBehavior
keep (default)Copies files as-is, without compressing or decompressing
noneCopies without compression; decompresses the source file if it is compressed
gzip, gz, lzo, lzop, snappyCompresses the destination file using the specified codec
hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-oss-bucket/hourly_table \
  --codec gz

After the job completes, destination files have the .gz extension:

oss://example-oss-bucket/hourly_table/2017-02-01/03/000151.sst.gz
oss://example-oss-bucket/hourly_table/2017-02-01/03/1.log.gz
oss://example-oss-bucket/hourly_table/2017-02-01/03/2.log.gz
oss://example-oss-bucket/hourly_table/2017-02-01/03/OPTIONS-000109.gz
oss://example-oss-bucket/hourly_table/2017-02-01/03/emp01.txt.gz
oss://example-oss-bucket/hourly_table/2017-02-01/03/emp06.txt.gz
To use lzo on an open source Hadoop cluster, install the gplcompression native library and the hadoop-lzo package first. On clusters without these dependencies, use a different codec.

--policy

Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Not supported

Sets the storage class for files copied to OSS. The default is Standard. Supported values:

ValueStorage classNotes
(not set)StandardDefault
iaInfrequent Access (IA)Lower cost for data accessed less than once a month
archiveArchiveFor long-term archival; retrieval takes minutes
coldArchiveCold ArchiveFor rarely accessed data; only supported in specific regions

Cold Archive example (check region availability first):

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-bucket/hourly_table \
  --policy coldArchive \
  --parallelism 20

Archive example:

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-bucket/hourly_table \
  --policy archive \
  --parallelism 20

IA example:

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-bucket/hourly_table \
  --policy ia \
  --parallelism 20

--filters

Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported

Excludes files from the copy based on regex patterns. Point --filters to a file containing one regex pattern per line. Files whose paths match any pattern are skipped.

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-oss-bucket/hourly_table \
  --filters filter.txt

If filter.txt contains .*test.*, files with test anywhere in their path are excluded.

--srcPrefixesFile

Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported

Copies only the files whose paths match the regex patterns in the specified file. This is the inverse of --filters: only matching files are included.

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-oss-bucket/hourly_table \
  --srcPrefixesFile prefixes.txt

If prefixes.txt contains .*test.*, only files with test in their path are copied.

--parallelism

Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported

Sets the number of map tasks for the DistCp job, equivalent to the mapreduce.job.maps parameter. The default in EMR is 10.

Because files are the smallest unit of work in a DistCp job, increasing the number of map tasks beyond the total file count provides no benefit. Adding more tasks improves throughput only when the cluster has spare CPU and network capacity. Tune this value based on your cluster size, the number of files, and available bandwidth.

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /opt/tmp \
  --dest oss://example-oss-bucket/tmp \
  --parallelism 20

--jobBatch

Version: 4.5.1+ | OSS: Supported | OSS-HDFS: Supported

Sets the maximum number of files processed per DistCp job. Default: 10000. Increase this value when copying large datasets to reduce job overhead.

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-oss-bucket/hourly_table \
  --jobBatch 50000

--taskBatch

Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported

Sets the number of files processed per map task. Default: 1.

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-oss-bucket/hourly_table \
  --taskBatch 1

--tmp

Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported

Specifies the HDFS directory used to store temporary data. Default: /tmp (resolves to hdfs:///tmp/).

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-oss-bucket/hourly_table \
  --tmp /tmp

--hadoopConf

Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported

Passes Hadoop configuration properties inline. Use this to provide credentials for OSS or OSS-HDFS in non-EMR environments, or when AccessKey-free access is not available.

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-oss-bucket/hourly_table \
  --hadoopConf fs.oss.accessKeyId=<your-access-key-id> \
  --hadoopConf fs.oss.accessKeySecret=<your-access-key-secret>

To avoid passing credentials on every command, add them to the core-site.xml of the Hadoop-Common service in the EMR console:

<configuration>
    <property>
        <name>fs.oss.accessKeyId</name>
        <value>xxx</value>
    </property>
    <property>
        <name>fs.oss.accessKeySecret</name>
        <value>xxx</value>
    </property>
</configuration>

--disableChecksum

Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported

Skips checksum verification after copying. By default, Jindo DistCp verifies file checksums to confirm data integrity. Disable this only when checksum verification causes false failures—for example, when copying between storage systems that use incompatible checksum algorithms.

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-oss-bucket/hourly_table \
  --disableChecksum

--deleteOnSuccess

Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported

Deletes source files after a successful copy—similar to a mv operation. Use this for data migration when source files are no longer needed after transfer.

Important

Deletion is irreversible. Verify that the copy completed successfully before using this flag in production.

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-oss-bucket/hourly_table \
  --deleteOnSuccess

--enableTransaction

Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported

Enables job-level atomicity. By default, Jindo DistCp ensures data integrity at the task level—a failed task does not affect other tasks. With --enableTransaction, the entire job succeeds or fails as a unit: no partial results are written to the destination.

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-oss-bucket/hourly_table \
  --enableTransaction

--ignore

Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported

Continues the job when individual file copies fail, rather than stopping the entire job. Failed files are recorded in the Jindo DistCp counters (see COPY_FAILED). If CloudMonitor is enabled, alerts are sent through your configured notification channels.

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-oss-bucket/hourly_table \
  --ignore

--diff

Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported

Enables DIF mode. In DIF mode, if a source file fails to be copied to the destination directory, a file is generated in the directory where the command is run to record the differences between source and destination files.

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-oss-bucket/hourly_table \
  --diff

Sample output when differences exist:

JindoCounter
DIFF_FILES=1

To include metadata in the comparison, combine --diff with --preserveMeta:

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-oss-bucket/hourly_table \
  --diff --preserveMeta

Limitations:

  • File size differences are not accurate if Jindo DistCp applied compression or decompression during a previous copy.

  • When --dest is an HDFS path, use /path, hdfs://hostname:port/path, or hdfs://headerIp:port/path. The formats hdfs:///path and hdfs:/path are not supported.

--update

Version: 4.3.0+ | OSS: Supported | OSS-HDFS: Supported

Copies only files that are missing from the destination or differ from the corresponding destination files. Use this to resume an interrupted job or to sync new files added to the source.

hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-oss-bucket/hourly_table \
  --update

--preserveMeta

Version: 4.4.0+ | OSS: Not supported | OSS-HDFS: Supported

Copies file metadata along with file content. The following metadata attributes are preserved: Owner, Group, Permission, Atime, Mtime, Replication, BlockSize, XAttrs, and ACL.

OSS does not support this parameter because it does not have an equivalent metadata model to HDFS. Use this parameter only when the destination is OSS-HDFS.
hadoop jar jindo-distcp-tool-${version}.jar \
  --src /data/hourly_table \
  --dest oss://example-oss-bucket/hourly_table \
  --preserveMeta

--enableCMS

Version: 4.5.1+ | OSS: Supported | OSS-HDFS: Supported

Enables monitoring and alerting through CloudMonitor. When active, CloudMonitor sends notifications through your configured alert channels when copy errors occur.

Jindo DistCp counters

After each job, Jindo DistCp reports execution statistics through counters. Use these to verify copy results and diagnose issues.

CounterDescription
FILES_EXPECTEDNumber of files expected to be copied
BYTES_EXPECTEDNumber of bytes expected to be copied
FILES_COPIEDNumber of files successfully copied
BYTES_COPIEDNumber of bytes successfully copied
FILES_SKIPPEDNumber of files skipped during incremental update (--update)
BYTES_SKIPPEDNumber of bytes skipped during incremental update
COPY_FAILEDNumber of files that failed to copy
CHECKSUM_DIFFNumber of files that failed checksum verification; included in COPY_FAILED (in copy mode) and in DIFF_FILES (in --diff mode)
DIFF_FILESNumber of files that differ between source and destination (--diff mode)
SAME_FILESNumber of files that are identical in source and destination (--diff mode)
DST_MISSFiles missing from the destination; included in DIFF_FILES
LENGTH_DIFFFiles with different sizes in source and destination; included in DIFF_FILES
DIFF_FAILEDFiles that could not be compared