All Products
Search
Document Center

E-MapReduce:Use Jindo DistCp

Last Updated:Mar 26, 2026

Use Jindo DistCp to copy data

Jindo DistCp is a distributed data copy tool built on E-MapReduce (EMR). Based on Apache DistCp, it is optimized for large-scale data transfers between Hadoop Distributed File System (HDFS) and Object Storage Service (OSS), with native support for OSS storage classes, Cloud Monitor alerts, and S3-compatible sources.

Prerequisites

Before you begin, ensure that you have:

  • Java Development Kit (JDK) 8 installed on your computer

  • An EMR cluster. For more information, see Create a cluster cluster.").

Parameter reference

All parameters are optional except --src and --dest.

ParameterDescriptionDefaultConstraints
--srcSource directoryRequired
--destDestination directoryRequired. For HDFS destinations, use /path, hdfs://hostname:ip/path, or hdfs://headerIp:ip/path only.
--parallelismNumber of parallel reduce tasks (mapreduce.job.reduces)7
--srcPatternRegular expression to filter source filesMust match the full file path
--deleteOnSuccessDelete source files after a successful copy
--outputCodecCompression codec for copied fileskeepValid values: gzip, gz, lzo, lzop, snappy, none, keep. To use LZO in an open source Hadoop cluster, install the gplcompression native library and the hadoop-lzo package.
--outputManifestName of the manifest file to generateGZ format only. Set --requirePreviousManifest=false to create a new manifest.
--previousManifestPath to an existing manifest fileUsed with --outputManifest for incremental tracking, or with --copyFromManifest to replay a manifest.
--requirePreviousManifestWhether a previous manifest is requiredSet to false when generating the first manifest.
--copyFromManifestCopy files listed in a manifest instead of listing a directoryRequires --previousManifest.
--srcPrefixesFilePath to a file containing source URI prefixesOne prefix per line.
--groupByRegular expression to group input files for mergingCannot be used with --enableBalancePlan or --enableDynamicPlan.
--targetSizeTarget size for merged output files (MB)Cannot be used with --enableBalancePlan or --enableDynamicPlan.
--enableBalancePlanOptimize task allocation when file sizes are similar within each size groupCannot be used with --groupBy or --targetSize.
--enableDynamicPlanOptimize task allocation when most files are smallCannot be used with --groupBy or --targetSize.
--enableTransactionEnable job-level transaction support
--diffCompare source and destination file listsDoes not return accurate size differences if files were compressed or decompressed during the copy.
--updateCopy only missing or changed files and directories (incremental update)
--filtersPath to a filter file; each line is a regular expressionMatching files are excluded from copy and --diff operations.
--queueYARN queue name
--bandwidthBandwidth limit per node (MB/s)
--policyOSS storage class for copied filesValid values: coldArchive, archive, ia. Cold Archive is available only in select regions.
--keyAccessKey ID for OSS accessRequired when accessing OSS from outside EMR or when AccessKey-free access is not supported.
--secretAccessKey secret for OSS accessSee --key.
--endPointOSS endpointSee --key.
--s3KeyAmazon S3 access key
--s3SecretAmazon S3 secret key
--s3EndPointAmazon S3 endpoint
--enableCMSReport task failures to Cloud MonitorRequires Cloud Monitor environment variables to be set. See Use Cloud Monitor.
--cleanUpPendingClean up incomplete multipart uploads after the task finishesAlternatively, clean up in the OSS console.

Connect to the EMR cluster

SSH into the master node of your EMR cluster. For more information, see Connect to the master node of an EMR cluster in SSH mode cluster by using an SSH key pair or password.").

Copy files

Basic copy

Copy all files from /opt/tmp in HDFS to an OSS bucket:

jindo distcp --src /opt/tmp --dest oss://<yourBucketName>/tmp

Replace <yourBucketName> with the name of your OSS bucket. If you do not specify a root directory in --dest, Jindo DistCp creates one automatically.

Adjust parallelism

--parallelism controls how many reduce tasks run in parallel. The default is 7. Increase it when you have more cluster resources available:

jindo distcp --src /opt/tmp --dest oss://<yourBucketName>/tmp --parallelism 20

Filter files by pattern

--srcPattern accepts a regular expression that must match the full file path. To copy only .log files from /data/incoming/hourly_table:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --srcPattern '.*\.log' --parallelism 20

Verify the result:

hdfs dfs -ls oss://<yourBucketName>/hourly_table/2017-02-01/03

Expected output — only .log files are copied:

Found 2 items
-rw-rw-rw-   1       4891 2020-04-17 20:52 oss://<yourBucketName>/hourly_table/2017-02-01/03/1.log
-rw-rw-rw-   1       4891 2020-04-17 20:52 oss://<yourBucketName>/hourly_table/2017-02-01/03/2.log

Delete source files after copy

--deleteOnSuccess removes source files once the copy succeeds:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --deleteOnSuccess --parallelism 20

Compress files during copy

--outputCodec compresses files as they are copied. Valid values: gzip, gz, lzo, lzop, snappy, none, keep (default).

  • none: decompress files that are already compressed

  • keep: copy files without changing their compression

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --outputCodec=gz --parallelism 20

Verify that files in the destination are compressed:

hdfs dfs -ls oss://<yourBucketName>/hourly_table/2017-02-01/03

Expected output:

Found 6 items
-rw-rw-rw-   1        938 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/000151.sst.gz
-rw-rw-rw-   1       1956 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/1.log.gz
-rw-rw-rw-   1       1956 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/2.log.gz
-rw-rw-rw-   1       1956 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/OPTIONS-000109.gz
-rw-rw-rw-   1        506 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/emp01.txt.gz
-rw-rw-rw-   1        506 2020-04-17 20:58 oss://<yourBucketName>/hourly_table/2017-02-01/03/emp06.txt.gz
To use the LZO codec in an open source Hadoop cluster, install the gplcompression native library and the hadoop-lzo package.

Copy files in multiple folders

--srcPrefixesFile points to a file where each line is a source URI prefix. Jindo DistCp copies all files under the listed prefixes:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --srcPrefixesFile file:///opt/folders.txt --parallelism 20

Sample folders.txt:

hdfs://emr-header-1.cluster-50466:9000/data/incoming/hourly_table/2017-02-01
hdfs://emr-header-1.cluster-50466:9000/data/incoming/hourly_table/2017-02-02

Merge small files

Reading large numbers of small files from HDFS degrades query performance. Use --groupBy and --targetSize to merge them into larger files.

--groupBy is a regular expression that groups files by a captured pattern. --targetSize sets the maximum output file size in MB.

The following example merges all .txt files (grouped by name prefix) into files no larger than 10 MB:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --targetSize=10 --groupBy='.*/([a-z]+).*.txt' --parallelism 20

Verify the result:

hdfs dfs -ls oss://<yourBucketName>/hourly_table/2017-02-01/03/

Expected output — two .txt files merged into one:

Found 1 items
-rw-rw-rw-   1       2032 2020-04-17 21:18 oss://<yourBucketName>/hourly_table/2017-02-01/03/emp2
--groupBy and --targetSize cannot be used with --enableBalancePlan or --enableDynamicPlan.

Optimize task allocation

Use one of these options to improve copy performance based on your file size distribution. Neither option can be used with --groupBy or --targetSize.

  • --enableBalancePlan: Use when both small and large files are to be copied but the file sizes are not significantly different among small files and among large files. Distributes tasks evenly across reducers.

    jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --enableBalancePlan --parallelism 20
  • --enableDynamicPlan: Use when file sizes vary significantly and most files are small. Allocates tasks dynamically to avoid reducer imbalance.

    jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --enableDynamicPlan --parallelism 20

Enable transaction support

--enableTransaction ensures job-level atomicity:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --enableTransaction --parallelism 20

Exclude files with filters

--filters points to a file where each line is a regular expression. Files with paths matching any expression are excluded from both copy and --diff operations.

hadoop jar jindo-distcp-3.5.0.jar --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --filters /path/to/filterfile.txt --parallelism 20

Sample filter file:

.*\.tmp.
.*\.staging.*

Files whose paths contain .tmp or .staging are skipped.

Limit bandwidth

--bandwidth caps the bandwidth used by each node (in MB/s), preventing the task from consuming all available network capacity:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --bandwidth 100 --parallelism 20

Specify a YARN queue

--queue assigns the DistCp job to a specific YARN queue:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --queue yarnqueue

Track changes with manifest files

A manifest file records all files copied by a job — their destination paths, source paths, and sizes. Use manifests to track incremental copies and verify completeness.

Generate a manifest

Set --requirePreviousManifest=false to create the first manifest. The file is always compressed in GZ format:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --outputManifest=manifest-2020-04-17.gz --requirePreviousManifest=false --parallelism 20

Read the manifest:

hadoop fs -text oss://<yourBucketName>/hourly_table/manifest-2020-04-17.gz > before.lst
cat before.lst

Each line is a JSON object describing one file:

{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/000151.sst","baseName":"2017-02-01/03/000151.sst","srcDir":"oss://<yourBucketName>/hourly_table","size":2252}
{"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/1.log","baseName":"2017-02-01/03/1.log","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}
...

Copy newly added files

Pass the previous manifest to --previousManifest so that Jindo DistCp only copies files not already listed:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --outputManifest=manifest-2020-04-18.gz --previousManifest=oss://<yourBucketName>/hourly_table/manifest-2020-04-17.gz --parallelism 20

Compare manifests to see what was newly copied:

hadoop fs -text oss://<yourBucketName>/hourly_table/manifest-2020-04-18.gz > current.lst
diff before.lst current.lst

Output shows the two newly added files:

3a4,5
> {"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/5.log","baseName":"2017-02-01/03/5.log","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}
> {"path":"oss://<yourBucketName>/hourly_table/2017-02-01/03/6.log","baseName":"2017-02-01/03/6.log","srcDir":"oss://<yourBucketName>/hourly_table","size":4891}

Replay a manifest

To copy exactly the files listed in a previous manifest:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --previousManifest=oss://<yourBucketName>/hourly_table/manifest-2020-04-17.gz --copyFromManifest --parallelism 20

Verify and sync

Check for differences

--diff compares source and destination file lists. Run it after a copy to verify completeness:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --diff

If all files are copied:

INFO distcp.JindoDistCp: Jindo DistCp job exit with 0

If some files are missing, Jindo DistCp generates a manifest in the destination directory listing the missing files. Use --copyFromManifest and --previousManifest to copy them:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --previousManifest=file:///opt/manifest-2020-04-17.gz --copyFromManifest --parallelism 20
--diff does not return accurate file size differences if files were compressed or decompressed during the copy. For HDFS destinations, use /path, hdfs://hostname:ip/path, or hdfs://headerIp:ip/path. Formats like hdfs:///path and hdfs:/path are not supported.

Sync incrementally

--update skips the files and directories that are the same as those in the destination and copies only new or changed files and directories:

hadoop jar jindo-distcp-3.5.0.jar --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --update --parallelism 20

Write to OSS storage classes

Use --policy to write data directly to a specific OSS storage class.

  • Cold Archive (available in select regions):

    hadoop jar jindo-distcp-3.5.0.jar --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --policy coldArchive --parallelism 20

    For supported regions, see Overview, Archive, and Cold Archive.").

  • Archive:

    jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --policy archive --parallelism 20
  • Infrequent Access (IA):

    jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --policy ia --parallelism 20

Access OSS with an AccessKey pair

When accessing OSS from outside EMR, or when AccessKey-free access is not supported, specify credentials directly in the command:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --key <yourAccessKeyId> --secret <yourAccessKeySecret> --endPoint oss-cn-hangzhou.aliyuncs.com --parallelism 20

Replace <yourAccessKeyId> and <yourAccessKeySecret> with your Alibaba Cloud AccessKey ID and AccessKey secret.

Use Amazon S3 as a source

Specify S3 credentials with --s3Key, --s3Secret, and --s3EndPoint:

jindo distcp jindo-distcp-2.7.3.jar --src s3a://yourbucket/ --dest oss://<your_bucket>/hourly_table --s3Key yourkey --s3Secret yoursecret --s3EndPoint s3-us-west-1.amazonaws.com

To avoid passing credentials on every command, add them to core-site.xml in your Hadoop configuration:

<configuration>
    <property>
        <name>fs.s3a.access.key</name>
        <value>xxx</value>
    </property>
    <property>
        <name>fs.s3a.secret.key</name>
        <value>xxx</value>
    </property>
    <property>
        <name>fs.s3.endpoint</name>
        <value>s3-us-west-1.amazonaws.com</value>
    </property>
</configuration>

Then run without explicit credentials:

jindo distcp /tmp/jindo-distcp-2.7.3.jar --src s3://smartdata1/ --dest s3://smartdata1/tmp --s3EndPoint s3-us-west-1.amazonaws.com

Use Cloud Monitor

Cloud Monitor can alert you when a DistCp job fails. To configure alerting:

  1. Create an alert contact or alert contact group. For more information, see Create an alert contact or alert group.

  2. Obtain an alert token:

    1. In the left-side navigation pane, choose Alerts > Alert Contacts.

    2. Click the Alert Contact Group tab.

    3. Find your alert contact group and click Access External alert.

    4. Record the alert token shown in the panel.

  3. In the panel, click Test Command to configure the following environment variables: Example:

    VariableDescription
    cmsAccessKeyIdAccessKey ID of your Alibaba Cloud account
    cmsAccessSecretAccessKey secret of your Alibaba Cloud account
    cmsRegionRegion ID of the cluster, for example, cn-hangzhou
    cmsTokenAlert token obtained in step 2
    cmsLevelAlert level: INFO (email and DingTalk), WARN (SMS, email, and DingTalk), or CRITICAL (phone call, SMS, email, and DingTalk)
    export cmsAccessKeyId=<your_key_id>
    export cmsAccessSecret=<your_key_secret>
    export cmsRegion=cn-hangzhou
    export cmsToken=<your_cms_token>
    export cmsLevel=WARN
    
    hadoop jar jindo-distcp-3.5.0.jar \
      --src /data/incoming/hourly_table \
      --dest oss://yang-hhht/hourly_table \
      --enableCMS

Clean up incomplete uploads

If a DistCp job is interrupted, partially uploaded files may remain in OSS. These files are tracked by OSS using an uploadId and are not visible in normal listings. Add --cleanUpPending to remove them automatically when the job finishes:

jindo distcp --src /data/incoming/hourly_table --dest oss://<yourBucketName>/hourly_table --cleanUpPending --parallelism 20

Alternatively, delete incomplete uploads in the OSS console.

Monitor job progress with DistCp counters

After a job completes, check the MapReduce counter output for the JindoDistcpCounter group:

JindoDistcpCounter
  BYTES_EXPECTED=10000
  BYTES_SKIPPED=10000
  FILES_EXPECTED=11
  FILES_SKIPPED=11
CounterDescription
COPY_FAILEDFiles that failed to copy. An alert is triggered when this value is not 0.
CHECKSUM_DIFFFiles that failed checksum verification. Added to COPY_FAILED.
FILES_EXPECTEDTotal files to copy
FILES_COPIEDFiles successfully copied
FILES_SKIPPEDFiles skipped during incremental updates
BYTES_SKIPPEDBytes skipped during incremental updates
DIFF_FILESFiles with differences found by --diff. An alert is triggered when this value is not 0.
SAME_FILESFiles with no differences found by --diff
DST_MISSFiles missing from the destination. Added to DIFF_FILES.
LENGTH_DIFFFiles with different sizes in source and destination. Added to DIFF_FILES.
CHECKSUM_DIFFFiles that failed checksum verification. Added to DIFF_FILES.
DIFF_FAILEDFiles with errors during --diff. Check log files for details.