Use Jindo DistCp - E-MapReduce - Alibaba Cloud Documentation Center

This topic describes how to use Jindo DistCp.

What is Jindo DistCp?

Jindo DistCp is a distributed copy tool developed by the Alibaba Cloud data lake storage team to copy files within or between large-scale clusters. Jindo DistCp uses MapReduce to distribute files, handle errors, and recover data. Files and directories are used as the input for map tasks. Each task copies part of the files and directories in the input list. Jindo DistCp supports full data copy among Hadoop Distributed File System (HDFS), OSS-HDFS, Object Storage Service (OSS), and Amazon Simple Storage Service (Amazon S3). Jindo DistCp also provides various copy parameters and copy policies. When you use Jindo DistCp to copy data from HDFS to OSS-HDFS, you can use a custom CopyCommitter to copy files without renaming files. This ensures that copies are consistent with the source files. Jindo DistCp supports all features provided by Amazon S3 DistCp and HDFS DistCp. Compared with HDFS DistCp, Jindo DistCp significantly improves the efficiency, stability, and security of data copying.

Environment requirements

You need to install Java Development Kit (JDK) 1.8.0.
If you use Hadoop 2.3 or later, you need to download the latest JAR package of Jindo DistCp. You can download the latest jindosdk-${version}.tar.gz package from the link provided in the Download JindoData topic, extract the downloaded package, and then find the jindo-distcp-tool-x.x.x.jar package in the /tools folder.
Note
Jindo DistCp is deployed in clusters of E-MapReduce (EMR) V5.6.0 or a later minor version and clusters of EMR V3.40.0 or a later minor version. You can find the jindo-distcp-tool-x.x.x.jar package in the /opt/apps/JINDOSDK/jindosdk-current/tools directory.

Parameters

Jindo DistCp provides a JAR package. You can use the hadoop jar command with a series of parameters to copy files.

Parameter	Required	Description	Default value	Version	OSS	OSS-HDFS
--src	Yes	Specifies the source directory. The following prefixes are supported: hdfs:// oss:// s3:// cos:// obs://	No default value	4.3.0 and later	Supported	Supported
--dest	Yes	Specifies the destination directory. The following prefixes are supported: hdfs:// oss:// s3:// cos:// obs://	No default value	4.3.0 and later	Supported	Supported
--bandWidth	No	Specifies the bandwidth that can be used by a DistCp job. Unit: MB.	-1	4.3.0 and later	Supported	Supported
--codec	No	Specifies the codec that is used to compress and decompress files. The following codecs are supported: gzip, gz, lzo, lzop, and snappy.	keep (The value indicates that the source files are copied without compressing or decompressing the files.)	4.3.0 and later	Supported	Supported
--policy	No	The storage class of the files copied to OSS. The following storage classes are supported: Standard, Infrequent Access (IA), Archive, and Cold Archive.	Standard	4.3.0 and later	Supported	Not supported
--filters	No	Specifies the file that contains filter conditions.	No default value	4.3.0 and later	Supported	Supported
--srcPrefixesFile	No	Specifies the file that contains matching conditions.	No default value	4.3.0 and later	Supported	Supported
--parallelism	No	Specifies the parallelism of a DistCp job. This parameter is equivalent to the mapreduce.job.maps parameter for MapReduce jobs.	10	4.3.0 and later	Supported	Supported
--jobBatch	No	Specifies the number of files that can be processed by a DistCp job.	10000	4.5.1 or later	Supported	Supported
--taskBatch	No	Specifies the number of files that can be processed by a DistCp task.	1	4.3.0 and later	Supported	Supported
--tmp	No	Specifies the HDFS temporary directory that is used to store temporary data.	/tmp	4.3.0 and later	Supported	Supported
--hadoopConf <key=value>	No	Specifies the AccessKey pair that is used to access OSS or OSS-HDFS.	No default value	4.3.0 and later	Supported	Supported
--disableChecksum	No	Specifies whether to disable the file checksum feature.	false	4.3.0 and later	Supported	Supported
--deleteOnSuccess	No	Specifies whether to delete the source files after the files are copied.	false	4.3.0 and later	Supported	Supported
--enableTransaction	No	Specifies whether to enable the DistCp job to support transactions to ensure the atomicity of jobs.	false	4.3.0 and later	Supported	Supported
--ignore	No	Specifies whether to ignore exceptions that occur during the data copy process to prevent the exceptions from interrupting the ongoing job.	false	4.3.0 and later	Supported	Supported
--enableCMS	No	Specifies whether to enable the monitoring and alerting features of CloudMonitor.	false	4.5.1 or later	Supported	Supported
--diff	No	Speficies whether to use the DIF mode. In DIF mode, a file is generated to record the differences between the source and destination files.	DistCpMode.COPY	4.3.0 and later	Supported	Supported
--update	No	Specifies whether to use the UPDATE mode. In UPDATE mode, only incremental source files are copied, such as the source files that do not exist in the destination directory or the source files that differ from the destination files.	DistCpMode.COPY	4.3.0 and later	Supported	Supported
--preserveMeta	No	Specifies whether to copy metadata.	false	4.4.0 or later	Not supported	Supported

--src and --dest (Required)

Version	OSS	OSS-HDFS
4.3.0 and later	Supported	Supported

--src: specifies the directory of the source files.
--dest: specifies the directory of the destination files.

Sample command:

hadoop jar jindo-distcp-tool-${version}.jar --src /data/hourly_table  --dest oss://example-oss-bucket/hourly_table

You can use the --dest parameter to specify the directory to which the source files are copied. You can use the preceding sample command to copy the files in the /data/hourly_table directory to the /hourly_table directory in an OSS bucket named example-oss-bucket. Unlike HDFS DistCp, Jindo DistCp copies only the files in the source directory to the specified destination directory by default. Jindo DistCp does not copy the directory. To resolve this issue, you can specify the directory in the --dest parameter. If the specified directory does not exist in the destination directory, Jindo DistCp creates a directory.

If you want to copy a single file, specify a directory in the --dest parameter. Sample command:

hadoop jar jindo-distcp-tool-${version}.jar --src /test.txt --dest oss://example-oss-bucket/tmp

--bandWidth

Version	OSS	OSS-HDFS
4.3.0 and later	Supported	Supported

--bandWidth: specifies the bandwidth used by a DistCp task. Unit: MB. This prevents a single DistCp task from occupying excessive bandwidth resources.

Sample command:

jindo-distcp-tool-${version}.jar --src /data/hourly_table --dest oss://example-oss-bucket/hourly_table --bandWidth 6

--codec

Version	OSS	OSS-HDFS
4.3.0 and later	Supported	Supported

The destination files in OSS or OSS-HDFS are usually text files that are not compressed. Text files are not ideal for storage cost control or data analysis. Jindo DistCp allows you to configure the --codec parameter to compress the destination files online. This helps you save storage resources.

The --codec parameter specifies the codec that is used to compress and decompress files. Valid values: gzip, gz, lzo, lzop, snappy, none, and keep. Default value: keep. Descriptions of none and keep:

none: copies the source files without compression. If the source files are compressed, Jindo DistCp decompresses the files.
keep: copies the source files without compressing or decompressing the files.

Sample command:

jindo-distcp-tool-${version}.jar --src /data/hourly_table --dest oss://example-oss-bucket/hourly_table --codec gz

After the command is run, you can check the files in the destination directory. The output shows that the files are compressed by using the gz codec.

[root@emr-header-1 opt]# hdfs dfs -ls oss://example-oss-bucket/hourly_table/2017-02-01/03
Found 6 items
-rw-rw-rw-   1        938 2020-04-17 20:58 oss://example-oss-bucket/hourly_table/2017-02-01/03/000151.sst.gz
-rw-rw-rw-   1       1956 2020-04-17 20:58 oss://example-oss-bucket/hourly_table/2017-02-01/03/1.log.gz
-rw-rw-rw-   1       1956 2020-04-17 20:58 oss://example-oss-bucket/hourly_table/2017-02-01/03/2.log.gz
-rw-rw-rw-   1       1956 2020-04-17 20:58 oss://example-oss-bucket/hourly_table/2017-02-01/03/OPTIONS-000109.gz
-rw-rw-rw-   1        506 2020-04-17 20:58 oss://example-oss-bucket/hourly_table/2017-02-01/03/emp01.txt.gz
-rw-rw-rw-   1        506 2020-04-17 20:58 oss://example-oss-bucket/hourly_table/2017-02-01/03/emp06.txt.gz

Note

If you want to use the lzo codec in an open source Hadoop cluster, you must install the native library of gplcompression and the hadoop-lzo package. If you do not have a Hadoop cluster, we recommend that you use other codecs.

--filters

Version	OSS	OSS-HDFS
4.3.0 and later	Supported	Supported

--filters: specifies the file that contains filter conditions.

Sample command:

 jindo-distcp-tool-${version}.jar --src /data/hourly_table --dest oss://example-oss-bucket/hourly_table --filters filter.txt

For example, the filter.txt file contains the .*test.* string. In this case, the files whose paths contain the test string are not copied to OSS.

-srcPrefixesFile

Version	OSS	OSS-HDFS
4.3.0 and later	Supported	Supported

--srcPrefixesFile: specifies the file that contains matching conditions.

Sample command:

 jindo-distcp-tool-${version}.jar --src /data/hourly_table --dest oss://example-oss-bucket/hourly_table --srcPrefixesFile prefixes.txt

For example, the prefixes.txt file contains the .*test.* string. In this case, only the files whose paths contain the test string are copied to OSS.

--parallelism

Version	OSS	OSS-HDFS
4.3.0 and later	Supported	Supported

--parallelism: specifies the parallelism of a DistCp job. This parameter is equivalent to the mapreduce.job.maps parameter for MapReduce jobs. The default value of the mapreduce.job.maps parameter is 10 in EMR. You can specify the value of the --parallelism parameter based on the usage of cluster resources.

Sample command:

 jindo-distcp-tool-${version}.jar --src /opt/tmp --dest oss://example-oss-bucket/tmp --parallelism 20

--taskBatch

Version	OSS	OSS-HDFS
4.3.0 and later	Supported	Supported

--taskBatch: specifies the number of files that can be processed by a DistCp task. Default value: 1.

Sample command:

 jindo-distcp-tool-${version}.jar --src /data/hourly_table --dest oss://example-oss-bucket/hourly_table --taskBatch 1

--tmp

Version	OSS	OSS-HDFS
4.3.0 and later	Supported	Supported

--tmp: specifies the HDFS temporary directory that is used to store temporary data. The default value is /tmp, which indicates hdfs:///tmp/.

Sample command:

 jindo-distcp-tool-${version}.jar --src /data/hourly_table --dest oss://example-oss-bucket/hourly_table  --tmp /tmp

--hadoopConf

Version	OSS	OSS-HDFS
4.3.0 and later	Supported	Supported

--hadoopConf: specifies the AccessKey pair that is used to access OSS or OSS-HDFS in a non-EMR environment or a scenario in which AccessKey-free access is not supported. You can use the --hadoopConf parameter to specify an AccessKey pair.

Sample command:

 jindo-distcp-tool-${version}.jar --src /data/hourly_table --dest oss://example-oss-bucket/hourly_table --hadoopConf fs.oss.accessKeyId=yourkey --hadoopConf fs.oss.accessKeySecret=yoursecret

You can also add the AccessKey pair that is used to access OSS or OSS-HDFS to the core-site.xml file of Hadoop. This saves you the need to enter the AccessKey pair each time. For example, you can add the following information to the core-site.xml file of the Hadoop-Common service in the EMR console:

<configuration>
    <property>
        <name>fs.oss.accessKeyId</name>
        <value>xxx</value>
    </property>

    <property>
        <name>fs.oss.accessKeySecret</name>
        <value>xxx</value>
    </property>
</configuration>

-disableChecksum

Version	OSS	OSS-HDFS
4.3.0 and later	Supported	Supported

--disableChecksum: specifies whether to disable the file checksum feature.

Sample command:

 jindo-distcp-tool-${version}.jar --src /data/hourly_table --dest oss://example-oss-bucket/hourly_table --disableChecksum

-deleteOnSuccess

Version	OSS	OSS-HDFS
4.3.0 and later	Supported	Supported

--deleteOnSuccess: specifies whether to migrate data instead of copying data. When you configure this parameter, the copy operation is similar to the operation performed by using the mv command. Jindo DistCp copies the source files and then deletes the source files from the source directory.

Sample command:

 jindo-distcp-tool-${version}.jar --src /data/hourly_table --dest oss://example-oss-bucket/hourly_table --deleteOnSuccess

--enableTransaction

Version	OSS	OSS-HDFS
4.3.0 and later	Supported	Supported

--enableTransaction: specifies whether to enable transactions for jobs. Transactions help ensure data integrity at the job level. By default, Jindo DistCp ensures data integrity at the task level.

Sample command:

 jindo-distcp-tool-${version}.jar --src /data/hourly_table --dest oss://example-oss-bucket/hourly_table --enableTransaction

--ignore

Version	OSS	OSS-HDFS
4.3.0 and later	Supported	Supported

--ignore: specifies whether to ignore exceptions that occur during the data copy process to prevent the exceptions from interrupting the ongoing job. The exceptions are displayed in Jindo DistCp counters. If the monitoring and alerting features of CloudMonitor are enabled, the system sends notifications to you by using the specified methods when exceptions occur.

Sample command:

 jindo-distcp-tool-${version}.jar --src /data/hourly_table --dest oss://example-oss-bucket/hourly_table --ignore

--diff

Version	OSS	OSS-HDFS
4.3.0 and later	Supported	Supported

--diff: specifies whether to use the DIF mode. In DIF mode, a file is generated to record the differences between the source and destination files. If a source file fails to be copied to the destination directory, a file is generated in the directory where the command is run to record the differences between the source and destination files. If Jindo DistCp has performed compression or decompression operations during the copy process, --diff does not return accurate file size differences.

Sample command:

 jindo-distcp-tool-${version}.jar --src /data/hourly_table --dest oss://example-oss-bucket/hourly_table --diff

If the destination files differ from the source files, a file is generated in the directory where the command is run to record the file differences and the following information is displayed:

JindoCounter
DIFF_FILES=1

If the destination directory is an HDFS directory, you can specify the directory in the /path, hdfs://hostname:ip/path, or hdfs://headerIp:ip/path format. Other formats, such as hdfs:///path and hdfs:/path, are not supported.

If you want to view the metadata differences between the source and destination files, you can configure the --diff and --preserveMeta parameters. Sample command:

 jindo-distcp-tool-${version}.jar --src /data/hourly_table --dest oss://example-oss-bucket/hourly_table --diff --preserveMeta

--update

Version	OSS	OSS-HDFS
4.3.0 and later	Supported	Supported

--update: specifies whether to use the UPDATE mode. In UPDATE mode, only incremental source files are copied, such as the source files that do not exist in the destination directory or the source files that differ from the destination files.

If a Jindo DistCp job is interrupted and files fail to be copied to the destination directory, you can use the --update parameter to copy these files. If specific files are added to the source directory, you can also use the --update parameter to copy the incremental files to the destination directory.

Sample command:

 jindo-distcp-tool-${version}.jar --src /data/hourly_table --dest oss://example-oss-bucket/hourly_table --update

--policy

Version	OSS	OSS-HDFS
4.3.0 and later	Supported	Not supported

--policy: specifies the storage class of the files copied to OSS, such as Cold Archive, Archive, or IA. If you do not configure this parameter, the Standard storage class is used by default.

Cold Archive

Only specific regions support this storage class. For more information, see Overview. Sample command:

 jindo-distcp-tool-${version}.jar --src /data/hourly_table --dest oss://example-bucket/hourly_table --policy coldArchive --parallelism 20

--preserveMeta

Version	OSS	OSS-HDFS
4.4.0 and later	Not supported	Supported

--preserveMeta: specifies whether to copy metadata during the data copy process, including the Owner, Group, Permission, Atime, Mtime, Replication, BlockSize, XAttrs, and ACL metadata.

 jindo-distcp-tool-${version}.jar --src /data/hourly_table --dest oss://example-oss-bucket/hourly_table --preserveMeta

--jobBatch

Version	OSS	OSS-HDFS
4.5.1 and later	Supported	Supported

--jobBatch: specifies the number of files that can be processed by a DistCp job when you copy data to OSS. Default value: 10000.

 jindo-distcp-tool-${version}.jar --src /data/hourly_table --dest oss://example-oss-bucket/hourly_table --jobBatch 50000

--enableCMS

Version	OSS	OSS-HDFS
4.5.1 and later	Supported	Supported

--enableCMS: specifies whether to enable the monitoring and alerting features of CloudMonitor.

Jindo DistCp counters

Jindo DistCp counters describe the execution results of Jindo DistCp jobs. The following table describes specific counters.

Counter	Description
COPY_FAILED	The number of files that fail to be copied.
CHECKSUM_DIFF	The number of files that fail to pass the checksum verification. The number is added to the value of COPY_FAILED.
FILES_EXPECTED	The number of files that are expected to be copied.
BYTES_EXPECTED	The number of bytes that are expected to be copied.
FILES_COPIED	The number of files that are copied.
BYTES_COPIED	The number of bytes that are copied.
FILES_SKIPPED	The number of files that are skipped during incremental updates.
BYTES_SKIPPED	The number of bytes that are skipped during incremental updates.
DIFF_FILES	The number of files that are different in the source and destination directories.
SAME_FILES	The number of files that are identical in the source and destination directories.
DST_MISS	The number of files that do not exist in the destination directory. The number is added to the value of DIFF_FILES.
LENGTH_DIFF	The number of files that are of different sizes in the source and destination directories. The number is added to the value of DIFF_FILES.
CHECKSUM_DIFF	The number of files that fail to pass the checksum verification. The number is added to the value of DIFF_FILES.
DIFF_FAILED	The number of files that are not properly compared.